The Impact of Cluster Virtualization on HPC

By Nicole Hemsoth

December 1, 2006

Is virtualization the next “big thing” to impact high performance computing or is much of the buzz just hype? Of its many definitions, which technologies or virtualization strategies are actually solving real HPC challenges today? The adoption of Linux and Windows clusters, the rise of multi-core processors, and the spread of various virtualization technologies offer a great deal of potential for organizations to increase capabilities and reduce costs.

In part one of a two-part interview, HPC luminary Don Becker, CTO of Penguin Computing and co-inventor of the Beowulf clustering model, and Pauline Nist, senior vice president of product development and management for Penguin Computing, discuss how cluster virtualization can enable large pools of servers to appear and act as a single, unified system. According to them, a cluster virtualization strategy is one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing and getting the most out of your server resources.

HPCwire: There is a tremendous amount of buzz surrounding virtualization these days, and the term seems to be applied to many different things. Can you explain a bit about what virtualization is/means and what its benefits are?

Donald Becker: There is a lot of buzz, but it’s not necessarily hype. However there are multiple definitions of “virtualization” – which can cause confusion. Essentially, the two different types boil down to 1). Making many look like one and 2). Making one machine work as many machines. There is a lot of activity and noise right now around virtualization as it applies to virtual machine technology. Because of this there is a tendency to think of VMs as the only example of virtualization.

Machine virtualization or virtual machine technology (a la VMware and Xen) is where you divide up the resources of a single server into multiple execution environments, or multiple virtual machines. It’s the exact opposite of making many look like one — this is making one work as many separate instances. This concept has been around for 40+ years in proprietary IBM machines but much of the recent buzz in the marketplace today is around the proliferation of this capability on commodity servers and operating systems.

HPCwire: Why is there so much interest right now in the virtual machine technology?

Becker: The key driving factor is proliferation of applications in the enterprise and the fact that this technology is now available on commodity x86 platforms prolific in the enterprise computing today. Many of these applications have strict requirements for the environment they run in, sometimes mutually exclusive requirements with other apps, and many are business critical functions that cannot be compromised.

As a result, apps became silo’d across racks of servers, each getting about 15-20 percent utilization. This is really inefficient and very costly to scale. What’s appealing about virtual machine technology is consolidation of many applications on to a single server, such that each app thinks it has the machine to itself. Each virtual machine is effectively a logical instance of the complete machine environment with its own CPU, memory and I/O. The net effect is fewer physical servers with much higher per-server utilization. This reduces capital costs on hardware and operational costs related to power, real estate, etc. It does not however take away the fact that you still have the same, or perhaps even larger, pools of logical servers that must be provisioned and managed.

Pauline Nist: The major enterprise server vendors, Intel and AMD are rolling out ongoing mechanisms at the processor level to better enable virtual machine technology and position themselves as optimal platforms for this new wave of innovations. Solutions like VMware have built a tremendous business in just a few short years bringing a VM solution to both Linux and Windows platforms. First with a capability that sits on top of the standard OS, and then with a variant that sits right on top of the hardware and launches OS environments. This is where it gets very interesting.

And now, you have Microsoft asserting its intentions to offer core virtual machine capability in Windows. The major enterprise Linux distros, Red Hat and Novell, working with the open source company XenSource, maintainers of the Linux VM technology Xen.

These OS platform providers are all making announcements and clearly jockeying for position to stake their claim as the “socket” on the hardware into which software plugs in. Meanwhile, VMware maintains such a strong presence on both Windows and Linux that the OS vendors will be of course working closely with them as this all progresses. It will be interesting to see how this all shakes out.

HPCwire: So we’ve heard about one definition of virtualization: “Making one machine work as many machines.” Tell us more about the other definition: “Making many look like one.”

Nist: Historically, the broader application of the term virtualization has been applied to the abstraction of physical devices from their clients to provide better availability and mobility while reducing the complexity of dealing with larger and larger pools of resources.

For example, virtualization is a well-known concept in networking, from virtual channels in ATM, to virtual private networks, virtual LANs and virtual IP addresses.

Storage virtualization is the pooling of physical storage from multiple network storage devices into what appears to be a single storage device that is managed from a central console — for example, a storage area network or SAN.

Likewise, clustered and Grid computing enables the “virtualization” of increasingly large pools of servers, along with their interconnects and storage needs, to reduce the complexity of managing and coordinating these resources for the purpose of distributed computing.

Becker: Organizations should be taking a closer look at their server utilization and determining if a cluster virtualization strategy would work for them. The financial and efficiency benefits are extremely compelling — and today’s tools and solutions offer very easy integration, making cluster virtualization one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing.

This is where Penguin Computing plays — Our virtualized cluster solutions are driven by Scyld ClusterWare, our unique Linux software, which makes large pools of Linux servers appear and act like a single , consistent, virtual system. Through a single point of command/control, or “Master,” thousands of systems can be managed as if they were a single, consistent, virtual system.

This dramatically simplifies deployment and management and significantly improves datacenter resource utilization as well as server performance and scalability. Our aim is to make Linux clustering as powerful and easy to use as expensive symmetric multiprocessing (SMP) environments, at a Linux price point.

By combining the economics of Open Source with the simplicity and manageability of cluster virtualization, we’re able to provide an extremely compelling financial benefit to our customers by removing many of the on-going costs associated with complex management and administration of competitive offerings.

HPCwire: Where and how did this concept of cluster virtualization start? What was the driving force and how has it impacted HPC?

Becker: Cluster computing has become the hottest hardware trend in HPC, and is poised to ride the wave of “mainstream adoption.” Scientific and academic organizations have broadly embraced clustered system architectures as a cost effective alternative to monolithic UNIX SMP systems for computationally intensive simulations. Commercial businesses have more recently adopted Linux clusters for edge-of-the network applications such as Web hosting.

The tipping point that enabled the initial move to Linux and clustered solutions in HPC was driven predominantly by the compelling order-of-magnitude cost savings offered by commodity hardware and/or open source software. As the market has matured and more complete solutions have come available, the value proposition for Linux clusters has expanded as well to include significant TCO savings spanning hardware, software, and support costs; flexibility in configuration and upgrade options; freedom from constraints of single vendor development schedules and support; greater flexibility through open source software customization; and rapid performance advancements in industry standard processor, storage, and networking technologies.

The catalyst behind the evolution of cluster virtualization was the ongoing complexity of administration and management of traditional, roll-your-own Linux clusters. This complexity added significant dollars to the overall cost of a Linux cluster. In the midst of the rapid adoption of Linux clusters, IDC conducted a research study in 2003 asking users to identify the greatest challenges they faced in implementing clusters. The most frequently cited response was system management, which was selected by 48 percent of respondents. Breaking the issues of manageability down further, we see a number of challenges that cluster virtualization was designed to address.

Ad hoc, do-it-yourself clusters are cumbersome to use and generally adapted only to batch processing. Usage habits change. Clusters in HPC have been effectively adapted for batch computing environments, but until recently have not been well suited for interactive use.

This is why batch schedulers have been so prevalent as the issue for users is that they are presented in a traditional cluster with 100 separate servers for a 100-node cluster. Managing jobs and their related processes, data etc. means manually interacting with all of those machines. It is not only tedious and time consuming to do so, it is also error prone.

Complexity of administration and management can be both time-intensive and costly. This complexity is even worse for the admin who must set up and maintain the cluster. The admin is presented with 100 separate machines that must be loaded with the OS and cluster management software and this can take 5 to 30 minutes for each machine if everything goes well. Then all of the servers must be configured for access, users, security and coordinating as a cluster. Extensive scripting is usually required and these scripts must be maintained by different people over time adding overhead and cost.

What’s more, many DIY solutions are simply open source projects lacking a bit in terms of formal support and documentation. It’s true that open source has zero upfront costs compared to professionally developed cluster solutions. However, there are many associated costs — in terms of time and resources — that open source carries over the lifetime of the cluster.

It is far more cost effective to use a commercially developed and fully supported solution that can simplify the management of large pools of servers.

System stability is highly dependent on software consistency across compute nodes. Even a simple inconsistency can cripple a parallel program. An effective management software solution is needed to handle the underlying OS distribution, communication libraries such as MPI, as well as revision control for third-party software.

When software provisioning involves full Linux distributions and supporting software to each cluster node’s disk, the chances of inconsistency are inherently much higher. With HPC jobs that sometimes takes weeks or months to complete, the risk of wasted work is very costly. It is far better to use stateless provisioning of software — direct to memory — as it is safer and orders of magnitude faster to reprovision on demand.

Security threats can be introduced with each server deployed. A full install of Linux on each cluster node introduces the software components that hackers typically target. Full user shells and network services expose multiple attack points and should not actually be required if the cluster were architected properly.

HPCwire: Does cluster virtualization address these challenges?

Nist: Cluster virtualization enables large pools of servers to act and appear like a single, consistent virtual machine. Practically, it makes a Linux cluster so easy to use that you don’t have to be a sys admin to run one.

At Penguin, our clusters are application-ready, so our customers can be up and running — fast. They are driven by Scyld ClusterWare which is a productive, simple and hardware-agnostic HPC system that enables administrators to install, monitor and manage the cluster as a single system, from a single node — the Master.

It has a powerful unified process space so end users can easily and intuitively deploy, manage and run complex applications from the Master, as if it were a single virtual machine. What this means is that the users and the admin are presented with what appears to be a single server that can scale up or down with virtually any number of processors behind it and still be as simple to use and manage as a single server. The compute nodes exist only to run applications specified by the Master node.

Because the compute nodes run an initial lightweight, in-memory distribution, they can be provisioned rapidly so users can flexibly add or delete nodes on demand, in seconds, making the cluster extraordinarily scalable and resilient. The compute nodes can at any time, in this architecture, provision precisely the additional dependencies required for applications. The lightweight compute environment is stripped of any unnecessary system services and associated vulnerabilities, making the cluster nearly impossible to attack, thus inherently more secure.

Finally, with single point provisioning and consistency control mechanisms, there are no version skews, so the system is intrinsically more reliable. Compared to legacy technologies, our virtualized clusters offer a more efficient, secure way to manage servers that delivers the productivity, reliability, scalability and lower total cost of ownership that high-performance business computing demands.

Read Part Two of the “The Impact on Cluster Virtualization“ interview.

—–

Donald Becker is the CTO of Penguin Computing and co-inventor of Beowulf clusters. Donald is an internationally recognized operating system developer and the original inventor of Beowulf clustering. In 1999 he founded Scyld Computing and led the development of the next-generation Beowulf cluster operating system. Prior to founding Scyld, Donald started the Beowulf Parallel Workstation project at NASA Goddard Space Flight Center. He is the co-author of How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. With colleagues from the California Institute of Technology and the Los Alamos National Laboratory, he was the recipient of the IEEE Computer Society 1997 Gordon Bell Prize for Price/Performance.

Pauline Nist is the SVP of Product Development and Management at Penguin Computing. Before joining Penguin Computing, Pauline served as vice president of Quality for HP’s Enterprise Storage and Servers Division and immediately prior to that, as vice president and general manager for HP’s NonStop Enterprise Division, where she was responsible for the development, delivery, and marketing of the NonStop family of servers, database, and middleware software. Prior to the NonStop Enterprise Division (formerly known as Tandem Computers), Pauline served as vice president of the Alpha Servers business unit at Digital Equipment Corporation.

Topics: Middleware, Systems

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pressing needs and hurdles to widespread AI adoption. The sudde Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Google Addresses the Mysteries of Its Hypercomputer

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Click Here for More Headlines

HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.

Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications, Inc. is prohibited.

Leading Solution Providers

Off The Wire

Industry Headlines

April 18, 2024

April 17, 2024

April 16, 2024

Subscribe to HPCwire's Weekly Update!

Kathy Yelick on Post-Exascale Challenges

2024 Winter Classic: Texas Two Step

2024 Winter Classic: The Return of Team Fayetteville

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

2024 Winter Classic: Meet Team Morehouse

MLCommons Launches New AI Safety Benchmark Initiative

Kathy Yelick on Post-Exascale Challenges

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

MLCommons Launches New AI Safety Benchmark Initiative

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

Nvidia’s GTC Is the New Intel IDF

Google Announces Homegrown ARM-based CPUs

Nvidia H100: Are 550,000 GPUs Enough for This Year?

Synopsys Eats Ansys: Does HPC Get Indigestion?

Intel’s Server and PC Chip Development Will Blur After 2025

Choosing the Right GPU for LLM Inference and Training

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

Google Addresses the Mysteries of Its Hypercomputer

How AMD May Get Across the CUDA Moat

Leading Solution Providers

Contributors

Tiffany Trader

Editorial Director

Douglas Eadline

Managing Editor

John Russell

Senior Editor

Kevin Jackson

Contributing Editor

Ali Azhar

Contributing Editor

Alex Woodie

Contributing Editor

Addison Snell

Contributing Editor

Drew Jolly

Assistant Editor

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

DoD Takes a Long View of Quantum Computing

China Is All In on a RISC-V Future

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

Eyes on the Quantum Prize – D-Wave Says its Time is Now

GenAI Having Major Impact on Data Culture, Survey Says

The GenAI Datacenter Squeeze Is Here

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link