The Impact of Cluster Virtualization on HPC

By Nicole Hemsoth

December 1, 2006

Is virtualization the next “big thing” to impact high performance computing or is much of the buzz just hype? Of its many definitions, which technologies or virtualization strategies are actually solving real HPC challenges today? The adoption of Linux and Windows clusters, the rise of multi-core processors, and the spread of various virtualization technologies offer a great deal of potential for organizations to increase capabilities and reduce costs.

In part one of a two-part interview, HPC luminary Don Becker, CTO of Penguin Computing and co-inventor of the Beowulf clustering model, and Pauline Nist, senior vice president of product development and management for Penguin Computing, discuss how cluster virtualization can enable large pools of servers to appear and act as a single, unified system. According to them, a cluster virtualization strategy is one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing and getting the most out of your server resources.

HPCwire: There is a tremendous amount of buzz surrounding virtualization these days, and the term seems to be applied to many different things. Can you explain a bit about what virtualization is/means and what its benefits are?

Donald Becker: There is a lot of buzz, but it’s not necessarily hype. However there are multiple definitions of “virtualization” – which can cause confusion. Essentially, the two different types boil down to 1). Making many look like one and 2). Making one machine work as many machines. There is a lot of activity and noise right now around virtualization as it applies to virtual machine technology. Because of this there is a tendency to think of VMs as the only example of virtualization.

Machine virtualization or virtual machine technology (a la VMware and Xen) is where you divide up the resources of a single server into multiple execution environments, or multiple virtual machines. It’s the exact opposite of making many look like one — this is making one work as many separate instances. This concept has been around for 40+ years in proprietary IBM machines but much of the recent buzz in the marketplace today is around the proliferation of this capability on commodity servers and operating systems.

HPCwire: Why is there so much interest right now in the virtual machine technology?

Becker: The key driving factor is proliferation of applications in the enterprise and the fact that this technology is now available on commodity x86 platforms prolific in the enterprise computing today. Many of these applications have strict requirements for the environment they run in, sometimes mutually exclusive requirements with other apps, and many are business critical functions that cannot be compromised.

As a result, apps became silo’d across racks of servers, each getting about 15-20 percent utilization. This is really inefficient and very costly to scale. What’s appealing about virtual machine technology is consolidation of many applications on to a single server, such that each app thinks it has the machine to itself. Each virtual machine is effectively a logical instance of the complete machine environment with its own CPU, memory and I/O. The net effect is fewer physical servers with much higher per-server utilization. This reduces capital costs on hardware and operational costs related to power, real estate, etc. It does not however take away the fact that you still have the same, or perhaps even larger, pools of logical servers that must be provisioned and managed.

Pauline Nist: The major enterprise server vendors, Intel and AMD are rolling out ongoing mechanisms at the processor level to better enable virtual machine technology and position themselves as optimal platforms for this new wave of innovations. Solutions like VMware have built a tremendous business in just a few short years bringing a VM solution to both Linux and Windows platforms. First with a capability that sits on top of the standard OS, and then with a variant that sits right on top of the hardware and launches OS environments. This is where it gets very interesting.

And now, you have Microsoft asserting its intentions to offer core virtual machine capability in Windows. The major enterprise Linux distros, Red Hat and Novell, working with the open source company XenSource, maintainers of the Linux VM technology Xen.

These OS platform providers are all making announcements and clearly jockeying for position to stake their claim as the “socket” on the hardware into which software plugs in. Meanwhile, VMware maintains such a strong presence on both Windows and Linux that the OS vendors will be of course working closely with them as this all progresses. It will be interesting to see how this all shakes out.

HPCwire: So we’ve heard about one definition of virtualization: “Making one machine work as many machines.” Tell us more about the other definition: “Making many look like one.”

Nist: Historically, the broader application of the term virtualization has been applied to the abstraction of physical devices from their clients to provide better availability and mobility while reducing the complexity of dealing with larger and larger pools of resources.

For example, virtualization is a well-known concept in networking, from virtual channels in ATM, to virtual private networks, virtual LANs and virtual IP addresses.

Storage virtualization is the pooling of physical storage from multiple network storage devices into what appears to be a single storage device that is managed from a central console — for example, a storage area network or SAN.

Likewise, clustered and Grid computing enables the “virtualization” of increasingly large pools of servers, along with their interconnects and storage needs, to reduce the complexity of managing and coordinating these resources for the purpose of distributed computing.

Becker: Organizations should be taking a closer look at their server utilization and determining if a cluster virtualization strategy would work for them. The financial and efficiency benefits are extremely compelling — and today’s tools and solutions offer very easy integration, making cluster virtualization one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing.

This is where Penguin Computing plays — Our virtualized cluster solutions are driven by Scyld ClusterWare, our unique Linux software, which makes large pools of Linux servers appear and act like a single , consistent, virtual system. Through a single point of command/control, or “Master,” thousands of systems can be managed as if they were a single, consistent, virtual system.

This dramatically simplifies deployment and management and significantly improves datacenter resource utilization as well as server performance and scalability. Our aim is to make Linux clustering as powerful and easy to use as expensive symmetric multiprocessing (SMP) environments, at a Linux price point.

By combining the economics of Open Source with the simplicity and manageability of cluster virtualization, we’re able to provide an extremely compelling financial benefit to our customers by removing many of the on-going costs associated with complex management and administration of competitive offerings.

HPCwire: Where and how did this concept of cluster virtualization start? What was the driving force and how has it impacted HPC?

Becker: Cluster computing has become the hottest hardware trend in HPC, and is poised to ride the wave of “mainstream adoption.” Scientific and academic organizations have broadly embraced clustered system architectures as a cost effective alternative to monolithic UNIX SMP systems for computationally intensive simulations. Commercial businesses have more recently adopted Linux clusters for edge-of-the network applications such as Web hosting.

The tipping point that enabled the initial move to Linux and clustered solutions in HPC was driven predominantly by the compelling order-of-magnitude cost savings offered by commodity hardware and/or open source software. As the market has matured and more complete solutions have come available, the value proposition for Linux clusters has expanded as well to include significant TCO savings spanning hardware, software, and support costs; flexibility in configuration and upgrade options; freedom from constraints of single vendor development schedules and support; greater flexibility through open source software customization; and rapid performance advancements in industry standard processor, storage, and networking technologies.

The catalyst behind the evolution of cluster virtualization was the ongoing complexity of administration and management of traditional, roll-your-own Linux clusters. This complexity added significant dollars to the overall cost of a Linux cluster. In the midst of the rapid adoption of Linux clusters, IDC conducted a research study in 2003 asking users to identify the greatest challenges they faced in implementing clusters. The most frequently cited response was system management, which was selected by 48 percent of respondents. Breaking the issues of manageability down further, we see a number of challenges that cluster virtualization was designed to address.

Ad hoc, do-it-yourself clusters are cumbersome to use and generally adapted only to batch processing. Usage habits change. Clusters in HPC have been effectively adapted for batch computing environments, but until recently have not been well suited for interactive use.

This is why batch schedulers have been so prevalent as the issue for users is that they are presented in a traditional cluster with 100 separate servers for a 100-node cluster. Managing jobs and their related processes, data etc. means manually interacting with all of those machines. It is not only tedious and time consuming to do so, it is also error prone.

Complexity of administration and management can be both time-intensive and costly. This complexity is even worse for the admin who must set up and maintain the cluster. The admin is presented with 100 separate machines that must be loaded with the OS and cluster management software and this can take 5 to 30 minutes for each machine if everything goes well. Then all of the servers must be configured for access, users, security and coordinating as a cluster. Extensive scripting is usually required and these scripts must be maintained by different people over time adding overhead and cost.

What’s more, many DIY solutions are simply open source projects lacking a bit in terms of formal support and documentation. It’s true that open source has zero upfront costs compared to professionally developed cluster solutions. However, there are many associated costs — in terms of time and resources — that open source carries over the lifetime of the cluster.

It is far more cost effective to use a commercially developed and fully supported solution that can simplify the management of large pools of servers.

System stability is highly dependent on software consistency across compute nodes. Even a simple inconsistency can cripple a parallel program. An effective management software solution is needed to handle the underlying OS distribution, communication libraries such as MPI, as well as revision control for third-party software.

When software provisioning involves full Linux distributions and supporting software to each cluster node’s disk, the chances of inconsistency are inherently much higher. With HPC jobs that sometimes takes weeks or months to complete, the risk of wasted work is very costly. It is far better to use stateless provisioning of software — direct to memory — as it is safer and orders of magnitude faster to reprovision on demand.

Security threats can be introduced with each server deployed. A full install of Linux on each cluster node introduces the software components that hackers typically target. Full user shells and network services expose multiple attack points and should not actually be required if the cluster were architected properly.

HPCwire: Does cluster virtualization address these challenges?

Nist: Cluster virtualization enables large pools of servers to act and appear like a single, consistent virtual machine. Practically, it makes a Linux cluster so easy to use that you don’t have to be a sys admin to run one.

At Penguin, our clusters are application-ready, so our customers can be up and running — fast. They are driven by Scyld ClusterWare which is a productive, simple and hardware-agnostic HPC system that enables administrators to install, monitor and manage the cluster as a single system, from a single node — the Master.

It has a powerful unified process space so end users can easily and intuitively deploy, manage and run complex applications from the Master, as if it were a single virtual machine. What this means is that the users and the admin are presented with what appears to be a single server that can scale up or down with virtually any number of processors behind it and still be as simple to use and manage as a single server. The compute nodes exist only to run applications specified by the Master node.

Because the compute nodes run an initial lightweight, in-memory distribution, they can be provisioned rapidly so users can flexibly add or delete nodes on demand, in seconds, making the cluster extraordinarily scalable and resilient. The compute nodes can at any time, in this architecture, provision precisely the additional dependencies required for applications. The lightweight compute environment is stripped of any unnecessary system services and associated vulnerabilities, making the cluster nearly impossible to attack, thus inherently more secure.

Finally, with single point provisioning and consistency control mechanisms, there are no version skews, so the system is intrinsically more reliable. Compared to legacy technologies, our virtualized clusters offer a more efficient, secure way to manage servers that delivers the productivity, reliability, scalability and lower total cost of ownership that high-performance business computing demands.

Read Part Two of the “The Impact on Cluster Virtualization“ interview.

—–

Donald Becker is the CTO of Penguin Computing and co-inventor of Beowulf clusters. Donald is an internationally recognized operating system developer and the original inventor of Beowulf clustering. In 1999 he founded Scyld Computing and led the development of the next-generation Beowulf cluster operating system. Prior to founding Scyld, Donald started the Beowulf Parallel Workstation project at NASA Goddard Space Flight Center. He is the co-author of How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. With colleagues from the California Institute of Technology and the Los Alamos National Laboratory, he was the recipient of the IEEE Computer Society 1997 Gordon Bell Prize for Price/Performance.

Pauline Nist is the SVP of Product Development and Management at Penguin Computing. Before joining Penguin Computing, Pauline served as vice president of Quality for HP’s Enterprise Storage and Servers Division and immediately prior to that, as vice president and general manager for HP’s NonStop Enterprise Division, where she was responsible for the development, delivery, and marketing of the NonStop family of servers, database, and middleware software. Prior to the NonStop Enterprise Division (formerly known as Tandem Computers), Pauline served as vice president of the Alpha Servers business unit at Digital Equipment Corporation.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Harvard/Google Use AI to Help Produce Astonishing 3D Map of Brain Tissue

May 10, 2024

Although LLMs are getting all the notice lately, AI techniques of many varieties are being infused throughout science. For example, Harvard researchers, Google, and colleagues published a 3D map in Science this week that Read more…

ISC Preview: Focus Will Be on Top500 and HPC Diversity 

May 9, 2024

Last year's Supercomputing 2023 in November had record attendance, but the direction of high-performance computing was a hot topic on the floor. Expect more of that at the upcoming ISC High Performance 2024, which is hap Read more…

Processor Security: Taking the Wong Path

May 9, 2024

More research at UC San Diego revealed yet another side-channel attack on x86_64 processors. The research identified a new vulnerability that allows precise control of conditional branch prediction in modern processors.� Read more…

The Ultimate 2024 Winter Class Round-Up

May 8, 2024

To make navigating easier, we have compiled a collection of all the 2024 Winter Classic News in this single page round-up. Meet The Teams   Introducing Team Lobo This is the other team from University of New Mex Read more…

How the Chip Industry is Helping a Battery Company

May 8, 2024

Chip companies, once seen as engineering pure plays, are now at the center of geopolitical intrigue. Chip manufacturing firms, especially TSMC and Intel, have become the backbone of devices with an on/off switch. Thes Read more…

Illinois Considers $20 Billion Quantum Manhattan Project Says Report

May 7, 2024

There are multiple reports that Illinois governor Jay Robert Pritzker is considering a $20 billion Quantum Manhattan-like project for the Chicago area. According to the reports, photonics quantum computer developer PsiQu Read more…

ISC Preview: Focus Will Be on Top500 and HPC Diversity 

May 9, 2024

Last year's Supercomputing 2023 in November had record attendance, but the direction of high-performance computing was a hot topic on the floor. Expect more of Read more…

Illinois Considers $20 Billion Quantum Manhattan Project Says Report

May 7, 2024

There are multiple reports that Illinois governor Jay Robert Pritzker is considering a $20 billion Quantum Manhattan-like project for the Chicago area. Accordin Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

How Nvidia Could Use $700M Run.ai Acquisition for AI Consumption

May 6, 2024

Nvidia is touching $2 trillion in market cap purely on the brute force of its GPU sales, and there's room for the company to grow with software. The company hop Read more…

Hyperion To Provide a Peek at Storage, File System Usage with Global Site Survey

May 3, 2024

Curious how the market for distributed file systems, interconnects, and high-end storage is playing out in 2024? Then you might be interested in the market anal Read more…

Qubit Watch: Intel Process, IBM’s Heron, APS March Meeting, PsiQuantum Platform, QED-C on Logistics, FS Comparison

May 1, 2024

Intel has long argued that leveraging its semiconductor manufacturing prowess and use of quantum dot qubits will help Intel emerge as a leader in the race to de Read more…

Stanford HAI AI Index Report: Science and Medicine

April 29, 2024

While AI tools are incredibly useful in a variety of industries, they truly shine when applied to solving problems in scientific and medical discovery. Research Read more…

IBM Delivers Qiskit 1.0 and Best Practices for Transitioning to It

April 29, 2024

After spending much of its December Quantum Summit discussing forthcoming quantum software development kit Qiskit 1.0 — the first full version — IBM quietly Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Leading Solution Providers

Contributors

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire