The Impact of Cluster Virtualization on HPC
Is virtualization the next “big thing” to impact high performance computing or is much of the buzz just hype? Of its many definitions, which technologies or virtualization strategies are actually solving real HPC challenges today? The adoption of Linux and Windows clusters, the rise of multi-core processors, and the spread of various virtualization technologies offer a great deal of potential for organizations to increase capabilities and reduce costs.
In part one of a two-part interview, HPC luminary Don Becker, CTO of Penguin Computing and co-inventor of the Beowulf clustering model, and Pauline Nist, senior vice president of product development and management for Penguin Computing, discuss how cluster virtualization can enable large pools of servers to appear and act as a single, unified system. According to them, a cluster virtualization strategy is one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing and getting the most out of your server resources.
HPCwire: There is a tremendous amount of buzz surrounding virtualization these days, and the term seems to be applied to many different things. Can you explain a bit about what virtualization is/means and what its benefits are?
Donald Becker: There is a lot of buzz, but it’s not necessarily hype. However there are multiple definitions of “virtualization” – which can cause confusion. Essentially, the two different types boil down to 1). Making many look like one and 2). Making one machine work as many machines. There is a lot of activity and noise right now around virtualization as it applies to virtual machine technology. Because of this there is a tendency to think of VMs as the only example of virtualization.
Machine virtualization or virtual machine technology (a la VMware and Xen) is where you divide up the resources of a single server into multiple execution environments, or multiple virtual machines. It’s the exact opposite of making many look like one — this is making one work as many separate instances. This concept has been around for 40+ years in proprietary IBM machines but much of the recent buzz in the marketplace today is around the proliferation of this capability on commodity servers and operating systems.
HPCwire: Why is there so much interest right now in the virtual machine technology?
Becker: The key driving factor is proliferation of applications in the enterprise and the fact that this technology is now available on commodity x86 platforms prolific in the enterprise computing today. Many of these applications have strict requirements for the environment they run in, sometimes mutually exclusive requirements with other apps, and many are business critical functions that cannot be compromised.
As a result, apps became silo’d across racks of servers, each getting about 15-20 percent utilization. This is really inefficient and very costly to scale. What’s appealing about virtual machine technology is consolidation of many applications on to a single server, such that each app thinks it has the machine to itself. Each virtual machine is effectively a logical instance of the complete machine environment with its own CPU, memory and I/O. The net effect is fewer physical servers with much higher per-server utilization. This reduces capital costs on hardware and operational costs related to power, real estate, etc. It does not however take away the fact that you still have the same, or perhaps even larger, pools of logical servers that must be provisioned and managed.
Pauline Nist: The major enterprise server vendors, Intel and AMD are rolling out ongoing mechanisms at the processor level to better enable virtual machine technology and position themselves as optimal platforms for this new wave of innovations. Solutions like VMware have built a tremendous business in just a few short years bringing a VM solution to both Linux and Windows platforms. First with a capability that sits on top of the standard OS, and then with a variant that sits right on top of the hardware and launches OS environments. This is where it gets very interesting.
And now, you have Microsoft asserting its intentions to offer core virtual machine capability in Windows. The major enterprise Linux distros, Red Hat and Novell, working with the open source company XenSource, maintainers of the Linux VM technology Xen.
These OS platform providers are all making announcements and clearly jockeying for position to stake their claim as the “socket” on the hardware into which software plugs in. Meanwhile, VMware maintains such a strong presence on both Windows and Linux that the OS vendors will be of course working closely with them as this all progresses. It will be interesting to see how this all shakes out.
HPCwire: So we’ve heard about one definition of virtualization: “Making one machine work as many machines.” Tell us more about the other definition: “Making many look like one.”
Nist: Historically, the broader application of the term virtualization has been applied to the abstraction of physical devices from their clients to provide better availability and mobility while reducing the complexity of dealing with larger and larger pools of resources.
For example, virtualization is a well-known concept in networking, from virtual channels in ATM, to virtual private networks, virtual LANs and virtual IP addresses.
Storage virtualization is the pooling of physical storage from multiple network storage devices into what appears to be a single storage device that is managed from a central console — for example, a storage area network or SAN.
Likewise, clustered and Grid computing enables the “virtualization” of increasingly large pools of servers, along with their interconnects and storage needs, to reduce the complexity of managing and coordinating these resources for the purpose of distributed computing.
Becker: Organizations should be taking a closer look at their server utilization and determining if a cluster virtualization strategy would work for them. The financial and efficiency benefits are extremely compelling — and today’s tools and solutions offer very easy integration, making cluster virtualization one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing.
This is where Penguin Computing plays — Our virtualized cluster solutions are driven by Scyld ClusterWare, our unique Linux software, which makes large pools of Linux servers appear and act like a single , consistent, virtual system. Through a single point of command/control, or “Master,” thousands of systems can be managed as if they were a single, consistent, virtual system.
This dramatically simplifies deployment and management and significantly improves datacenter resource utilization as well as server performance and scalability. Our aim is to make Linux clustering as powerful and easy to use as expensive symmetric multiprocessing (SMP) environments, at a Linux price point.
By combining the economics of Open Source with the simplicity and manageability of cluster virtualization, we’re able to provide an extremely compelling financial benefit to our customers by removing many of the on-going costs associated with complex management and administration of competitive offerings.
HPCwire: Where and how did this concept of cluster virtualization start? What was the driving force and how has it impacted HPC?
Becker: Cluster computing has become the hottest hardware trend in HPC, and is poised to ride the wave of “mainstream adoption.” Scientific and academic organizations have broadly embraced clustered system architectures as a cost effective alternative to monolithic UNIX SMP systems for computationally intensive simulations. Commercial businesses have more recently adopted Linux clusters for edge-of-the network applications such as Web hosting.
The tipping point that enabled the initial move to Linux and clustered solutions in HPC was driven predominantly by the compelling order-of-magnitude cost savings offered by commodity hardware and/or open source software. As the market has matured and more complete solutions have come available, the value proposition for Linux clusters has expanded as well to include significant TCO savings spanning hardware, software, and support costs; flexibility in configuration and upgrade options; freedom from constraints of single vendor development schedules and support; greater flexibility through open source software customization; and rapid performance advancements in industry standard processor, storage, and networking technologies.
The catalyst behind the evolution of cluster virtualization was the ongoing complexity of administration and management of traditional, roll-your-own Linux clusters. This complexity added significant dollars to the overall cost of a Linux cluster. In the midst of the rapid adoption of Linux clusters, IDC conducted a research study in 2003 asking users to identify the greatest challenges they faced in implementing clusters. The most frequently cited response was system management, which was selected by 48 percent of respondents. Breaking the issues of manageability down further, we see a number of challenges that cluster virtualization was designed to address.
Ad hoc, do-it-yourself clusters are cumbersome to use and generally adapted only to batch processing. Usage habits change. Clusters in HPC have been effectively adapted for batch computing environments, but until recently have not been well suited for interactive use.
This is why batch schedulers have been so prevalent as the issue for users is that they are presented in a traditional cluster with 100 separate servers for a 100-node cluster. Managing jobs and their related processes, data etc. means manually interacting with all of those machines. It is not only tedious and time consuming to do so, it is also error prone.
Complexity of administration and management can be both time-intensive and costly. This complexity is even worse for the admin who must set up and maintain the cluster. The admin is presented with 100 separate machines that must be loaded with the OS and cluster management software and this can take 5 to 30 minutes for each machine if everything goes well. Then all of the servers must be configured for access, users, security and coordinating as a cluster. Extensive scripting is usually required and these scripts must be maintained by different people over time adding overhead and cost.
What’s more, many DIY solutions are simply open source projects lacking a bit in terms of formal support and documentation. It’s true that open source has zero upfront costs compared to professionally developed cluster solutions. However, there are many associated costs — in terms of time and resources — that open source carries over the lifetime of the cluster.
It is far more cost effective to use a commercially developed and fully supported solution that can simplify the management of large pools of servers.
System stability is highly dependent on software consistency across compute nodes. Even a simple inconsistency can cripple a parallel program. An effective management software solution is needed to handle the underlying OS distribution, communication libraries such as MPI, as well as revision control for third-party software.
When software provisioning involves full Linux distributions and supporting software to each cluster node’s disk, the chances of inconsistency are inherently much higher. With HPC jobs that sometimes takes weeks or months to complete, the risk of wasted work is very costly. It is far better to use stateless provisioning of software — direct to memory — as it is safer and orders of magnitude faster to reprovision on demand.
Security threats can be introduced with each server deployed. A full install of Linux on each cluster node introduces the software components that hackers typically target. Full user shells and network services expose multiple attack points and should not actually be required if the cluster were architected properly.
HPCwire: Does cluster virtualization address these challenges?
Nist: Cluster virtualization enables large pools of servers to act and appear like a single, consistent virtual machine. Practically, it makes a Linux cluster so easy to use that you don’t have to be a sys admin to run one.
At Penguin, our clusters are application-ready, so our customers can be up and running — fast. They are driven by Scyld ClusterWare which is a productive, simple and hardware-agnostic HPC system that enables administrators to install, monitor and manage the cluster as a single system, from a single node — the Master.
It has a powerful unified process space so end users can easily and intuitively deploy, manage and run complex applications from the Master, as if it were a single virtual machine. What this means is that the users and the admin are presented with what appears to be a single server that can scale up or down with virtually any number of processors behind it and still be as simple to use and manage as a single server. The compute nodes exist only to run applications specified by the Master node.
Because the compute nodes run an initial lightweight, in-memory distribution, they can be provisioned rapidly so users can flexibly add or delete nodes on demand, in seconds, making the cluster extraordinarily scalable and resilient. The compute nodes can at any time, in this architecture, provision precisely the additional dependencies required for applications. The lightweight compute environment is stripped of any unnecessary system services and associated vulnerabilities, making the cluster nearly impossible to attack, thus inherently more secure.
Finally, with single point provisioning and consistency control mechanisms, there are no version skews, so the system is intrinsically more reliable. Compared to legacy technologies, our virtualized clusters offer a more efficient, secure way to manage servers that delivers the productivity, reliability, scalability and lower total cost of ownership that high-performance business computing demands.
Read Part Two of the “The Impact on Cluster Virtualization“ interview.
Donald Becker is the CTO of Penguin Computing and co-inventor of Beowulf clusters. Donald is an internationally recognized operating system developer and the original inventor of Beowulf clustering. In 1999 he founded Scyld Computing and led the development of the next-generation Beowulf cluster operating system. Prior to founding Scyld, Donald started the Beowulf Parallel Workstation project at NASA Goddard Space Flight Center. He is the co-author of How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. With colleagues from the California Institute of Technology and the Los Alamos National Laboratory, he was the recipient of the IEEE Computer Society 1997 Gordon Bell Prize for Price/Performance.
Pauline Nist is the SVP of Product Development and Management at Penguin Computing. Before joining Penguin Computing, Pauline served as vice president of Quality for HP’s Enterprise Storage and Servers Division and immediately prior to that, as vice president and general manager for HP’s NonStop Enterprise Division, where she was responsible for the development, delivery, and marketing of the NonStop family of servers, database, and middleware software. Prior to the NonStop Enterprise Division (formerly known as Tandem Computers), Pauline served as vice president of the Alpha Servers business unit at Digital Equipment Corporation.