The Impact of Cluster Virtualization on HPC

By Nicole Hemsoth

December 1, 2006

Is virtualization the next “big thing” to impact high performance computing or is much of the buzz just hype? Of its many definitions, which technologies or virtualization strategies are actually solving real HPC challenges today? The adoption of Linux and Windows clusters, the rise of multi-core processors, and the spread of various virtualization technologies offer a great deal of potential for organizations to increase capabilities and reduce costs.

In part one of a two-part interview, HPC luminary Don Becker, CTO of Penguin Computing and co-inventor of the Beowulf clustering model, and Pauline Nist, senior vice president of product development and management for Penguin Computing, discuss how cluster virtualization can enable large pools of servers to appear and act as a single, unified system. According to them, a cluster virtualization strategy is one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing and getting the most out of your server resources.

HPCwire: There is a tremendous amount of buzz surrounding virtualization these days, and the term seems to be applied to many different things. Can you explain a bit about what virtualization is/means and what its benefits are?

Donald Becker: There is a lot of buzz, but it’s not necessarily hype. However there are multiple definitions of “virtualization” – which can cause confusion. Essentially, the two different types boil down to 1). Making many look like one and 2). Making one machine work as many machines. There is a lot of activity and noise right now around virtualization as it applies to virtual machine technology. Because of this there is a tendency to think of VMs as the only example of virtualization.

Machine virtualization or virtual machine technology (a la VMware and Xen) is where you divide up the resources of a single server into multiple execution environments, or multiple virtual machines. It’s the exact opposite of making many look like one — this is making one work as many separate instances. This concept has been around for 40+ years in proprietary IBM machines but much of the recent buzz in the marketplace today is around the proliferation of this capability on commodity servers and operating systems.

HPCwire: Why is there so much interest right now in the virtual machine technology?

Becker: The key driving factor is proliferation of applications in the enterprise and the fact that this technology is now available on commodity x86 platforms prolific in the enterprise computing today. Many of these applications have strict requirements for the environment they run in, sometimes mutually exclusive requirements with other apps, and many are business critical functions that cannot be compromised.

As a result, apps became silo’d across racks of servers, each getting about 15-20 percent utilization. This is really inefficient and very costly to scale. What’s appealing about virtual machine technology is consolidation of many applications on to a single server, such that each app thinks it has the machine to itself. Each virtual machine is effectively a logical instance of the complete machine environment with its own CPU, memory and I/O. The net effect is fewer physical servers with much higher per-server utilization. This reduces capital costs on hardware and operational costs related to power, real estate, etc. It does not however take away the fact that you still have the same, or perhaps even larger, pools of logical servers that must be provisioned and managed.

Pauline Nist: The major enterprise server vendors, Intel and AMD are rolling out ongoing mechanisms at the processor level to better enable virtual machine technology and position themselves as optimal platforms for this new wave of innovations. Solutions like VMware have built a tremendous business in just a few short years bringing a VM solution to both Linux and Windows platforms. First with a capability that sits on top of the standard OS, and then with a variant that sits right on top of the hardware and launches OS environments. This is where it gets very interesting.

And now, you have Microsoft asserting its intentions to offer core virtual machine capability in Windows. The major enterprise Linux distros, Red Hat and Novell, working with the open source company XenSource, maintainers of the Linux VM technology Xen.

These OS platform providers are all making announcements and clearly jockeying for position to stake their claim as the “socket” on the hardware into which software plugs in. Meanwhile, VMware maintains such a strong presence on both Windows and Linux that the OS vendors will be of course working closely with them as this all progresses. It will be interesting to see how this all shakes out.

HPCwire: So we’ve heard about one definition of virtualization: “Making one machine work as many machines.” Tell us more about the other definition: “Making many look like one.”

Nist: Historically, the broader application of the term virtualization has been applied to the abstraction of physical devices from their clients to provide better availability and mobility while reducing the complexity of dealing with larger and larger pools of resources.

For example, virtualization is a well-known concept in networking, from virtual channels in ATM, to virtual private networks, virtual LANs and virtual IP addresses.

Storage virtualization is the pooling of physical storage from multiple network storage devices into what appears to be a single storage device that is managed from a central console — for example, a storage area network or SAN.

Likewise, clustered and Grid computing enables the “virtualization” of increasingly large pools of servers, along with their interconnects and storage needs, to reduce the complexity of managing and coordinating these resources for the purpose of distributed computing.

Becker: Organizations should be taking a closer look at their server utilization and determining if a cluster virtualization strategy would work for them. The financial and efficiency benefits are extremely compelling — and today’s tools and solutions offer very easy integration, making cluster virtualization one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing.

This is where Penguin Computing plays — Our virtualized cluster solutions are driven by Scyld ClusterWare, our unique Linux software, which makes large pools of Linux servers appear and act like a single , consistent, virtual system. Through a single point of command/control, or “Master,” thousands of systems can be managed as if they were a single, consistent, virtual system.

This dramatically simplifies deployment and management and significantly improves datacenter resource utilization as well as server performance and scalability. Our aim is to make Linux clustering as powerful and easy to use as expensive symmetric multiprocessing (SMP) environments, at a Linux price point.

By combining the economics of Open Source with the simplicity and manageability of cluster virtualization, we’re able to provide an extremely compelling financial benefit to our customers by removing many of the on-going costs associated with complex management and administration of competitive offerings.

HPCwire: Where and how did this concept of cluster virtualization start? What was the driving force and how has it impacted HPC?

Becker: Cluster computing has become the hottest hardware trend in HPC, and is poised to ride the wave of “mainstream adoption.” Scientific and academic organizations have broadly embraced clustered system architectures as a cost effective alternative to monolithic UNIX SMP systems for computationally intensive simulations. Commercial businesses have more recently adopted Linux clusters for edge-of-the network applications such as Web hosting.

The tipping point that enabled the initial move to Linux and clustered solutions in HPC was driven predominantly by the compelling order-of-magnitude cost savings offered by commodity hardware and/or open source software. As the market has matured and more complete solutions have come available, the value proposition for Linux clusters has expanded as well to include significant TCO savings spanning hardware, software, and support costs; flexibility in configuration and upgrade options; freedom from constraints of single vendor development schedules and support; greater flexibility through open source software customization; and rapid performance advancements in industry standard processor, storage, and networking technologies.

The catalyst behind the evolution of cluster virtualization was the ongoing complexity of administration and management of traditional, roll-your-own Linux clusters. This complexity added significant dollars to the overall cost of a Linux cluster. In the midst of the rapid adoption of Linux clusters, IDC conducted a research study in 2003 asking users to identify the greatest challenges they faced in implementing clusters. The most frequently cited response was system management, which was selected by 48 percent of respondents. Breaking the issues of manageability down further, we see a number of challenges that cluster virtualization was designed to address.

Ad hoc, do-it-yourself clusters are cumbersome to use and generally adapted only to batch processing. Usage habits change. Clusters in HPC have been effectively adapted for batch computing environments, but until recently have not been well suited for interactive use.

This is why batch schedulers have been so prevalent as the issue for users is that they are presented in a traditional cluster with 100 separate servers for a 100-node cluster. Managing jobs and their related processes, data etc. means manually interacting with all of those machines. It is not only tedious and time consuming to do so, it is also error prone.

Complexity of administration and management can be both time-intensive and costly. This complexity is even worse for the admin who must set up and maintain the cluster. The admin is presented with 100 separate machines that must be loaded with the OS and cluster management software and this can take 5 to 30 minutes for each machine if everything goes well. Then all of the servers must be configured for access, users, security and coordinating as a cluster. Extensive scripting is usually required and these scripts must be maintained by different people over time adding overhead and cost.

What’s more, many DIY solutions are simply open source projects lacking a bit in terms of formal support and documentation. It’s true that open source has zero upfront costs compared to professionally developed cluster solutions. However, there are many associated costs — in terms of time and resources — that open source carries over the lifetime of the cluster.

It is far more cost effective to use a commercially developed and fully supported solution that can simplify the management of large pools of servers.

System stability is highly dependent on software consistency across compute nodes. Even a simple inconsistency can cripple a parallel program. An effective management software solution is needed to handle the underlying OS distribution, communication libraries such as MPI, as well as revision control for third-party software.

When software provisioning involves full Linux distributions and supporting software to each cluster node’s disk, the chances of inconsistency are inherently much higher. With HPC jobs that sometimes takes weeks or months to complete, the risk of wasted work is very costly. It is far better to use stateless provisioning of software — direct to memory — as it is safer and orders of magnitude faster to reprovision on demand.

Security threats can be introduced with each server deployed. A full install of Linux on each cluster node introduces the software components that hackers typically target. Full user shells and network services expose multiple attack points and should not actually be required if the cluster were architected properly.

HPCwire: Does cluster virtualization address these challenges?

Nist: Cluster virtualization enables large pools of servers to act and appear like a single, consistent virtual machine. Practically, it makes a Linux cluster so easy to use that you don’t have to be a sys admin to run one.

At Penguin, our clusters are application-ready, so our customers can be up and running — fast. They are driven by Scyld ClusterWare which is a productive, simple and hardware-agnostic HPC system that enables administrators to install, monitor and manage the cluster as a single system, from a single node — the Master.

It has a powerful unified process space so end users can easily and intuitively deploy, manage and run complex applications from the Master, as if it were a single virtual machine. What this means is that the users and the admin are presented with what appears to be a single server that can scale up or down with virtually any number of processors behind it and still be as simple to use and manage as a single server. The compute nodes exist only to run applications specified by the Master node.

Because the compute nodes run an initial lightweight, in-memory distribution, they can be provisioned rapidly so users can flexibly add or delete nodes on demand, in seconds, making the cluster extraordinarily scalable and resilient. The compute nodes can at any time, in this architecture, provision precisely the additional dependencies required for applications. The lightweight compute environment is stripped of any unnecessary system services and associated vulnerabilities, making the cluster nearly impossible to attack, thus inherently more secure.

Finally, with single point provisioning and consistency control mechanisms, there are no version skews, so the system is intrinsically more reliable. Compared to legacy technologies, our virtualized clusters offer a more efficient, secure way to manage servers that delivers the productivity, reliability, scalability and lower total cost of ownership that high-performance business computing demands.

Read Part Two of the “The Impact on Cluster Virtualization“ interview.

—–

Donald Becker is the CTO of Penguin Computing and co-inventor of Beowulf clusters. Donald is an internationally recognized operating system developer and the original inventor of Beowulf clustering. In 1999 he founded Scyld Computing and led the development of the next-generation Beowulf cluster operating system. Prior to founding Scyld, Donald started the Beowulf Parallel Workstation project at NASA Goddard Space Flight Center. He is the co-author of How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. With colleagues from the California Institute of Technology and the Los Alamos National Laboratory, he was the recipient of the IEEE Computer Society 1997 Gordon Bell Prize for Price/Performance.

Pauline Nist is the SVP of Product Development and Management at Penguin Computing. Before joining Penguin Computing, Pauline served as vice president of Quality for HP’s Enterprise Storage and Servers Division and immediately prior to that, as vice president and general manager for HP’s NonStop Enterprise Division, where she was responsible for the development, delivery, and marketing of the NonStop family of servers, database, and middleware software. Prior to the NonStop Enterprise Division (formerly known as Tandem Computers), Pauline served as vice president of the Alpha Servers business unit at Digital Equipment Corporation.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Supercomputers Streamline Prediction of Dangerous Arrhythmia

June 2, 2020

Heart arrhythmia can prove deadly, contributing to the hundreds of thousands of deaths from cardiac arrest in the U.S. every year. Unfortunately, many of those arrhythmia are induced as side effects from various medicati Read more…

By Staff report

Indiana University to Deploy Jetstream 2 Cloud with AMD, Nvidia Technology

June 2, 2020

Indiana University has been awarded a $10 million NSF grant to build ‘Jetstream 2,’ a cloud computing system that will provide 8 aggregate petaflops of computing capability in support of data analysis and AI workload Read more…

By Tiffany Trader

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been instrumental to AMD’s datacenter market resurgence. Nanomet Read more…

By Doug Black

Supercomputer-Powered Protein Simulations Approach Lab Accuracy

June 1, 2020

Protein simulations have dominated the supercomputing conversation of late as supercomputers around the world race to simulate the viral proteins of COVID-19 as accurately as possible and simulate potential bindings in t Read more…

By Oliver Peckham

HPC Career Notes: June 2020 Edition

June 1, 2020

In this monthly feature, we'll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it's a promotion, new company hire, or even an accolade, we've got Read more…

By Mariana Iriarte

AWS Solution Channel

Computational Fluid Dynamics on AWS

Over the past 30 years Computational Fluid Dynamics (CFD) has grown to become a key part of many engineering design processes. From aircraft design to modelling the blood flow in our bodies, the ability to understand the behaviour of fluids has enabled countless innovations and improved the time to market for many products. Read more…

Supercomputer Modeling Shows How COVID-19 Spreads Through Populations

May 30, 2020

As many states begin to loosen the lockdowns and stay-at-home orders that have forced most Americans inside for the past two months, researchers are poring over the data, looking for signs of the dreaded second peak of t Read more…

By Oliver Peckham

Indiana University to Deploy Jetstream 2 Cloud with AMD, Nvidia Technology

June 2, 2020

Indiana University has been awarded a $10 million NSF grant to build ‘Jetstream 2,’ a cloud computing system that will provide 8 aggregate petaflops of comp Read more…

By Tiffany Trader

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

By Doug Black

COVID-19 HPC Consortium Expands to Europe, Reports on Research Projects

May 28, 2020

The COVID-19 HPC Consortium, a public-private effort delivering free access to HPC processing for scientists pursuing coronavirus research – some utilizing AI Read more…

By Doug Black

$100B Plan Submitted for Massive Remake and Expansion of NSF

May 27, 2020

Legislation to reshape, expand - and rename - the National Science Foundation has been submitted in both the U.S. House and Senate. The proposal, which seems to Read more…

By John Russell

IBM Boosts Deep Learning Accuracy on Memristive Chips

May 27, 2020

IBM researchers have taken another step towards making in-memory computing based on phase change (PCM) memory devices a reality. Papers in Nature and Frontiers Read more…

By John Russell

Hats Over Hearts: Remembering Rich Brueckner

May 26, 2020

HPCwire and all of the Tabor Communications family are saddened by last week’s passing of Rich Brueckner. He was the ever-optimistic man in the Red Hat presiding over the InsideHPC media portfolio for the past decade and a constant presence at HPC’s most important events. Read more…

Nvidia Q1 Earnings Top Expectations, Datacenter Revenue Breaks $1B

May 22, 2020

Nvidia’s seemingly endless roll continued in the first quarter with the company announcing blockbuster earnings that exceeded Wall Street expectations. Nvidia Read more…

By Doug Black

Microsoft’s Massive AI Supercomputer on Azure: 285k CPU Cores, 10k GPUs

May 20, 2020

Microsoft has unveiled a supercomputing monster – among the world’s five most powerful, according to the company – aimed at what is known in scientific an Read more…

By Doug Black

Supercomputer Modeling Tests How COVID-19 Spreads in Grocery Stores

April 8, 2020

In the COVID-19 era, many people are treating simple activities like getting gas or groceries with caution as they try to heed social distancing mandates and protect their own health. Still, significant uncertainty surrounds the relative risk of different activities, and conflicting information is prevalent. A team of Finnish researchers set out to address some of these uncertainties by... Read more…

By Oliver Peckham

[email protected] Turns Its Massive Crowdsourced Computer Network Against COVID-19

March 16, 2020

For gamers, fighting against a global crisis is usually pure fantasy – but now, it’s looking more like a reality. As supercomputers around the world spin up Read more…

By Oliver Peckham

[email protected] Rallies a Legion of Computers Against the Coronavirus

March 24, 2020

Last week, we highlighted [email protected], a massive, crowdsourced computer network that has turned its resources against the coronavirus pandemic sweeping the globe – but [email protected] isn’t the only game in town. The internet is buzzing with crowdsourced computing... Read more…

By Oliver Peckham

Global Supercomputing Is Mobilizing Against COVID-19

March 12, 2020

Tech has been taking some heavy losses from the coronavirus pandemic. Global supply chains have been disrupted, virtually every major tech conference taking place over the next few months has been canceled... Read more…

By Oliver Peckham

Supercomputer Simulations Reveal the Fate of the Neanderthals

May 25, 2020

For hundreds of thousands of years, neanderthals roamed the planet, eventually (almost 50,000 years ago) giving way to homo sapiens, which quickly became the do Read more…

By Oliver Peckham

DoE Expands on Role of COVID-19 Supercomputing Consortium

March 25, 2020

After announcing the launch of the COVID-19 High Performance Computing Consortium on Sunday, the Department of Energy yesterday provided more details on its sco Read more…

By John Russell

Steve Scott Lays Out HPE-Cray Blended Product Roadmap

March 11, 2020

Last week, the day before the El Capitan processor disclosures were made at HPE's new headquarters in San Jose, Steve Scott (CTO for HPC & AI at HPE, and former Cray CTO) was on-hand at the Rice Oil & Gas HPC conference in Houston. He was there to discuss the HPE-Cray transition and blended roadmap, as well as his favorite topic, Cray's eighth-gen networking technology, Slingshot. Read more…

By Tiffany Trader

Honeywell’s Big Bet on Trapped Ion Quantum Computing

April 7, 2020

Honeywell doesn’t spring to mind when thinking of quantum computing pioneers, but a decade ago the high-tech conglomerate better known for its control systems waded deliberately into the then calmer quantum computing (QC) waters. Fast forward to March when Honeywell announced plans to introduce an ion trap-based quantum computer whose ‘performance’ would... Read more…

By John Russell

Leading Solution Providers

SC 2019 Virtual Booth Video Tour

AMD
AMD
ASROCK RACK
ASROCK RACK
AWS
AWS
CEJN
CJEN
CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
IBM
IBM
MELLANOX
MELLANOX
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
SIX NINES IT
SIX NINES IT
VERNE GLOBAL
VERNE GLOBAL
WEKAIO
WEKAIO

Contributors

Tech Conferences Are Being Canceled Due to Coronavirus

March 3, 2020

Several conferences scheduled to take place in the coming weeks, including Nvidia’s GPU Technology Conference (GTC) and the Strata Data + AI conference, have Read more…

By Alex Woodie

Exascale Watch: El Capitan Will Use AMD CPUs & GPUs to Reach 2 Exaflops

March 4, 2020

HPE and its collaborators reported today that El Capitan, the forthcoming exascale supercomputer to be sited at Lawrence Livermore National Laboratory and serve Read more…

By John Russell

‘Billion Molecules Against COVID-19’ Challenge to Launch with Massive Supercomputing Support

April 22, 2020

Around the world, supercomputing centers have spun up and opened their doors for COVID-19 research in what may be the most unified supercomputing effort in hist Read more…

By Oliver Peckham

Cray to Provide NOAA with Two AMD-Powered Supercomputers

February 24, 2020

The United States’ National Oceanic and Atmospheric Administration (NOAA) last week announced plans for a major refresh of its operational weather forecasting supercomputers, part of a 10-year, $505.2 million program, which will secure two HPE-Cray systems for NOAA’s National Weather Service to be fielded later this year and put into production in early 2022. Read more…

By Tiffany Trader

15 Slides on Programming Aurora and Exascale Systems

May 7, 2020

Sometime in 2021, Aurora, the first planned U.S. exascale system, is scheduled to be fired up at Argonne National Laboratory. Cray (now HPE) and Intel are the k Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

Fujitsu A64FX Supercomputer to Be Deployed at Nagoya University This Summer

February 3, 2020

Japanese tech giant Fujitsu announced today that it will supply Nagoya University Information Technology Center with the first commercial supercomputer powered Read more…

By Tiffany Trader

Australian Researchers Break All-Time Internet Speed Record

May 26, 2020

If you’ve been stuck at home for the last few months, you’ve probably become more attuned to the quality (or lack thereof) of your internet connection. Even Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This