The Impact of Cluster Virtualization on HPC

By Nicole Hemsoth

December 1, 2006

Is virtualization the next “big thing” to impact high performance computing or is much of the buzz just hype? Of its many definitions, which technologies or virtualization strategies are actually solving real HPC challenges today? The adoption of Linux and Windows clusters, the rise of multi-core processors, and the spread of various virtualization technologies offer a great deal of potential for organizations to increase capabilities and reduce costs.

In part one of a two-part interview, HPC luminary Don Becker, CTO of Penguin Computing and co-inventor of the Beowulf clustering model, and Pauline Nist, senior vice president of product development and management for Penguin Computing, discuss how cluster virtualization can enable large pools of servers to appear and act as a single, unified system. According to them, a cluster virtualization strategy is one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing and getting the most out of your server resources.

HPCwire: There is a tremendous amount of buzz surrounding virtualization these days, and the term seems to be applied to many different things. Can you explain a bit about what virtualization is/means and what its benefits are?

Donald Becker: There is a lot of buzz, but it’s not necessarily hype. However there are multiple definitions of “virtualization” – which can cause confusion. Essentially, the two different types boil down to 1). Making many look like one and 2). Making one machine work as many machines. There is a lot of activity and noise right now around virtualization as it applies to virtual machine technology. Because of this there is a tendency to think of VMs as the only example of virtualization.

Machine virtualization or virtual machine technology (a la VMware and Xen) is where you divide up the resources of a single server into multiple execution environments, or multiple virtual machines. It’s the exact opposite of making many look like one — this is making one work as many separate instances. This concept has been around for 40+ years in proprietary IBM machines but much of the recent buzz in the marketplace today is around the proliferation of this capability on commodity servers and operating systems.

HPCwire: Why is there so much interest right now in the virtual machine technology?

Becker: The key driving factor is proliferation of applications in the enterprise and the fact that this technology is now available on commodity x86 platforms prolific in the enterprise computing today. Many of these applications have strict requirements for the environment they run in, sometimes mutually exclusive requirements with other apps, and many are business critical functions that cannot be compromised.

As a result, apps became silo’d across racks of servers, each getting about 15-20 percent utilization. This is really inefficient and very costly to scale. What’s appealing about virtual machine technology is consolidation of many applications on to a single server, such that each app thinks it has the machine to itself. Each virtual machine is effectively a logical instance of the complete machine environment with its own CPU, memory and I/O. The net effect is fewer physical servers with much higher per-server utilization. This reduces capital costs on hardware and operational costs related to power, real estate, etc. It does not however take away the fact that you still have the same, or perhaps even larger, pools of logical servers that must be provisioned and managed.

Pauline Nist: The major enterprise server vendors, Intel and AMD are rolling out ongoing mechanisms at the processor level to better enable virtual machine technology and position themselves as optimal platforms for this new wave of innovations. Solutions like VMware have built a tremendous business in just a few short years bringing a VM solution to both Linux and Windows platforms. First with a capability that sits on top of the standard OS, and then with a variant that sits right on top of the hardware and launches OS environments. This is where it gets very interesting.

And now, you have Microsoft asserting its intentions to offer core virtual machine capability in Windows. The major enterprise Linux distros, Red Hat and Novell, working with the open source company XenSource, maintainers of the Linux VM technology Xen.

These OS platform providers are all making announcements and clearly jockeying for position to stake their claim as the “socket” on the hardware into which software plugs in. Meanwhile, VMware maintains such a strong presence on both Windows and Linux that the OS vendors will be of course working closely with them as this all progresses. It will be interesting to see how this all shakes out.

HPCwire: So we’ve heard about one definition of virtualization: “Making one machine work as many machines.” Tell us more about the other definition: “Making many look like one.”

Nist: Historically, the broader application of the term virtualization has been applied to the abstraction of physical devices from their clients to provide better availability and mobility while reducing the complexity of dealing with larger and larger pools of resources.

For example, virtualization is a well-known concept in networking, from virtual channels in ATM, to virtual private networks, virtual LANs and virtual IP addresses.

Storage virtualization is the pooling of physical storage from multiple network storage devices into what appears to be a single storage device that is managed from a central console — for example, a storage area network or SAN.

Likewise, clustered and Grid computing enables the “virtualization” of increasingly large pools of servers, along with their interconnects and storage needs, to reduce the complexity of managing and coordinating these resources for the purpose of distributed computing.

Becker: Organizations should be taking a closer look at their server utilization and determining if a cluster virtualization strategy would work for them. The financial and efficiency benefits are extremely compelling — and today’s tools and solutions offer very easy integration, making cluster virtualization one of the most practical and cost-effective methodologies for reducing the complexity and overall administrative burden of clustered computing.

This is where Penguin Computing plays — Our virtualized cluster solutions are driven by Scyld ClusterWare, our unique Linux software, which makes large pools of Linux servers appear and act like a single , consistent, virtual system. Through a single point of command/control, or “Master,” thousands of systems can be managed as if they were a single, consistent, virtual system.

This dramatically simplifies deployment and management and significantly improves datacenter resource utilization as well as server performance and scalability. Our aim is to make Linux clustering as powerful and easy to use as expensive symmetric multiprocessing (SMP) environments, at a Linux price point.

By combining the economics of Open Source with the simplicity and manageability of cluster virtualization, we’re able to provide an extremely compelling financial benefit to our customers by removing many of the on-going costs associated with complex management and administration of competitive offerings.

HPCwire: Where and how did this concept of cluster virtualization start? What was the driving force and how has it impacted HPC?

Becker: Cluster computing has become the hottest hardware trend in HPC, and is poised to ride the wave of “mainstream adoption.” Scientific and academic organizations have broadly embraced clustered system architectures as a cost effective alternative to monolithic UNIX SMP systems for computationally intensive simulations. Commercial businesses have more recently adopted Linux clusters for edge-of-the network applications such as Web hosting.

The tipping point that enabled the initial move to Linux and clustered solutions in HPC was driven predominantly by the compelling order-of-magnitude cost savings offered by commodity hardware and/or open source software. As the market has matured and more complete solutions have come available, the value proposition for Linux clusters has expanded as well to include significant TCO savings spanning hardware, software, and support costs; flexibility in configuration and upgrade options; freedom from constraints of single vendor development schedules and support; greater flexibility through open source software customization; and rapid performance advancements in industry standard processor, storage, and networking technologies.

The catalyst behind the evolution of cluster virtualization was the ongoing complexity of administration and management of traditional, roll-your-own Linux clusters. This complexity added significant dollars to the overall cost of a Linux cluster. In the midst of the rapid adoption of Linux clusters, IDC conducted a research study in 2003 asking users to identify the greatest challenges they faced in implementing clusters. The most frequently cited response was system management, which was selected by 48 percent of respondents. Breaking the issues of manageability down further, we see a number of challenges that cluster virtualization was designed to address.

Ad hoc, do-it-yourself clusters are cumbersome to use and generally adapted only to batch processing. Usage habits change. Clusters in HPC have been effectively adapted for batch computing environments, but until recently have not been well suited for interactive use.

This is why batch schedulers have been so prevalent as the issue for users is that they are presented in a traditional cluster with 100 separate servers for a 100-node cluster. Managing jobs and their related processes, data etc. means manually interacting with all of those machines. It is not only tedious and time consuming to do so, it is also error prone.

Complexity of administration and management can be both time-intensive and costly. This complexity is even worse for the admin who must set up and maintain the cluster. The admin is presented with 100 separate machines that must be loaded with the OS and cluster management software and this can take 5 to 30 minutes for each machine if everything goes well. Then all of the servers must be configured for access, users, security and coordinating as a cluster. Extensive scripting is usually required and these scripts must be maintained by different people over time adding overhead and cost.

What’s more, many DIY solutions are simply open source projects lacking a bit in terms of formal support and documentation. It’s true that open source has zero upfront costs compared to professionally developed cluster solutions. However, there are many associated costs — in terms of time and resources — that open source carries over the lifetime of the cluster.

It is far more cost effective to use a commercially developed and fully supported solution that can simplify the management of large pools of servers.

System stability is highly dependent on software consistency across compute nodes. Even a simple inconsistency can cripple a parallel program. An effective management software solution is needed to handle the underlying OS distribution, communication libraries such as MPI, as well as revision control for third-party software.

When software provisioning involves full Linux distributions and supporting software to each cluster node’s disk, the chances of inconsistency are inherently much higher. With HPC jobs that sometimes takes weeks or months to complete, the risk of wasted work is very costly. It is far better to use stateless provisioning of software — direct to memory — as it is safer and orders of magnitude faster to reprovision on demand.

Security threats can be introduced with each server deployed. A full install of Linux on each cluster node introduces the software components that hackers typically target. Full user shells and network services expose multiple attack points and should not actually be required if the cluster were architected properly.

HPCwire: Does cluster virtualization address these challenges?

Nist: Cluster virtualization enables large pools of servers to act and appear like a single, consistent virtual machine. Practically, it makes a Linux cluster so easy to use that you don’t have to be a sys admin to run one.

At Penguin, our clusters are application-ready, so our customers can be up and running — fast. They are driven by Scyld ClusterWare which is a productive, simple and hardware-agnostic HPC system that enables administrators to install, monitor and manage the cluster as a single system, from a single node — the Master.

It has a powerful unified process space so end users can easily and intuitively deploy, manage and run complex applications from the Master, as if it were a single virtual machine. What this means is that the users and the admin are presented with what appears to be a single server that can scale up or down with virtually any number of processors behind it and still be as simple to use and manage as a single server. The compute nodes exist only to run applications specified by the Master node.

Because the compute nodes run an initial lightweight, in-memory distribution, they can be provisioned rapidly so users can flexibly add or delete nodes on demand, in seconds, making the cluster extraordinarily scalable and resilient. The compute nodes can at any time, in this architecture, provision precisely the additional dependencies required for applications. The lightweight compute environment is stripped of any unnecessary system services and associated vulnerabilities, making the cluster nearly impossible to attack, thus inherently more secure.

Finally, with single point provisioning and consistency control mechanisms, there are no version skews, so the system is intrinsically more reliable. Compared to legacy technologies, our virtualized clusters offer a more efficient, secure way to manage servers that delivers the productivity, reliability, scalability and lower total cost of ownership that high-performance business computing demands.

Read Part Two of the “The Impact on Cluster Virtualization“ interview.

—–

Donald Becker is the CTO of Penguin Computing and co-inventor of Beowulf clusters. Donald is an internationally recognized operating system developer and the original inventor of Beowulf clustering. In 1999 he founded Scyld Computing and led the development of the next-generation Beowulf cluster operating system. Prior to founding Scyld, Donald started the Beowulf Parallel Workstation project at NASA Goddard Space Flight Center. He is the co-author of How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. With colleagues from the California Institute of Technology and the Los Alamos National Laboratory, he was the recipient of the IEEE Computer Society 1997 Gordon Bell Prize for Price/Performance.

Pauline Nist is the SVP of Product Development and Management at Penguin Computing. Before joining Penguin Computing, Pauline served as vice president of Quality for HP’s Enterprise Storage and Servers Division and immediately prior to that, as vice president and general manager for HP’s NonStop Enterprise Division, where she was responsible for the development, delivery, and marketing of the NonStop family of servers, database, and middleware software. Prior to the NonStop Enterprise Division (formerly known as Tandem Computers), Pauline served as vice president of the Alpha Servers business unit at Digital Equipment Corporation.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Intel Debuts New GPU – Ponte Vecchio – and Outlines Aspirations for oneAPI

November 17, 2019

Intel today revealed a few more details about its forthcoming Xe line of GPUs – the top SKU is named Ponte Vecchio and will be used in Aurora, the first planned U.S. exascale computer. Intel also provided a glimpse of Read more…

By John Russell

SC19: Welcome to Denver

November 17, 2019

A significant swath of the HPC community has come to Denver for SC19, which began today (Sunday) with a rich technical program. As is customary, the ribbon cutting for the Expo Hall opening is Monday at 6:45pm, with the Read more…

By Tiffany Trader

SC19’s HPC Impact Showcase Chair: AI + HPC a ‘Speed Train’

November 16, 2019

This year’s chair of the HPC Impact Showcase at the SC19 conference in Denver is Lori Diachin, who has spent her career at the spearhead of HPC. Currently deputy director for the U.S. Department of Energy’s (DOE) Read more…

By Doug Black

Microsoft Azure Adds Graphcore’s IPU

November 15, 2019

Graphcore, the U.K. AI chip developer, is expanding collaboration with Microsoft to offer its intelligent processing units on the Azure cloud, making Microsoft the first large public cloud vendor to offer the IPU designe Read more…

By George Leopold

At SC19: What Is UrgentHPC and Why Is It Needed?

November 14, 2019

The UrgentHPC workshop, taking place Sunday (Nov. 17) at SC19, is focused on using HPC and real-time data for urgent decision making in response to disasters such as wildfires, flooding, health emergencies, and accidents. We chat with organizer Nick Brown, research fellow at EPCC, University of Edinburgh, to learn more. Read more…

By Tiffany Trader

AWS Solution Channel

Making High Performance Computing Affordable and Accessible for Small and Medium Businesses with HPC on AWS

High performance computing (HPC) brings a powerful set of tools to a broad range of industries, helping to drive innovation and boost revenue in finance, genomics, oil and gas extraction, and other fields. Read more…

IBM Accelerated Insights

Data Management – The Key to a Successful AI Project

 

Five characteristics of an awesome AI data infrastructure

[Attend the IBM LSF & HPC User Group Meeting at SC19 in Denver on November 19!]

AI is powered by data

While neural networks seem to get all the glory, data is the unsung hero of AI projects – data lies at the heart of everything from model training to tuning to selection to validation. Read more…

China’s Tencent Server Design Will Use AMD Rome

November 13, 2019

Tencent, the Chinese cloud giant, said it would use AMD’s newest Epyc processor in its internally-designed server. The design win adds further momentum to AMD’s bid to erode rival Intel Corp.’s dominance of the glo Read more…

By George Leopold

Intel Debuts New GPU – Ponte Vecchio – and Outlines Aspirations for oneAPI

November 17, 2019

Intel today revealed a few more details about its forthcoming Xe line of GPUs – the top SKU is named Ponte Vecchio and will be used in Aurora, the first plann Read more…

By John Russell

SC19: Welcome to Denver

November 17, 2019

A significant swath of the HPC community has come to Denver for SC19, which began today (Sunday) with a rich technical program. As is customary, the ribbon cutt Read more…

By Tiffany Trader

SC19’s HPC Impact Showcase Chair: AI + HPC a ‘Speed Train’

November 16, 2019

This year’s chair of the HPC Impact Showcase at the SC19 conference in Denver is Lori Diachin, who has spent her career at the spearhead of HPC. Currently Read more…

By Doug Black

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

Intel AI Summit: New ‘Keem Bay’ Edge VPU, AI Product Roadmap

November 12, 2019

At its AI Summit today in San Francisco, Intel touted a raft of AI training and inference hardware for deployments ranging from cloud to edge and designed to support organizations at various points of their AI journeys. The company revealed its Movidius Myriad Vision Processing Unit (VPU)... Read more…

By Doug Black

IBM Adds Support for Ion Trap Quantum Technology to Qiskit

November 11, 2019

After years of percolating in the shadow of quantum computing research based on superconducting semiconductors – think IBM, Rigetti, Google, and D-Wave (quant Read more…

By John Russell

Tackling HPC’s Memory and I/O Bottlenecks with On-Node, Non-Volatile RAM

November 8, 2019

On-node, non-volatile memory (NVRAM) is a game-changing technology that can remove many I/O and memory bottlenecks and provide a key enabler for exascale. That’s the conclusion drawn by the scientists and researchers of Europe’s NEXTGenIO project, an initiative funded by the European Commission’s Horizon 2020 program to explore this new... Read more…

By Jan Rowell

MLPerf Releases First Inference Benchmark Results; Nvidia Touts its Showing

November 6, 2019

MLPerf.org, the young AI-benchmarking consortium, today issued the first round of results for its inference test suite. Among organizations with submissions wer Read more…

By John Russell

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Kubernetes, Containers and HPC

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs. Read more…

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

Dell Ramps Up HPC Testing of AMD Rome Processors

October 21, 2019

Dell Technologies is wading deeper into the AMD-based systems market with a growing evaluation program for the latest Epyc (Rome) microprocessors from AMD. In a Read more…

By John Russell

Rise of NIH’s Biowulf Mirrors the Rise of Computational Biology

July 29, 2019

The story of NIH’s supercomputer Biowulf is fascinating, important, and in many ways representative of the transformation of life sciences and biomedical res Read more…

By John Russell

Xilinx vs. Intel: FPGA Market Leaders Launch Server Accelerator Cards

August 6, 2019

The two FPGA market leaders, Intel and Xilinx, both announced new accelerator cards this week designed to handle specialized, compute-intensive workloads and un Read more…

By Doug Black

When Dense Matrix Representations Beat Sparse

September 9, 2019

In our world filled with unintended consequences, it turns out that saving memory space to help deal with GPU limitations, knowing it introduces performance pen Read more…

By James Reinders

With the Help of HPC, Astronomers Prepare to Deflect a Real Asteroid

September 26, 2019

For years, NASA has been running simulations of asteroid impacts to understand the risks (and likelihoods) of asteroids colliding with Earth. Now, NASA and the European Space Agency (ESA) are preparing for the next, crucial step in planetary defense against asteroid impacts: physically deflecting a real asteroid. Read more…

By Oliver Peckham

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This