Cluster Management: Are We Making Progress?

By Dr. Robert Panoff, Shodor Foundation

February 10, 2005

I was invited by the editors of HPCwire to address some issues of cluster management, but others have recently done an admirable job of that. In a recent interview in HPCwire, for example, Tom Quinn, Director of Government Business Development at Linux Networx, spoke of the need to accurately measure the real performance of a system, focusing on true productivity, not just raw speed.

This reminded me of my own efforts as a post-doc in the 1980's to replace MIPS or MFLOPS as a comparison measure with a more meaningful unit which I called MYPS, a direct measure of how fast a VAX-like mini or early supercomputer ran MY PROGRAM. As a many-body physicist, I wanted to get real physics done, not just keep a processor chugging along. As we return, it seems, to the primitive days of “build your own” supercomputing, I am not sure we have learned the lessons of high performance computing history. And so I want to address not just cluster management, but the computational science to be achieved in a well-managed cluster-computing environment.

Twenty years of advances in high performance computing have ushered in three significant changes in the conduct of computational science. First, for the most part, we have been able to concentrate more on modeling instead of programming. Second, given the large volumes of data that are available and necessary to describe complex systems, we emphasize visualization instead of graphing. Third, individual desktop computing is “good enough” for many problems, at least in education, which once required allocations of time on a national supercomputer.

At the same time, the problems we really want to solve exhaust the combined power of the world's fastest supercomputers. As a simple example, just to initialize the spatial coordinates for a computation of a drop of water at the molecular level would take about a decade on the Earth Simulator! Go ahead, do the math; this can be a good exercise for your students learning units conversion. And while the problems at the forefront of computational science are as large and complex as ever, the replacement of large-scale stable systems with build-your-own clusters suggest we may be backpedaling on the advances that made computational platforms usable by computational scientists themselves.

But what about these home-grown cluster computing platforms that will be at the heart of any advances in computational science? Twenty years ago, in the frustration of lack of access to true supercomputing, computational physicists such as Mal Kalos at NYU and Norman Christ at Columbia doubled as computer scientists designing and building their own computing environments because they didn't trust the market place to provide them the power they needed to advance the science. One memorable comment by Kalos summed up the zeitgeist of that era: “The temptation to retire from physics and become a computer scientist is strong,” he mused. “Why try to perform hard calculations when, for less work and more money, one can simply talk about them?”

By the early 1990's access to national and even state supercomputing centers – along with significant advances in trustworthy compilers that harnessed a significant fraction of the computing power- allowed many of us to concentrate on being scientists. Yet now, in the age of commodity computing and reconfigurable clusters, it seems we have cycled back to a “if you build it, it will hum” approach to computational science. While we were computing, someone convinced administrators and funding agencies that only a select few needed and were using the high performance computing centers, and that in a time of budget cuts, most scientists should now be able to “get by” with a self-assembled and self-managed cluster. North Carolina dismantled its state supercomputing center in favor of a yet-to-be-realized promise of a state-wide grid of campus-based clusters. In the meantime, at least in the opinion of many, less science is being done.

Many years ago, Plato posed the dilemma of the philosopher king thusly: “Inasmuch as philosophers only are able to grasp the eternal and unchangeable, and those who wander in the region of the many and variable are not philosophers, I must ask you which of the two classes should be the rulers of our State? And how can we rightly answer that question?”

More recently a similar debate has been raging in the computational science literature as to the appropriate education of a new generation of computational biologists, physicists, engineers, and chemists. Would it be better for well-grounded biologists to learn a little math, or whether well- prepared mathematicians should learn a little biology. Is it better to teach a physicist enough about clusters and their management to get some real science done, or should we be trying to teach a clusterist just enough physics to be dangerous?

Ultimately, the crucial question of computational science is: how do you know if the computation is to be believed? We ask whether the computation is verifiable – are the results reproducible, and whether the computation is validated – that good science is being computed. In order to take advantage of various configurations of processors and networks, the physical problems to be studied usually need to be approximated or expressed in different ways, yet many of the “uncontrolled approximations” degrade the science in the process of improving system “performance.” Isn't this the real question Quinn raised about productivity? Computational scientists want machines that work without changing the science too much to get them to work well. Measuring the performance and productivity of a cluster or grid of clusters still comes down to the quality of the science that the system will enable.

When I talk to other physicists using home-grown clusters, the conversation invariably descends to lamentations on the poor quality of compilers that let us maximize performance AND productivity. Most of my colleagues have returned to writing their own MPI versions of their codes for both computation and data handling (sorry, OpenMP just isn't “there” yet). It's back to being programmers instead of modelers. And what about my colleagues who have been able to get their codes to run on the clusters? They tend to spend their time at meetings discussing details about tricks to minimize network latency in a cluster, as opposed to insights into the physics coming off of that cluster. Grid computing over national networks complicates the communication-to- computation latency and magnifies the challenges of resource allocation. This is progress?

One program funded by the National Science Foundation through the ATE program is looking at ways of training the technicians needed to set up and manage the clusters through a consortium of community colleges (see http://highperformancecomputing.org), but they are realizing that “cluster management” requires a degree of understanding of the science in the codes to be run. While new tools make it easier to set up clusters (BCCD from Paul Gray at the University of Northern Iowa, along with earlier tools such as OSCAR and ROCKS), to manage them, and provide for some moderate level of system security, even the simplest examples of real science codes that run on clusters exceed the science and math preparation of the technician. Exemplary templates are being developed to make it easier to explore the cluster computing paradigms in real science applications; some of these are accessible at the new Pathway project of the National Science Digital Library, the Computational Science Education Reference Desk (http://cserd.nsdl.org). These approaches can help.

And what of the issue of how to educate the current and future generations of scientists and engineers, starting at the undergraduate level, to use these resources effectively? Unless we are satisfied to allow education to lag research by ten or more years, we need to start looking at ways of making the computational experience part of the science training. Life and physical scientists need to be able to communicate with computer scientists in such a way that their collaboration gives rise to a productive computing infrastructure for the science to advance. This is one area that needs new ideas, collective effort, software development, and -I would expect- considerable funding. A renewed effort should be undertaken to develop computational science problem-solving environments that make as much of the underlying computing as transparent as possible, while allowing direct performance monitoring of the cluster and grid resources for validation and verification purposes.

One example of a content domain that has recognized the need to “let scientists be scientists” is the area of computational chemistry, where there are well-tested applications such as Gaussian and GAMESS. The Computational Chemistry Grid Project ( https://www.gridchem.org/project/faq.htm ) has taken on the task of porting important applications to a cluster/grid environment for the community, and this has enabled chemists to stay chemists. Physics and biology have yet to have a similar set of common applications, and so we seem to be left taking care of our own code and cluster management.

George Santayana is often quoted as saying that those who fail to learn from history are doomed to repeat it. Another George (Bernard Shaw) said that one thing we have learned from history is that we have learned nothing from history. Perhaps by the time the history of high performance computing is written computer scientists and computational scientists will have found a way to work with each other and to learn from each other to the advancement of both. We can always hope.

Dr. Robert M. Panoff is founder and Executive Director of The Shodor Education Foundation, Inc., a non-profit education and research corporation dedicated to reform and improvement of mathematics and science education by appropriate incorporation of computational and communication technologies.

Dr. Panoff has been a consultant at several national laboratories and is a frequent presenter at NSF- sponsored workshops on visualization, supercomputing, and networking. He has served on the advisory panel for Applications of Advanced Technology program at NSF, and is a founding partner of NSF-affiliated Corporate and Foundation Alliance.

Dr. Panoff received his B.S. in physics from the University of Notre Dame and his M.A. and Ph.D. in theoretical physics from Washington University in St. Louis, undertaking both pre- and postdoctoral work at the Courant Institute of Mathematical Sciences at New York University.

Topics: Applications, Systems

Sectors: Academia & Research, Government

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical. A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics — announce Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Google Addresses the Mysteries of Its Hypercomputer

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

Click Here for More Headlines

HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.

Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications, Inc. is prohibited.

Leading Solution Providers

Off The Wire

Industry Headlines

April 23, 2024

April 22, 2024

April 19, 2024

April 18, 2024

Subscribe to HPCwire's Weekly Update!

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

AI Saves the Planet this Earth Day

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

Empowering High-Performance Computing for Artificial Intelligence

Kathy Yelick on Post-Exascale Challenges

2024 Winter Classic: Texas Two Step

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

AI Saves the Planet this Earth Day

Kathy Yelick on Post-Exascale Challenges

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

MLCommons Launches New AI Safety Benchmark Initiative

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

Nvidia H100: Are 550,000 GPUs Enough for This Year?

Synopsys Eats Ansys: Does HPC Get Indigestion?

Intel’s Server and PC Chip Development Will Blur After 2025

Choosing the Right GPU for LLM Inference and Training

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

Google Addresses the Mysteries of Its Hypercomputer

How AMD May Get Across the CUDA Moat

Leading Solution Providers

Contributors

Tiffany Trader

Editorial Director

Douglas Eadline

Managing Editor

John Russell

Senior Editor

Kevin Jackson

Contributing Editor

Ali Azhar

Contributing Editor

Alex Woodie

Contributing Editor

Addison Snell

Contributing Editor

Drew Jolly

Assistant Editor

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

China Is All In on a RISC-V Future

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

Eyes on the Quantum Prize – D-Wave Says its Time is Now

GenAI Having Major Impact on Data Culture, Survey Says

The GenAI Datacenter Squeeze Is Here

Intel’s Xeon General Manager Talks about Server Chips

The Information Nexus of Advanced Computing and Data systems for a High Performance World

Share

Copy short link