CLUSTERED HPC: AN INTERVIEW WITH THOMAS STERLING

November 8, 2000

by Alan Beck, editor in chief LIVEwire

Dallas, Texas — Thomas Sterling holds a joint appointment with NASA’s Jet Propulsion Laboratory (JPL) and the California Institute of Technology (CalTech), serving as Principle Scientist in JPL’s High Performance Computing group and Faculty Associate in CalTech’s Center for Advanced Computing Research. He received his Ph.D. as a Hertz Fellow from MIT in 1984.

For the last 20 years, Sterling has engaged in applied research in parallel processing hardware and software systems for high-performance computing. He was a developer of the Concert shared-memory multiprocessor, the YARC static dataflow computer, and the Associative Template Dataflow computer concept, and has conducted extensive studies of distributed shared-memory cache-coherent systems. In 1994, Sterling led the NASA Goddard Space Flight Center team that developed the first Beowulf-class PC clusters. Since 1994, he has been a leader in the national Petaflops initiative. He is the Principal Investigator for the interdisciplinary Hybrid Technology Multithreaded (HTMT) architecture research project sponsored by NASA, NSA, NSF, and DARPA, which involves a collaboration of more than a dozen cooperating research institutions. Dr. Sterling holds six patents, and was one of the winners of the 1997 Gordon Bell Prize for Price/Performance.

Sterling gave a state of the field talk on COTS Cluster Systems for High-Performance Computing at SC2000; HPCwire talked with him to obtain a better perspective on his views:

HPCwire: Your work in clustered supercomputing has literally revolutionized HPC in the last few years. But surely there is a limit to what is possible for this type of technology — or is there? What are the most serious factors currently circumscribing the capabilities of clustered HPC? Are any solutions on the horizon?

STERLING: The rate of growth in numbers, scale, and diversity of the implementation and application of clusters in HPC including (but not limited to) Beowulf-class systems has been extraordinary. But my work with Don Becker on the early Beowulf systems succeeded in no small part because of much previous and continuing good work accomplished by many others in the distributed computing community in hardware and software systems. Workstation clusters (e.g. COW, NOW), message-passing libraries (e.g. PVM, MPI), operating systems (e.g. BSD, Linux), middleware (e.g. Condor, Maui Scheduler, PBS, the Scyld scalable cluster distribution), and advanced networking (e.g. Myrinet, QSW, cLAN) are only a few examples of the ideas, experiences, and components that contributed to the synthesis of Beowulf-class PC clusters and continue to push cluster computing forward at an accelerating rate. And driving all of that enabling technology is the computational scientists adopting their distributed application algorithms to the not always friendly operational properties of successive generations of Beowulf platforms.

It has been a pleasure to play a role in the Beowulf phenomenon but it is the accomplishment of many, not just a few. Many government organizations have contributed to this including a number of NASA and DOE labs with valuable tools disseminated to the community as open source software by some of them (e.g. Argonne, Oakridge, Goddard, Ames). And this is being paralleled by the more recent work in large NT-based clusters of PCs as well (e.g. at NCSA, CTC, UCSD). Of course now the field of Beowulf computing has matured such that it is partnered with industry, large and small, in hardware (e.g. Compaq, IBM, VA Linux, HPTI, Microway) and software (e.g. Turbo Linux, Redhat, SuSE, Scyld) providing improved functionality, performance, and robustness at reasonable (usually) cost. As a result, many tasks in academia, industry, government, and commerce are now performed on this class of systems providing a stable architecture family for both ISV and applications programmers to target with confidence while riding the Moore wave through future generations of advanced technology. Indeed, many of our computer science students have their first experiences with parallel computing on small Beowulfs.

How far clusters in general and Beowulf-class systems in particular can go is a tantalizing question. The challenges today may be seen in three dimensions: 1) bandwidth and latency of communications, 2) usability and generality of system environments, and 3) availability and robustness for industrial grade operation. The first is now being addressed by industry perhaps starting with the pathfinding work of Chuck Seitz with Myrinet. Improvements in both latency and bandwidth by one and two orders of magnitude over the baseline Fast-Ethernet LAN are being achieved with such consortium drivers as VIA and Infiniband. Bandwidths beyond 10 Gbps and real latencies approaching a microsecond are on the horizon as zero-copy software and optical channels become mainstream for future system area networks.

A number of groups in the US, Japan, and Europe are developing tools to establish acceptable environments for managing, administering, and applying these systems to real-world workloads. This will take some time to shake out, although significant progress is finally being made. Various efforts to collect representative tools in to usable distributions (e.g. Oscar, Scyld, Grendel) and make them available are involving collaborations across many institutions. While such systems may never be easy to program or truly transparent or seamless in their supervision, they may prove sufficient within the bounds of practical necessity.

Finally, the issue of reliability is one that appears to vary dramatically. One hears horror stories of nodes dying every few hours and others where complete systems stay up for more than half a year. At Caltech our Beowulf “Naegling” has had a worst case node failure within 80 days and a best failure time of almost 200 days. This is after surviving the usual burn-in period. Infant mortality is always part of the experience and certain types of components (e.g. fans, power supplies, disks, NICs) tend to experience some fatality within the first few weeks. Then the systems stabilize. A similar process occurs with the software environments; bugs in the installation and configuration are exposed early on and have to be eliminated one by one, sometimes painfully. But industry investment in the mass market nodes and networks and their recent efforts in system integration are showing results in improving availability and robustness. More work is needed in limiting the down time of a system when an individual component dies. There are severe challenges in even detecting when wrong results are produced, although a system keeps running. These are expected to receive increasing attention as a real market, especially in commerce, is found for systems as large as thousands of processors.

We are approaching the milestone (albeit somewhat arbitrary) of being able to assemble a Teraflops-scale Beowulf-class system for one million dollars. But the cost of running and maintaining such a system is non-trivial and has to be accounted for. And industry (e.g. Sun, Compaq, SGI, IBM) is playing an increasingly important role in making such systems accessible. Another area that is lagging is that of distributed mass storage and generalized parallel file servers. Systems oriented around the storage and fetching of mass data sets is likely to drive the commercial customer base for clusters and play an important role in scientific computing as well. While some early systems are being employed (e.g. PPFS, PVFS), much work has yet to be done in this area. With system on a chip (SOC) technology allowing multiple processors and their integrated caches to be implemented on a single die and clock rates slowly increasing through the single-digit GHz regime, performance density is likely to continue to advance at a steady pace. Will we see a Petaflops Beowulf by 2010 as possibly implied by the Top500 list? It is not out of the question, although personally I hope we find a better way. Beowulf was always about picking the low hanging fruit and has consistently shown that where there is a way, there is a will.

HPCwire: Within the last year several firms have emerged that are solely focused upon exploiting computing power from large networks of Internet-connected PCs. How do you view these efforts? What will ultimately determine the success or failure of such ventures?

STERLING: This is a new frontier in distributed computing and one based on the perceived opportunity of an untried business model. What I call “cottage computing” is unique and has no analogue in other domains of economy or production (that I can think of) since the beginnings of the industrial revolution in the mid 18th century. The seti@home experience is tantalizing and stimulates consideration of broader application that is driving these new enterprises. But I am extremely uncertain of the outcome. It will ultimately be determined by the complex interplay of factors including the difficulties of achieving adequate security in both directions, the relative value of diffuse computing cycles, and the competing alternatives. While I am not yet convinced of a favorable outcome, this is an exciting process with some very sharp people heavily engaged. Its evolution will be very interesting to watch over the next 18 months.

HPCwire: As Principal Investigator for the interdisciplinary Hybrid Technology Multithreaded (HTMT) architecture research project, you have a unique insight into the characteristics of these fascinating technologies. Please share some of your thoughts and observations with us.

STERLING: The multi-institution, interdisciplinary HTMT architecture research project is a four-year effort to explore a synthesis of alternative technologies and architectural structures to enable practical general-purpose computing in the trans-Petaflops regime. The genesis of this advanced exploratory investigation was catalyzed by the initial findings of the National Petaflops Initiative, a community-wide process, and is aligned with the strong recommendations of the President’s Information Technology Advisory Committee (PITAC) on high performance computing research directions. Through HTMT, significant insights have been acquired revealing the potential of aggressively exploiting non-conventional strategies to achieving ultra-scale performance. Perhaps the most important was the value of inter-relating system structure and disparate technologies to accomplish a synergy of complementing technology characteristics. Much of the public attention and controversy has been on the technologies themselves, which pushed the capability of logic speed, storage capacity, and communications throughput to extremes.

While the project was not committed to any particular device, it studied specific example technologies in detail, in some cases contributing to their advancement. Among these, the innovative packet switched Data Vortex optical network exploiting both time division and wave division multiplexing may have near term impact for a wide range of high-end systems. Optical holographic storage was shown to provide one possible means of providing a high density, high throughput memory layer between primary and secondary storage. The merger of semiconductor DRAM cell and CMOS logic was shown to enable Processor in Memory (PIM) smart memory structures that may make possible new relationships between high speed computer processors and their memory hierarchy imbued with extended functionality. The most controversial aspect of the project was its investigation of superconductor rapid single flux quantum (RSFQ) logic. Conventional wisdom dictates that previous experiences by IBM and within Japan demonstrated that computers built from superconductor electronics were infeasible while the cooling requirements made it impractical.

The findings of the HTMT project are that Niobium based RSFQ logic is both feasible and practical and affords unique opportunities in the design of very high-speed processor design between 50 GHz and 150 GHz clock rates. However, within the constraints of existing fabrication facilities and industrial/government investment, the likelihood of realizing such components is remote. Even more significant than the technologies within HTMT is the architecture that would incorporate them. HTMT explored the potential of a dynamic adaptive resource management methodology called “percolation” that employs smart memories to determine when tasks are to be performed and to pre-stage all information related to task execution proactively using low cost in-memory logic. The conclusion is that such small processors can remove the combined problems of overhead and latency from the main processors while performing many of the low-locality data intensive operations in the memories themselves. The result would be very high efficient operation even on those algorithms that have proven difficult to optimize in the past. The overall result of the HTMT project is the strong opportunities for increasing investments in high performance computer system research to benefit from significant potential benefits as yet not exploited.

HPCwire: What are the most important issues facing HPC today? What are the best ways those within the community can pursue creative solutions?

STERLING: The dominant strategic issues today are: first, is HPC important, and second, must all future HPC systems be limited to COTS clusters and their equivalents. While the first issue may appear silly to some, there is a real threat to HPC and supercomputing as a goal and discipline with some respected colleagues publicly stating that performance as a research goal is no longer important. This is in part driven by the excitement about the potential of the Internet, Web, and Grids that are perceived as a more attractive and lucrative area of pursuit than HPC systems development. There is also an apparent malaise derived from the perception of a small and shrinking HPC market, the Moore’s law juggernaut, lack of funding, the diminishing glamour, and the poor track record of such research in the past. For this reason, where HPC is really needed, both industry and academia in many cases perceive clusters including, but not limited to, Beowulf-class systems to be an easy, relatively low cost out, with short-term difficulties to be rectified to an adequate degree, it is presumed, by future developments in distributed system software.

Our work with Beowulfs has shown us that in many cases this is an acceptable solution and that the contributions being made by Becker and many others will reduce, although probably not close, the gap between cheap hardware and needed user environments. But my work on HTMT and Petaflops scale computing has revealed both the need for and opportunity of devising innovative new structures for attacking major computational challenges at performance levels orders of magnitude beyond that which is being implemented today. The early work by IBM on its BlueGene project is suggesting the same conclusions. Problems of controlled fusion, molecular protein folding and drug design, high confidence climate modeling, complex aerospace vehicle design optimization, and brain modeling while perhaps not as enticing as real-time video games and e-commerce would nonetheless revolutionize human existence in the 21st Century.

From a technical perspective, the dual challenges of good price-performance for scalable systems and latency management for acceptable system efficiency are matched by the more vague goals of programmability and generality. These are nothing new in the field of parallel processing but their impact is of increasing significance as system scale extends beyond 10,000 coarse-grain nodes (e.g. ASCI White) or even a million fine grain nodes (e.g. IBM BlueGene) and as more complex interdisciplinary applications are pursued. Cost is important and the need to devise structures that can be realized at low cost, other than COTS cluster techniques, is critical. PIM is one very real possibility here but the architectures, while retaining simplicity, must be advanced well beyond current examples.

In my view (and others may disagree), dynamic adaptive resource methodologies, most likely exploiting PIM smart memories, may address the key problems of latency (perhaps through percolation), overhead, and load balancing while simplifying both hardware and software development. But in the long term even as it pains me to say so, I see a need for advanced parallel languages that are not constrained by assumptions of conventional underlying hardware components and organizations. I believe a new decision-tree model for resource management is required; one that revises the notion of what does a computer know and when does it know it in making the determination of resource to task allocation in time and space. These questions are both significant and tantalizing. It only remains to the combined high performance research community to revitalize its commitment to their pursuit and ultimate resolution.

HPCwire: How would you characterize the current interrelationship between national policy, corporate policy, and leading-edge HPC research? Should this be modified? If so, how?

STERLING: It is difficult to characterize “national policy” as it pertains to HPC research. The PITAC recommendations on future directions in HPC research were clear and specific and I adhere to them both in principal and in their explicit proposed actions. These recommendations were not addressed by the Federal agencies for FY01 although many other important areas in IT considered by PITAC did receive attention. There is real interest in many quarters to do so, but at the moment, aggressive pursuit of these ideals remains dormant. Corporate policy quite reasonably focuses on the sweet spot of the market and the Cluster approach lends well to this strategy, providing a degree of scalability without investing in unique systems for the high end. The risks are too high for industry alone to attack them while the perception is that the market is too small to provide adequate financial return. My guess is that the latent market is much greater but not at the price-performance point of the older supercomputers and MPPs. Of course, many applications today routinely run at performance levels on the desktop that supercomputer applications consumed a decade ago. That should be a strong signal that the opportunities for much greater performance systems are plenty. However, the community either has not gotten the hint or rather they use the same experience to justify waiting: Petaflops will come to those who wait; Moore or less.

From my previous comments, yes, I believe the apparent policies and interrelationships should be modified. The partnership between national policy and corporate policy in HPC research should be one of mutual and complementing strengths. The DOE ASCI program performed well in working with industry to development pace setting high performance systems through the extension of conventional means and in so doing demonstrated the value of advanced capability systems for exploring the frontiers of science and technology through computation. But no counter balancing non-incremental advanced research of significance has been undertaken or sponsored to explore over-the-horizon regimes. Yes, quantum computing and other exotic forms of processing are being supported under basic research. But there is a major gap between these and today’s conventional distributed systems. I would like to see the PITAC recommendations in HPC carried out and that a partnering between industry and government be developed involving that academic community to explore innovative opportunities and reduce the risk so that a truly new class of parallel computer system can emerge to escape the current cul-de-sac in which we are trapped and deliver a revolutionary new tool with which to build the world habitat of the 21st century.

============================================================

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire