The New Limits on High Performance Computing

By John L. Gustafson, PhD

June 30, 2006

Seymour Cray once quipped that he never made any money until he became a plumber.

We laugh because it's a case of very low-tech trumping very high-tech. This kind of irony is plentiful lately, as PhDs steeped in computational science find themselves wrestling, not with whether to interchange the level of loop nesting, or whether conjugate gradient solvers are superior to direct solvers, but with whether their power company is charging 23 cents per kilowatt-hour or merely 12 cents. Something profound has happened to the art of architecting supercomputing systems, and it has a lot more to do with issues from the era of GE's Thomas Edison than from the era of JPL's Thomas Sterling.

The Limit of Power Dissipation

Suppose you dream of building the world's first system capable of achieving a petaflop on Linpack (one quadrillion floating-point operations per second solving dense systems of linear equations). The excitement builds as you sketch the possibility on the restaurant placemat — a few thousand nodes, each at a sustained speed of a few teraflops — and then you note what that kind of cluster will require in the way of power dissipation and real estate (square footage), and the “gulp” sound can be heard all over the restaurant.

Walk into any Wal-Mart and you will have no problem finding a Windows-based PC (that you can reconfigure to run Linux in a matter of hours) for a few hundred dollars. If you put it in a cluster, how much will you spend on electricity over its useful life of about three years?

Over a thousand dollars. Hmm… can that be right? If a server consumes just 300 watts and you keep it on for a year, how much does the electricity cost? At 12 cents per kilowatt-hour, it will use $375 per year. Google's biggest single line item expense is the power bill for their enormous server farms.

Electricity ranges from about 5 cents per kilowatt-hour in places like DOE's Pacific Northwest National Laboratory (which is right next to a surplus of nuclear power as well as plenty of hydroelectric power) to about 23 cents per kilowatt-hour at the Maui High Performance Computing Center (MPHCC). As the price of hardware falls with Moore's law, the price of the energy to flip all those ever-denser bits keeps rising with inflation and the price of a barrel of oil.

The chip and system suppliers have definitely gotten the message. Intel and AMD have moved away from advertising their gigahertz rates and made quite a point about performance-per-watt. Having spent years convincing the typical consumer that speed and clock frequency are the same thing, their marketing departments now face the task of preaching an entirely new figure of merit. It might not be a tough sell to anyone who's run out of laptop battery power on a long plane trip, or has noticed how laptop computers have almost gotten too hot to keep on your lap for very long.

The POWER architecture folks at IBM may be amused by all this, since for the last several years the POWER-based entries at or near the top of the TOP500 list have very high performance-per-watt — and that didn't happen by luck. The POWER-based systems have long had people scratching their head when they win benchmarks despite having much lower clock rates than their competition.

But speaking of POWER-based designs, the Xbox 360 has set some records for the wattage consumed by a game console — about 170 watts. That is more than double the wattage of the original Xbox, and far more than the early video game consoles. As with supercomputing, the push for performance has resulted in a collision with the practical limits of power supply.

Unfortunately, in multi-gigahertz chips, the clock is itself a big part of the problem. Pumping the clock signal up and down takes from 25 to 50 percent of the power sent to the chip. And most of those frenetic cycles are simply waiting for some much slower part of the system, like the dynamic RAM or the disk or the network. It is a lot like roaring a car engine at the stop light. Chip designers used to think the power consumption simply went up linearly with the clock speed, but in practice, it rises much faster than linearly.

For the physicists among us, there is an interesting thing about making performance-per-watt a figure of merit. Performance is computational work per second. A watt is a unit of physical work (one joule) per second. (If you do not have a gut feel for what a “joule” is, 3.6 million of them is one kilowatt-hour.) Thus, performance in flops/sec divided by watts is simply flops per joule! As we approach physical limits to computation, there may be a convergence in the metrics used by computing people and those used by physicists.

For the largest supercomputing clusters, the procurement planners now must entangle a power budget with the hardware budget. It is a bit simplistic to equate the power budget with money. It usually does not work that way. The facility has some maximum ability to supply power to the computer room, and increasing it by several megawatts is a major engineering project. The money for doing that comes from a different pot, if the money is even obtainable for that purpose.

At an early hypercube conference, Intel discovered there was not enough electric power in the conference area of the Knoxville Hilton to power their iPSC system, so they rigged a rather audible diesel generator in the courtyard to get their demonstrations running. At a presentation later that day, the Intel speaker got a catcall from the audience: “Hey, how many megaflops per gallon are you getting?” The laughter lasted several minutes, but in 2006, it is no joke for those seeking the leading edge of HPC.

The Limit of Floor Space

When I first proposed to LANL that they fill a building with racks of microprocessor-based units to reach new supercomputing performance, they just about laughed me out of the room. That was in 1984, and supercomputing floor space was whatever it took for a love seat-shaped mainframe from Cray Research and some boxes full of proprietary disk drives, not an entire building.

Now it is generally accepted that if you want to do supercomputing, you will be filling a building with racks. The issue is how many you can accommodate.

Even if your facility has enough floor space, the communication fabric might limit the maximum distance between racks. And even if the electrical specification does not limit the distance, the speed of light will start to add to the message-passing latency between cabinets if they get too far apart. With MPI latency at about one microsecond lately, that has so far managed to mask plenty of speed-of-light delay; light in a vacuum travels about a thousand feet in a microsecond. But for a system that occupies over 10,000 square feet, connecting the opposite corners of a 100 by 100 foot square might (at best) be done with an optical cable in which signals transmit at only 70 percent the speed of light. Moreover, the cable is not line-of-sight straight between the corners, so allow perhaps 180 feet of cable between points. That delay adds about 250 nanoseconds to the MPI latency, reducing performance significantly for latency-sensitive applications.

The first thing to do, of course, is to pack each cabinet with as much computing capability as possible to reduce the number of racks. It may be easy to use a 25-inch rack instead of a 19-inch rack, but increasing the height introduces an objection from an unexpected source: those concerned with safety. Taller racks have a greater chance of injuring people if they tip over, and any system that asks administrative staff to stand on stools or ladders incurs a nasty reaction from the insurance company.

Packing more general-purpose computers into a rack, however, intensifies the heat generated. That forces the ugly choice between using more floor space and introducing liquid cooling. Air cooling hits its limit at about 70 watts per liter near sea level. For a place like Los Alamos National Laboratory, at 7500 feet above sea level, forced-air cooling is only about half as effective. Hence, air-cooled racks hit a limit of about 20 to 30 kilowatts. Besides blowing the heat away from the processors, the requirement to blow the heat out of the computing room is getting arduous as well. The Thunder system at Lawrence Livermore National Laboratory requires an air current under the raised floor at 60 miles per hour. A typical guideline is that the power to remove the heat adds 20 percent to the total power required for the system.

Finally, the limit of floor space is like the limit of power dissipation, in that it does not simply translate into cost. The floor space may not be available, at any price. At a national laboratory, creating new building space is literally a line item for Congress to approve. In the financial centers like Manhattan and London that use HPC for financial modeling, the space is not only expensive but unlikely to be for sale right where it is needed.

So, What Should We Do?

Things are not as dismal as they sound. Perhaps the most visible ray of hope is in the emergence of multi-core processor chips running at lower clock speeds. Every time a microprocessor vendor doubles the number of processing elements but lowers the clock speed, it drops the power consumption yet raises the effective performance. The ClearSpeed chip is currently the most radical example of this, with 96 processing elements running at only 250 megahertz. The result is a chip that is simultaneously the fastest at 64-bit floating point speed (about 25 Gflops/s sustained) yet one of the lowest for power consumption (under 10 watts). The ClearSpeed chip is a true coprocessor, and depends on a general-purpose processor for a host. If you need double-precision floating-point performance, you can achieve it without adding to the facilities burdens of power demand or floor space.

The new frontier in chip design is finding clever ways to compute with less energy. A startup company named MultiGig Inc. has a good example of the coming innovations to reduce power demand — a terahertz clock generator that slashes the power required to run the clock on digital chips, through a differential transmission line twisted into a Mobius loop. They and other chip designers are looking at adiabatic switching technologies that promise dramatic reductions in power per gate. DARPA's HPCS program has helped Sun Microsystems explore a method for inter-chip communication that promises to reduce energy per bit moved by over two orders of magnitude. This is exciting stuff, and it will directly benefit everyone from the consumer of video games to the scientists and engineers pursuing the fastest computing possible.

Using chips optimized for an application regime in combination with standard, generic processors is something we can do right now to mitigate facilities costs. This “hybrid” computing approach is very reminiscent of hybrid cars — two kinds of motors for two kinds of driving, resulting in much higher efficiency than if one tries to use one kind of engine for everything. The resurgence of interest in the coprocessor approach may remind people of the old days of attached processors from Floating Point Systems, but the FPS approach never had anything to do with saving electricity or space in the computing room. Coprocessors are no longer separate cabinets, but instead are available via plug-in boards that consume an otherwise empty expansion slot on a server or workstation.

The Tokyo Institute of Technology has used this approach to create the fastest supercomputer in Asia with a power budget of less than a megawatt. Los Alamos intends to create a hybrid petaflop computer with its Roadrunner project. You do not have to become a plumber to liquid-cool a supercomputer in 2006; you simply use coprocessors so you do not generate all that heat in the first place!

—–

John Gustafson, CTO, HPC, ClearSpeed Technology Inc.John Gustafson joined ClearSpeed in 2005 after leading high performance computing efforts at Sun Microsystems. He has 32 years experience using and designing compute-intensive systems, including the first matrix algebra accelerator and the first commercial massively- parallel cluster while at Floating Point Systems. His pioneering work on a 1024-processor nCUBE at Sandia National Laboratories created a watershed in parallel computing, for which he received the inaugural Gordon Bell Award. He also has received three R&D 100 Awards for innovative performance models, including the model commonly known as Gustafson's Law or Scaled Speedup. He received his B.S. degree from Caltech and his M.S. and Ph.D. degrees from Iowa State University, all in Applied Mathematics.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire