Forget Zettascale, Trouble is Brewing in Scaling Exascale Supercomputers

By Agam Shah

November 14, 2023

In 2021, Intel famously declared its goal to get to zettascale supercomputing by 2027, or scaling today’s Exascale computers by 1,000 times.

Moving forward to 2023, attendees said challenges are scaling up performance even within Exaflops at the Supercomputing 2023 conference, which is being held in Denver.

The move to CPU-GPU architecture has helped scale performance, but other concerns — such as architectural limitations and sustainability issues — are making it difficult to scale performance, officials at Top500 said.

In fact, at the current rate, supercomputers may not reach 10 Exaflops of performance by 2030. Also, the performance growth has fallen off the last couple of years despite new Exascale systems entering the Top500 list.

“Unless we change how we approach computing, our growth in the future can be substantially smaller than they have been in the past,” said Erich Strohmaier, cofounder of Top500, during a press conference.

The end of two fundamental corollaries — Dennard Scaling and Moore’s Law — have created challenges in scaling performance.

“The end of Moore’s Law is coming, there’s no doubt about that,” Strohmaier said.

The number of systems submitted to Top500 has progressively declined since 2017. The average performance of systems has also been declining over the last couple of years.

The slowdown is also related to the inability to grow system sizes due to architectural limitations and sustainability issues.

“Our data centers cannot grow much larger than they are. So we cannot increase the numbers of … CPU sockets,” Strohmaier said.

Optical I/O has been identified as a technology to help reach zettascale. However, a U.S. Department of Energy (DoE) official said that optical I/O was not on their roadmap because of the cost and the energy required to operate optical I/O to connect circuits over short distances at the motherboard level. By comparison, copper is cheap and plentiful.

Average HPC systems also have a longer shelf life. The average age of a Top500 system was about 15 months in 2018-2019 and doubled to 30 months in 2023.

SC 23 Top500 Average System Age

The top seven systems on the November Top500 list have as much performance as the remaining 493. The upcoming systems will create an even bigger divide, with an even higher ratio of performance coming from the top 10 systems.

At the same time, some exciting new Exascale machines will be making their way to the Top500 list. There may be many lead changes as multiple supercomputers come online and are optimized to perform faster.

There are two new systems – Aurora this year and El Capitan next year — that could take the top Top500 positions in the coming years. The systems will scale to two Exaflops.

There were no change in the leader of the Top500 supercomputing list issued this week, with Frontier at Oak Ridge National Laboratory retaining its top spot. The system delivered peak performance at 1.1 Exaflops of performance and remained the only Exascale system on the list.

“I would say that the machine is really stable right now, and it’s performing exceptionally well,” said Lori Diachin, project director for Exascale computing project at the U.S. Department of Energy.

But Frontier could soon be replaced by the second-fastest system, Aurora, installed at Argonne National Laboratory. It delivered a performance of 585.34 petaflops and has been partially benchmarked. The system has Intel 4th Gen Xeon server chips called Sapphire Rapids CPUs and Data Center GPU Max chips called Ponte Vecchio.

Argonne submitted benchmarks for half the system size, and its performance will only go up when fully benchmarked, said Erich Strohmaier, cofounder of Top500.

“It’s questionable if Frontier will stay the number one system for much longer,” Strohmaier said.

Diachin’s team has had limited access to the system since July and is seeing great performance.

“We’re really looking forward to getting full access to that system, hopefully later this month,” Diachin said.

The third Exascale supercomputer, El Capitan, will be deployed in mid to late 2024 at the Lawrence Livermore National Laboratory.

The system will likely take the top spot on Top500 when the benchmark is released, but it is not sure when that will happen.

“There’ll be a brief early science period for that machine before it’s turned over to classified use for stockpile stewardship for the NSA,” Diachin said.

Additionally, many Top500 class Exaflop systems may be in plain sight, especially in cloud facilities of vendors who have not bothered to submit the results. Google’s A3 supercomputer can accommodate up to 26,000 Nvidia H100 GPUs but has not submitted any results.

But one submission, Microsoft’s Azure AI supercomputer called Eagle, unexpectedly landed in the third spot of this year’s Top500, and Nvidia’s bare metal Eos was in the ninth spot.

A past contributor, China, has gone off the map and isn’t submitting results to Top500. One submission for the Gordon Bell awards is a Chinese Exascale system, but there were no submissions of the system’s performance to the Top500.

Beyond raw horsepower, DoE’s Diachin is also trying new ways to scale performance within the current hardware limitations.

One such idea is using mixed precision and a wider implementation of accelerated computing. Also, DoE is looking at incorporating AI into large multiphysics models and enveloping that into classical computing to reach faster results.

“From our perspective, one of the things we’re really looking toward is some of these algorithmic improvements and broader incorporation of those kinds of technologies to accelerate applications while keeping the power footprint manageable,” Diachin said.

Many labs are also looking at their old code written in languages like Fortran 77 and rewriting and recompiling it for accelerated computing environments.

This approach “will help future-proof many of these codes by extracting layers that are specific to different kinds of hardware and allowing them to be more performance portable with less work,” Diachin said.

Hardware and algorithmic improvements gave performance improvements mostly in the 200x to 300x range and “as much as even several 1000 times improvement,” Diachin said.

The labs typically rely on E4S, or Extreme-scale Scientific Software, comprising debugging, runtime, math, visualization, and compression tools. It has more than 115 packages and is being pushed out to academia, scientific organizations, and other U.S. government agencies.

SC23 Top500 List Highlights
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer. The AI training system is home to Google's new AI chip, Read more…

Qubit Roundup – Quantum Zoo Grows, Rigetti’s QPU Play, Google’s New Algorithm, QuEra’s EC Advance, and More

December 11, 2023

While the IBM Quantum Summit and the QC Ware’s Q2B Silicon Valley conference dominated last week’s news flow, there was no shortage of other quantum news emerging. Here’s brief recap of highlights. Let’s start Read more…

Inside AWS’s Plans to Make S3 Faster and Better

December 10, 2023

As far as big data storage goes, Amazon S3 has won the war. Even among storage vendors whose initials are not A.W.S., S3 is the defacto standard for storing lots of data. But AWS isn’t resting on its laurels with S3, a Read more…

Quantum Market, Though Small, will Grow 22% and Hit $1.5B in 2026

December 7, 2023

Few markets as small as the quantum information sciences market generate as much lively discussion. Hyperion Research pegged the worldwide quantum market at $848 million for 2023 and expects it to reach ~$1.5 billion in Read more…

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed its new Instinct MI300X GPU is the fastest AI chip in the worl Read more…

AWS Solution Channel

Shutterstock 1708898095

Reducing Costs by Up to 87% Using AWS Batch with Seqera

Biotech software company Seqera wanted to unlock scale for high performance computing (HPC) while maintaining ease of use for scientists worldwide. Scientists, engineers, and developers download Seqera’s open-source software, Nextflow, more than 160,000 times each month to power their bioinformatics workloads. Read more…

QCT Solution Channel

QCT and Intel Codeveloped QCT DevCloud Program to Jumpstart HPC and AI Development

Organizations and developers face a variety of issues in developing and testing HPC and AI applications. Challenges they face can range from simply having access to a wide variety of hardware, frameworks, and toolkits to time spent on installation, development, testing, and troubleshooting which can lead to increases in cost. Read more…

Finding Opportunity in the High-Growth “AI Market” 

December 6, 2023

 “What’s the size of the AI market?” It’s a totally normal question for anyone to ask me. After all, I’m an analyst, and my company, Intersect360 Research, specializes in scalable, high-performance datacenter Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

Inside AWS’s Plans to Make S3 Faster and Better

December 10, 2023

As far as big data storage goes, Amazon S3 has won the war. Even among storage vendors whose initials are not A.W.S., S3 is the defacto standard for storing lot Read more…

Quantum Market, Though Small, will Grow 22% and Hit $1.5B in 2026

December 7, 2023

Few markets as small as the quantum information sciences market generate as much lively discussion. Hyperion Research pegged the worldwide quantum market at $84 Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Finding Opportunity in the High-Growth “AI Market” 

December 6, 2023

 “What’s the size of the AI market?” It’s a totally normal question for anyone to ask me. After all, I’m an analyst, and my company, Intersect360 Res Read more…

Imagine a Beowulf Cluster of SuperNODEs …
(They did)

December 6, 2023

Clustering resources for faster performance is not new. In the early days of clustering, the Beowulf project demonstrated that high performance was achievable f Read more…

The IBM-Meta AI Alliance Promotes Safe and Open AI Progress

December 5, 2023

IBM and Meta have co-launched a massive industry-academic-government alliance to shepherd AI development. The new group has united under the AI Alliance banner Read more…

Shutterstock 1336284338

ChatGPT Friendly Programming Languages
(hello-world.llm)

December 4, 2023

 Using OpenAI's ChatGPT to write code is an alluring goal. Describing "what to" solve, but not "how to solve" would be a huge breakthrough in computer programm Read more…

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

Leading Solution Providers

Contributors

SC23 Booth Videos

Achronix @ SC23
AMD @ SC23
AWS @ SC23
Altair @ SC23
CoolIT @ SC23
Cornelis Networks @ SC23
CoreHive @ SC23
DDC @ SC23
HPE @ SC23 with Justin Hotard
HPE @ SC23 with Trish Damkroger
Intel @ SC23
Intelligent Light @ SC23
Lenovo @ SC23
Penguin Solutions @ SC23
QCT Intel @ SC23
Tyan AMD @ SC23
Tyan Intel @ SC23
HPCwire LIVE from SC23 Playlist

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire