In 2021, Intel famously declared its goal to get to zettascale supercomputing by 2027, or scaling today’s Exascale computers by 1,000 times.
Moving forward to 2023, attendees said challenges are scaling up performance even within Exaflops at the Supercomputing 2023 conference, which is being held in Denver.
The move to CPU-GPU architecture has helped scale performance, but other concerns — such as architectural limitations and sustainability issues — are making it difficult to scale performance, officials at Top500 said.
In fact, at the current rate, supercomputers may not reach 10 Exaflops of performance by 2030. Also, the performance growth has fallen off the last couple of years despite new Exascale systems entering the Top500 list.
“Unless we change how we approach computing, our growth in the future can be substantially smaller than they have been in the past,” said Erich Strohmaier, cofounder of Top500, during a press conference.
The end of two fundamental corollaries — Dennard Scaling and Moore’s Law — have created challenges in scaling performance.
“The end of Moore’s Law is coming, there’s no doubt about that,” Strohmaier said.
The number of systems submitted to Top500 has progressively declined since 2017. The average performance of systems has also been declining over the last couple of years.
The slowdown is also related to the inability to grow system sizes due to architectural limitations and sustainability issues.
“Our data centers cannot grow much larger than they are. So we cannot increase the numbers of … CPU sockets,” Strohmaier said.
Optical I/O has been identified as a technology to help reach zettascale. However, a U.S. Department of Energy (DoE) official said that optical I/O was not on their roadmap because of the cost and the energy required to operate optical I/O to connect circuits over short distances at the motherboard level. By comparison, copper is cheap and plentiful.
Average HPC systems also have a longer shelf life. The average age of a Top500 system was about 15 months in 2018-2019 and doubled to 30 months in 2023.
The top seven systems on the November Top500 list have as much performance as the remaining 493. The upcoming systems will create an even bigger divide, with an even higher ratio of performance coming from the top 10 systems.
At the same time, some exciting new Exascale machines will be making their way to the Top500 list. There may be many lead changes as multiple supercomputers come online and are optimized to perform faster.
There are two new systems – Aurora this year and El Capitan next year — that could take the top Top500 positions in the coming years. The systems will scale to two Exaflops.
There were no change in the leader of the Top500 supercomputing list issued this week, with Frontier at Oak Ridge National Laboratory retaining its top spot. The system delivered peak performance at 1.1 Exaflops of performance and remained the only Exascale system on the list.
“I would say that the machine is really stable right now, and it’s performing exceptionally well,” said Lori Diachin, project director for Exascale computing project at the U.S. Department of Energy.
But Frontier could soon be replaced by the second-fastest system, Aurora, installed at Argonne National Laboratory. It delivered a performance of 585.34 petaflops and has been partially benchmarked. The system has Intel 4th Gen Xeon server chips called Sapphire Rapids CPUs and Data Center GPU Max chips called Ponte Vecchio.
Argonne submitted benchmarks for half the system size, and its performance will only go up when fully benchmarked, said Erich Strohmaier, cofounder of Top500.
“It’s questionable if Frontier will stay the number one system for much longer,” Strohmaier said.
Diachin’s team has had limited access to the system since July and is seeing great performance.
“We’re really looking forward to getting full access to that system, hopefully later this month,” Diachin said.
The third Exascale supercomputer, El Capitan, will be deployed in mid to late 2024 at the Lawrence Livermore National Laboratory.
The system will likely take the top spot on Top500 when the benchmark is released, but it is not sure when that will happen.
“There’ll be a brief early science period for that machine before it’s turned over to classified use for stockpile stewardship for the NSA,” Diachin said.
Additionally, many Top500 class Exaflop systems may be in plain sight, especially in cloud facilities of vendors who have not bothered to submit the results. Google’s A3 supercomputer can accommodate up to 26,000 Nvidia H100 GPUs but has not submitted any results.
But one submission, Microsoft’s Azure AI supercomputer called Eagle, unexpectedly landed in the third spot of this year’s Top500, and Nvidia’s bare metal Eos was in the ninth spot.
A past contributor, China, has gone off the map and isn’t submitting results to Top500. One submission for the Gordon Bell awards is a Chinese Exascale system, but there were no submissions of the system’s performance to the Top500.
Beyond raw horsepower, DoE’s Diachin is also trying new ways to scale performance within the current hardware limitations.
One such idea is using mixed precision and a wider implementation of accelerated computing. Also, DoE is looking at incorporating AI into large multiphysics models and enveloping that into classical computing to reach faster results.
“From our perspective, one of the things we’re really looking toward is some of these algorithmic improvements and broader incorporation of those kinds of technologies to accelerate applications while keeping the power footprint manageable,” Diachin said.
Many labs are also looking at their old code written in languages like Fortran 77 and rewriting and recompiling it for accelerated computing environments.
This approach “will help future-proof many of these codes by extracting layers that are specific to different kinds of hardware and allowing them to be more performance portable with less work,” Diachin said.
Hardware and algorithmic improvements gave performance improvements mostly in the 200x to 300x range and “as much as even several 1000 times improvement,” Diachin said.
The labs typically rely on E4S, or Extreme-scale Scientific Software, comprising debugging, runtime, math, visualization, and compression tools. It has more than 115 packages and is being pushed out to academia, scientific organizations, and other U.S. government agencies.