The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn’t make the top spot on the Top500 list of the fastest supercomputers in the world.
At slightly over 1-exaflops of performance, Aurora remained in the second spot behind Frontier, which took the top spot at around 1.2 exaflops. Aurora passed the exaflop barrier, making it only the second system to pass the threshold.
The organizers seemed to prioritize the utility of the system over performance. Only time will tell whether the hundreds of millions of tax-payer money spent on the system was worth it.
ANL, HPE, and Intel, the main organizations behind the system, explained why Aurora isn’t complete and answered benchmarking questions.
Argonna runs on a chip Intel calls the “Exascale Compute Blade,” which has the 52-core Xeon CPU Max and Intel Data Center GPU Max.
The Benchmark Isn’t Complete
The Aurora system is still being installed, and more performance can be squeezed out of the system.
The high-performance LINPACK run represents about 80% to 90% of the overall system performance benchmarked and may still be able to topple Frontier to the top spot.
Aurora’s theoretical performance was estimated to be 2 exaflops. The real-world performance measure is about 55% of this, which is in line with the numbers achieved on other supercomputers, which are about 50% to 70%.
The System Wasn’t Built for High-performance LINPACK Runs
The hardware choices for Aurora indicate the system wasn’t built expressly to achieve top high-performance LINPACK benchmarks.
Instead, the system was built to balance scientific and AI computing. The system achieved 10.6 exaflops on mixed-precision computing on limited system benchmarking.
Scientific computing is shifting toward mixed-precision computing, and ANL, HPE, and Intel are looking ahead with Aurora. At the same time, Aurora meets the needs of conventional scientific applications that require double-precision computing.
The Aurora builders deliberately decided not to include processing units that drive up the main Top500 benchmark.
The System Has More AI Hardware Parts
Aurora’s design decision was to dedicate more silicon space and power budget to AI and mixed-precision parts than Frontier.
For example, Intel’s GPU, called Ponte Vecchio, in Aurora does not have dedicated matrix engines for FP64 (double-precision computing). By comparison, AMD’s MI250X in Frontier has dedicated parts for faster and more power-efficient FP64 matrix math calculations.
The dedicated FP64 engines give MI250X an advantage in LINPACK benchmarking. However, most scientific calculations cannot take advantage of MI250X’s double-precision matrix units.
“It was a deliberate design decision to not use silicon for a matrix unit for double precision. We put that extra silicon into accelerating lower precision. In bfloat16, for example, we have a lot more performance. So that’s the technical reason,” said Rick Stevens, who leads the exascale efforts at Argonne National Laboratory.
That puts Aurora at a disadvantage in terms of power consumption when running high-performance LINPACK benchmarks, but Aurora gains more efficiency on low-precision computing.
Stress Testing the System
The Aurora builders are considering the long-term stability of Aurora, which takes preference over the benchmarks.
“This is a brand new GPU architecture on the market at scale. And it’s somewhat unusual to build a very large system as the first instance of technology because not only do you have to understand individual GPUs and the microarchitecture and how to get performance out of that, but you’re also debugging at scale, which pushes hard on the reliability of individual components,” Stevens said.
ANL is running stress tests to achieve the target MTBF (mean time between failure) to achieve stability so components can run for a decade or more without failure.
El Capitan Is Coming
Bets are in favor of the El Capitan supercomputer at Lawrence Livermore National Laboratory, ultimately taking the supercomputing title in the coming years. A portion of the system — under construction — is ranked 47th on Top500, and to go up the charts with more benchmarking.
The 20-petaflop system showcases AMD’s hardware – the 24-core 4th Gen Epyc CPU and Instinct MI300A GPU.