With the Linpack exaflops milestone achieved by the Frontier supercomputer at Oak Ridge National Laboratory, the United States is turning its attention to the next crop of exascale machines, some 5-10x more performant than Frontier. At least one such system is being planned for the 2025-2030 timeline, and the DOE is soliciting input from the vendor community to inform the design and procurement process.
A request for information (RFI) was issued today by the Department of Energy, seeking feedback from computing hardware and software vendors, system integrators, and other entities to assist the DOE National Laboratories in planning for next-gen exascale systems. The RFI says responses will “inform one or more DOE system acquisition RFPs, which will describe requirements for system deliveries in the 2025–2030 timeframe.” This could include the successor to Frontier (aka OLCF-6), the successor to Aurora (aka ALCF-5), the successor to Crossroads (aka ATS-5), the successor to El Capitan (aka ATS-6) as well as a future NERSC system (possibly NERSC-11). Note that of the “predecessor systems,” only Frontier has been installed so far.
Here’s an excerpt from the RFI:
“DOE is interested in the deployment of one or more supercomputers that can solve scientific problems 5 to 10 times faster – or solve more complex problems, such as those with more physics or requirements for higher fidelity – than the current state-of-the-art systems. These future systems will include associated networks and data hierarchies. A capable software stack will meet the requirements of a broad spectrum of applications and workloads, including large-scale computational science campaigns in modeling and simulation, machine intelligence, and integrated data analysis. We expect these systems to operate within a power envelope of 20–60 MW. These systems must be sufficiently resilient to hardware and software failures, in order to minimize requirements for user intervention. As the technologies evolve, we anticipate increased attention to resilience in other supercomputing system developments.”
While the RFI states a desired overall performance increase of 5-10x, the notice sharpens the estimate to 10–20+ FP64 exaflops systems in the 2025+ timeframe and 100+ FP64 exaflops in the 2030+ timeframe, achieved “through hardware and software acceleration mechanisms.”
“This is roughly 8 times more than 2022 systems in 2026 and 64 times more in 2030,” the RFI states. For lower-precision AI, there is an expected multiple of at least 8 to 16 times the FP64 rates.
A section on “mission need” stresses the importance of data-driven modeling and simulation to the nation’s science, energy and security priorities. “[T]he United States must continue to push strategic advancements in HPC – bringing about a grand convergence of modeling and simulation, data analytics, deep learning, artificial intelligence (AI), quantum computing, and other emerging capabilities – across integrated infrastructures in computational ecosystems,” the RFI states.
As such, “these systems are expected to solve emerging data science, artificial intelligence, edge deployments at facilities, and science ecosystem problems, in addition to the traditional modeling and simulation applications.”
The ideal future system will also be more agile, modular and extensible.
“We also wish to explore the development of an approach that moves away from monolithic acquisitions toward a model for enabling more rapid upgrade cycles of deployed systems, to enable faster innovation on hardware and software. One possible strategy would include increased reuse of existing infrastructure so that the upgrades are modular. A goal would be to reimagine systems architecture and an efficient acquisition process that allows continuous injection of technological advances to a facility (e.g., every 12–24 months rather than every 4–5 years),” asserts the RFI.
A key thrust of the DOE supercomputing strategy is the creation of an Advanced Computing Ecosystem (ACE) that enables “integration with other DOE facilities, including light source, data, materials science, and advanced manufacturing.”
“The next generation of supercomputers will need to be capable of being integrated into an ACE environment that supports automated workflows, combining one or more of these facilities to reduce the time from experiment and observation to scientific insight,” the document states.
The information collected in response to the RFI will support next-generation system planning and decision-making at Oak Ridge National Laboratory, Lawrence Berkeley National Laboratory, Lawrence Livermore National Laboratory, Argonne National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratories. The labs will use the information to update their advanced system roadmaps and to draft future RFPs (requests for proposals) for those systems.
The RFI published today has some similar hallmarks to the one issued by the Collaboration for Oak Ridge, Argonne and Livermore – aka CORAL – in 2012. However, a couple people I spoke with at Oak Ridge National Laboratory last week said they don’t expect there to be another CORAL program, partly because the cadence is off on account of the rewritten Aurora contract. Delays and reconceptualizations moved that system from the CORAL-1 to CORAL-2 timeline.
The original CORAL contract called for three pre-exascale systems (~100-200 petaflops each) with at least two different architectures to manage risk. Only two systems – Summit at Oak Ridge and Sierra at Livermore – were completed in the intended timeframe, using nearly the same heterogeneous IBM-Nvidia architecture. CORAL-2 took a similar tack, calling for two or three exascale-class systems with at least two distinct architectures. The program is procuring two systems – Frontier and El Capitan – both based on a similar heterogenous HPE AMD+AMD architecture. The redefined Aurora – which is based on the heterogenous HPE Intel+Intel architecture – becomes the “architecturally-diverse” third system (although it technically still belongs to the first CORAL contract).