Early Access Systems at LLNL Mark Progress Toward El Capitan

Oct. 21, 2021 — Though the arrival of the exascale supercomputer El Capitan at Lawrence Livermore National Laboratory (LLNL) is still almost two years away, teams of code developers are busy working on predecessor systems to ensure critical applications are ready for Day One.

Delivered in February, the “RZNevada” early-access system is providing experts at the National Nuclear Security Administration (NNSA) labs (LLNL, Los Alamos and Sandia) and their counterparts at industry partners Hewlett Packard Enterprise (HPE) and Advanced Micro Devices Inc. (AMD) with their first look at the nodes and software stack, anticipating what El Capitan will eventually sport.

The second El Capitan “early-access” machine sited at LLNL, RZNevada for months has served as a testbed for Livermore Computing and the NNSA Tri-Labs to port and develop applications in preparation for El Capitan’s arrival in 2023.

In addition to RZNevada, LLNL installed an earlier testbed system, “Hetchy,” that is being used by system administrators. More advanced EAS-3 testbeds will contain nodes with next-generation CPUs and GPUs, close to the technology that will be installed in the nation’s first exascale supercomputer at Oak Ridge National Laboratory (ORNL).

“Successful large-scale systems require extensive planning, from system design and implementation, to siting and, most importantly, use,” said LLNL’s Chief Technology Officer for Livermore Computing Bronis de Supinski. “This planning includes deployment of precursor systems that enable application teams to be ready as soon as the full-scale system is available. For this reason, the El Capitan project includes several generations of early-access systems, including RZNevada and three systems that will be architecturally similar to the ORNL Frontier system, and made available to our application teams early next year.”

When delivered, El Capitan will be NNSA’s first exascale (a quintillion floating-point operations per second) machine and is currently projected to be the world’s most powerful supercomputer. As with any new machine, it takes a village to get existing codes optimized for advancements in hardware and software environments, enlisting the efforts of hundreds of employees.

LLNL computational physicist David Richards heads the El Capitan Center of Excellence, a collaboration of developers and experts at the NNSA labs, HPE and AMD who work together on RZNevada to help guarantee applications will perform well on El Capitan from the get-go. “It’s a very tight relationship,” Richards said, and provides a mechanism for granting access to codes that HPE and AMD developers otherwise couldn’t see.

“The advantage of a Center of Excellence is that we get access to a lot of [non-disclosure agreement] material from the vendors,” Richards explained. “We get to have their experts give webinar presentations and provide information to our developers. Also, their developers get access to our codes; they can sit right next to each other and work on these codes to understand the performance bottlenecks and what can we do to resolve them.”

Ahead of the technology that will compose El Capitan, RZNevada has 24 advanced AMD CPU/GPU compute nodes, each containing one AMD EPYC* CPU 7702 processor with 64 cores each, and one AMD Instinct MI100 accelerator. The hardware is connected via HPE’s Slingshot network. While the system tops out at just a few hundred teraflops, the focus of RZNevada wasn’t speed — the Lab specifically chose fewer GPUs to provide the most nodes for code teams to work with, according to El Capitan Integration Lead Adam Bertsch.

“These early-access systems require a little bit of imagination to see how they relate to the final system, because they’re typically a lot smaller and the architecture looks a little different,” said Bertsch. “There’s a logical progression moving where we are to El Cap, but it’s certainly not the same stuff. That’s one of the things we have to deal with when we’re buying technology that doesn’t exist yet. We have to have this progression to move our applications from where we are to where we need to go.”

Bertsch said, thanks to the Lab’s previous experience in refactoring applications for heterogeneous CPU/GPU systems with the IBM/NVIDIA supercomputer Sierra, El Capitan will be a “less painful transition” for developers.

“We know the applications will run well because the conceptual architecture is close to Sierra, so we’ve already done a lot of work on our codes that we can leverage that leads us right to El Cap,” Bertsch said. “The whole notion of packaging up your work and being very efficient about what you copy over to the GPU and not going back and forth too much, this whole design architecture for heterogeneity is something we added in, and that’s going to pay dividends for several generations of systems.”

Though a fraction of the size and scope of El Capitan, RZNevada is also providing researchers with an early gauge of application performance. Developers determine the best “speed-of-light” performance that could be expected out of the codes, then identify aspects of the hardware that are constraining performance and factor in the projected hardware advancements that will be made in the next two years, according to Richards.

“It’s like moving into a house that has the outside walls and a roof, but it doesn’t have the carpeting or drywall and a nice coat of paint — all those features that really make it a home. Those are things that we expect will be developed over the next couple of years,” Richards said. “It looks kind of like the house we’ll move into eventually, but it’s not all there yet. That’s what the COE is working with the vendor to do, making sure we’re choosing the right carpeting, picking the paint colors we like and making sure that house will work for us and the family.”

The real benefit of RZNevada, Richards said, is that it gives developers a way to address the biggest challenges of standing up a new system — evaluating how the Lab’s codes communicate with the AMD CPUs and GPUs, the HPE/AMD tool chain (including libraries, debugging tools and compilers) and the prototype fourth-generation Tri-Laboratory Operating System Stack (TOSS).

“Like any system that comes along, there are just so many details and it takes time to get them right,” Richards said. “We’re getting very close to having many of our major codes through the first level of readiness. The next step is being able to use the debugging tools to identify bottlenecks, and then addressing those bottlenecks.”

Since stockpile stewardship is El Capitan’s No. 1 mission, the Center of Excellence is prioritizing multi-physics codes similar to the classified codes used in support of the nuclear stockpile, which also require many different physics packages to run effectively on the same platform. These include codes like MARBL, Ares and HYDRA, which are used in support of inertial confinement fusion research and modeling of experiments relevant for stockpile stewardship. They will eventually move on to preparing open science codes for El Capitan’s future unclassified companion system, nicknamed Tuolumne.

Since June, LLNL computer scientist Brian Ryujin and his team have been porting the Ares code on RZNevada, comparing its performance to Sierra and familiarizing himself with AMD’s tool chain. Ryujin’s team has leaned on their experience porting codes for Sierra, and Ryujin said that although the fundamental designs of the two systems are similar, there are subtle differences in architecture that require tweaks.

“It was a leap for us to get to Sierra, when we had only been supporting CPU architectures up until then,” Ryujin said. “Now that we have more experience with supporting multiple GPU architectures, we’ve had to fine-tune our abstraction layers (from Sierra). Fortunately, our abstractions have been good enough so that we don’t require a large change in the code again, so our Sierra work is definitely paying dividends. Still, it has been a reminder of how important practical experience on multiple architectures is to reach a reasonable degree of performance portability.”

The three yet-to-be-named EAS-3 systems are due to be delivered at LLNL in early 2022 and will present better platforms for comparing application performance, researchers said. It will also be the first opportunity for LLNL to work with RABBIT, a near-node local storage solution co-developed with HPE that will provide extremely fast access to storage and significantly improve throughput on El Capitan.

For more on El Capitan, visit the web.

Source: LLNL