Early Access Systems at LLNL Mark Progress Toward El Capitan

October 21, 2021

Oct. 21, 2021 — Though the arrival of the exascale supercomputer El Capitan at Lawrence Livermore National Laboratory (LLNL) is still almost two years away, teams of code developers are busy working on predecessor systems to ensure critical applications are ready for Day One.

RZNevada System

Delivered in February, the “RZNevada” early-access system is providing experts at the National Nuclear Security Administration (NNSA) labs (LLNL, Los Alamos and Sandia) and their counterparts at industry partners Hewlett Packard Enterprise (HPE) and Advanced Micro Devices Inc. (AMD) with their first look at the nodes and software stack, anticipating what El Capitan will eventually sport.

The second El Capitan “early-access” machine sited at LLNL, RZNevada for months has served as a testbed for Livermore Computing and the NNSA Tri-Labs to port and develop applications in preparation for El Capitan’s arrival in 2023.

In addition to RZNevada, LLNL installed an earlier testbed system, “Hetchy,” that is being used by system administrators. More advanced EAS-3 testbeds will contain nodes with next-generation CPUs and GPUs, close to the technology that will be installed in the nation’s first exascale supercomputer at Oak Ridge National Laboratory (ORNL).

“Successful large-scale systems require extensive planning, from system design and implementation, to siting and, most importantly, use,” said LLNL’s Chief Technology Officer for Livermore Computing Bronis de Supinski. “This planning includes deployment of precursor systems that enable application teams to be ready as soon as the full-scale system is available. For this reason, the El Capitan project includes several generations of early-access systems, including RZNevada and three systems that will be architecturally similar to the ORNL Frontier system, and made available to our application teams early next year.”

When delivered, El Capitan will be NNSA’s first exascale (a quintillion floating-point operations per second) machine and is currently projected to be the world’s most powerful supercomputer. As with any new machine, it takes a village to get existing codes optimized for advancements in hardware and software environments, enlisting the efforts of hundreds of employees.

LLNL computational physicist David Richards heads the El Capitan Center of Excellence, a collaboration of developers and experts at the NNSA labs, HPE and AMD who work together on RZNevada to help guarantee applications will perform well on El Capitan from the get-go. “It’s a very tight relationship,” Richards said, and provides a mechanism for granting access to codes that HPE and AMD developers otherwise couldn’t see.

“The advantage of a Center of Excellence is that we get access to a lot of [non-disclosure agreement] material from the vendors,” Richards explained. “We get to have their experts give webinar presentations and provide information to our developers. Also, their developers get access to our codes; they can sit right next to each other and work on these codes to understand the performance bottlenecks and what can we do to resolve them.”

Ahead of the technology that will compose El Capitan, RZNevada has 24 advanced AMD CPU/GPU compute nodes, each containing one AMD EPYC* CPU 7702 processor with 64 cores each, and one AMD Instinct MI100 accelerator. The hardware is connected via HPE’s Slingshot network. While the system tops out at just a few hundred teraflops, the focus of RZNevada wasn’t speed — the Lab specifically chose fewer GPUs to provide the most nodes for code teams to work with, according to El Capitan Integration Lead Adam Bertsch.

“These early-access systems require a little bit of imagination to see how they relate to the final system, because they’re typically a lot smaller and the architecture looks a little different,” said Bertsch. “There’s a logical progression moving where we are to El Cap, but it’s certainly not the same stuff. That’s one of the things we have to deal with when we’re buying technology that doesn’t exist yet. We have to have this progression to move our applications from where we are to where we need to go.”

Bertsch said, thanks to the Lab’s previous experience in refactoring applications for heterogeneous CPU/GPU systems with the IBM/NVIDIA supercomputer Sierra, El Capitan will be a “less painful transition” for developers.

“We know the applications will run well because the conceptual architecture is close to Sierra, so we’ve already done a lot of work on our codes that we can leverage that leads us right to El Cap,” Bertsch said. “The whole notion of packaging up your work and being very efficient about what you copy over to the GPU and not going back and forth too much, this whole design architecture for heterogeneity is something we added in, and that’s going to pay dividends for several generations of systems.”

Though a fraction of the size and scope of El Capitan, RZNevada is also providing researchers with an early gauge of application performance. Developers determine the best “speed-of-light” performance that could be expected out of the codes, then identify aspects of the hardware that are constraining performance and factor in the projected hardware advancements that will be made in the next two years, according to Richards.

“It’s like moving into a house that has the outside walls and a roof, but it doesn’t have the carpeting or drywall and a nice coat of paint — all those features that really make it a home. Those are things that we expect will be developed over the next couple of years,” Richards said. “It looks kind of like the house we’ll move into eventually, but it’s not all there yet. That’s what the COE is working with the vendor to do, making sure we’re choosing the right carpeting, picking the paint colors we like and making sure that house will work for us and the family.”

The real benefit of RZNevada, Richards said, is that it gives developers a way to address the biggest challenges of standing up a new system — evaluating how the Lab’s codes communicate with the AMD CPUs and GPUs, the HPE/AMD tool chain (including libraries, debugging tools and compilers) and the prototype fourth-generation Tri-Laboratory Operating System Stack (TOSS).

“Like any system that comes along, there are just so many details and it takes time to get them right,” Richards said. “We’re getting very close to having many of our major codes through the first level of readiness. The next step is being able to use the debugging tools to identify bottlenecks, and then addressing those bottlenecks.”

Since stockpile stewardship is El Capitan’s No. 1 mission, the Center of Excellence is prioritizing multi-physics codes similar to the classified codes used in support of the nuclear stockpile, which also require many different physics packages to run effectively on the same platform. These include codes like MARBL, Ares and HYDRA, which are used in support of inertial confinement fusion research and modeling of experiments relevant for stockpile stewardship. They will eventually move on to preparing open science codes for El Capitan’s future unclassified companion system, nicknamed Tuolumne.

Since June, LLNL computer scientist Brian Ryujin and his team have been porting the Ares code on RZNevada, comparing its performance to Sierra and familiarizing himself with AMD’s tool chain. Ryujin’s team has leaned on their experience porting codes for Sierra, and Ryujin said that although the fundamental designs of the two systems are similar, there are subtle differences in architecture that require tweaks.

“It was a leap for us to get to Sierra, when we had only been supporting CPU architectures up until then,” Ryujin said. “Now that we have more experience with supporting multiple GPU architectures, we’ve had to fine-tune our abstraction layers (from Sierra). Fortunately, our abstractions have been good enough so that we don’t require a large change in the code again, so our Sierra work is definitely paying dividends. Still, it has been a reminder of how important practical experience on multiple architectures is to reach a reasonable degree of performance portability.”

The three yet-to-be-named EAS-3 systems are due to be delivered at LLNL in early 2022 and will present better platforms for comparing application performance, researchers said. It will also be the first opportunity for LLNL to work with RABBIT, a near-node local storage solution co-developed with HPE that will provide extremely fast access to storage and significantly improve throughput on El Capitan.

For more on El Capitan, visit the web.


Source: LLNL

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire