Early Access Systems at LLNL Mark Progress Toward El Capitan

October 21, 2021

Oct. 21, 2021 — Though the arrival of the exascale supercomputer El Capitan at Lawrence Livermore National Laboratory (LLNL) is still almost two years away, teams of code developers are busy working on predecessor systems to ensure critical applications are ready for Day One.

RZNevada System

Delivered in February, the “RZNevada” early-access system is providing experts at the National Nuclear Security Administration (NNSA) labs (LLNL, Los Alamos and Sandia) and their counterparts at industry partners Hewlett Packard Enterprise (HPE) and Advanced Micro Devices Inc. (AMD) with their first look at the nodes and software stack, anticipating what El Capitan will eventually sport.

The second El Capitan “early-access” machine sited at LLNL, RZNevada for months has served as a testbed for Livermore Computing and the NNSA Tri-Labs to port and develop applications in preparation for El Capitan’s arrival in 2023.

In addition to RZNevada, LLNL installed an earlier testbed system, “Hetchy,” that is being used by system administrators. More advanced EAS-3 testbeds will contain nodes with next-generation CPUs and GPUs, close to the technology that will be installed in the nation’s first exascale supercomputer at Oak Ridge National Laboratory (ORNL).

“Successful large-scale systems require extensive planning, from system design and implementation, to siting and, most importantly, use,” said LLNL’s Chief Technology Officer for Livermore Computing Bronis de Supinski. “This planning includes deployment of precursor systems that enable application teams to be ready as soon as the full-scale system is available. For this reason, the El Capitan project includes several generations of early-access systems, including RZNevada and three systems that will be architecturally similar to the ORNL Frontier system, and made available to our application teams early next year.”

When delivered, El Capitan will be NNSA’s first exascale (a quintillion floating-point operations per second) machine and is currently projected to be the world’s most powerful supercomputer. As with any new machine, it takes a village to get existing codes optimized for advancements in hardware and software environments, enlisting the efforts of hundreds of employees.

LLNL computational physicist David Richards heads the El Capitan Center of Excellence, a collaboration of developers and experts at the NNSA labs, HPE and AMD who work together on RZNevada to help guarantee applications will perform well on El Capitan from the get-go. “It’s a very tight relationship,” Richards said, and provides a mechanism for granting access to codes that HPE and AMD developers otherwise couldn’t see.

“The advantage of a Center of Excellence is that we get access to a lot of [non-disclosure agreement] material from the vendors,” Richards explained. “We get to have their experts give webinar presentations and provide information to our developers. Also, their developers get access to our codes; they can sit right next to each other and work on these codes to understand the performance bottlenecks and what can we do to resolve them.”

Ahead of the technology that will compose El Capitan, RZNevada has 24 advanced AMD CPU/GPU compute nodes, each containing one AMD EPYC* CPU 7702 processor with 64 cores each, and one AMD Instinct MI100 accelerator. The hardware is connected via HPE’s Slingshot network. While the system tops out at just a few hundred teraflops, the focus of RZNevada wasn’t speed — the Lab specifically chose fewer GPUs to provide the most nodes for code teams to work with, according to El Capitan Integration Lead Adam Bertsch.

“These early-access systems require a little bit of imagination to see how they relate to the final system, because they’re typically a lot smaller and the architecture looks a little different,” said Bertsch. “There’s a logical progression moving where we are to El Cap, but it’s certainly not the same stuff. That’s one of the things we have to deal with when we’re buying technology that doesn’t exist yet. We have to have this progression to move our applications from where we are to where we need to go.”

Bertsch said, thanks to the Lab’s previous experience in refactoring applications for heterogeneous CPU/GPU systems with the IBM/NVIDIA supercomputer Sierra, El Capitan will be a “less painful transition” for developers.

“We know the applications will run well because the conceptual architecture is close to Sierra, so we’ve already done a lot of work on our codes that we can leverage that leads us right to El Cap,” Bertsch said. “The whole notion of packaging up your work and being very efficient about what you copy over to the GPU and not going back and forth too much, this whole design architecture for heterogeneity is something we added in, and that’s going to pay dividends for several generations of systems.”

Though a fraction of the size and scope of El Capitan, RZNevada is also providing researchers with an early gauge of application performance. Developers determine the best “speed-of-light” performance that could be expected out of the codes, then identify aspects of the hardware that are constraining performance and factor in the projected hardware advancements that will be made in the next two years, according to Richards.

“It’s like moving into a house that has the outside walls and a roof, but it doesn’t have the carpeting or drywall and a nice coat of paint — all those features that really make it a home. Those are things that we expect will be developed over the next couple of years,” Richards said. “It looks kind of like the house we’ll move into eventually, but it’s not all there yet. That’s what the COE is working with the vendor to do, making sure we’re choosing the right carpeting, picking the paint colors we like and making sure that house will work for us and the family.”

The real benefit of RZNevada, Richards said, is that it gives developers a way to address the biggest challenges of standing up a new system — evaluating how the Lab’s codes communicate with the AMD CPUs and GPUs, the HPE/AMD tool chain (including libraries, debugging tools and compilers) and the prototype fourth-generation Tri-Laboratory Operating System Stack (TOSS).

“Like any system that comes along, there are just so many details and it takes time to get them right,” Richards said. “We’re getting very close to having many of our major codes through the first level of readiness. The next step is being able to use the debugging tools to identify bottlenecks, and then addressing those bottlenecks.”

Since stockpile stewardship is El Capitan’s No. 1 mission, the Center of Excellence is prioritizing multi-physics codes similar to the classified codes used in support of the nuclear stockpile, which also require many different physics packages to run effectively on the same platform. These include codes like MARBL, Ares and HYDRA, which are used in support of inertial confinement fusion research and modeling of experiments relevant for stockpile stewardship. They will eventually move on to preparing open science codes for El Capitan’s future unclassified companion system, nicknamed Tuolumne.

Since June, LLNL computer scientist Brian Ryujin and his team have been porting the Ares code on RZNevada, comparing its performance to Sierra and familiarizing himself with AMD’s tool chain. Ryujin’s team has leaned on their experience porting codes for Sierra, and Ryujin said that although the fundamental designs of the two systems are similar, there are subtle differences in architecture that require tweaks.

“It was a leap for us to get to Sierra, when we had only been supporting CPU architectures up until then,” Ryujin said. “Now that we have more experience with supporting multiple GPU architectures, we’ve had to fine-tune our abstraction layers (from Sierra). Fortunately, our abstractions have been good enough so that we don’t require a large change in the code again, so our Sierra work is definitely paying dividends. Still, it has been a reminder of how important practical experience on multiple architectures is to reach a reasonable degree of performance portability.”

The three yet-to-be-named EAS-3 systems are due to be delivered at LLNL in early 2022 and will present better platforms for comparing application performance, researchers said. It will also be the first opportunity for LLNL to work with RABBIT, a near-node local storage solution co-developed with HPE that will provide extremely fast access to storage and significantly improve throughput on El Capitan.

For more on El Capitan, visit the web.


Source: LLNL

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

SC21 Was Unlike Any Other — Was That a Good Thing?

December 3, 2021

For a long time, the promised in-person SC21 seemed like an impossible fever dream, the assurances of a prominent physical component persisting across years of canceled conferences, including two virtual ISCs and the virtual SC20. With the advent of the Delta variant, Covid surges in St. Louis and contention over vaccine requirements... Read more…

The Green500’s Crystal Anniversary Sees MN-3 Crystallize Its Winning Streak

December 2, 2021

“This is the 30th Green500,” said Wu Feng, custodian of the Green500 list, at the list’s SC21 birds-of-a-feather session. “You could say 15 years of Green500, which makes it, I guess, the crystal anniversary.” Indeed, HPCwire marked the 15th anniversary of the Green500 – which ranks supercomputers by flops-per-watt, rather than just by flops – earlier this year with... Read more…

AWS Arm-based Graviton3 Instances Now in Preview

December 1, 2021

Three years after unveiling the first generation of its AWS Graviton chip-powered instances in 2018, Amazon Web Services announced that the third generation of the processors – the AWS Graviton3 – will power all-new Amazon Elastic Compute 2 (EC2) C7g instances that are now available in preview. Debuting at the AWS re:Invent 2021... Read more…

Nvidia Dominates Latest MLPerf Results but Competitors Start Speaking Up

December 1, 2021

MLCommons today released its fifth round of MLPerf training benchmark results with Nvidia GPUs again dominating. That said, a few other AI accelerator companies participated and, one of them, Graphcore, even held a separ Read more…

HPC Career Notes: December 2021 Edition

December 1, 2021

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it’s a promotion, new company hire, or even an accolade, we’ Read more…

AWS Solution Channel

Running a 3.2M vCPU HPC Workload on AWS with YellowDog

Historically, advances in fields such as meteorology, healthcare, and engineering, were achieved through large investments in on-premises computing infrastructure. Upfront capital investment and operational complexity have been the accepted norm of large-scale HPC research. Read more…

At SC21, Experts Ask: Can Fast HPC Be Green?

November 30, 2021

HPC is entering a new era: exascale is (somewhat) officially here, but Moore’s law is ending. Power consumption and other sustainability concerns loom over the enormous systems and chips of this new epoch, for both cost and compliance reasons. Reconciling the need to continue the supercomputer scale-up while reducing HPC’s environmental impacts... Read more…

SC21 Was Unlike Any Other — Was That a Good Thing?

December 3, 2021

For a long time, the promised in-person SC21 seemed like an impossible fever dream, the assurances of a prominent physical component persisting across years of canceled conferences, including two virtual ISCs and the virtual SC20. With the advent of the Delta variant, Covid surges in St. Louis and contention over vaccine requirements... Read more…

The Green500’s Crystal Anniversary Sees MN-3 Crystallize Its Winning Streak

December 2, 2021

“This is the 30th Green500,” said Wu Feng, custodian of the Green500 list, at the list’s SC21 birds-of-a-feather session. “You could say 15 years of Green500, which makes it, I guess, the crystal anniversary.” Indeed, HPCwire marked the 15th anniversary of the Green500 – which ranks supercomputers by flops-per-watt, rather than just by flops – earlier this year with... Read more…

Nvidia Dominates Latest MLPerf Results but Competitors Start Speaking Up

December 1, 2021

MLCommons today released its fifth round of MLPerf training benchmark results with Nvidia GPUs again dominating. That said, a few other AI accelerator companies Read more…

At SC21, Experts Ask: Can Fast HPC Be Green?

November 30, 2021

HPC is entering a new era: exascale is (somewhat) officially here, but Moore’s law is ending. Power consumption and other sustainability concerns loom over the enormous systems and chips of this new epoch, for both cost and compliance reasons. Reconciling the need to continue the supercomputer scale-up while reducing HPC’s environmental impacts... Read more…

Raja Koduri and Satoshi Matsuoka Discuss the Future of HPC at SC21

November 29, 2021

HPCwire's Managing Editor sits down with Intel's Raja Koduri and Riken's Satoshi Matsuoka in St. Louis for an off-the-cuff conversation about their SC21 experience, what comes after exascale and why they are collaborating. Koduri, senior vice president and general manager of Intel's accelerated computing systems and graphics (AXG) group, leads the team... Read more…

Jack Dongarra on SC21, the Top500 and His Retirement Plans

November 29, 2021

HPCwire's Managing Editor sits down with Jack Dongarra, Top500 co-founder and Distinguished Professor at the University of Tennessee, during SC21 in St. Louis to discuss the 2021 Top500 list, the outlook for global exascale computing, and what exactly is going on in that Viking helmet photo. Read more…

SC21: Larry Smarr on The Rise of Supernetwork Data Intensive Computing

November 26, 2021

Larry Smarr, founding director of Calit2 (now Distinguished Professor Emeritus at the University of California San Diego) and the first director of NCSA, is one of the seminal figures in the U.S. supercomputing community. What began as a personal drive, shared by others, to spur the creation of supercomputers in the U.S. for scientific use, later expanded into a... Read more…

Three Chinese Exascale Systems Detailed at SC21: Two Operational and One Delayed

November 24, 2021

Details about two previously rumored Chinese exascale systems came to light during last week’s SC21 proceedings. Asked about these systems during the Top500 media briefing on Monday, Nov. 15, list author and co-founder Jack Dongarra indicated he was aware of some very impressive results, but withheld comment when asked directly if he had... Read more…

IonQ Is First Quantum Startup to Go Public; Will It be First to Deliver Profits?

November 3, 2021

On October 1 of this year, IonQ became the first pure-play quantum computing start-up to go public. At this writing, the stock (NYSE: IONQ) was around $15 and its market capitalization was roughly $2.89 billion. Co-founder and chief scientist Chris Monroe says it was fun to have a few of the company’s roughly 100 employees travel to New York to ring the opening bell of the New York Stock... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

US Closes in on Exascale: Frontier Installation Is Underway

September 29, 2021

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, held by Zoom this week (Sept. 29-30), it was revealed that the Frontier supercomputer is currently being installed at Oak Ridge National Laboratory in Oak Ridge, Tenn. The staff at the Oak Ridge Leadership... Read more…

AMD Launches Milan-X CPU with 3D V-Cache and Multichip Instinct MI200 GPU

November 8, 2021

At a virtual event this morning, AMD CEO Lisa Su unveiled the company’s latest and much-anticipated server products: the new Milan-X CPU, which leverages AMD’s new 3D V-Cache technology; and its new Instinct MI200 GPU, which provides up to 220 compute units across two Infinity Fabric-connected dies, delivering an astounding 47.9 peak double-precision teraflops. “We're in a high-performance computing megacycle, driven by the growing need to deploy additional compute performance... Read more…

Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups

October 15, 2021

Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Killer Instinct: AMD’s Multi-Chip MI200 GPU Readies for a Major Global Debut

October 21, 2021

AMD’s next-generation supercomputer GPU is on its way – and by all appearances, it’s about to make a name for itself. The AMD Radeon Instinct MI200 GPU (a successor to the MI100) will, over the next year, begin to power three massive systems on three continents: the United States’ exascale Frontier system; the European Union’s pre-exascale LUMI system; and Australia’s petascale Setonix system. Read more…

Leading Solution Providers

Contributors

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

D-Wave Embraces Gate-Based Quantum Computing; Charts Path Forward

October 21, 2021

Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat fo Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer... Read more…

Three Chinese Exascale Systems Detailed at SC21: Two Operational and One Delayed

November 24, 2021

Details about two previously rumored Chinese exascale systems came to light during last week’s SC21 proceedings. Asked about these systems during the Top500 media briefing on Monday, Nov. 15, list author and co-founder Jack Dongarra indicated he was aware of some very impressive results, but withheld comment when asked directly if he had... Read more…

2021 Gordon Bell Prize Goes to Exascale-Powered Quantum Supremacy Challenge

November 18, 2021

Today at the hybrid virtual/in-person SC21 conference, the organizers announced the winners of the 2021 ACM Gordon Bell Prize: a team of Chinese researchers leveraging the new exascale Sunway system to simulate quantum circuits. The Gordon Bell Prize, which comes with an award of $10,000 courtesy of HPC pioneer Gordon Bell, is awarded annually... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire