Lawrence Livermore National Laboratory (LLNL) is one of several national labs working with the National Nuclear Security Administration (NNSA), which manages the military applications of nuclear science – that is, the United States’ nuclear weapons stockpile. The NNSA doesn’t actually conduct weapons tests, though: it simulates them. To do this, the NNSA – and its partner labs – use in-house HPC systems. This week, LLNL highlighted one of the latest additions to its computing arsenal: Magma.
The NNSA’s core mission might sound straightforward, but simulating nuclear weapons is a deeply multidimensional task. “The high-performance computing aspects of that mission involve the development of predictive physics-based models,” explained Matt Leininger, deputy for advanced technology projects at LLNL, in a webinar. “Those […] are models for such areas as materials science, molecular dynamics, particle transport, hydrodynamics, mathematical solvers and other areas.”
LLNL researchers run multi-physics applications – applications incorporating, for instance, hydrodynamics, particle transport and complex geometries – at first, then run individual science-based applications to drill down into the uncertainties produced by the multi-physics models. Using those results, researchers then revisit the multi-physics applications, iterating the process until they achieve what Leininger calls “predictive science capability.” Many of those models run along various spectra of resolution, dimensionality, timescales and more, adding up to produce an enormous appetite for computing capacity.
To sate this appetite, LLNL calls on the Commodity Technology Systems contract (CTS-1), an NNSA grant awarded to LLNL and its two sister laboratories, Sandia National Laboratories and Los Alamos National Laboratory. Magma, which was shipped in November 2019, is the latest procurement under the CTS-1 umbrella following an award in 2016.
Magma is a Penguin Computing “Relion” system comprised of 752 nodes with Intel Xeon Platinum 9242 (Cascade Lake-AP) processors. The cluster has 293 terabytes of memory, liquid cooling provided by CoolIT Systems and an Intel Omni-Path interconnect. Its 3.24 Linpack petaflops placed it 69th on the latest Top500 list of the world’s most powerful supercomputers out of a theoretical peak of 5.31 petaflops. On a per-node basis, Leininger told HPCwire, the Cascade Lake processors delivered “about three to three and a half” times the performance compared to Broadwell processors deployed earlier in the CTS program.
Magma has no distinct storage capacity, Leininger said, as it is connected into several different Lustre file systems, but he says that it has access to “many, many petabytes” of storage. In terms of its footprint, Leininger explained that LLNL clusters are designed in “scalable units” that act like LEGO bricks, allowing researchers to scale a system from as few as 20 nodes to several thousand nodes. Magma is about four scalable units, making it physically around the size of “half a tennis court.”
What Magma brings to the table
Leininger was especially excited about a few new elements of Magma. The interconnect, he said, was “particularly critical.” “You can’t just solve [the models] on a single server,” he explained. “You really have to break up the problem and distribute it across thousands of servers and then use that high performance interconnect to tie the pieces back together again.” Thanks to that high performance interconnect, he said, tasks that used to be impossible on a single server now take a couple of days. Leininger also emphasized the memory bandwidth per node (which he called “tremendous”) noting that typical workloads were even more intensive on memory bandwidth than on the network.
Crucially, and unlike much of LLNL’s Broadwell-based systems, Magma’s uses liquid cooling – specifically, liquid coolant focused on the CPU and memory modules, to which Leininger credits much of Magma’s high density. “When you have a gigantic machine like Sierra that’s liquid-cooled, and then you put a big cluster in the corner that’s air-cooled, it’s challenging facilities-wise to make sure all that cold air is going in that right spot,” Leininger said in an earlier interview with HPCwire. “And it’s often a very human-intensive thing to optimize for all that, and it ends up just being easier and much more cost-effective to just move to liquid cooling on these solutions. So we knew we wanted to do that as well.”
Leininger also stressed that memory errors are a large portion of overall computing errors at LLNL and suggested that the direct liquid cooling may help. “We’re looking forward to reducing the operating temperature of the DIMMs and hopefully therefore reducing the overall number of memory errors we see over the system lifetime,” Leininger said, adding that the cooling system was designed for easy serviceability.
How Magma fits into the NNSA computing landscape
Magma is currently in the final stages of installation at LLNL, after which it will undergo testing and enter full production within the next month. Magma exists alongside several CTS-1 comrades (also supplied by Penguin Computing), including Corona (another LLNL system) and Attaway, which is housed at Sandia. Unlike Magma, Corona is the first of the “A+A” systems: AMD CPUs and AMD GPUs (specifically, AMD Naples CPUs and a 50-50 mix of MI25 and MI60 GPUs). This A+A structure makes Corona an early precursor to the forthcoming exascale Frontier system at Oak Ridge National Laboratory and the forthcoming El Capitan system at LLNL itself. While Magma serves problems more related to materials science, Corona’s GPUs make it more suitable for tasks such as machine learning and AI applications, Leininger explained to HPCwire. Attaway, meanwhile, uses Intel Skylake processors and placed 94th in the most recent Top500.
Leininger claims that LLNL has no plans to sunset any of its other systems once Magma reaches full production, saying that “all the CTS-1 systems we’ve procured over the last four years now, including Magma, will continue to deliver HPC cycles to our users over the next several years.” In fact, he explained, those systems remain in “very heavy use” and LLNL is facing demand beyond even its new capabilities.
To that end, LLNL is ready to move beyond CTS-1. “We are preparing for our next round of CTS procurements that’ll occur starting in late 2021,” Leininger said, “and that’ll be under the second round of the CTS procurements, called CTS-2.” Leininger said an RFP would be issued this summer and a contract would be awarded late in the calendar year as part of a push to deliver systems to NNSA labs from the second half of 2021 through 2024. Of course, he emphasized, there are still a few more systems to deliver before that point.
In general, Leininger said, the CTS-1 systems are “everyday workhorses,” intended to take the load off of the Advanced Technology System (ATS) supercomputers. “Commodity-based systems take on the bulk of day-to-day computing, leaving the larger advanced technology capability systems available for only the most demanding problems across the Tri-Lab community,” said Mark Anderson, director of the NNSA’s Advanced Simulation and Computing Program. The current ATS flagship is the Trinity supercomputer at Los Alamos, which is scheduled to reach end-of-life in 2021. At that point, Trinity will be replaced by a new ATS system, called Crossroads.
Tiffany Trader contributed to this report.