Ahead of Frontier’s Deployment This Year, 1.5 Cabinet ‘Crusher’ Serves Science

By Tiffany Trader

March 28, 2022

The Frontier supercomputer was installed at Department of Energy’s Oak Ridge National Laboratory in 2021, with the final cabinet rolled into place in October. While shakeout of the full 2-exaflops peak system continues – we have heard off-record about troubles with the interconnect technology – the Frontier project is running with a smaller testbed system of the same core design.

Clocking in at about 40 petaflops peak double-precision, “Crusher” is a 1.5-cabinet iteration of the Cray EX Frontier supercomputer. Crusher will serve early science users while integration and testing of the full 74-cabinet Frontier system continues. The Frontier system is on track to be the United States’ first exascale system sometime this year, and will enter full user operations on January 1, 2023, according to Oak Ridge National Laboratory.

Crusher consists of 192 HPE Cray EX nodes – each with one AMD “Trento” 7A53 Epyc CPU and four AMD Instinct MI250X GPUs (for a total 768 GPUs). Trento uses the same Zen-3 cores as Milan, optimized for better memory efficiencies. Nodes are connected by HPE’s Slingshot-11 interconnect. Each node sports 512GiB DDR4 memory on the CPU and 512GiB HMB2e (128GiB per GPU) with coherent memory across the node.

By contrast, the full-size Frontier is slated to deliver 2 exaflops of peak double-precision performance in 74 cabinets within a 29MW power envelope. Occupying a 372 m2 footprint at the Oak Ridge Leadership Computing Facility (OLCF), Frontier spans 9,408 nodes aggregating 9.2 petabytes of memory (4.6 petabytes of DDR4 and 4.6 petabytes of HBM2e). Total GPU count: 37,632. There are 37 petabytes of node local storage, and access to 716 petabytes of center-wide storage. 

The HPE Olympus racks used in the Frontier architecture are entirely liquid-cooled, including the DIMMs and NICs. Each cabinet (when dry) weighs 3,630 kilograms. The full Frontier system has a total of 81,000 cables.

The facility was built to host a 100+ cabinet system, but HPE and AMD hit the 2 exaflops peak design target in only 74 cabinets (source: Al Geist, SuperComputingAsia 2022 keynote)

Crusher, said Oak Ridge, is ready to “crush” science, although we suspect the name might also be a nod to the chief medical officer from the television series Star Trek: The Next Generation. By extension, the full configuration would be the “Final Frontier.”

A visualization of an outflow of galactic wind at a single point in time using Cholla. Source: OLCF.

Four projects have already had their codes successfully optimized for Crusher and thus Frontier as well. They are the CANcer Distributed Learning Environment, or CANDLE, project; the Computational hydrodynamics on ∥ (parallel) architectures, or Cholla, project; the Locally Self-Consistent Multiple Scattering, or LSMS, project; and the Nuclear Coupled-Cluster Oak Ridge, or NuCCOR, project. Some of these codes date back to OLCF’s first hybrid-architecture system, the decommissioned 27-petaflop Cray XK7 Titan supercomputer that also employed CPU+GPU nodes and which was stood up in 2012.

Highlights of early results: 

  • The CANDLE team has successfully run one of their Transformer models (for natural language processing) on Crusher, achieving an 80 percent speedup on a Crusher node from previous systems.
  • Cholla, one of the first astrophysics codes to be rewritten for Frontier, is seeing 15-fold speedups on Crusher.
  • A materials code – LSMS – that can perform large-scale calculations of up to 100,000 atoms has been successfully deployed on Crusher.
  • NuCCOR, a nuclear physics code that can perform massive simulations of nuclei, is seeing 8-fold speedups on Crusher.

“Crusher is the latest in a long line of test and development systems we have deployed for early users of OLCF platforms and is easily the most powerful of these we have ever provided,” said ORNL’s Bronson Messer, OLCF director of science. “The results these code teams are realizing on the machine are very encouraging as we look toward the dawn of the exascale era with Frontier.”

“Taking up only 44 square feet of floor space, Crusher is 1/100th the size of the previous Titan supercomputer but faster than the entire 4,352-square-foot system was, packing a massive computing punch for its small size,” further reported the Oak Ridge announcement.

Frontier blade on display at SC21. Each node has one AMD Epyc CPU and four Instinct GPUs, however given the dual-GPU die design of the MI200-series accelerators, there are eight logical GPUs available to applications.

Frontier was originally scheduled to be deployed in the back half of 2021 and accepted in 2022. Delays of some kind or another are typical with supercomputing systems of this scope and scale, and Frontier is the first implementation of the AMD A+A architecture in addition to being one of the world’s first exascale machines. It remains to be seen whether Frontier will be ready in time for the late-May (not June this year) Top500 list as had been widely anticipated (given that the system was fully installed prior to the release of the November 2021 list). Oak Ridge did not offer a precise timeline for Frontier’s deployment and acceptance other than stating it will happen in 2022, followed by full operations commencing on January 1, 2023.

One challenge that Oak Ridge and their vendor partners have already overcome pertains to Covid-spurred supply chain shortages. Speaking at SCA22 earlier this month, ORNL Corporate Research Fellow Al Geist said that of Frontier’s 59 million parts, there were about 2 million parts that the regular manufacturers could not supply. “There was a heroic effort by the HPE and AMD teams calling up electronics warehouses and […] other manufacturers and [sourcing the missing parts.]”

A leadership-class facility (it’s in the name), OLCF is the home of Summit, another heterogeneous CPU-GPU system that debuted in 2018. Delivering 149 Linpack petaflops, the IBM-built machine is currently the number two system on the twice-yearly Top500 list of fastest computers. The title of world’s fastest supercomputer is officially held by the Riken Arm-based Fujitsu system (442 petaflops peak), but China is thought to have two exascale systems that were withheld from the list for political reasons.

Two other exascale systems are on deck in the United States: Aurora at Argonne National Laboratory and El Capitan at Livermore National Laboratory. Aurora, having had several resets and setbacks, is slated to be stood up at Argonne National Lab later this year. The Intel-HPE collaboration is now targeting more than 2-exaflops peak performance. On the face of it, Frontier’s slowed rollout could conceivably put those timelines in contention; however, Frontier is already already on the floor and Aurora isn’t. The Ponte Vecchio GPU for the Aurora supercomputer won’t be delivered until later this year, Intel recently reported. Meanwhile, preparation for El Capitan is well underway at Livermore; the system – to be built by HPE using a similar architecture as Frontier – is slated for delivery in 2023, promising greater than 2-exaflops peak performance.

Read the OLCF press release for more details on the science codes that are running on Crusher.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pressing needs and hurdles to widespread AI adoption. The sudde Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire