Cray, AMD to Extend DOE’s Exascale Frontier

By Tiffany Trader

May 7, 2019

Cray and AMD are coming back to Oak Ridge National Laboratory to partner on the world’s largest and most expensive supercomputer. The Department of Energy’s Oak Ridge National Laboratory has selected American HPC company Cray–and its technology partner AMD–to provide the lab with its first exascale supercomputer for 2021 deployment.

The $600 million award marks the first system announcement to come out of the second CORAL (Collaboration of Oak Ridge, Argonne and Livermore) procurement process (CORAL-2). Poised to deliver “greater than 1.5 exaflops of HPC and AI processing performance,” Frontier (ORNL-5) will be based on Cray’s new Shasta architecture and Slingshot interconnect and will feature future-generation AMD Epyc CPUs and Radeon Instinct GPUs.

In a media briefing ahead of today’s announcement at Oak Ridge, the partners revealed that Frontier will span more than 100 Shasta supercomputer cabinets, each supporting 300 kilowatts of computing. Single-socket nodes will consist of one CPU and four GPUs, connected by AMD’s custom high bandwidth, low latency coherent Infinity fabric.

Oak Ridge Director Thomas Zacharia indicated that 40 MW of power, the maximum power draw set out in the CORAL-2 RFP, would be available for Frontier.

“Cray’s Slingshot system interconnect ties together this massive supercomputer and a new system software stack fuses the best of high performance computing and cloud capabilities,” said Cray CEO Pete Ungaro. “We worked together with AMD to design a new high density heterogeneous computing blade for Shasta and new programming environment for this new CPU-GPU node.”

Frontier will use a custom AMD Epyc processor based on a future generation of AMD’s Zen cores (beyond Rome and Milan). “[The future-gen Epycs] will have additional instructions in the microarchitecture as well as in the architecture itself for both optimization of AI as well as supercomputing workloads,” said AMD CEO Lisa Su, adding that the new Radeon Instinct GPU incorporates “extensive optimization for the AI and the computing performance, [with] mixed-precision operations for optimum deep learning performance, and high bandwidth memory for the best latency.”

The CPU and GPUs will be linked by AMD’s new coherent Infinity fabric and each GPU will be able to talk directly to the Slingshot network, enabling each node “to get the optimum performance for both supercomputing as well as AI,” said Su. All these components were designed for Frontier but will be available to enterprise applications after the system debuts, according to AMD.

Frontier marks a return for Cray and AMD to Oak Ridge, home to another Cray-AMD system, Titan. Benchmarked at 17.6 Linpack petaflops, Titan was the number one system in the world when it debuted (as an upgrade to Jaguar) in 2012. With Titan set to be decommissioned on August 1, 2019, and Frontier scheduled to be deployed in the back half of 2021 and accepted in 2022, Oak Ridge won’t be without a Cray-AMD machine for too long. While Titan used AMD (Opteron) CPUs and Nvidia (K20X) GPUS, Frontier will rely on AMD for all its in-node processing elements.

Frontier is Oak Ridge’s third machine to use a heterogeneous design. In addition to the aforementioned Titan, Oak Ridge is of course home to Summit, which became the world’s fastest supercomputer in June 2018. Its 143.5 GPU-accelerated Linpack petaflops are owed to 9,216 Power9 22-core CPUs and 27,648 Nvidia Tesla V100 GPUs.

“Since Titan, Oak Ridge has pioneered this idea of having GPU accelerators along with CPUs,” said Zacharia. “Frontier will be the third generation of supercomputing system built around this architecture and it will be the second generation AI machine.”

Frontier will be used for future application simulations for quantum computers, nuclear energy systems, fusion reactors, and precision medicines, said Zacharia, adding “Frontier finally gets us to the point where we can actually design new materials.”

“We are approaching a revolution in how we can design and analyze materials,” said Tom Evans, Oak Ridge National Laboratory technical lead for the Energy Applications Focus Area, Exascale Computing Project. “We can look and carefully characterize the electronic structure of fairly simple atoms and very simple molecules right now. But with exascale computing on Frontier, we’re trying to stretch that to molecules that consist of thousands of atoms. The more we understand about the electronic structure, the more we’re able to actually manufacture and use exotic materials for things like very small, high tensile strength materials and buildings to make them more energy efficient. At the end of the day, everything in some sense comes down to materials.”
AMD’s Forrest Norrod and Cray’s Pete Ungaro on stage at AMD’s Next Horizon event in November 2018.

In terms of number-one system bragging rights, the DOE has previously stated, and recently confirmed, that Aurora (aka Aurora21, the revised CORAL-1 system that Intel is contracted to deliver to Argonne) is on track to be the United States’, and possibly the world’s, first exascale system in 2021; and since that messaging has not changed, we believe it is the intention of the DOE to deliver on that goal. However, even if it is the case that Intel keeps to its timeline and Aurora is deployed and benchmarked first, Frontier is slated to be stood up on a very similar timeline and according to publicly stated performance goals will provide roughly 50 percent more flops capability.

Asked to comment on the “competitive” timelines for Frontier and Aurora, Zacharia said he could only comment on Frontier.

“I don’t know all the details of Aurora procurement because that information has not been publicly released, but we do know that Frontier will be the largest system by far that the DOE has procured,” he said.

“We know that Oak Ridge has experience with Summit and Titan previously in using CPU-GPU systems. We also know that the pre-exascale system that the scientific community is using today to develop all their applications and system software is on our system Summit, which is the largest machine available to anybody…. If there is any competition between the labs, it’s just competition for ideas, which is what scientists should do, but otherwise this is truly a DOE lab system effort to ensure the United States maintains the forefront of this important technology, not only because it drives technology innovation in the IT computing space but it also drives economic competition and creates jobs.”

Zacharia further cited that the goals for Frontier are aligned and consistent with the White House AI initiative as well as the National Council on American Workers, which is creating new jobs using AI and scientific computing in manufacturing and other spaces.

As for that $600-million-plus price tag, it is “by far the most expensive single machine that [the DOE has] ever procured,” said Zacharia. It’s also Cray’s largest contract ever.

The total amount includes the system build contract for “over $500 million,” as well as the development contract for “over $100 million” that will, according to Ungaro, be used to develop some of the core technologies for the machine, as well as a new programming environment that will enhance GPU programmability via extensions for Radeon Open Compute Platform (ROCm).

“The Cray Programming Environment (Cray PE)…will see a number of enhancements for increased functionality and scale,” said Cray. “This will start with Cray working with AMD to enhance these tools for optimized GPU scaling with extensions for Radeon Open Compute Platform (ROCm). These software enhancements will leverage low-level integrations of AMD ROCmRDMA technology with Cray Slingshot to enable direct communication between the Slingshot NIC to read and write data directly to GPU memory for higher application performance.”

To support the converged use of analytics, AI, and HPC at extreme scale, “Cray PE will be integrated with a full machine learning software stack with support for the most popular tools and frameworks.”

Shasta cabinet detail

Frontier marks Cray’s third major contract award for the Shasta architecture and Slingshot interconnect. Previous awards were for the National Energy Research Scientific Computing Center’s NERSC-9 pre-exascale Perlmutter system (with partners AMD and Nvidia) and the Argonne National Laboratory’s Aurora exascale system (with Intel as the prime).

Frontier is the first CORAL-2 award, announced nearly 13 months after the RFP was released. As laid out in the program’s RFP, CORAL-2 seeks to fund up to three exascale-class systems: Frontier at Oak Ridge, El Capitan at Livermore and a potential third system at Argonne if the lab chooses to make an award under the RFP and if funding is available. Like the original CORAL program, which kicked off in 2012, CORAL-2 has a mandate to field architecturally diverse machines in a way that manages risk during a period of rapid technological evolution. The stipulation indicates that “the systems residing at or planned to reside at ORNL and ANL must be diverse from one another,” however the program allows Oak Ridge and Livermore labs to employ the same architecture if they choose to do so, as in the case of Summit and Sierra, which employ very similar IBM-Nvidia architectures.

The CORAL-2 effort is part of the U.S. Exascale Computing Initiative. The ECI has two components: one is the hardware delivery and the other is application readiness. The latter is the domain of the Exascale Computing Project (see HPCwire‘s recent coverage to read about the latest progress), which is investing $1.7 billion to ensure there’s an exascale-ready software ecosystem to get the most from exascale hardware when it arrives.

“ECP Software Technology is excited to be a part of preparing the software stack for Frontier,” said Sandia’s Mike Heroux, director of software technology for the Exascale Computing Project. “We are already on our way, using Summit and Sierra as launching pads. Working with [Oak Ridge Leadership Computing Facility], Cray, and AMD, we look forward to providing the programming environments and tools, and math, data and visualization libraries that will unlock the potential of Frontier for producing the countless scientific achievements we expect from such a powerful system. We are privileged to be part of the effort.”

ORNL’s Center for Accelerated Application Readiness is accepting proposals from scientists to prepare their codes to run on Frontier. Check with the Frontier website for additional information.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

U.S. CTO Michael Kratsios Adds DoD Research & Engineering Title

July 13, 2020

Michael Kratsios, the U.S. Chief Technology Officer, has been appointed acting Undersecretary of Defense for research and engineering. He replaces Mike Griffin, who along with his deputy Lis Porter, stepped down last wee Read more…

By John Russell

Supercomputer Research Reveals Star Cluster Born Outside Our Galaxy

July 11, 2020

The Milky Way is our galactic home, containing our solar system and continuing into a giant band of densely packed stars that stretches across clear night skies around the world – but, it turns out, not all of those st Read more…

By Oliver Peckham

Max Planck Society Begins Installation of Liquid-Cooled Supercomputer from Lenovo

July 9, 2020

Lenovo announced today that it is supplying a new high performance computer to the Max Planck Society, one of Germany's premier research organizations. Comprised of Intel Xeon processors and Nvidia A100 GPUs, and featuri Read more…

By Tiffany Trader

Xilinx Announces First Adaptive Computing Challenge

July 9, 2020

A new contest is challenging the computing world. Xilinx has announced the first Xilinx Adaptive Computing Challenge, a competition that will task developers and startups with finding creative workload acceleration solutions. Xilinx is running the Adaptive Computing Challenge in partnership with Hackster.io, a developing community... Read more…

By Staff report

Reviving Moore’s Law? LBNL Researchers See Promise in Heterostructure Oxides

July 9, 2020

The reality of Moore’s law’s decline is no longer doubted for good empirical reasons. That said, never say never. Recent work by Lawrence Berkeley National Laboratory researchers suggests heterostructure oxides may b Read more…

By John Russell

AWS Solution Channel

Best Practices for Running Computational Fluid Dynamics (CFD) Workloads on AWS

The scalable nature and variable demand of CFD workloads makes them well-suited for a cloud computing environment. Many of the AWS instance types, such as the compute family instance types, are designed to include support for this type of workload.  Read more…

Intel® HPC + AI Pavilion

Supercomputing the Pandemic: Scientific Community Tackles COVID-19 from Multiple Perspectives

Since their inception, supercomputers have taken on the biggest, most complex, and most data-intensive computing challenges—from confirming Einstein’s theories about gravitational waves to predicting the impacts of climate change. Read more…

President’s Council Targets AI, Quantum, STEM; Recommends Spending Growth

July 9, 2020

Last week the President Council of Advisors on Science and Technology (PCAST) met (webinar) to review policy recommendations around three sub-committee reports: 1) Industries of the Future (IotF), chaired be Dario Gil (d Read more…

By John Russell

Max Planck Society Begins Installation of Liquid-Cooled Supercomputer from Lenovo

July 9, 2020

Lenovo announced today that it is supplying a new high performance computer to the Max Planck Society, one of Germany's premier research organizations. Comprise Read more…

By Tiffany Trader

President’s Council Targets AI, Quantum, STEM; Recommends Spending Growth

July 9, 2020

Last week the President Council of Advisors on Science and Technology (PCAST) met (webinar) to review policy recommendations around three sub-committee reports: Read more…

By John Russell

Google Cloud Debuts 16-GPU Ampere A100 Instances

July 7, 2020

On the heels of the Nvidia’s Ampere A100 GPU launch in May, Google Cloud is announcing alpha availability of the A100 “Accelerator Optimized” VM A2 instance family on Google Compute Engine. The instances are powered by the HGX A100 16-GPU platform, which combines two HGX A100 8-GPU baseboards using... Read more…

By Tiffany Trader

Q&A: HLRS’s Bastian Koller Tackles HPC and Industry in Germany and Europe

July 6, 2020

In this exclusive interview for HPCwire – sadly not face to face – Steve Conway, senior advisor for Hyperion Research, talks with Dr.-Ing Bastian Koller about the state of HPC and its collaboration with Industry in Europe. Koller is a familiar figure in HPC. He is the managing director at High Performance Computing Center Stuttgart (HLRS) and also serves... Read more…

By Steve Conway, Hyperion

OpenPOWER Reboot – New Director, New Silicon Partners, Leveraging Linux Foundation Connections

July 2, 2020

Earlier this week the OpenPOWER Foundation announced the contribution of IBM’s A21 Power processor core design to the open source community. Roughly this time Read more…

By John Russell

Hyperion Forecast – Headwinds in 2020 Won’t Stifle Cloud HPC Adoption or Arm’s Rise

June 30, 2020

The semiannual taking of HPC’s pulse by Hyperion Research – late fall at SC and early summer at ISC – is a much-watched indicator of things come. This yea Read more…

By John Russell

Racism and HPC: a Special Podcast

June 29, 2020

Promoting greater diversity in HPC is a much-discussed goal and ostensibly a long-sought goal in HPC. Yet it seems clear HPC is far from achieving this goal. Re Read more…

Top500 Trends: Movement on Top, but Record Low Turnover

June 25, 2020

The 55th installment of the Top500 list saw strong activity in the leadership segment with four new systems in the top ten and a crowning achievement from the f Read more…

By Tiffany Trader

Supercomputer Modeling Tests How COVID-19 Spreads in Grocery Stores

April 8, 2020

In the COVID-19 era, many people are treating simple activities like getting gas or groceries with caution as they try to heed social distancing mandates and protect their own health. Still, significant uncertainty surrounds the relative risk of different activities, and conflicting information is prevalent. A team of Finnish researchers set out to address some of these uncertainties by... Read more…

By Oliver Peckham

[email protected] Turns Its Massive Crowdsourced Computer Network Against COVID-19

March 16, 2020

For gamers, fighting against a global crisis is usually pure fantasy – but now, it’s looking more like a reality. As supercomputers around the world spin up Read more…

By Oliver Peckham

[email protected] Rallies a Legion of Computers Against the Coronavirus

March 24, 2020

Last week, we highlighted [email protected], a massive, crowdsourced computer network that has turned its resources against the coronavirus pandemic sweeping the globe – but [email protected] isn’t the only game in town. The internet is buzzing with crowdsourced computing... Read more…

By Oliver Peckham

Supercomputer Simulations Reveal the Fate of the Neanderthals

May 25, 2020

For hundreds of thousands of years, neanderthals roamed the planet, eventually (almost 50,000 years ago) giving way to homo sapiens, which quickly became the do Read more…

By Oliver Peckham

DoE Expands on Role of COVID-19 Supercomputing Consortium

March 25, 2020

After announcing the launch of the COVID-19 High Performance Computing Consortium on Sunday, the Department of Energy yesterday provided more details on its sco Read more…

By John Russell

Neocortex Will Be First-of-Its-Kind 800,000-Core AI Supercomputer

June 9, 2020

Pittsburgh Supercomputing Center (PSC - a joint research organization of Carnegie Mellon University and the University of Pittsburgh) has won a $5 million award Read more…

By Tiffany Trader

Honeywell’s Big Bet on Trapped Ion Quantum Computing

April 7, 2020

Honeywell doesn’t spring to mind when thinking of quantum computing pioneers, but a decade ago the high-tech conglomerate better known for its control systems waded deliberately into the then calmer quantum computing (QC) waters. Fast forward to March when Honeywell announced plans to introduce an ion trap-based quantum computer whose ‘performance’ would... Read more…

By John Russell

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

By Doug Black

Leading Solution Providers

Contributors

Nvidia’s Ampere A100 GPU: Up to 2.5X the HPC, 20X the AI

May 14, 2020

Nvidia's first Ampere-based graphics card, the A100 GPU, packs a whopping 54 billion transistors on 826mm2 of silicon, making it the world's largest seven-nanom Read more…

By Tiffany Trader

‘Billion Molecules Against COVID-19’ Challenge to Launch with Massive Supercomputing Support

April 22, 2020

Around the world, supercomputing centers have spun up and opened their doors for COVID-19 research in what may be the most unified supercomputing effort in hist Read more…

By Oliver Peckham

Australian Researchers Break All-Time Internet Speed Record

May 26, 2020

If you’ve been stuck at home for the last few months, you’ve probably become more attuned to the quality (or lack thereof) of your internet connection. Even Read more…

By Oliver Peckham

15 Slides on Programming Aurora and Exascale Systems

May 7, 2020

Sometime in 2021, Aurora, the first planned U.S. exascale system, is scheduled to be fired up at Argonne National Laboratory. Cray (now HPE) and Intel are the k Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

TACC Supercomputers Run Simulations Illuminating COVID-19, DNA Replication

March 19, 2020

As supercomputers around the world spin up to combat the coronavirus, the Texas Advanced Computing Center (TACC) is announcing results that may help to illumina Read more…

By Staff report

$100B Plan Submitted for Massive Remake and Expansion of NSF

May 27, 2020

Legislation to reshape, expand - and rename - the National Science Foundation has been submitted in both the U.S. House and Senate. The proposal, which seems to Read more…

By John Russell

John Martinis Reportedly Leaves Google Quantum Effort

April 21, 2020

John Martinis, who led Google’s quantum computing effort since establishing its quantum hardware group in 2014, has left Google after being moved into an advi Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This