US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

By Tiffany Trader

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with the release of the much-anticipated CORAL-2 request for proposals (RFP). Although funding is not yet secured, the anticipated budget range for each system is significant: $400 million to $600 million per machine including associated non-recurring engineering (NRE).

CORAL of course refers to the joint effort to procure next-generation supercomputers for Department of Energy’s National Laboratories at Oak Ridge, Argonne, and Livermore. The fruits of the original CORAL RFP include Summit and Sierra, ~200 petaflops systems being built by IBM in partnership with Nvidia and Mellanox for Oak Ridge and Livermore, respectively, and “A21,” the retooled Aurora contract with prime Intel (and partner Cray), destined for Argonne in 2021 and slated to be the United States’ first exascale machine.

The heavyweight supercomputers are required to meet the mission needs of the Advanced Scientific Computing Research (ASCR) Program within the DOE’s Office of Science and the Advanced Simulation and Computing (ASC) Program within the National Nuclear Security Administration.

The CORAL-2 collaboration specifically seeks to fund non-recurring engineering and up to three exascale-class systems: one at Oak Ridge, one at Livermore and a potential third system at Argonne if it chooses to make an award under the RFP and if funding is available. The Exascale Computing Project (ECP), a joint DOE-NNSA effort, has been organizing and leading R&D in the areas of the software stack, applications, and hardware to ensure “capable,” i.e., productively usable, exascale machines that can solve science problems 50x faster (or more complex) over today’s ~20-petaflops DOE systems (i.e., Sequoia and Titan). In terms of peak Linpack, 1.3 exaflops is the “desirable” target set by the DOE.

Like the original CORAL program, which kicked off in 2012, CORAL-2 has a mandate to field architecturally diverse machines in a way that manages risk during a period of rapid technological evolution. “Regardless of which system or systems are being discussed, the systems residing at or planned to reside at ORNL and ANL must be diverse from one another,” notes the CORAL-2 RFP cover letter [PDF]. Sharpening the point, that means the Oak Ridge system must be distinct from A21 and from a potential CORAL-2 machine at Argonne. It is conceivable, then, that this RFP may result in one, two or three different architectures, depending of course on the selections made by the labs and whether Argonne’s CORAL-2 machine comes to fruition.

“Diversity,” according to the RFP documents, “will be evaluated by how much the proposed system(s) promotes a competition of ideas and technologies; how much the proposed system(s) reduces risk that may be caused by delays or failure of a particular technology or shifts in vendor business focus, staff, or financial health; and how much the proposed system(s) diversity promotes a rich and healthy HPC ecosystem.”

Here is a listing of current and future CORAL machines:

Proposals for CORAL-2 are due in May with bidders to be selected later this year. Acquisition contracts are anticipated for 2019.

If Argonne takes delivery of A21 in 2021 and deploys an additional machine (or upgrade) in the third quarter of 2022, it would be fielding two exascale machines/builds in less than two years.

“Whether CORAL-2 winds up being two systems or three may come down to funding, which is ‘expected’ at this point, but not committed,” commented HPC veteran and market watcher Addison Snell, CEO of Intersect360 Research. “If ANL does not fund an exascale system as part of CORAL-2, I would nevertheless expect an exascale system there in a similar timeframe, just possibly funded separately.”

Several HPC community leaders we spoke with shared more pointed speculation on what the overture for a second exascale machine at Argonne so soon on the heels of A21 may indicate, insofar as there may be doubt about whether Intel’s “novel architecture” will satisfy the full scope of DOE’s needs. Given the close timing and the reality of lengthy procurement cycles, the decision on a follow-on will have to be made without the benefit of experience with A21.

Argonne’s Associate Laboratory Director for Computing, Environment and Life Sciences Rick Stevens, commenting for this piece, underscored the importance of technology diversity and shined a light on Argonne’s thinking. “We are very interested in getting as broad range of responses as possible to consider for our planning. We would love to have multiple choices to consider for the DOE landscape including exciting options for potential upgrades to Aurora,” he said.

If Intel, working with Cray, is able to fulfill the requirements for a 1-exaflops A21 machine in 2021, the pair may be in a favorable position to fulfill the more rigorous “capable exascale” requirements outlined by ECP and CORAL-2.

The overall bidding pool for CORAL-2 is likely to include IBM, Intel, Cray and Hewlett Packard Enterprise (HPE); upstart system-maker Nvidia may also have a hand to play. HPE could come in with a GPU-based machine or an implementation of its memory-centric architecture, known as The Machine. In IBM’s court, the successor architectures to Power9 are no doubt being looked at as candidates.

And while it’s always fun dishing over the sexy processing elements (with flavors from Intel, Nvidia, AMD and IBM on the tasting menu), Snell pointed out it is perhaps more interesting to prospect the interconnect topologies in the field. “Will we be looking at systems based on an upcoming version of a current technology, such as InfiniBand or OmniPath, or a future technology like Gen-Z, or something else proprietary?” he pondered.

Stevens weighed in on the many technological challenges still at hand, ranging from memory capacity, power consumption, and systems balance, but he noted that, fortunately, the DOE has been investing in many of these issues for many years, through the PathForward program and its predecessors, created to foster the technology pipeline needed for extreme-scale computing. It’s no accident or coincidence that we’ve already run through all the names in the current “Forward” program: AMD, Cray, HPE, IBM, Intel, and Nvidia.

“Hopefully the vendors will have some good options for us to consider,” said Stevens, adding that Argonne is seeking a broad set of responses from as many vendors as possible. “This RFP is really about opening up the aperture to new architectural concepts and to enable new partnerships in the vendor landscape. I think it’s particularly important to notice that we are interested in systems that can support the integration of simulation, data and machine learning. This is reflected in both the technology specifications as well as the benchmarks outlined in the RFP.”

Other community members also shared their reactions.

“It is good to see a commitment to high-end computing by DOE, though I note that the funding has not yet been secured,” said Bill Gropp, director of the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign (home to the roughly 13-petaflops Cray “Blue Waters” supercomputer). “What is needed is a holistic approach to HEC; this addresses the next+1 generation of systems but not work on applications or algorithms.”

“What stands out [about the CORAL-2 RFP] is that it doesn’t take advantage of the diversity of systems to encourage specialization in the hardware to different data structure/algorithm choices,” Gropp added. “Once you decide to acquire several systems, you can consider specialization. Frankly, for example, GPU-based systems are specialized; they run some important algorithms very well, but are less effective at others. Rather than deny that, make it into a strength. There are hints of this in the way the different classes of benchmarks are described and the priorities placed on them [see page 23 of the RFP’s Proposal Evaluation and Proposal Preparation Instructions], but it could be much more explicit.

“Also, this line on page 23 stands out: “The technology must have potential commercial viability.” I understand the reasoning behind this, but it is an additional constraint that may limit the innovation that is possible. In any case, this is an indirect requirement. DOE is looking for viable technologies that it can support at reasonable cost. But this misses the point that using commodity (which is how commercial viability is often interpreted) technology has its own costs, in the part of the environment that I mentioned above and that is not covered by this RFP.”

Gropp, who is awaiting the results of the NSF Track 1 RFP that will award the follow-on to Blue Waters, also pointed out that NSF has only found $60 million for the next-generation system, and has (as of November 2017) cut the number of future track 2 systems to one. “I hope that DOE can convince Congress to not only appropriate the funds for these systems, but also for the other science agencies,” he said.

Adding further valuable insight into the United States’ strategy to field next-generation leadership-class supercomputers especially with regard to the “commercial viability” precept is NNSA Chief Scientist Dimitri Kusnezov. Interviewed by the Supercomputing Frontiers Europe 2018 conference in Warsaw, Poland, last month (link to video), Kusnezov characterized DOE and NNSA’s $258 million funding of the PathFoward program as “an investment with the private sector to buy down risk in next-generation technologies.”

“We would love to simply buy commercial,” he said. “It would be more cost-effective for us. We’d run in the cloud if that was the answer for us, if that was the most cost-effective way, because it’s not about the computer, it’s about the outcomes. The $250 million [spent on PathForward] was just a piece of ongoing and much larger investments we are making to try and steer, on the sides, vendor roadmaps. We have a sense where companies are going. They share with us their technology investments, and we ask them if there are things we can build on those to help modify it so they can be more broadly serviceable to large scalable architectures.

“$250 million dollars is not a lot of money in the computer world. A billion dollars is not a lot of money in the computer world, so you have to have measured expectations on what you think you can actually impact. We look at impacting the high-end next-generation roadmaps of companies where we can, to have the best output. The best outcome for us is we invest in modifications, lower-power processors, memory closer to the processor, AI-injected into the CPUs in some way, and, in the best case, it becomes commercial, and there’s a market for it, a global market ideally because then the price point comes down and when we build something there, it’s more cost-effective for us. We’re trying to avoid buying special-purpose, single-use systems because they’re too expensive and it doesn’t make a lot of sense. If we can piggyback on where companies want to go by having a sense of what might ultimately have market value for them, we leverage a lot of their R&D and production for our value as well.

“This investment we are doing buys down risk. If other people did it for us that would even be better. If they felt the urgency and invested in the areas we care about, we’d be really happy. So we fill in the gaps where we can. …But ultimately it’s not about the computer, it’s really about the purpose…the problems you are solving and do they make a difference.”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Graphcore Introduces Larger-Than-Ever IPU-Based Pods

October 22, 2021

Graphcore and its “Intelligent Processing Units” (IPUs) emerged from stealth in 2016 and launched its second-generation IPU in 2020. While the company has also launched its IPUs in a variety of form factors over the Read more…

Quantum Chemistry Project to Be Among the First on EuroHPC’s LUMI System

October 22, 2021

Finland’s CSC has just installed the first module of LUMI, a 550-peak petaflops system supported by the European Union’s EuroHPC Joint Undertaking. While LUMI -- pictured in the header -- isn’t slated to complete i Read more…

Killer Instinct: AMD’s Multi-Chip MI200 GPU Readies for a Major Global Debut

October 21, 2021

AMD’s next-generation supercomputer GPU is on its way – and by all appearances, it’s about to make a name for itself. The AMD Radeon Instinct MI200 GPU (a successor to the MI100) will, over the next year, begin to power three massive systems on three continents: the United States’ exascale Frontier system; the European Union’s pre-exascale LUMI system; and Australia’s petascale Setonix system. Read more…

D-Wave Embraces Gate-Based Quantum Computing; Charts Path Forward

October 21, 2021

Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat for that approach), announced it was expanding into gate-based Read more…

LLNL Prepares the Water and Power Infrastructure for El Capitan

October 21, 2021

When it’s (ostensibly) ready in early 2023, El Capitan is expected to deliver in excess of two exaflops of peak computing power – around four times the power of Fugaku, the current top-ranked supercomputer in the wor Read more…

AWS Solution Channel

Royalty-free stock illustration ID: 537899029

Running GROMACS on GPU instances

Comparing the performance of real applications across different Amazon Elastic Compute Cloud (Amazon EC2) instance types is the best way we’ve found for finding optimal configurations for HPC applications here at AWS. Read more…

Faster Optical Switch that Operates at ‘Room Temp’ Developed by IBM, Skolkovo Researchers

October 19, 2021

Optical switching technology holds great promise for many applications but hot operating temperatures have been a key obstacle slowing progress. Now, a new optical switching device that can operate at room temperatures a Read more…

Killer Instinct: AMD’s Multi-Chip MI200 GPU Readies for a Major Global Debut

October 21, 2021

AMD’s next-generation supercomputer GPU is on its way – and by all appearances, it’s about to make a name for itself. The AMD Radeon Instinct MI200 GPU (a successor to the MI100) will, over the next year, begin to power three massive systems on three continents: the United States’ exascale Frontier system; the European Union’s pre-exascale LUMI system; and Australia’s petascale Setonix system. Read more…

D-Wave Embraces Gate-Based Quantum Computing; Charts Path Forward

October 21, 2021

Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat fo Read more…

LLNL Prepares the Water and Power Infrastructure for El Capitan

October 21, 2021

When it’s (ostensibly) ready in early 2023, El Capitan is expected to deliver in excess of two exaflops of peak computing power – around four times the powe Read more…

Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups

October 15, 2021

Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing... Read more…

Quantum Workforce – NSTC Report Highlights Need for International Talent

October 13, 2021

Attracting and training the needed quantum workforce to fuel the ongoing quantum information sciences (QIS) revolution is a hot topic these days. Last week, the U.S. National Science and Technology Council issued a report – The Role of International Talent in Quantum Information Science... Read more…

Eni Returns to HPE for ‘HPC4’ Refresh via GreenLake

October 13, 2021

Italian energy company Eni is upgrading its HPC4 system with new gear from HPE that will be installed in Eni’s Green Data Center in Ferrera Erbognone (a provi Read more…

The Blueprint for the National Strategic Computing Reserve

October 12, 2021

Over the last year, the HPC community has been buzzing with the possibility of a National Strategic Computing Reserve (NSCR). An in-utero brainchild of the COVID-19 High-Performance Computing Consortium, an NSCR would serve as a Merchant Marine for urgent computing... Read more…

UCLA Researchers Report Largest Chiplet Design and Early Prototyping

October 12, 2021

What’s the best path forward for large-scale chip/system integration? Good question. Cerebras has set a high bar with its wafer scale engine 2 (WSE-2); it has 2.6 trillion transistors, including 850,000 cores, and was fabricated using TSMC’s 7nm process on a roughly 8” x 8” silicon footprint. Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

US Closes in on Exascale: Frontier Installation Is Underway

September 29, 2021

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, held by Zoom this week (Sept. 29-30), it was revealed that the Frontier supercomputer is currently being installed at Oak Ridge National Laboratory in Oak Ridge, Tenn. The staff at the Oak Ridge Leadership... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups

October 15, 2021

Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing... Read more…

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Leading Solution Providers

Contributors

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make i Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire