How the United States Invests in Supercomputing

By Maciej Chojnowski

November 14, 2018

At the end of October, the U.S. Department of Energy unveiled Sierra – the second leadership-class supercomputer delivered as a result of CORAL collaboration between Lawrence Livermore, Oak Ridge and Argonne national laboratories. Earlier this year, the first CORAL system Summit was launched and became the world’s fastest system in June. In this contributed interview which was conducted ahead of SC18, Maciej Chojnowski, editor with the Interdisciplinary Centre for Mathematical and Computational Modelling at the University of Warsaw, discusses the details of the CORAL project with Dr. Dimitri Kusnezov from the U.S. Department of Energy.

Disclaimer: The views expressed in the responses are personal and do not necessarily represent the view of the U.S. Department of Energy or the United States.

Maciej Chojnowski: At the moment, the CORAL systems Summit and Sierra are the number one and number three [now number two] most powerful supercomputers on the planet. In an interview during the Supercomputing Frontiers Europe 2018 conference in Warsaw, you said that what counts most in HPC is not computers themselves but the purpose we try to achieve using them. What is the purpose of these systems? For what sort of work will it be harnessed?

Dimitri Kusnezov, Supercomputing Frontiers Europe 2018 in Warsaw, Poland

Dimitri Kusnezov: You are asking the right questions. We use public money to support the development and delivery of these remarkable supercomputers. But they are just tools – so we need to be sure that it is the right tool for the problems people care most about, and that the return on the investment (ROI) far exceeds the cost of these systems – increasingly in the hundreds of millions of dollars. The ranking of the systems should not be interpreted as a measure of the success or overall utility of such a tool. It can be a distraction that drives systems to perform against metrics that do not accurately measure the needed workflow you may otherwise optimize your computer design to.

We do look hard at who will use these systems – years in advance, whether the cloud is a more effective option, whether a collection for small systems is more cost effective, and so forth. It is all based on the classes or problems you believe are worth that scale of investment. These large systems take years to plan, co-design and deliver, and we develop our computer codes during this period so we can test performance on early prototypes prior to full delivery. So to your question, what is the ROI and what will these systems do? Of our two big systems, one is for a well-defined set of nuclear security questions and the other one is for the open scientific and technology community. Interestingly, fundamental and applied science issues are common to both systems, they just answer different types of questions.

I think of the ROI through several measures. For Sierra, we have classes of problems that impact billion dollar class decisions. So it is not hard to make that calculation. For decisions based on simulation, the confidence in your predictions is very important. We call this ‘uncertainty quantification’ or UQ and it can be computationally demanding – more than we can accommodate even at this scale. I described some of this in my keynote last winter at SCFE2018 (which I wrote up in https://arxiv.org/pdf/1804.11002.pdf). For applied science and technology, the ROI is on the order of 500 to 1. That is for every dollar invested, you have a return of somewhere around 500 dollars in productivity or market value or similar. For pure science, you can measure the impact of the work produced and assess whether the results have had fundamental changes in the understanding of key problems: are they the top discovery, in the top 5 or top 10 or less impactful, for example. An objective look can help gauge your effectiveness in using these systems. But this type of return depends on planning the use cases in advance in order to maximize the impact of the supercomputer during its 4 to 5 year typical lifetime in service.

Chojnowski: The CORAL system designs in Sierra and Summit create “a new breed of computer,” according to Nvidia CEO Jensen Huang this summer. It is expected to let scientists harness AI hand in hand with simulation. What are the key technological components that make it possible and why is it important?

Kusnezov: We are in a remarkable period of technology change today. Artificial intelligence is of increasing relevance to all our activities, across the spectrum from sensor data and detection, through learning methods such as machine learning (ML), to decisions based on searching, planning and proving, to actions such as autonomy or human/AI interfaces. When we started down the exascale path some years back, it was not with AI in mind. But as technologies have developed, we have found that these hybrid architectures are well suited to helping us better understand the ML piece of AI in more detail. For these particular systems, we did push in a number of directions, from the design of the motherboards and water cooled compute nodes, to pushing limits for the GPU resilience, scheduling, diagnostics, burst buffers, switch based collectives, GPFS performance & scalability and so forth. But these are productivity enablers for the larger purpose of taking initial small steps in starting to drive AI into model based prediction. We will begin to study how to augment computer simulation with learning based methods, recognizing that we have a user base invested in traditional computer simulation. So a gentle turn is needed in our large systems. As tools, these systems can help us understand ML in different ways, but we know they are not specifically optimized for that. Today, there is a remarkable global industry developing in novel AI based hardware, designed and fabbed for AI which can offer remarkable speed-ups. We are certainly pursuing that as well.

Chojnowski: IBM’s architecture for the CORAL systems integrate the data analysis capabilities of IBM Power9 CPUs with the deep learning capabilities of GPUs. What are the expected results of this architecture?

Kusnezov: Aside from the power efficiency, network advantage of the fat nodes and the complex memory hierarchies, what really caught my eye some years back was the coherent memory space on the nodes. That and the number of PCIe slots. It opened the door for exploration of how neuromorphic or machine learning technologies could coexist with our more traditional approaches to computer simulation. I do believe that the future of predictive simulation will require a bold step into data centered AI approaches and these architectures are suited for taking some first steps in understanding how you integrate machine learning methods with the more traditional approaches to model-based prediction. That is what I really find exciting.

Why is this important? Our department should better be called the Department of Hard Problems, or the Department of Modelling. We drive computing and simulation not as a means to an end, but as a tool to help us answer questions that are consequential and important to get right. From nuclear security to the energy sector and cyber, there are decisions we have to make with limited funds to ensure against situations we hope never happen. Simulation provides us with a means to understand problems we face and provide options. But simulations without rigorous bounds on their validity are not predictions nor actionable. For that reason we have been pushing validation, verification, uncertainty quantification (UQ) and the many methods needed to help bracket our confidence in any prediction. These architectures move us closer to those which will ultimately be able to help us with this – more intelligent, better able to deal with the deluge of experimental and numerical data, and more cognitive. These architectures, while ‘smarter’ will help us start this transition to tackling the UQ problem which I believe to be NP-Hard, in the sense of complexity theory, and consequently problematic on any von Neumann architecture anyway.

Chojnowski: In April this year, it was announced that DOE intends to spend $1.8 billion on building two, or possible three, exascale supercomputers under the CORAL-2 program. Both the CORAL and CORAL-2 programs mandate architecturally diverse machines: CORAL systems are the results of the collaboration of IBM, Nvidia and Mellanox, while A21 will be produced by Intel and Cray. Why this diversity?

Kusnezov: The diversity of the industry and the competition that emerges when we challenge the technology sector with new designs helps drive innovation and is an important part of the technology development cycle. We look for approaches where industry can collaborate on solutions together, leveraging their strengths and product development paths, to allow for novel architectures and hardware/software approaches to otherwise hard problems through such cost and risk sharing. We also do quite a bit of non-recoverable engineering (NRE) work with the industry to help develop technologies that would not otherwise be available for us, and which might be aligned with their technology roadmaps. The larger the pool of companies out there, the richer the set of ideas that emerge.

Dr. Dimitri Kusnezov was the keynote speaker at Supercomputing Frontiers Europe 2018 in Warsaw, Poland. You can watch his talk and interview with him in the MEDIA section on the SCFE website.

Registration for SCFE2019 is already open.

This article originally appeared on the ICM website.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

What’s After Exascale? The Internet of Workflows Says HPE’s Nicolas Dubé

July 29, 2021

With the race to exascale computing in its final leg, it’s natural to wonder what the Post Exascale Era will look like. Nicolas Dubé, VP and chief technologist for HPE’s HPC business unit, agrees and shared his vision at Supercomputing Frontiers Europe 2021 held last week. The next big thing, he told the virtual audience at SFE21, is something that will connect HPC and (broadly) all of IT – into what Dubé calls The Internet of Workflows. Read more…

How UK Scientists Developed Transformative, HPC-Powered Coronavirus Sequencing System

July 29, 2021

In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Read more…

KAUST Leverages Mixed Precision for Geospatial Data

July 28, 2021

For many computationally intensive tasks, exacting precision is not necessary for every step of the entire task to obtain a suitably precise result. The alternative is mixed-precision computing: using high precision wher Read more…

Oak Ridge Supercomputer Enables Next-Gen Jet Turbine Research

July 27, 2021

Air travel is notoriously carbon-inefficient, with many airlines going as far as to offer purchasable carbon offsets to ease the guilt over large-footprint travel. But even over just the last decade, major aircraft model Read more…

IBM and University of Tokyo Roll Out Quantum System One in Japan

July 27, 2021

IBM and the University of Tokyo today unveiled an IBM Quantum System One as part of the IBM-Japan quantum program announced in 2019. The system is the second IBM Quantum System One assembled outside the U.S. and follows Read more…

AWS Solution Channel

Data compression with increased performance and lower costs

Many customers associate a performance cost with data compression, but that’s not the case with Amazon FSx for Lustre. With FSx for Lustre, data compression reduces storage costs and increases aggregate file system throughput. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

What’s After Exascale? The Internet of Workflows Says HPE’s Nicolas Dubé

July 29, 2021

With the race to exascale computing in its final leg, it’s natural to wonder what the Post Exascale Era will look like. Nicolas Dubé, VP and chief technologist for HPE’s HPC business unit, agrees and shared his vision at Supercomputing Frontiers Europe 2021 held last week. The next big thing, he told the virtual audience at SFE21, is something that will connect HPC and (broadly) all of IT – into what Dubé calls The Internet of Workflows. Read more…

How UK Scientists Developed Transformative, HPC-Powered Coronavirus Sequencing System

July 29, 2021

In November 2020, the COVID-19 Genomics UK Consortium (COG-UK) won the HPCwire Readers’ Choice Award for Best HPC Collaboration for its CLIMB-COVID sequencing project. Launched in March 2020, CLIMB-COVID has now resulted in the sequencing of over 675,000 coronavirus genomes – an increasingly critical task as variants like Delta threaten the tenuous prospect of a return to normalcy in much of the world. Read more…

IBM and University of Tokyo Roll Out Quantum System One in Japan

July 27, 2021

IBM and the University of Tokyo today unveiled an IBM Quantum System One as part of the IBM-Japan quantum program announced in 2019. The system is the second IB Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

Will Approximation Drive Post-Moore’s Law HPC Gains?

July 26, 2021

“Hardware-based improvements are going to get more and more difficult,” said Neil Thompson, an innovation scholar at MIT’s Computer Science and Artificial Intelligence Lab (CSAIL). “I think that’s something that this crowd will probably, actually, be already familiar with.” Thompson, speaking... Read more…

With New Owner and New Roadmap, an Independent Omni-Path Is Staging a Comeback

July 23, 2021

Put on a shelf by Intel in 2019, Omni-Path faced a uncertain future, but under new custodian Cornelis Networks, OmniPath is looking to make a comeback as an independent high-performance interconnect solution. A "significant refresh" – called Omni-Path Express – is coming later this year according to the company. Cornelis Networks formed last September as a spinout of Intel's Omni-Path division. Read more…

Chameleon’s HPC Testbed Sharpens Its Edge, Presses ‘Replay’

July 22, 2021

“One way of saying what I do for a living is to say that I develop scientific instruments,” said Kate Keahey, a senior fellow at the University of Chicago a Read more…

Summer Reading: “High-Performance Computing Is at an Inflection Point”

July 21, 2021

At last month’s 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), a group of researchers led by Martin Schulz of the Leibniz Supercomputing Center (Munich) presented a “position paper” in which they argue HPC architectural landscape... Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

Leading Solution Providers

Contributors

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Q&A with Jim Keller, CTO of Tenstorrent, and an HPCwire Person to Watch in 2021

April 22, 2021

As part of our HPCwire Person to Watch series, we are happy to present our interview with Jim Keller, president and chief technology officer of Tenstorrent. One of the top chip architects of our time, Keller has had an impactful career. Read more…

Senate Debate on Bill to Remake NSF – the Endless Frontier Act – Begins

May 18, 2021

The U.S. Senate today opened floor debate on the Endless Frontier Act which seeks to remake and expand the National Science Foundation by creating a technology Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire