Momentum Builds for US Exascale

By Alex R. Larzelere

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. quest for exascale on a solid foundation. In my last article, I provided a description of the elements of the High Performance Computing (HPC) ecosystem and its importance for advancing and sustaining this strategically important technology. It is good to report that the U.S. exascale program seems to be hitting the full range of ecosystem elements.

As a reminder, the National Strategic Computing Initiative (NSCI) assigned the U.S. Department of Energy (DOE) Office of Science (SC) and the National Nuclear Security Administration (NNSA) to execute a joint program to deliver capable exascale computing that emphasizes sustained performance on relevant applications and analytic computing to support their missions. The overall DOE program is known as the Exascale Computing Initiative (ECI) and is funded by the SC Advanced Scientific Computing Research (ASCR) program and the NNSA Advanced Simulation and Computing (ASC) program. Elements of the ECI include the procurement of exascale class systems and the facility investments in site preparations and non-recurring engineering. Also, ECI includes the Exascale Computing Project (ECP) that will conduct the Research and Development (R&D) in the areas of middleware (software stack), applications, and hardware to ensure that exascale systems will be productively usable to address Office of Science and NNSA missions.

In the area of hardware – the last part of 2017 revealed a number of important developments. First and most visible, is the initial installation of the SC Summit system at Oak Ridge National Laboratory (ORNL) and the NNSA Sierra system at Lawrence Livermore National Laboratory (LLNL). Both systems are being built by IBM using Power9 processors with Nvidia GPU co-processors. The machines will have two Power9 CPUs per system board and will use a Mellenox InfinBand interconnection network.

Beyond that, the architecture of each machine is slightly different. The ORNL Summit machine will use six Nvidia Volta GPUs per two Power9 CPUs on a system board and will use NVLink to connect to 512 GB of memory. The Summit machine will use a combination of air and water cooling. The LLNL Sierra machine will use four Nvidia Voltas and 256 GB of memory connected with the two Power9 CPUs per board. The Sierra machine will use only air cooling. As was reported by HPCwire in November 2017, the peak performance of the Summit machine will be about 200 petaflops and the Sierra machine is expected to be about 125 petaflops.

Installation of both the Summit and Sierra systems is currently underway with about 279 racks (without system boards) and the interconnection network already installed at each lab. Now that IBM has formally released the Power9 processors, the racks will soon start being populated with the boards that contain the CPUs, GPUs and memory. Once that is completed, the labs will start their acceptance testing, which is expected to be finished later in 2018.

Another important piece of news about the DOE exascale program is the clarification of the status of the Argonne National Laboratory (ANL) Aurora machine. This system was part of the collaborative CORAL procurement that also selected the Sierra and Summit machines. The Aurora system is being manufactured by Intel with Cray Inc. acting as the system integrator. The machine was originally scheduled to be an approximately 180 peak petaflops system using the Knights Hill third generation Phi processors. However, during SC17, we learned that Intel is removing the Knights Hill chip from its roadmap. This explains the reason why during the September ASCR Advisory Committee (ASCAC) meeting, Barb Helland, the Associate Director of the ASCR office, announced that the Aurora system would be delayed to 2021 and upgraded to 1,000 petaflops (aka 1 exaflops).

The full details of the revised Aurora system are still under wraps. We have learned that it is going to use “novel” processor technologies, but exactly what that means is unclear. The ASCR program subjected the new Aurora design to an independent outside review. It found, “The hardware choices/design within the node is extremely well thought through. Early projections suggest that the system will support a broad workload.” The review committee even suggested that, “The system as presented is exciting with many novel technology choices that can change the way computing is done.” The Aurora system is in the process of being “re-baselined” by the DOE. Hopefully, once that is complete, we will get a better understanding of the meaning of “novel” technologies. If things go as expected, the changes to Aurora will allow the U.S. to achieve exascale by 2021.

An important, but sometimes overlooked, aspect of the U.S. exascale program is the number of computing systems that are being procured, tested and optimized by the ASCR and ASC programs as part of the buildup to exascale. Other computing systems involved with “pre-exascale” systems include the 8.6 petaflops Mira computer at ANL and the 14 petaflops Cori system at Lawrence Berkeley National Lab (LBNL). The NNSA also has the 14.1 petaflops Trinity system at Los Alamos National Lab (LANL). Up to 20 percent of these precursor machines will serve as testbeds to enable computing science R&D needed to ensure that the U.S. exascale systems will be able to productively address important national security and discovery science objectives.

The last, but certainly not least, bit of hardware news is that the ASCR and ASC programs are expected to start their next computer system procurement processes in early 2018. During her presentation to the U.S. Consortium for the Advancement of Supercomputing (USCAS), Barb Helland told the group that she expects that the Request for Proposals (RFP) will soon be released for the follow-ons to the Summit and Sierra systems. These systems, to be delivered in the 2021-2023 timeframe, are expected to be provide in excess of exaFLOP/s performance. The procurement process to be used will be similar to the CORAL procurement and will be a collaboration between the DOE-SC ASCR and NNSA ASC programs. The ORNL exascale system will be called Frontier and the LLNL system will be known as El Capitan.

2017 also saw significant developments for the people element of the U.S HPC ecosystem. As was previously reported, at last September’s ASCAC meeting, Paul Messina announced that he would be stepping down as the ECP Director on October 1st. Doug Kothe, who was previously the applications development lead, was announced as the new ECP Director. Upon taking the Director job, Kothe with his deputy, Stephen Lee of LANL, instituted a process to review the organization and management of the ECP. At the December ASCAC conference call, Doug reported that the review had been completed and resulted in a number of changes. This included paring down ECP from five to four components (applications development, software technology, hardware and integration, and project management). He also reported that ECP has implemented a more structured management approach that includes a revised work breakdown structure (WBS) and additional milestones, new key performance parameters and risk management approaches. Finally, the new ECP Director reported that they had established an Extended Leadership Team with a number of new faces.

Another important, element of the HPC ecosystem are the people doing the R&D and other work need to keep the ecosystem going. The DOE ECI involves a huge number of people. Last year, there were about 500 researchers who attended the ECP Principle Investigator meeting and there are many more involved in other DOE/NNSA programs and from industry. The ASCR and ASC programs are involved with a number of programs to educate and train future members of the HPC ecosystem. Such programs are the ASCR and ASC co-funded Computational Science Graduate Fellowship (CSGF) and the Early Career Research Program. The NNSA offers similar opportunities. Both the ASCR and ASC programs continue to coordinate with National Science Foundation educational programs to ensure that America’s top computational science talent continues to flow into the ecosystem.

Finally, in addition to people and hardware, the U.S. program continues to develop the software stack (aka middleware) to develop end users’ applications to ensure that exascale will be used productively. Doug Kothe reported that ECP has adopted standard Software Development Kits. These SDKs are designed to support the goal of building a comprehensive, coherent software stack that enables application developers to productively write highly parallel applications that effectively target diverse exascale architectures. Kothe also reported that ECP is making good progress in developing applications software. This includes the implementation of innovative approaches that include Machine Learning to utilize the GPUs that are part of the future exascale computers.

All in all – the last several months of 2017 have set the stage for a very exciting 2018 for the U.S. exascale program. It has been about 5 years since the ORNL Titan supercomputer came onto the stage at #1 on the TOP500 list. Over that time, other more powerful DOE computers have come online (Trinity, Cori, etc.) but they were overshadowed by Chinese and European systems. It remains unclear whether or not the upcoming exascale systems will put the U.S. back on the top of the supercomputing world. However, the recent developments help to reassure the country is not going to give up its computing leadership position without a fight. That is great news because for more than 60 years, the U.S. has sought leadership in high performance computing for the strategic value it provides in the areas of national security, discovery science, energy security, and economic competitiveness.

About the Author

Alex Larzelere is a senior fellow at the U.S. Council on Competitiveness, the president of Larzelere & Associates Consulting and HPCwire’s policy editor. He is currently a technologist, speaker and author on a number of disruptive technologies that include: advanced modeling and simulation; high performance computing; artificial intelligence; the Internet of Things; and additive manufacturing. Alex’s career has included time in federal service (working closely with DOE national labs), private industry, and as founder of a small business. Throughout that time, he led programs that implemented the use of cutting edge advanced computing technologies to enable high resolution, multi-physics simulations of complex physical systems. Alex is the author of “Delivering Insight: The History of the Accelerated Strategic Computing Initiative (ASCI).”

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Is Data Science the Fourth Pillar of the Scientific Method?

April 18, 2019

Nvidia CEO Jensen Huang revived a decade-old debate last month when he said that modern data science (AI plus HPC) has become the fourth pillar of the scientific method. While some disagree with the notion that statistic Read more…

By Alex Woodie

At ASF 2019: The Virtuous Circle of Big Data, AI and HPC

April 18, 2019

We've entered a new phase in IT -- in the world, really -- where the combination of big data, artificial intelligence, and high performance computing is pushing the bounds of what's possible in business and science, in w Read more…

By Alex Woodie with Doug Black and Tiffany Trader

Google Open Sources TensorFlow Version of MorphNet DL Tool

April 18, 2019

Designing optimum deep neural networks remains a non-trivial exercise. “Given the large search space of possible architectures, designing a network from scratch for your specific application can be prohibitively expens Read more…

By John Russell

HPE Extreme Performance Solutions

HPE and Intel® Omni-Path Architecture: How to Power a Cloud

Learn how HPE and Intel® Omni-Path Architecture provide critical infrastructure for leading Nordic HPC provider’s HPCFLOW cloud service.

powercloud_blog.jpgFor decades, HPE has been at the forefront of high-performance computing, and we’ve powered some of the fastest and most robust supercomputers in the world. Read more…

IBM Accelerated Insights

Bridging HPC and Cloud Native Development with Kubernetes

The HPC community has historically developed its own specialized software stack including schedulers, filesystems, developer tools, container technologies tuned for performance and large-scale on-premises deployments. Read more…

Interview with 2019 Person to Watch Michela Taufer

April 18, 2019

Today, as part of our ongoing HPCwire People to Watch focus series, we are highlighting our interview with 2019 Person to Watch Michela Taufer. Michela -- the General Chair of SC19 -- is an ACM Distinguished Scientist. Read more…

By HPCwire Editorial Team

At ASF 2019: The Virtuous Circle of Big Data, AI and HPC

April 18, 2019

We've entered a new phase in IT -- in the world, really -- where the combination of big data, artificial intelligence, and high performance computing is pushing Read more…

By Alex Woodie with Doug Black and Tiffany Trader

Intel Gold U-Series SKUs Reveal Single Socket Intentions

April 18, 2019

Intel plans to jump into the single socket market with a portion of its just announced Cascade Lake microprocessor line according to one media report. This isn Read more…

By John Russell

BSC Researchers Shrink Floating Point Formats to Accelerate Deep Neural Network Training

April 15, 2019

Sometimes calculating solutions as precisely as a computer can wastes more CPU resources than is necessary. A case in point is with deep learning. In early stag Read more…

By Ken Strandberg

Intel Extends FPGA Ecosystem with 10nm Agilex

April 11, 2019

The insatiable appetite for higher throughput and lower latency – particularly where edge analytics and AI, network functions, or for a range of datacenter ac Read more…

By Doug Black

Nvidia Doubles Down on Medical AI

April 9, 2019

Nvidia is collaborating with medical groups to push GPU-powered AI tools into clinical settings, including radiology and drug discovery. The GPU leader said Monday it will collaborate with the American College of Radiology (ACR) to provide clinicians with its Clara AI tool kit. The partnership would allow radiologists to leverage AI techniques for diagnostic imaging using their own clinical data. Read more…

By George Leopold

Digging into MLPerf Benchmark Suite to Inform AI Infrastructure Decisions

April 9, 2019

With machine learning and deep learning storming into the datacenter, the new challenge is optimizing infrastructure choices to support diverse ML and DL workfl Read more…

By John Russell

AI and Enterprise Datacenters Boost HPC Server Revenues Past Expectations – Hyperion

April 9, 2019

Building on the big year of 2017 and spurred in part by the convergence of AI and HPC, global revenue for high performance servers jumped 15.6 percent last year Read more…

By Doug Black

Intel Launches Cascade Lake Xeons with Up to 56 Cores

April 2, 2019

At Intel's Data-Centric Innovation Day in San Francisco (April 2), the company unveiled its second-generation Xeon Scalable (Cascade Lake) family and debuted it Read more…

By Tiffany Trader

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

It’s Official: Aurora on Track to Be First US Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

Arm Unveils Neoverse N1 Platform with up to 128-Cores

February 20, 2019

Following on its Neoverse roadmap announcement last October, Arm today revealed its next-gen Neoverse microarchitecture with compute and throughput-optimized si Read more…

By Tiffany Trader

France to Deploy AI-Focused Supercomputer: Jean Zay

January 22, 2019

HPE announced today that it won the contract to build a supercomputer that will drive France’s AI and HPC efforts. The computer will be part of GENCI, the Fre Read more…

By Tiffany Trader

Intel Launches Cascade Lake Xeons with Up to 56 Cores

April 2, 2019

At Intel's Data-Centric Innovation Day in San Francisco (April 2), the company unveiled its second-generation Xeon Scalable (Cascade Lake) family and debuted it Read more…

By Tiffany Trader

Microsoft to Buy Mellanox?

December 20, 2018

Networking equipment powerhouse Mellanox could be an acquisition target by Microsoft, according to a published report in an Israeli financial publication. Microsoft has reportedly gone so far as to engage Goldman Sachs to handle negotiations with Mellanox. Read more…

By Doug Black

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

Oil and Gas Supercloud Clears Out Remaining Knights Landing Inventory: All 38,000 Wafers

March 13, 2019

The McCloud HPC service being built by Australia’s DownUnder GeoSolutions (DUG) outside Houston is set to become the largest oil and gas cloud in the world th Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This