Roughly a year ago Doug Kothe was appointed director of the U.S. Exascale Computing Project. He stepped in for ECP founding director Paul Messina who returned to Argonne National Laboratory after two years at the helm of ECP. Yesterday, Kothe, who is also one of HPCwire’s 2018 People to Watch, issued an update on ECP progress and directions.
A fair amount has happened since Kothe took office last September. Standing up the pre-exascale machine Summit by the ORNL/OLCF staff may be the most visible event (see HPCwire article, ORNL Summit Supercomputer Is Officially Here). Just to be clear, while the ECP doesn’t procure or stand up the systems at the national labs, the project team does work closely with the HPC facilities at the labs to define software, application readiness, and will use Summit to prepare for the future exascale platforms. Last month Kothe announced the appointment of Lori Diachin from Lawrence Livermore National Laboratory (LLNL) as the new ECP Deputy Director. And in March, the ECP – which focuses on exascale software – refined its mission slightly to the main points bulleted below.
-
Parallelism: Exascale systems will have parallelism (also referred to as concurrency), a thousand-fold greater than petascale systems. Developing systems and applications software is already challenging at the petascale and increasing concurrency by a thousand will make software development efforts even more difficult. To mitigate this complexity, a portion of the project’s R&D investments will be on tools that improve the programmability of exascale systems.
- Memory and Storage: In today’s HPC systems, moving data from computer memory into the CPU consumes the greatest amount of time (compared to basic math operations.) This data movement challenge is already an issue in petascale systems and it will become a critical issue in exascale systems. R&D is required to develop memory and storage architectures to provide timely access to and storage of information at the anticipated computational rates.
- Reliability: Exascale systems will contain significantly more components than today’s petascale systems. Achieving system-level reliability, especially with designs based on projected reductions in power, will require R&D to enable the systems to dynamically adapt to a possible constant stream of transient and permanent failures of components and the applications to remain resilient, in spite of system and device failures, in order to produce accurate results.
- Energy Consumption: To state the obvious, the operating cost of an exascale system built on current technology would be prohibitive. Through pre-ECP programs like Fast Forward and Design Forward and current ECP elements like PathForward, engineering improvements identified with the vendor partners have potential to reduce the power required significantly. Current estimates indicate initial exascale systems could operate in the range of 20-40 megawatts (MW). Achieving this efficiency level by the mid-2020s requires R&D beyond what the industry vendors had projected on their product roadmaps.
Below is a link to a video interview with Kothe conducted by ECP communications lead Mike Berhnardt earlier this month in which Kothe discusses ECP progress and plans including, among other things the new ECP co-design machine learning center (ExaLearn); the ExaSMR project (SMR stands for small modular reactor aimed at high-fidelity modeling of coupled neutronics and fluid dynamics), and the role of performance measurement with ECP.
Link to update post on the ECP website: https://www.exascaleproject.org/newsletter/leadership-collaboration-and-a-focus-on-key-exascale-challenges/