EXAALT-ing Molecular Dynamics to the Power of Exascale

June 20, 2023

June 20, 2023 — Stronger. Lighter. More durable. These physical qualities and other properties, such as conductivity, heat resistance, and reactivity, are key to developing novel materials with exceptional performance for various applications, including national security. In nuclear energy production, for example, commercial fusion reactors are a promising technology for producing clean, affordable, limitless energy, but they will require newly engineered materials that can withstand the punishing conditions inside the reactor for sustained, long-term operation. To create these next-generation materials and accelerate their development, scientists need enhanced computational tools for understanding material behavior at the most fundamental level.

Figure 1. This simulation shows the growth of helium bubbles (shown in purple) in tungsten in conditions relevant to fusion reactors. Achieving realistic growth rates is essential to capture the proper morphological evolution of the material’s surface. EXAALT extends simulation timescales to those that are closer to realistic conditions.

First demonstrated in the late 1950s, molecular dynamics (MD) simulations have become a key capability for studying material behavior at the atomic scale. These virtual experiments allow scientists to computationally analyze the physical movement and trajectories of individual atoms and molecules in a system as a function of time and fully evaluate their response to external conditions as the system evolves. MD provides a dependable predictive capability for understanding material properties at the finest scales and in regimes that are often inaccessible in a laboratory setting. The raw computing power of exascale machines offers the ability to perform larger, more complex, and precise MD simulations—but there’s a catch.

Traditional MD algorithms numerically solve equations of motion for each atom in a physical system through a sequential series of time steps. To efficiently utilize computational resources and optimize time to solution, the larger problem is broken down into smaller subdomains that are distributed equally across individual processors, allowing many independent calculations to be performed simultaneously using a  parallel processing algorithm called domain decomposition. When a time step is complete, neighboring subdomains exchange information about what they’ve “learned,” and the process repeats in a succession of loops until a terminating function—such as a preprogrammed number of time steps—stops the iterative process. However, on larger machines with tens of thousands or more processors, each subdomain contains fewer and fewer atoms and there is less work to be done locally. Eventually, progress hits a communication wall, where the overhead of synchronizing the work between subdomains exceeds the subdomain computations and scaling breaks down. This scaling crunch limits conventional MD algorithms to submicrosecond timescales—too short to assess the longer term structural effects of stress, temperature, and pressure on a material.

As part of the Exascale Computing Project (ECP), a collaborative team of scientists, software developers, and hardware integration specialists from across the Department of Energy (DOE) has developed the Exascale Atomistics for Accuracy, Length, and Time (EXAALT) application to bring MD into the exascale era. Danny Perez, a physicist within the Theoretical Division at Los Alamos National Laboratory and the project’s principal investigator says, “We’ve implemented new scalable methods that allow us to access as much of the accuracy, length, and time space as possible on exascale machines by rethinking our methods and developing algorithms that go around some of the bottlenecks that limited scaling previously.” Such a capability has potential to revolutionize MD.

‘Scaling’ the Communication Wall

The EXAALT application integrates three cutting-edge MD computer codes: the long-renowned LAMMPS (Large-Scale Atomic/Molecular Massively Parallel Simulator) classical MD code; the ParSplice (Parallel Trajectory Splicing) algorithm; and the LATTE (Los Alamos Transferable Tight-Binding for Energetics) code, which is used for simulating quantum aspects of a physical system. Early on, the project team had in hand an open-source version that worked well for huge systems on short timescales or for very small systems on long timescales. The goal was to extend this long-time capability (picoseconds to milliseconds) to intermediate-size systems (hundreds to millions of atoms) at two levels of accuracy: one where machine-learning (ML) interatomic potentials are used to approximate quantum physics and another using a simplified quantum representation of the system that is much more affordable than conventional first-principle quantum approaches.

Figure 2. The EXAALT framework provides a task (blue) and data (green) management infrastructure that can orchestrate the execution of large numbers of MD simulations (shown using the LAMMPS MD engine). This framework is used to implement the ParSplice method, which uses a parallel-in-time approach to lengthen the timescales that can be achieved.

ECP projects all have a “figure of merit,” which is a specified performance target. Progress toward this target is assessed by running the codes on challenge problems that address real-world scientific questions and serve as a proof-of-concept for future work. EXAALT is subject to two underlying challenge problems related to materials for nuclear energy. In the first, the team’s ML models simulate the effects of plasma, neutron flux, and extremely high temperatures on the walls of fusion reactors to further development of more resilient, long-lasting materials. The second problem uses quantum-based models to better understand the evolution of nuclear fuels in fission power plants. This work helps to address structural material degradation that affects production efficiency. Perez says, “The physics and the chemistry of these materials is way more complex than just a simple metal, so capturing quantum effects is very important to understanding the material behavior and developing better performing materials for these applications.”

The EXAALT framework and management infrastructure orchestrates all the different calculations and stitches them together seamlessly to generate high-quality trajectories of ensembles of atoms in materials. The framework implements the ParSplice method, which uses a parallel-in-time approach to lengthen the timescales that can be achieved. “We did a lot of work to figure out how to enable the basic codes that drive the whole framework so that we could speed up the time steps and synchronize the data into interesting trajectories,” says Perez. “We added an extra layer of domain decomposition that enables time acceleration by running different subregions of the system independently. We can accelerate one small patch at a time and run multiple copies of each patch, which allows us to leverage more parallelism.” With this approach, the simulation timescale becomes controlled by the morphological changes happening locally rather than globally, extending the timescales that be achieved.

Another major focus of the project was learning how to port their models efficiently to GPU-based machines. Early in the project, the team observed that as the code ran on increasingly more advanced machines using GPUs, the application performance—as demonstrated by the challenge problems—was decreasing against the peak performance of the hardware. “These models have lots of nested loops that you can unroll and unfold in many different ways. Finding the right mapping of the physics onto the hardware is tricky,” says Perez. “We developed clever ways to better exploit symmetry and memory and to manipulate the loops to maximize performance.”

One of those tricks was to rewrite from scratch a fundamental physical model called SNAP (Spectral Neighbor Analysis Potential), which uses ML techniques to accurately predict the underlying physics of material behavior using high-accuracy quantum calculations. Performance experts at the National Energy Research Scientific Computing Center and NVIDIA were brought in to identify areas for improvement in the SNAP kernel and help the team implement optimization strategies that led to an initial 25x speed up in the code. Fast forward to early 2023, and the team’s enhanced EXAALT code is surpassing all expectations. A recent run using Oak Ridge National Laboratory’s Frontier supercomputer—DOE’s first exascale machine—returned exceptional results. Perez says, “We project more than a 500x speed up when we extrapolate our results to the machine’s full capability, almost a factor of 10 higher than our target.”

An Unexpected Bonus

With the EXAALT code showing increasingly strong scaling and performance, the team has turned its attention to building a model-generation capability based on the ML models used to run the simulations. “We want to show how this tool can drastically speed up the whole scientific workflow from picking a system, to obtaining a model, to performing the run,” explains Perez. “Developing the specific machine-learning models that drive the simulations is time consuming and tedious work. We discovered that the ML infrastructure we were using to generate the models could also automate the entire workflow.”

Figure 3. Danny Perez, a physicist in the Theoretical Division at Los Alamos National Laboratory, leads ECP’s EXAALT application development effort.

Rather than relying on an expert’s time and knowledge to define a small data set that contains the right amount of information to include in a model, the team’s approach allows the machine to figure out the right “ingredients.” Perez adds, “We decided that instead of guessing what will happen in a simulation, we would try to capture everything that’s physically sensible. We wanted the most diverse data set possible so that in a simulation we would never encounter a configuration that’s unlike anything used in our training data, but this requires generating a massive amount of data. So, we framed this as an optimization problem where we can quantify the diversity of the data set and create new configurations without generating redundant information.”

The team has also been exploring ways to integrate quantum aspects, like charge transfer, into the ML models to create an intermediate model that would have features of both. “We have demonstrated an integrated capability where you can generate these configurations, do the quantum calculations, and train the potentials all inside one big run. Between now and the end of ECP, we want to transition this prototype into a robust, production-quality version of the machine-learning framework,” says Perez. In addition, the team is looking to make MD simulations more cost efficient. Since MD algorithms resolve each atom in a system, the computational cost of running larger systems with more sophisticated codes could become cost prohibitive. He adds, “We are working with a world expert to develop methodologies on the machine-learning side so that we can obtain the extra physics at a reasonable price.”

Taking the Long View

The notable progress the EXAALT team has made over the last seven years builds off decades of MD code development. Aidan Thompson, the leader of LAMMPS development at Sandia National Laboratories, says, “The creation of the EXAALT framework and the arrival of exascale computing resources are game-changing events for LAMMPS, allowing users for the first time the freedom to select the accuracy, length, timescale, and chemical composition appropriate to their particular science applications, rather than being restricted to certain niche areas.”

Figure 4. Aidan Thompson, a researcher in the Center for Computing Research at Sandia National Laboratories, leads the LAMMPS development effort.

The work is also a testament to the ECP research paradigm. “Typically, sponsors want the scientific potential up front, and the capability advancements needed to make that possible are considered afterward. With ECP, the capability was at the forefront, so it gave us the opportunity to focus on the long view of code development. We had time to decide what features we needed and to optimize the code as aggressively as we could to reach the necessary performance,” states Perez. He is also quick to point out that the ability to work across the DOE complex as part of a multidisciplinary team was paramount to EXAALT’s success. He notes, “This work would not have been possible without this integrated team that brings together the skills of physics, methods, machine-learning, and performance experts. This whole ecosystem of people all pushing in the same direction at the same time gave the project momentum and enabled us to meet and exceed our performance goals.”

The team will continue to test EXAALT’s performance and ensure scalability on Frontier and other future exascale machines. Through demonstration of the challenge problems, the team has shown that EXAALT can be applied to issues that directly impact the national security side of DOE’s mission space, but the code’s relevance extends beyond nuclear energy. Perez says, “Our goal is that users will be able to use EXAALT on exascale machines to run MD simulations directly in the conditions relevant to their applications by giving them access to the entire accuracy, length, and time space.” In addition, the team’s MD ML-based computational workflow could cut development time of new materials from decades to just a few years by using computer simulations almost exclusively. The ability to achieve atomistic materials predictions at the engineering scale would indeed be an MD revolution.


Source: Caryn Meissner, LLNL and ECP

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Watsonx Brings AI Visibility to Banking Systems

September 21, 2023

A new set of AI-based code conversion tools is available with IBM watsonx. Before introducing the new "watsonx," let's talk about the previous generation Watson, perhaps better known as "Jeopardy!-Watson." The origi Read more…

Researchers Advance Topological Superconductors for Quantum Computing

September 21, 2023

Quantum computers process information using quantum bits, or qubits, based on fragile, short-lived quantum mechanical states. To make qubits robust and tailor them for applications, researchers from the Department of Ene Read more…

Fortran: Still Compiling After All These Years

September 20, 2023

A recent article appearing in EDN (Electrical Design News) points out that on this day, September 20, 1954, the first Fortran program ran on a mainframe computer. Originally developed by IBM, Fortran (or FORmula TRANslat Read more…

Intel’s Gelsinger Lays Out Vision and Map at Innovation 2023 Conference

September 20, 2023

Intel’s sprawling, optimistic vision for the future was on full display yesterday in CEO Pat Gelsinger’s opening keynote at the Intel Innovation 2023 conference being held in San Jose. While technical details were sc Read more…

Intel Showcases “AI Everywhere” Strategy in MLPerf Inferencing v3.1

September 18, 2023

Intel used the latest MLPerf Inference (version 3.1) results as a platform to reinforce its developing “AI Everywhere” vision, which rests upon 4th gen Xeon CPUs and Gaudi2 (Habana) accelerators. Both fared well on t Read more…

AWS Solution Channel

Shutterstock 1679562793

How Maxar Builds Short Duration ‘Bursty’ HPC Workloads on AWS at Scale

Introduction

High performance computing (HPC) has been key to solving the most complex problems in every industry and has been steadily changing the way we work and live. Read more…

QCT Solution Channel

QCT and Intel Codeveloped QCT DevCloud Program to Jumpstart HPC and AI Development

Organizations and developers face a variety of issues in developing and testing HPC and AI applications. Challenges they face can range from simply having access to a wide variety of hardware, frameworks, and toolkits to time spent on installation, development, testing, and troubleshooting which can lead to increases in cost. Read more…

Survey: Majority of US Workers Are Already Using Generative AI Tools, But Company Policies Trail Behind

September 18, 2023

A new survey from the Conference Board indicates that More than half of US employees are already using generative AI tools, at least occasionally, to accomplish work-related tasks. Yet some three-quarters of companies st Read more…

Watsonx Brings AI Visibility to Banking Systems

September 21, 2023

A new set of AI-based code conversion tools is available with IBM watsonx. Before introducing the new "watsonx," let's talk about the previous generation Watson Read more…

Intel’s Gelsinger Lays Out Vision and Map at Innovation 2023 Conference

September 20, 2023

Intel’s sprawling, optimistic vision for the future was on full display yesterday in CEO Pat Gelsinger’s opening keynote at the Intel Innovation 2023 confer Read more…

Intel Showcases “AI Everywhere” Strategy in MLPerf Inferencing v3.1

September 18, 2023

Intel used the latest MLPerf Inference (version 3.1) results as a platform to reinforce its developing “AI Everywhere” vision, which rests upon 4th gen Xeon Read more…

China’s Quiet Journey into Exascale Computing

September 17, 2023

As reported in the South China Morning Post HPC pioneer Jack Dongarra mentioned the lack of benchmarks from recent HPC systems built by China. “It’s a we Read more…

Nvidia Releasing Open-Source Optimized Tensor RT-LLM Runtime with Commercial Foundational AI Models to Follow Later This Year

September 14, 2023

Nvidia's large-language models will become generally available later this year, the company confirmed. Organizations widely rely on Nvidia's graphics process Read more…

MLPerf Releases Latest Inference Results and New Storage Benchmark

September 13, 2023

MLCommons this week issued the results of its latest MLPerf Inference (v3.1) benchmark exercise. Nvidia was again the top performing accelerator, but Intel (Xeo Read more…

Need Some H100 GPUs? Nvidia is Listening

September 12, 2023

During a recent earnings call, Tesla CEO Elon Musk, the world's richest man, summed up the shortage of Nvidia enterprise GPUs in a few sentences.  "We're us Read more…

Intel Getting Squeezed and Benefiting from Nvidia GPU Shortages

September 10, 2023

The shortage of Nvidia's GPUs has customers searching for scrap heap to kickstart makeshift AI projects, and Intel is benefitting from it. Customers seeking qui Read more…

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

Leading Solution Providers

Contributors

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

ISC 2023 Booth Videos

Cornelis Networks @ ISC23
Dell Technologies @ ISC23
Intel @ ISC23
Lenovo @ ISC23
Microsoft @ ISC23
ISC23 Playlist
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire