EXAALT-ing Molecular Dynamics to the Power of Exascale

June 20, 2023

June 20, 2023 — Stronger. Lighter. More durable. These physical qualities and other properties, such as conductivity, heat resistance, and reactivity, are key to developing novel materials with exceptional performance for various applications, including national security. In nuclear energy production, for example, commercial fusion reactors are a promising technology for producing clean, affordable, limitless energy, but they will require newly engineered materials that can withstand the punishing conditions inside the reactor for sustained, long-term operation. To create these next-generation materials and accelerate their development, scientists need enhanced computational tools for understanding material behavior at the most fundamental level.

Figure 1. This simulation shows the growth of helium bubbles (shown in purple) in tungsten in conditions relevant to fusion reactors. Achieving realistic growth rates is essential to capture the proper morphological evolution of the material’s surface. EXAALT extends simulation timescales to those that are closer to realistic conditions.

First demonstrated in the late 1950s, molecular dynamics (MD) simulations have become a key capability for studying material behavior at the atomic scale. These virtual experiments allow scientists to computationally analyze the physical movement and trajectories of individual atoms and molecules in a system as a function of time and fully evaluate their response to external conditions as the system evolves. MD provides a dependable predictive capability for understanding material properties at the finest scales and in regimes that are often inaccessible in a laboratory setting. The raw computing power of exascale machines offers the ability to perform larger, more complex, and precise MD simulations—but there’s a catch.

Traditional MD algorithms numerically solve equations of motion for each atom in a physical system through a sequential series of time steps. To efficiently utilize computational resources and optimize time to solution, the larger problem is broken down into smaller subdomains that are distributed equally across individual processors, allowing many independent calculations to be performed simultaneously using a  parallel processing algorithm called domain decomposition. When a time step is complete, neighboring subdomains exchange information about what they’ve “learned,” and the process repeats in a succession of loops until a terminating function—such as a preprogrammed number of time steps—stops the iterative process. However, on larger machines with tens of thousands or more processors, each subdomain contains fewer and fewer atoms and there is less work to be done locally. Eventually, progress hits a communication wall, where the overhead of synchronizing the work between subdomains exceeds the subdomain computations and scaling breaks down. This scaling crunch limits conventional MD algorithms to submicrosecond timescales—too short to assess the longer term structural effects of stress, temperature, and pressure on a material.

As part of the Exascale Computing Project (ECP), a collaborative team of scientists, software developers, and hardware integration specialists from across the Department of Energy (DOE) has developed the Exascale Atomistics for Accuracy, Length, and Time (EXAALT) application to bring MD into the exascale era. Danny Perez, a physicist within the Theoretical Division at Los Alamos National Laboratory and the project’s principal investigator says, “We’ve implemented new scalable methods that allow us to access as much of the accuracy, length, and time space as possible on exascale machines by rethinking our methods and developing algorithms that go around some of the bottlenecks that limited scaling previously.” Such a capability has potential to revolutionize MD.

‘Scaling’ the Communication Wall

The EXAALT application integrates three cutting-edge MD computer codes: the long-renowned LAMMPS (Large-Scale Atomic/Molecular Massively Parallel Simulator) classical MD code; the ParSplice (Parallel Trajectory Splicing) algorithm; and the LATTE (Los Alamos Transferable Tight-Binding for Energetics) code, which is used for simulating quantum aspects of a physical system. Early on, the project team had in hand an open-source version that worked well for huge systems on short timescales or for very small systems on long timescales. The goal was to extend this long-time capability (picoseconds to milliseconds) to intermediate-size systems (hundreds to millions of atoms) at two levels of accuracy: one where machine-learning (ML) interatomic potentials are used to approximate quantum physics and another using a simplified quantum representation of the system that is much more affordable than conventional first-principle quantum approaches.

Figure 2. The EXAALT framework provides a task (blue) and data (green) management infrastructure that can orchestrate the execution of large numbers of MD simulations (shown using the LAMMPS MD engine). This framework is used to implement the ParSplice method, which uses a parallel-in-time approach to lengthen the timescales that can be achieved.

ECP projects all have a “figure of merit,” which is a specified performance target. Progress toward this target is assessed by running the codes on challenge problems that address real-world scientific questions and serve as a proof-of-concept for future work. EXAALT is subject to two underlying challenge problems related to materials for nuclear energy. In the first, the team’s ML models simulate the effects of plasma, neutron flux, and extremely high temperatures on the walls of fusion reactors to further development of more resilient, long-lasting materials. The second problem uses quantum-based models to better understand the evolution of nuclear fuels in fission power plants. This work helps to address structural material degradation that affects production efficiency. Perez says, “The physics and the chemistry of these materials is way more complex than just a simple metal, so capturing quantum effects is very important to understanding the material behavior and developing better performing materials for these applications.”

The EXAALT framework and management infrastructure orchestrates all the different calculations and stitches them together seamlessly to generate high-quality trajectories of ensembles of atoms in materials. The framework implements the ParSplice method, which uses a parallel-in-time approach to lengthen the timescales that can be achieved. “We did a lot of work to figure out how to enable the basic codes that drive the whole framework so that we could speed up the time steps and synchronize the data into interesting trajectories,” says Perez. “We added an extra layer of domain decomposition that enables time acceleration by running different subregions of the system independently. We can accelerate one small patch at a time and run multiple copies of each patch, which allows us to leverage more parallelism.” With this approach, the simulation timescale becomes controlled by the morphological changes happening locally rather than globally, extending the timescales that be achieved.

Another major focus of the project was learning how to port their models efficiently to GPU-based machines. Early in the project, the team observed that as the code ran on increasingly more advanced machines using GPUs, the application performance—as demonstrated by the challenge problems—was decreasing against the peak performance of the hardware. “These models have lots of nested loops that you can unroll and unfold in many different ways. Finding the right mapping of the physics onto the hardware is tricky,” says Perez. “We developed clever ways to better exploit symmetry and memory and to manipulate the loops to maximize performance.”

One of those tricks was to rewrite from scratch a fundamental physical model called SNAP (Spectral Neighbor Analysis Potential), which uses ML techniques to accurately predict the underlying physics of material behavior using high-accuracy quantum calculations. Performance experts at the National Energy Research Scientific Computing Center and NVIDIA were brought in to identify areas for improvement in the SNAP kernel and help the team implement optimization strategies that led to an initial 25x speed up in the code. Fast forward to early 2023, and the team’s enhanced EXAALT code is surpassing all expectations. A recent run using Oak Ridge National Laboratory’s Frontier supercomputer—DOE’s first exascale machine—returned exceptional results. Perez says, “We project more than a 500x speed up when we extrapolate our results to the machine’s full capability, almost a factor of 10 higher than our target.”

An Unexpected Bonus

With the EXAALT code showing increasingly strong scaling and performance, the team has turned its attention to building a model-generation capability based on the ML models used to run the simulations. “We want to show how this tool can drastically speed up the whole scientific workflow from picking a system, to obtaining a model, to performing the run,” explains Perez. “Developing the specific machine-learning models that drive the simulations is time consuming and tedious work. We discovered that the ML infrastructure we were using to generate the models could also automate the entire workflow.”

Figure 3. Danny Perez, a physicist in the Theoretical Division at Los Alamos National Laboratory, leads ECP’s EXAALT application development effort.

Rather than relying on an expert’s time and knowledge to define a small data set that contains the right amount of information to include in a model, the team’s approach allows the machine to figure out the right “ingredients.” Perez adds, “We decided that instead of guessing what will happen in a simulation, we would try to capture everything that’s physically sensible. We wanted the most diverse data set possible so that in a simulation we would never encounter a configuration that’s unlike anything used in our training data, but this requires generating a massive amount of data. So, we framed this as an optimization problem where we can quantify the diversity of the data set and create new configurations without generating redundant information.”

The team has also been exploring ways to integrate quantum aspects, like charge transfer, into the ML models to create an intermediate model that would have features of both. “We have demonstrated an integrated capability where you can generate these configurations, do the quantum calculations, and train the potentials all inside one big run. Between now and the end of ECP, we want to transition this prototype into a robust, production-quality version of the machine-learning framework,” says Perez. In addition, the team is looking to make MD simulations more cost efficient. Since MD algorithms resolve each atom in a system, the computational cost of running larger systems with more sophisticated codes could become cost prohibitive. He adds, “We are working with a world expert to develop methodologies on the machine-learning side so that we can obtain the extra physics at a reasonable price.”

Taking the Long View

The notable progress the EXAALT team has made over the last seven years builds off decades of MD code development. Aidan Thompson, the leader of LAMMPS development at Sandia National Laboratories, says, “The creation of the EXAALT framework and the arrival of exascale computing resources are game-changing events for LAMMPS, allowing users for the first time the freedom to select the accuracy, length, timescale, and chemical composition appropriate to their particular science applications, rather than being restricted to certain niche areas.”

Figure 4. Aidan Thompson, a researcher in the Center for Computing Research at Sandia National Laboratories, leads the LAMMPS development effort.

The work is also a testament to the ECP research paradigm. “Typically, sponsors want the scientific potential up front, and the capability advancements needed to make that possible are considered afterward. With ECP, the capability was at the forefront, so it gave us the opportunity to focus on the long view of code development. We had time to decide what features we needed and to optimize the code as aggressively as we could to reach the necessary performance,” states Perez. He is also quick to point out that the ability to work across the DOE complex as part of a multidisciplinary team was paramount to EXAALT’s success. He notes, “This work would not have been possible without this integrated team that brings together the skills of physics, methods, machine-learning, and performance experts. This whole ecosystem of people all pushing in the same direction at the same time gave the project momentum and enabled us to meet and exceed our performance goals.”

The team will continue to test EXAALT’s performance and ensure scalability on Frontier and other future exascale machines. Through demonstration of the challenge problems, the team has shown that EXAALT can be applied to issues that directly impact the national security side of DOE’s mission space, but the code’s relevance extends beyond nuclear energy. Perez says, “Our goal is that users will be able to use EXAALT on exascale machines to run MD simulations directly in the conditions relevant to their applications by giving them access to the entire accuracy, length, and time space.” In addition, the team’s MD ML-based computational workflow could cut development time of new materials from decades to just a few years by using computer simulations almost exclusively. The ability to achieve atomistic materials predictions at the engineering scale would indeed be an MD revolution.


Source: Caryn Meissner, LLNL and ECP

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top500 list of the fastest supercomputers in the world. At s Read more…

ISC 2024 Keynote: High-precision Computing Will Be a Foundation for AI Models

May 15, 2024

Some scientific computing applications cannot sacrifice accuracy and will always require high-precision computing. Therefore, conventional high-performance computing (HPC) will remain essential, even as many applicati Read more…

EuroHPC Expands: United Kingdom Joins as 35th Member

May 14, 2024

The United Kingdom has officially joined the EuroHPC Joint Undertaking, becoming the 35th member state. This was confirmed after the 38th Governing Board meeting, and it's set to enhance Europe's supercomputing capabilit Read more…

Linux Foundation Announces the Launch of the High-Performance Software Foundation

May 14, 2024

The Linux Foundation, the nonprofit organization enabling mass innovation through open source, is excited to announce the launch of the High-Performance Software Foundation (HPSF). The announcement was made at the ISC Read more…

Nvidia Showcases Work with Quantum Centers at ISC 2024

May 13, 2024

With quantum computing surging in Europe, Nvidia took advantage of ISC 2024 to showcase its efforts working with quantum development centers. Currently, Nvidia GPUs are dominant inside classical systems used for quantum Read more…

ISC 2024: Hyperion Research Predicts HPC Market Rebound after Flat 2023

May 13, 2024

First, the top line: the overall HPC market was flat in 2023 at roughly $37 billion, bogged down by supply chain issues and slowed acceptance of some larger systems (e.g. exascale), according to Hyperion Research’s ann Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

ISC 2024 Keynote: High-precision Computing Will Be a Foundation for AI Models

May 15, 2024

Some scientific computing applications cannot sacrifice accuracy and will always require high-precision computing. Therefore, conventional high-performance c Read more…

Shutterstock 493860193

Linux Foundation Announces the Launch of the High-Performance Software Foundation

May 14, 2024

The Linux Foundation, the nonprofit organization enabling mass innovation through open source, is excited to announce the launch of the High-Performance Softw Read more…

ISC 2024: Hyperion Research Predicts HPC Market Rebound after Flat 2023

May 13, 2024

First, the top line: the overall HPC market was flat in 2023 at roughly $37 billion, bogged down by supply chain issues and slowed acceptance of some larger sys Read more…

Top 500: Aurora Breaks into Exascale, but Can’t Get to the Frontier of HPC

May 13, 2024

The 63rd installment of the TOP500 list is available today in coordination with the kickoff of ISC 2024 in Hamburg, Germany. Once again, the Frontier system at Read more…

ISC Preview: Focus Will Be on Top500 and HPC Diversity 

May 9, 2024

Last year's Supercomputing 2023 in November had record attendance, but the direction of high-performance computing was a hot topic on the floor. Expect more of Read more…

Illinois Considers $20 Billion Quantum Manhattan Project Says Report

May 7, 2024

There are multiple reports that Illinois governor Jay Robert Pritzker is considering a $20 billion Quantum Manhattan-like project for the Chicago area. Accordin Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Leading Solution Providers

Contributors

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

How the Chip Industry is Helping a Battery Company

May 8, 2024

Chip companies, once seen as engineering pure plays, are now at the center of geopolitical intrigue. Chip manufacturing firms, especially TSMC and Intel, have b Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire