ECP Podcast Discusses HACC Cosmological Code for Exascale

April 2, 2021

April 2, 2021 — Episode 79 of the Let’s Talk Exascale podcast explores the efforts of the Department of Energy’s (DOE) Exascale Computing Project (ECP)—from the development challenges and achievements to the ultimate expected impact of exascale computing on society.

This is the second in a series of episodes based on an effort aimed at sharing best practices in preparing applications for the upcoming Aurora exascale supercomputer at the Argonne Leadership Computing Facility (ALCF).

The series is highlighting achievements in optimizing code to run on GPUs and providing developers with lessons learned to help them overcome any initial hurdles.

The focus in the current episode is on a cosmological code called HACC, and joining us to discuss it are Nicholas Frontiere, Steve Rangel, Michael Buehlmann, and JD Emberson of DOE’s Argonne National Laboratory.

Our topics:

what HACC is and how it’s applied, how the team is preparing it for exascale and for the upcoming Aurora system, and we’ll cover more.

Interview Transcript

Gibson: Nicholas, will you start us off by telling us about HACC, including what it’s currently used for?

Frontiere: Yeah, so HACC stands for Hardware/Hybrid Accelerator Cosmology Code, and as its name implies, it’s a cosmology N-body and hydrodynamics solver that studies the structure and formation of the universe evolved over cosmic time.

Clockwise from top left: Nicholas Frontiere, Steve Rangel, Michael Buehlmann, and JD Emberson of Argonne National Laboratory

The code itself consists of a gravitational solver, which was later modified to resolve gas dynamics, and it includes astrophysical subgrid models that emulate galaxy formation. One important characteristic, I would say, in addition to the physics is that HACC was designed from the beginning to run at extreme scales on all high-performance systems operated by DOE labs and continue to do so as new hardware emerged. So, in fact, HACC began development, I think, over a decade ago when the first petascale machine came online, called Roadrunner, at Los Alamos. And since then, we’ve optimized the code to run at tens and hundreds of petaflop scales leading into the exascale computing effort that we’re currently preparing for.

Gibson: OK. This seems important: how will HACC benefit from exascale computing power?

Frontiere: That is a good question. One important element to consider in that respect for us is that there’s going to be cosmological measurements being made now and by future observational surveys. Upcoming instruments like the Vera Rubin Observatory, for example, will be making extremely precise measurements, which in turn, require commensurate theoretical predictions from codes like HACC. And exascale computing, in particular, will play a large role in meeting the resolution requirements for those predictions.

I would say that since the machines are getting a significant boost in computational performance—roughly an order of magnitude, I believe, at this point—it allows us to include a lot more physics into our simulations at these immense volumes, which then, in turn, drastically increases the fidelity of our predictions. And none of that would be possible without exascale computing. So it’s actually quite exciting for us that we’ll be able to run these simulations with the new computers that are going to come out.

Gibson: Nicholas, what is the team doing to prepare HACC for exascale?

Frontiere: HACC has two ALCF Early Science projects, and it is the primary code for the ECP project called ExaSky. And all of those activities have a goal to make sure that the code is performant on the upcoming exascale machines, namely Aurora and Frontier. These projects allow us to access early hardware and software compiler stacks to ensure that the code is optimized for each new hardware and that we’re ready basically to run on day one of launch for each machine. Since HACC has a long history of optimization for a variety of supercomputers, we have a lot of experience in what preparations we need to make to get the code ready.

Gibson: HACC has been developed to run on diverse architectures across DOE computing facilities. How have the code’s earlier portability efforts helped in preparations for DOE’s upcoming exascale systems?

Frontiere: It’s a very good point that HACC essentially was designed to emphasize performance over, I’d say, strict portability, yet it achieved both with tractable effort, which is pretty interesting. We accomplish this by combining a long-range gravitational spectral solver, which encapsulates most of our global communication within the code. And we couple it to a limited number of computationally intensive short-range force kernels, which are node local, in that they don’t require communication. And if these kernels were a majority of the time spent within the solver, that decomposition that we chose since the beginning of HACC has the advantage that the spectral solver already efficiently scales to millions of ranks and is written in MPI, so that’s already portable to the new machines that we get access to. And, as such, all of our performance efforts can be restricted to optimizing the limited number of short-range force kernels for whatever hardware we’re running on since they contribute to the majority of our time to solution.

This strategy we’ve taken, I would say, has a number of interesting advantages. The first is that we don’t have any external library dependencies from portability frameworks to worry about. So we don’t need to wait on any specific machine support for other codes. And, secondly, we have the flexibility to rewrite these kernels in whatever hardware-specific language is required to achieve the best performance. So that could be CUDA, OpenCL, DPC++, whatever the machine requires. And because there are only a small number of kernels to rewrite, that actual performance and optimization effort is quite manageable, without having to rewrite the whole code.

One other additional benefit is that because the kernels are node local, we can optimize on small early-hardware machines ahead of time, which essentially means that day one when a new machine drops or the supercomputers come online, we can immediately perfectly weak scale with the spectral solver and run state of the art out of the gate.

So this strategy’s been really successful in the past for a ton of different machines with quite different hardware. And, I would say, subsequently, we’re pretty confident that with the same strategy we should be able to optimally run on Frontier and Aurora using the same approach.

Gibson: Thanks, Nicholas. Steve, what is the team currently doing to get ready for Argonne’s Aurora system?

Rangel: Well, with respect to code development, the focus for our effort for Aurora has been porting and optimizing kernels for the solvers. This is for both the gravity-only and the hydro versions of HACC. So, as Nick mentioned, because of HACC’s structure, this work is at the node level for Aurora. It just means we’ll be targeting the Intel GPUs.

I should note Aurora isn’t the first GPU-accelerated system HACC will run on. The gravity-only version of HACC ran on Titan at Oak Ridge with an OpenCL implementation and later with a CUDA implementation. And the hydro version is being developed using CUDA, also running at Oak Ridge on Summit.

Very generally speaking, GPU architectures share a lot of computational characteristics. So, across manufacturers—AMD, NVIDIA, Intel—the underlying algorithms require only a little bit of change for us to begin optimizing. For Aurora, this lets us focus on evaluating the programming models, testing the SDK [software development kit], getting familiar with the Intel GPU hardware, and trying to develop a porting strategy that would ease maintainability across our different exascale implementations.

The work to evaluate the programming models really began in earnest in June of 2019 when we had our first Intel–Argonne hackathon. During this time we were able to get our existing gravity-only code written in OpenCL, originally written for Titan, running on the first Intel GPU hardware that was in our test bed, with very little modification. We were able to rewrite the gravity-only kernels in DPC++ and OpenMP, and we conducted some preliminary tests. Subsequently as the SDK matured and more representative hardware became available, we were able to conduct more extensive evaluations, and, ultimately, we chose DPC++ as a programming model for HACC on Aurora.

So in preparation for our second hackathon with Intel. This time we were going to focus more on the hydro code. We began exploring porting strategies because the hydro code involved significantly more effort, being that there are more kernels, and the kernels themselves are more complex.

Since we had a CUDA version of the gravity code, we began testing the DPC++ compatibility tool to migrate code from CUDA to DPC++. We finally converged on a semi-automatic strategy where we used the compatibility to migrate individual kernels, and we manually wrote kernel calling code and code for explicit data management between the host and the device using these newly available features for unified shared memory in DPC++. So in this way, at least from the perspective of host code, the data movement and device kernel calls looked remarkably similar to our CUDA implementation, hopefully getting us a little closer to a more manageable code base, and although this work is not completely finalized, we think it’s going in the right direction.

Lastly, we still have this ongoing collaboration with Intel where we’re optimizing the performance of the code on the new hardware assets becoming available.

Gibson: Thank you, Steve. We’ll pivot over to Michael and J.D. for this question. Michael, please begin a discussion about the next steps.

Buehlmann: So maybe I can start answering this question with talking a little bit about the physics that we’re adding to HACC.

As Nick already mentioned earlier, observational surveys that are currently being done and will be done in the future, they become both wider and deeper in scale but also much more precise. For us, from the numerical modeling side, we have to keep up with that development to get most of that data that is provided from these surveys. That means that we have to simulate both a very large volume of the universe as well as include all the physical processes that are important and affect the properties of the galaxies that we observe in those surveys.

To give some perspective, in the past it was often sufficient to treat the universe as if it only consisted of dark energy and dark matter, and that of course simplifies doing numerical simulations by a lot, because we only have to consider gravitational forces in an expanding background as the universe evolves.

Now with more and better observational data, the effects of the baryonic content of the universe could become more and more important, and they also open an exciting new window to study these astrophysical processes.

However, with baryons, there are a lot more physical properties attached than just gravity. So for example, baryonic gas can cool and heat due to radiation. Especially in the high-density regions, cooling becomes very efficient so that stars and galaxies can form there. So we also have to model star formation. And those stars, in turn, will heat up the surrounding gas and also change the composition of the baryonic gas and the environment as heavier elements are formed in the stars.

And at the center of massive galaxies, we find supermassive black holes, which accrete the baryonic gas and eject enormous amounts of energy that has to be accurately modeled to explain how the center of galaxies look like.

Currently, we are working on incrementally adding these processes and validating each as we add them. And I think JD will talk a little bit more afterwards about how we do this validation.

The scale of most of these baryon processes is well below the scales that we resolve in our simulation. For example, the mass resolution that we typically have is 106 to 109 solar masses. So we cannot follow the evolution of individual stars. Instead, we follow an entire population and model how it evolves.

One problem with these kinds of models is they have some free parameters, and those parameters are often resolution dependent. So these models are not resolution converged and have to be re-tuned for every simulation run.

With HACC in the future, we’re planning to go a slightly different route. With HACC, unlike other cosmology codes, we’re really focused on large-scale simulations. And we are, therefore, more interested in treating galaxies as a whole instead of resolving galaxy substructures and individual star-forming regions within that.

We are working on a new sub-grid model with what we call super particles or galaxy particles that will be more suitable for us and for the kind of simulations we want to do. And this will both help us to speed up the simulation as well as get more stable and converged results.

This new technique, together with the performance leap in the exascale machines, will allow us to run the largest multi-species simulations that have been run so far.

Emberson: So one thing I’ll add is that an important consideration that we need to make with all of our development work is that we thoroughly understand the error properties as well as the numerical convergence properties of these developments. And there are basically two main ways that we are addressing this topic.

The first is to consider how our code changes impact the gravity solver; that’s the part of the code that already existed beforehand. And the second is to consider the newly added hydro solver itself. With regards to the first point, as mentioned by Michael earlier, when we go from simulations that include only gravity to simulations that include both gravity and hydrodynamics, one of the fundamental differences is that the code now has to evolve two separate particle species. This includes both the dark matter, which interacts only via gravity, as well as baryons, which have both gravity and hydro interactions.

Now, in traditional gravity-only simulations that have only the dark matter component—so only one particle species—there are various numerical artifacts that are known to exist due to the discreteness of the problem. That is, since we’re using a set of discrete macroscopic particles to represent the continuous matter field of the universe, it’s easy to see that the numerical solution we obtain will deviate from the continuum limit when we approach the particle scale in the simulation. And knowing how to handle these errors and restrict them to small scales is something that’s well understood in the single-species case. However, in the case where we have multiple particle species interacting with each other, it’s possible that, (a) the numerical issues that we see in the single-species case become amplified due to the fact that we have particles with different masses and/or (b) that having two species introduces completely new numerical issues that do not have an analog in the single-species case. It turns out that both of these statements are true, and therefore a large part of our development work is to thoroughly investigate this topic and figure out the best ways to handle any numerical issues associated with having multiple particle species and make sure that those errors are confined to small scales and that the code behaves in a numerically converged and robust way.

The second part of this work is to understand the systematics associated with the hydro solver itself. The hydro solver in HACC is based upon a novel Lagrangian particle-based method, which is actually the first of its kind to be used in cosmological simulations. As such, we can expect that our hydro simulations produce slightly different results compared to other cosmological codes that have their own different hydro methods with their own different internal free parameters.

For instance, the companion code, Nyx, which is also part of our ECP project, uses a fundamentally different approach to solve hydrodynamic forces, namely a Eulerian mesh-based method. And so as part of our ECP efforts, we have begun an extensive comparison campaign between HACC and Nyx that serves not only as an internal verification and validation pipeline for our development work but also enables a more rigorous investigation of the convergence properties of each code.

And so to do this comparison properly, we start by comparing the codes in the simplest case of gravity-only dynamics and then systematically work our way upwards to increased complexity with all of the full range of sub-grid models we want to implement. Performing careful comparisons in this manner, between the two codes which have fundamentally different hydro schemes, we’re able to answer the question of what range of length and time scales we can expect cosmological codes to agree at a given level of accuracy.

Gibson: Thanks, JD, Michael. To close, let’s go back to Nicholas. Finally, do you have any lessons learned or advice that may help researchers in preparing their codes for GPU-accelerated exascale systems?

Frontiere: Yeah, so, specifically, GPU-accelerated systems, I think there were a number of good lessons we learned that others might find valuable.

The first is the standard advice you normally hear, which is that you want to reduce the memory management complexity and increase your data locality as much as possible. Essentially, attempting to take advantage of the brute force nature of GPUs, I should say. And this is because many applications over the years have implemented a variety of algorithms that were selected to optimally reduce the order of computation, essentially make the code as efficient as possible. For example, tree-based algorithms are a classic example of that. But people typically have issues translating that work to GPUs because they depend on complicated pointer chasing data structures, for example, and non-contiguous memory accesses.

And so rethinking which algorithms can be used and even potentially resorting to more computationally expensive but simpler algorithms can actually greatly improve your performance on GPUs. That’s like the standard advice people often give when thinking about GPUs. But to go beyond that, I would say we’ve also discovered that you, in general want to write your algorithms to be what I call GPU resident as opposed to GPU accelerated. And what I mean by that is, typically, people are taught that when you’re accelerating your application, you want to pipeline everything so you accelerate a bunch of tiny kernels, which are all reading and writing and computing asynchronously at the same time and then you’re trying to pipeline everything to avoid latencies, and I think we’ve actually conversely found that simple kernels with large—and even serial—data transfers to the GPU, where the data just resides and is batch computed, can not only achieve peak performance but also circumvent a lot of problems, potential hardware and software concerns that arise when you have new things that drop. So every time you get a new GPU or a machine comes online, you always expect there to be hardware and software concerns. And almost all of them immediately impact you when you take the traditional accelerated strategy as opposed to opting for some more simpler GPU-resident algorithms.

I would say that essentially making your data layout as simple as possible and doing these bulk data transfers to a device and computing away and repeating, if necessary, can be essentially a simple strategy to implement and maximize performance.

In modern GPUs, if you look at the memory capacity, it’s growing significantly. So I think intuitively for most scientists, it should make sense that your data or a large fraction of it should actually reside on the GPU as long as possible instead of trying to stream and pipeline it as fast as possible.

And so for those reasons, GPU-resident algorithms, I would say, should really be considered and I think are a good recommendation.

Gibson: Thank you all for being on the podcast. Again, our guests have been Nicholas Frontiere, Steve Rangel, Michael Buehlmann, and JD Emberson of Argonne National Laboratory.

Related Links

HACC: https://cpac.hep.anl.gov/projects/hacc/

Get Ready for Aurora (Aurora Documentation): https://www.alcf.anl.gov/support-center/aurora

oneAPI DPC++: https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

Click here to listen to the podcast.


Source: ECP

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Q&A with Google’s Bill Magro, an HPCwire Person to Watch in 2021

June 11, 2021

Last Fall Bill Magro joined Google as CTO of HPC, a newly created position, after two decades at Intel, where he was responsible for the company's HPC strategy. This interview was conducted by email at the beginning of A Read more…

A Carbon Crisis Looms Over Supercomputing. How Do We Stop It?

June 11, 2021

Supercomputing is extraordinarily power-hungry, with many of the top systems measuring their peak demand in the megawatts due to powerful processors and their correspondingly powerful cooling systems. As a result, these Read more…

Honeywell Quantum and Cambridge Quantum Plan to Merge; More to Follow?

June 10, 2021

Earlier this week, Honeywell announced plans to merge its quantum computing business, Honeywell Quantum Solutions (HQS), which focuses on trapped ion hardware, with the U.K.-based Cambridge Quantum Computing (CQC), which Read more…

ISC21 Keynoter Xiaoxiang Zhu to Deliver a Bird’s-Eye View of a Changing World

June 10, 2021

ISC High Performance 2021 – once again virtual due to the ongoing pandemic – is swiftly approaching. In contrast to last year’s conference, which canceled its in-person component with a couple months’ notice, ISC Read more…

Xilinx Expands Versal Chip Family With 7 New Versal AI Edge Chips

June 10, 2021

FPGA chip vendor Xilinx has been busy over the last several years cranking out its Versal AI Core, Versal Premium and Versal Prime chip families to fill customer compute needs in the cloud, datacenters, networks and more. Now Xilinx is expanding its reach to the booming edge... Read more…

AWS Solution Channel

Building highly-available HPC infrastructure on AWS

Reminder: You can learn a lot from AWS HPC engineers by subscribing to the HPC Tech Short YouTube channel, and following the AWS HPC Blog channel. Read more…

Space Weather Prediction Gets a Supercomputing Boost

June 9, 2021

Solar winds are a hot topic in the HPC world right now, with supercomputer-powered research spanning from the Princeton Plasma Physics Laboratory (which used Oak Ridge’s Titan system) to University College London (which used resources from the DiRAC HPC facility). One of the larger... Read more…

A Carbon Crisis Looms Over Supercomputing. How Do We Stop It?

June 11, 2021

Supercomputing is extraordinarily power-hungry, with many of the top systems measuring their peak demand in the megawatts due to powerful processors and their c Read more…

Honeywell Quantum and Cambridge Quantum Plan to Merge; More to Follow?

June 10, 2021

Earlier this week, Honeywell announced plans to merge its quantum computing business, Honeywell Quantum Solutions (HQS), which focuses on trapped ion hardware, Read more…

ISC21 Keynoter Xiaoxiang Zhu to Deliver a Bird’s-Eye View of a Changing World

June 10, 2021

ISC High Performance 2021 – once again virtual due to the ongoing pandemic – is swiftly approaching. In contrast to last year’s conference, which canceled Read more…

Xilinx Expands Versal Chip Family With 7 New Versal AI Edge Chips

June 10, 2021

FPGA chip vendor Xilinx has been busy over the last several years cranking out its Versal AI Core, Versal Premium and Versal Prime chip families to fill customer compute needs in the cloud, datacenters, networks and more. Now Xilinx is expanding its reach to the booming edge... Read more…

What is Thermodynamic Computing and Could It Become Important?

June 3, 2021

What, exactly, is thermodynamic computing? (Yes, we know everything obeys thermodynamic laws.) A trio of researchers from Microsoft, UC San Diego, and Georgia Tech have written an interesting viewpoint in the June issue... Read more…

AMD Introduces 3D Chiplets, Demos Vertical Cache on Zen 3 CPUs

June 2, 2021

At Computex 2021, held virtually this week, AMD showcased a new 3D chiplet architecture that will be used for future high-performance computing products set to Read more…

Nvidia Expands Its Certified Server Models, Unveils DGX SuperPod Subscriptions

June 2, 2021

Nvidia is busy this week at the virtual Computex 2021 Taipei technology show, announcing an expansion of its nascent Nvidia-certified server program, a range of Read more…

Using HPC Cloud, Researchers Investigate the COVID-19 Lab Leak Hypothesis

May 27, 2021

At the end of 2019, strange pneumonia cases started cropping up in Wuhan, China. As Wuhan (then China, then the world) scrambled to contain what would, of cours Read more…

AMD Chipmaker TSMC to Use AMD Chips for Chipmaking

May 8, 2021

TSMC has tapped AMD to support its major manufacturing and R&D workloads. AMD will provide its Epyc Rome 7702P CPUs – with 64 cores operating at a base cl Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Leading Solution Providers

Contributors

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Microsoft to Provide World’s Most Powerful Weather & Climate Supercomputer for UK’s Met Office

April 22, 2021

More than 14 months ago, the UK government announced plans to invest £1.2 billion ($1.56 billion) into weather and climate supercomputing, including procuremen Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire