Argonne Researchers Work to Prepare PETSc for Nation’s Exascale Supercomputers

October 5, 2022

Oct. 5, 2022 — Junchao Zhang, a software engineer at the U.S. Department of Energy’s (DOE) Argonne National Laboratory, is leading a team of researchers working to prepare PETSc (Portable, Extensible Toolkit for Scientific Computation) for the nation’s exascale supercomputers—including Aurora, the exascale system set for deployment at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility located at Argonne.

PDE Library Used in Numerous Fields

PETSc is a math library for the scalable solution of models generated with continuous partial differential equations (PDEs). PDEs, fundamental for describing the natural world, are ubiquitous in science and engineering. As such, PETSc is used across numerous disciplines and industry sectors, including aerodynamics, neuroscience, computational fluid dynamics, seismology, fusion, materials science, ocean dynamics, and the oil industry.

As researchers from both science and industry seek to generate increasingly high-fidelity simulations and apply them to increasingly large-scale problems, PETSc stands to directly benefit from the advances of exascale computing power. In addition, the technology developed for exascale can also be applied to less powerful computing systems and make applications of PETSc on such systems faster and cheaper, in turn resulting in broader adoption.

Furthermore, each of the exascale machines scheduled to come online at DOE facilities has adopted an accelerator-based architecture and derives the majority of its compute power from graphics processing units (GPUs). This has made porting PETSc for efficient use on GPUs an absolute necessity.

However, every vendor of exascale computing systems has adopted its own programming model and corresponding ecosystem. Moreover, portability between different models, where intended, remains in its relative infancy for all practical purposes.

So as to avoid getting locked into a particular vendor’s programming model and to take advantage of its extensive user support and math library, Zhang’s team opted to prepare PETSc for GPUs by using the vendor-independent Kokkos as their portability layer and as their primary backend wherever possible (otherwise relying on CUDA, SYCL, and HIP).

Instead of writing multiple interfaces for different vendor libraries, the researchers employ the Kokkos math library, known as Kokkos-Kernels, as a wrapper. Kokkos, by virtue of being a library, benefitted the team by letting them consider their users’ choice of programming model, thereby enabling seamless and natural GPU support.

Expanding GPU Support

Prior to the efforts of Zhang’s team, which DOE’s Exascale Computing Project (ECP) sponsors, PETSc support for GPUs was limited to NVIDIA processors and required many of its compute kernels to execute on host machines. This had the effect of minimizing both the code’s portability and its capability.

“So far, we think adopting Kokkos is successful, as we only need a single source code,” Zhang said. “We had direct support for NVIDIA GPUs with CUDA. We tried to duplicate the code to directly support AMD GPUs with HIP. We find it is painful to maintain duplicated code: the same feature needs to be implemented at multiple places, and the same bug needs to be fixed at multiple places. Once CUDA and HIP application programming interfaces (APIs) diverge, it becomes even more difficult to duplicate a code.”

However, while PETSc is written in C, enough GPU programming models use C++ that Zhang’s team has found it necessary to add an increasing number of C++ files.

“Within the ECP project, bearing in mind a formula in computing architecture known as Amdahl’s law, which suggests that any single unaccelerated portion of the code could become a bottleneck to overall speedup,” Zhang explained, “we tried to consider the GPU-porting job and the portability of the GPU code in holistic terms.”

Optimizing Communication and Computation

The team is working to optimize GPU functionality on two fronts: communication and computation.

As the team discovered, CPU-GPU data synchronizations must be carefully isolated to avoid the tricky and elusive bugs they effect.

Therefore, to improve communication, the researchers have added support for GPU-aware Message Passing Interfaces (MPI), thereby enabling data to pass directly to GPUs instead of buffering on CPUs. Moreover, to remove GPU synchronizations that result from current MPI constraints on asynchronous computation, the team researched GPU-stream-aware communication that, bypassing MPI altogether, passes data using the NVIDIA NVSHMEM library. The team is also collaborating with Argonne’s MPICH group to test new extensions that address the MPI constraints, as well as a stream-aware MPI feature developed by the group.

For optimized GPU computation, Zhang’s team ported to device a number of functions intended to reduce back-and-forth copying of data between host and device. For example, while matrix assembly—essential for PETSc use—was previously carried out on host machines, its APIs could not be feasibly parallelized on GPUs, despite their friendliness with CPUs. The team added new matrix assembly APIs suitable for GPUs, improving performance.

Improving Code Development

Aside from recognizing the importance of avoiding code duplication and of encapsulating and isolating inter-processor data synchronizations, the team has learned to profile often (relying on NVIDIA nvprof and Nsight Systems) and to inspect the timeline of GPU activities in order to identify hidden and unexpected activities (and subsequently eliminate them).

One crucial difference between the Intel Xe GPUs that will power Aurora and the GPUs contained in other exascale machines is that the Xes have multiple subslices, indicating that optimal performance hinges on NUMA-aware programming. (NUMA, or non-uniform memory access, is a method for configuring a group of processors to share memory locally.)

Reliance on a single source code enables PETSc to run readily on Intel, AMD, and NVIDIA GPUs, albeit with certain tradeoffs. By making Kokkos a sort of intermediary between PETSc and vendors, PETSc becomes dependent on the quality of Kokkos. The Kokkos-Kernel APIs must therefore be optimized for vendor libraries to avoid impaired performance. Discovering that certain key Kokkos-Kernels functions are unoptimized for vendor libraries, the researchers contribute fixes to address issues as they arise.

As part of the project’s next steps, the researchers will help the Kokkos-Kernels team add interfaces to the Intel oneMKL math kernel library before testing them with PETSc. This, in turn, will aid the Intel oneMKL team as they prepare the library for Aurora.

Zhang noted that to further expand PETSc’s GPU capabilities, his team will work to support more low-level data structures in PETSc along with more high-level user-facing GPU interfaces. The researchers also intend to work with users to help ensure efficient use of PETSc on Aurora.

The Best Practices for GPU Code Development series highlights researchers’ efforts to optimize codes to run efficiently on the ALCF’s Aurora exascale supercomputer.

About ALCF

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.


Source: Nils Heionen, ALCF

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire