April 30, 2021 — As part of a new series aimed at sharing best practices in preparing applications for Aurora, U.S. Department of Energy’s (DOE) Argonne National Laboratory is highlighting researchers’ efforts to optimize codes to run efficiently on graphics processing units (GPUs).
The fundamental interactions between the quarks and gluons that constitute protons and nuclei can be calculated systematically by the physics theory known as lattice quantum chromodynamics (LQCD). These interactions account for 99 percent of the mass in the visible universe, but they can only be simulated with powerful computer systems such as those housed at the U.S. Department of Energy’s (DOE) Argonne Leadership Computing Facility (ALCF).
While a majority of the small army of codes necessary for the study of LQCD was originally written to run well on CPU-based computers—including the ALCF’s Theta machine—the next generation of high-performance computing will derive much of its power from GPUs, as exemplified by the ALCF’s forthcoming Polaris and Aurora systems.
Exascale capabilities promise to expand high energy and nuclear physics by providing the ability to simulate atomic nuclei more realistically than has ever been possible, enabling groundbreaking discoveries about the details of quark-boson coupling foundational to our present understanding of elementary particles.
Given the size of the LQCD suite, preparing the applications for exascale by making them GPU-ready is no small effort.
The project’s three major code bases— Chroma, CPS, and MILC—specialize in different quark discretizations (Wilson-clover, domain-wall, and staggered formulations, respectively) and take advantage of optimized routines available in the QUDA (“QCD in CUDA”) library and Grid code. The project additionally supports two minor code bases, HotQCD, which is optimized for QCD thermodynamics, and QEX, which is intended for high-level devemopment of lattice field theory codes.
Porting lattice QCD applications
Abstraction is the primary thrust of the porting process; the developers are working to make all the performance critical parts of the LQCD codes completely vendor-independent.
The changes made through the abstraction process are localized to a few backend files that provide functionality for mathematical operations. Once all of these backend and target-specific calls are grouped, they can be replaced or rewritten with higher-level functions that make the code more generic.
This is happening on a large scale to remove CUDA-specific code.
QUDA is the largest code base of any of the components comprising the lattice QCD project. Direct calls to CUDA pervaded its entirety.
GPU-optimized QUDA was developed independently and has its own code base. In contrast to OpenMP and SYCL, CUDA does not offer a unified programming model.
Relying on a conversion tool to prepare the code to run on GPU machines was not a viable option; CUDA-specific code would have to be manually excised and refactored.
As part of the effort to move operations to the backend and genericize the code, the developers are constructing a SYCL backend; Intel, likewise, is adding an extension that expands SYCL’s functions with APIs similar to those of CUDA to make porting as easy as possible for users.
As the other two applications, Grid and HotQCD, already had vendor-independent programming interfaces, the work being done to them is backend-intensive.
Grid was originally a CPU-only code to which GPU support was later added via a CUDA backend; it now has a DPC++ backend as well. Its porting can be seen as twofold: from CPU to GPU, and from CUDA to DPC++.
It is more than just a code; it is a framework. It began as a CUDA abstraction for Nvidia that was expanded to incorporate SYCL compatibility. The expansion has helped guide the development of SYCL backend, making its thread-indexing APIs exposable via global varieties as in CUDA.
Early in the development cycle of Grid, a code benchmark called GridBench was constructed and functioned like a mini-app. GridBench incorporated the entire functionality of DPC++ to run the most important kernel, a stencil operator that, operating on multidimensional lattices, is responsible for key computations within the application.
The porting of the stencil operator illustrates a subtlety to bear in mind when translating between GPU and CPU systems: while a developer cannot write code for GPUs precisely the same as would be done for CPUs (that is, in general there will not be something so simple as a direct one-to-one correspondence between the two), code can be written for both types of architectures in a way that is not terribly different—and, in fact, is even reasonably natural: both can be written using the same approaches to programmability and optimization.
This is true of the Grid library itself: the CPU and GPU versions of the code base share the same memory layout (Array of Structures of Arrays, or AoSoA). Through a C++ template mechanism, at compile time it is decided if a single instruction, multiple thread (SIMT) mode is used for the GPU or if a SIMD (single instruction, multiple data) mode is used for the CPU.
That analogous bodies of code can be generated for a given application across the distinct architectures of course carries important ramifications for development time and code manageability. Moreover, it can help enable crosspollination between various projects as similarities shared by different codes emerge.
CPU-only code on GPU
HotQCD, which is based on OpenMP, was, like Grid, originally built exclusively to run on CPU machines.
The question of how to get a CPU-only application to run on GPUs—the Aurora GPUs in particular—breaks down into smaller questions. First, how do you convey information from a CPU to a GPU? One way would be to include explicit data transfers between the processors. Including explicit data transfers, however, would require numerous changes to the underlying code—the GPU’s every action would necessitate a data transfer. An alternative would be to rely on unified shared memory. Unified shared memory does not require explicit data transfers—the information would be automatically transferred to and from the GPU if accessed.
The developers must also determine how to make an OpenMP thread that maps to CPU cores compatible with GPUs. As with the majority of CPUs, all GPUs are SIMD machines. This means that on a CPU machine a CPU thread would execute a vector instruction and that on a GPU machine a GPU thread (or warp, to use NVIDIA’s terminology) would execute a vector instruction.
Parallelization and vectorization can be induced with OpenMP via pragmas—one pragma effects parallelization, another effects vectorization. Compiler support enables the pragmas to run with full performance on GPU machines with the developers needing to make only minor changes to the code if a vectorized, CPU version exists and is parallelized via OpenMP.
Ultimately, the success of OpenMP vectorization seems to occur in pairs—that is, successful OpenMP vectorization on GPU systems tends to suggest successful OpenMP vectorization on CPU systems (and vice versa), and unsuccessful OpenMP vectorization on GPU systems tends to suggest unsuccessful vectorization on CPU systems (and vice versa).
Source: NILS HEINONEN, ALCF