Porting a Lattice QCD Code-Suite to Exascale Architectures

April 30, 2021

April 30, 2021 — As part of a new series aimed at sharing best practices in preparing applications for Aurora, U.S. Department of Energy’s (DOE) Argonne National Laboratory is highlighting researchers’ efforts to optimize codes to run efficiently on graphics processing units (GPUs).

The fundamental interactions between the quarks and gluons that constitute protons and nuclei can be calculated systematically by the physics theory known as lattice quantum chromodynamics (LQCD). These interactions account for 99 percent of the mass in the visible universe, but they can only be simulated with powerful computer systems such as those housed at the U.S. Department of Energy’s (DOE) Argonne Leadership Computing Facility (ALCF).

While a majority of the small army of codes necessary for the study of LQCD was originally written to run well on CPU-based computers—including the ALCF’s Theta machine—the next generation of high-performance computing will derive much of its power from GPUs, as exemplified by the ALCF’s forthcoming Polaris and Aurora systems.

Exascale capabilities promise to expand high energy and nuclear physics by providing the ability to simulate atomic nuclei more realistically than has ever been possible, enabling groundbreaking discoveries about the details of quark-boson coupling foundational to our present understanding of elementary particles.

Given the size of the LQCD suite, preparing the applications for exascale by making them GPU-ready is no small effort.

The project’s three major code bases— Chroma, CPS, and MILC—specialize in different quark discretizations (Wilson-clover, domain-wall, and staggered formulations, respectively) and take advantage of optimized routines available in the QUDA (“QCD in CUDA”) library and Grid code. The project additionally supports two minor code bases, HotQCD, which is optimized for QCD thermodynamics, and QEX, which is intended for high-level devemopment of lattice field theory codes.

Porting lattice QCD applications

Abstraction is the primary thrust of the porting process; the developers are working to make all the performance critical parts of the LQCD codes completely vendor-independent.

The changes made through the abstraction process are localized to a few backend files that provide functionality for mathematical operations. Once all of these backend and target-specific calls are grouped, they can be replaced or rewritten with higher-level functions that make the code more generic.

This is happening on a large scale to remove CUDA-specific code.

QUDA is the largest code base of any of the components comprising the lattice QCD project. Direct calls to CUDA pervaded its entirety.

GPU-optimized QUDA was developed independently and has its own code base. In contrast to OpenMP and SYCL, CUDA does not offer a unified programming model.

Relying on a conversion tool to prepare the code to run on GPU machines was not a viable option; CUDA-specific code would have to be manually excised and refactored.

As part of the effort to move operations to the backend and genericize the code, the developers are constructing a SYCL backend; Intel, likewise, is adding an extension that expands SYCL’s functions with APIs similar to those of CUDA to make porting as easy as possible for users.

As the other two applications, Grid and HotQCD, already had vendor-independent programming interfaces, the work being done to them is backend-intensive.

Grid was originally a CPU-only code to which GPU support was later added via a CUDA backend; it now has a DPC++ backend as well. Its porting can be seen as twofold: from CPU to GPU, and from CUDA to DPC++.

It is more than just a code; it is a framework. It began as a CUDA abstraction for Nvidia that was expanded to incorporate SYCL compatibility. The expansion has helped guide the development of SYCL backend, making its thread-indexing APIs exposable via global varieties as in CUDA.

Early in the development cycle of Grid, a code benchmark called GridBench was constructed and functioned like a mini-app. GridBench incorporated the entire functionality of DPC++ to run the most important kernel, a stencil operator that, operating on multidimensional lattices, is responsible for key computations within the application.

The porting of the stencil operator illustrates a subtlety to bear in mind when translating between GPU and CPU systems: while a developer cannot write code for GPUs precisely the same as would be done for CPUs (that is, in general there will not be something so simple as a direct one-to-one correspondence between the two), code can be written for both types of architectures in a way that is not terribly different—and, in fact, is even reasonably natural: both can be written using the same approaches to programmability and optimization.

This is true of the Grid library itself: the CPU and GPU versions of the code base share the same memory layout (Array of Structures of Arrays, or AoSoA). Through a C++ template mechanism, at compile time it is decided if a single instruction, multiple thread (SIMT) mode is used for the GPU or if a SIMD (single instruction, multiple data) mode is used for the CPU.

That analogous bodies of code can be generated for a given application across the distinct architectures of course carries important ramifications for development time and code manageability. Moreover, it can help enable crosspollination between various projects as similarities shared by different codes emerge.

CPU-only code on GPU

HotQCD, which is based on OpenMP, was, like Grid, originally built exclusively to run on CPU machines.

The question of how to get a CPU-only application to run on GPUs—the Aurora GPUs in particular—breaks down into smaller questions. First, how do you convey information from a CPU to a GPU? One way would be to include explicit data transfers between the processors. Including explicit data transfers, however, would require numerous changes to the underlying code—the GPU’s every action would necessitate a data transfer. An alternative would be to rely on unified shared memory. Unified shared memory does not require explicit data transfers—the information would be automatically transferred to and from the GPU if accessed.

The developers must also determine how to make an OpenMP thread that maps to CPU cores compatible with GPUs. As with the majority of CPUs, all GPUs are SIMD machines. This means that on a CPU machine a CPU thread would execute a vector instruction and that on a GPU machine a GPU thread (or warp, to use NVIDIA’s terminology) would execute a vector instruction.

Parallelization and vectorization can be induced with OpenMP via pragmas—one pragma effects parallelization, another effects vectorization. Compiler support enables the pragmas to run with full performance on GPU machines with the developers needing to make only minor changes to the code if a vectorized, CPU version exists and is parallelized via OpenMP.

Ultimately, the success of OpenMP vectorization seems to occur in pairs—that is, successful OpenMP vectorization on GPU systems tends to suggest successful OpenMP vectorization on CPU systems (and vice versa), and unsuccessful OpenMP vectorization on GPU systems tends to suggest unsuccessful vectorization on CPU systems (and vice versa).


Source: NILS HEINONEN, ALCF

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

IBM Research Debuts 2nm Test Chip with 50 Billion Transistors

May 6, 2021

IBM Research today announced the successful prototyping of the world's first 2 nanometer chip, fabricated with silicon nanosheet technology on a standard 300mm bulk wafer. With ~50 billion transistors, the chip will enab Read more…

Supercomputer-Powered CRISPR Simulation Lights Path to Better DNA Editing

May 5, 2021

CRISPR-Cas9 – mostly just known as CRISPR – is a powerful genome editing tool that uses an enzyme (Cas9) to slice off sections of DNA and a guide RNA to repair and modify the DNA as desired, opening the door for cure Read more…

LRZ Announces New Phase of SuperMUC-NG Supercomputer with Intel’s ‘Ponte Vecchio’ GPU

May 5, 2021

At the Leibniz Supercomputing Centre (LRZ) in München, Germany – one of the constituent centers of the Gauss Centre for Supercomputing (GCS) – the SuperMUC-NG system has stood tall for several years, placing 15th on Read more…

HPC Simulations Show How Antibodies Quash SARS-CoV-2

May 5, 2021

Following more than a year of rapid-fire research and pharmaceutical development, nearly a billion COVID-19 vaccine doses have been administered around the world, with many of those vaccines proving remarkably effective Read more…

Crystal Ball Gazing at Nvidia: R&D Chief Bill Dally Talks Targets and Approach

May 4, 2021

There’s no quibbling with Nvidia’s success. Entrenched atop the GPU market, Nvidia has ridden its own inventiveness and growing demand for accelerated computing to meet the needs of HPC and AI. Recently it embarked o Read more…

AWS Solution Channel

FLYING WHALES runs CFD workloads 15 times faster on AWS

FLYING WHALES is a French startup that is developing a 60-ton payload cargo airship for the heavy lift and outsize cargo market. The project was born out of France’s ambition to provide efficient, environmentally friendly transportation for collecting wood in remote areas. Read more…

2021 Winter Classic – Coaches Chat

May 4, 2021

The Winter Classic Invitational Student Cluster Competition raged for all last week and now we’re into the week of judging interviews. Time has been flying. So as we wait for results, let’s dive a bit deeper into t Read more…

IBM Research Debuts 2nm Test Chip with 50 Billion Transistors

May 6, 2021

IBM Research today announced the successful prototyping of the world's first 2 nanometer chip, fabricated with silicon nanosheet technology on a standard 300mm Read more…

Crystal Ball Gazing at Nvidia: R&D Chief Bill Dally Talks Targets and Approach

May 4, 2021

There’s no quibbling with Nvidia’s success. Entrenched atop the GPU market, Nvidia has ridden its own inventiveness and growing demand for accelerated compu Read more…

Intel Invests $3.5 Billion in New Mexico Fab to Focus on Foveros Packaging Technology

May 3, 2021

Intel announced it is investing $3.5 billion in its Rio Rancho, New Mexico, facility to support its advanced 3D manufacturing and packaging technology, Foveros. Read more…

Supercomputer Research Shows Standard Model May Withstand Muon Discrepancy

May 3, 2021

Big news recently struck the physics world: researchers at the Fermi National Accelerator Laboratory (FNAL), in the midst of their Muon g-2 experiment, publishe Read more…

HPC Career Notes: May 2021 Edition

May 3, 2021

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it Read more…

NWChemEx: Computational Chemistry Code for the Exascale Era

April 29, 2021

A team working on biofuel research is rewriting the decades-old NWChem software program for the exascale era. The new software, NWChemEx, will enable computatio Read more…

HPE Will Build Singapore’s New National Supercomputer

April 28, 2021

More than two years ago, Singapore’s National Supercomputing Centre (NSCC) announced a $200 million SGD (~$151 million USD) investment to boost its supercomputing power by an order of magnitude. Today, those plans come closer to fruition with the announcement that Hewlett Packard Enterprise (HPE) has been awarded... Read more…

Arm Details Neoverse V1, N2 Platforms with New Mesh Interconnect, Advances Partner Ecosystem

April 27, 2021

Chip designer Arm Holdings is sharing details about its Neoverse V1 and N2 cores, introducing its new CMN-700 interconnect, and showcasing its partners' plans t Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Saudi Aramco Unveils Dammam 7, Its New Top Ten Supercomputer

January 21, 2021

By revenue, oil and gas giant Saudi Aramco is one of the largest companies in the world, and it has historically employed commensurate amounts of supercomputing Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Leading Solution Providers

Contributors

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

Programming the Soon-to-Be World’s Fastest Supercomputer, Frontier

January 5, 2021

What’s it like designing an app for the world’s fastest supercomputer, set to come online in the United States in 2021? The University of Delaware’s Sunita Chandrasekaran is leading an elite international team in just that task. Chandrasekaran, assistant professor of computer and information sciences, recently was named... Read more…

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

The History of Supercomputing vs. COVID-19

March 9, 2021

The COVID-19 pandemic poses a greater challenge to the high-performance computing community than any before. HPCwire's coverage of the supercomputing response t Read more…

HPE Names Justin Hotard New HPC Chief as Pete Ungaro Departs

March 2, 2021

HPE CEO Antonio Neri announced today (March 2, 2021) the appointment of Justin Hotard as general manager of HPC, mission critical solutions and labs, effective Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire