Jan. 30, 2023 — Numerical libraries have an enormous impact on scientific computing because they act as the gateway middleware that enables many applications to run on state-of-the-art hardware. Algorithmic and implementation advances in these libraries can increase the speed and accuracy of many modeling and simulation packages and can provide access to new computational systems.
Part of the magic of mathematics is that the same mathematical methods can be applied to wildly disparate physical phenomena. Examples include linear algebra and the fast Fourier transform (FFT). This generality is the foundation upon which numerical libraries are built. A well-designed numerical library also provides an abstraction layer via an API in such a way that users need only focus on their research and not on the computational platform. Such is the case with the Computational Libraries Optimized via Exascale Research (CLOVER) project.
Encompassing three foundational libraries, the Exascale Computing Project’s (ECP’s) CLOVER team has brought GPU acceleration to the Software for Linear Algebra Targeting Exascale (SLATE) library for dense linear algebra, to Ginkgo for sparse linear algebra, and to the highly efficient FFTs for exascale (heFFTe) library for CPU- and GPU-accelerated, multinode, multidimensional FFTs. These libraries provide scientists around the world with access to the latest in GPU-accelerated computing—whether the application is running on an exascale system, on a computational cluster, or locally on a GPU-accelerated laptop or workstation.
Preliminary results demonstrate that CLOVER users can run efficiently on AMD, NVIDIA, and Intel GPU-accelerated systems. These benchmarks also demonstrate performance parity with AMD, NVIDIA, and Intel vendor library implementations on some computational kernels. In the race for performance portability, Hartwig Anzt (Figure 1)—director of the Innovative Computing Laboratory, professor in the Min H. Kao Department of Electrical Engineering and Computer Science (EECS) at the University of Tennessee, and research group leader at the Karlsruhe Institute of Technology—observed, “Software lives longer than hardware. Our team is working to avoid library death through good design.”
Scalability and performance are essential to achieving exascale performance. To this end, benchmark results demonstrate that the heFFTe library can scale to support production runs on the Frontier exascale system after it passes its ready-for-production acceptance tests. The heFFTe team conducted these benchmarks on Crusher, which is an early access test-bed system built with hardware identical to that of the Frontier supercomputer and designed with similar software.
Technology Introduction
Linear algebra provides scientists with a language to describe space and the manipulation of space by using numbers that can be calculated on a computer. Specifically, linear algebra is about performing linear combinations of operations, or linear transforms, by using arithmetic operations on columns of numbers (i.e., vectors) and arrays of numbers (i.e., matrices) to create new, more meaningful vectors and matrices. Based on such linear algebra calculations, scientists can glean useful information about speed, distance, or time in a physical space. Alternatively, scientists can use these calculations in an abstract space to perform a linear regression, which is a valuable tool used to predict data related to decision-making, medical diagnosis, and statistical inference. The Google PageRank algorithm used by search engines provides a common example of the power of linear algebra and eigenvalues. The 3Blue1Brown video “Essence of Linear Algebra Preview” briefly introduces linear algebra for a general technical audience.
Normally, the choice of technique and library depends on the type of matrix being targeted. For example, the SLATE library is designed to operate on dense matrices, in which most or all the elements in the matrix are nonzero. Arithmetic operations on these types of matrices tend to be computationally intensive, so users can expect higher performance on a GPU. The Ginkgo library, by contrast, is designed to operate on sparse matrices, in which few elements in the matrix are nonzero. Operations on sparse matrices tend to be bound by memory bandwidth when accessing nonzero matrix elements and limited by memory capacity when an operation creates many nonzero matrix entries. Through smart design techniques, the Ginkgo team maximized the use of the GPU memory subsystem to achieve performance competitive with vendor-optimized, sparse matrix kernels. Ginkgo has a huge advantage over vendor libraries because it is GPU agnostic, so scientists are not locked into a specific vendor implementation or hardware.
The FFT has been described as “the most important numerical algorithm of our lifetime,” and the Institute of Electrical and Electronics Engineers (IEEE) magazine, Computing in Science & Engineering, included it in the top 10 algorithms of the twentieth century. It is used in many domain applications, including molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and wireless multimedia. The heFFTe website that more than a dozen Exascale Computing Project (ECP) applications use FFT in their codes. The 3Blue1Brown video, “But What Is the Fourier Transform? A Visual Introduction,” provides an overview of this important algorithm for a general audience.
The heFFTe library revisited the design of existing FFT libraries to develop a distributed, 3D FFT library and robust 2D implementations that can support applications that run on large-scale, heterogeneous systems with multicore processors and hardware accelerators. For example, the heFFTe team has focused on implementing a scalable, GPU-enabled, and performance-portable 3D FFT. Distributed 3D FFTs are one of the most important kernels in molecular-dynamics computations, and the performance of these FFTs can drastically affect an application’s ability to run on larger machines. Additionally, the redesign effort was a codesign activity that involved other ECP application developers. For more information, see “heFFTe—A Widely Applicable, CPU/GPU, Scalable Multidimensional FFT That Can Even Support Exascale Supercomputers.”
Anzt recognized the importance of lessons learned from the community-based Extreme-scale Scientific Software Development Kit (xSDK) and Extreme-scale Scientific Software Stack (E4S) projects. He emphasized that API design and user feedback go hand in hand. Furthermore, continuous integration (CI) is essential to ensure library performance and correctness on all supported systems. CI also frees development from the constraints of legacy algorithms and code decisions, as highlighted in the ECP article, “High-Accuracy, Exascale-Capable, Ab Initio Electronic Structure Calculations with QMCPACK: A Use Case of Good Software Practices.”
Technical Discussion
Each of the three libraries in the CLOVER project provides essential functionality for a large scientific user base. GPU acceleration is critical to achieving high performance on exascale supercomputers. Performance benchmarks demonstrate that the CLOVER libraries are ready for production runs on the Frontier exascale system. Early results on Intel and AMD GPUs demonstrate that all three libraries will deliver high performance regardless of the GPU vendor.
SLATE
The SLATE library implements a GPU-accelerated, distributed, dense linear algebra library that will replace the Scalable Linear Algebra PACKage (ScaLAPACK). The Linear Algebra PACKage (LAPACK) and ScaLAPACK have been the standard linear algebra libraries for decades, and this success can largely be attributed to the layered software stack (Figure 2) that can call vendor-optimized Basic Linear Algebra Subprograms (BLAS).
SLATE follows this proven paradigm to implement similar ScaLAPACK functionality (e.g., parallel BLAS, norms, solving linear-systems, least squares, eigenvalue problems, singular value decomposition) and to expand coverage to new algorithms.
The goal is to support current hardware (i.e., CPUs and GPUs) and to provide sufficient flexibility to support future hardware designs. The software layers are designed to prevent language lock-in, as successfully demonstrated by current support for Open Multi-Processing (OpenMP), CUDA, ROCm (Radeon Open Compute Platform), and oneAPI. This performance portability is based on the C++ standard library and C++ templates to avoid code duplication.
To continue reading ECP’s report, please click here.
Source: Rob Farber, Exascale Computing Project