The Exascale Computing Project (ECP) is working to combine two key technologies, LLVM and continuous integration (CI), to ensure that current and future compilers are stable and performant on high-performance computing (HPC) and exascale computer systems. The proliferation of new machine architectures has made the continuous testing and verification of software (hence the “continuous” in CI) an essential part of US Department of Energy DOE supercomputing.
Valentin Clement, a software engineer at Oak Ridge National Laboratory who is part of the team working to include LLVM in the ECP CI testing and verification framework, notes, “We are working to add CI for ECP-relevant architectures. This facilitates collaboration as each DOE lab currently has their own separate LLVM fork. Centralizing to one software fork for all the labs avoids wasted effort. It also means we can work to support GPUs from multiple vendors in one LLVM fork, which benefits all DOE sites and the world-wide LLVM community. Additionally, we can work to increase support for offloading to GPUs as GPU support is not really well tested in the current upstream LLVM release.”
The importance of the open-source LLVM collection of compiler and toolchain technologies cannot be overstated. Compilers that generate correct and performant binaries are a make-or-break technology, which is why the ECP CI framework is so important to the US supercomputing effort. Testing and verification is the only way to ensure that a compiler works. Johannes Doerfert, a researcher at Argonne National Laboratory, observes, “People don’t realize that most vendor compilers are LLVM-based. Improvements in collaboration, as well as improvements to LLVM, benefit all vendor products as well as the broad HPC community.”
Clement observes that “LLVM is a huge project. We are one of the first to try a CI fork of LLVM in ECP as CI is a huge resource investment. There are many interactions with the different facilities, plus there are interactions with several other ECP projects, such as SOLLVE and Flang, which also contribute to LLVM. ”
The magnitude and impact of the CI task can be seen in Figure 1, which illustrates the breadth of languages and compilers encompassed by the ECP LLVM effort, each of which supports important HPC applications on ECP-relevant architectures. All LLVM and CI work fits under the PROTEAS-TUNE effort managed by Jeffrey Vetter, manager of ECP LLVM efforts.
Key Benefits of the ECP CI Effort
Better support for GPUs is a key benefit of the CI effort. “For example,” Clement notes, “the integration effort can be really slow. It can take upward of 1 year for some code changes to be incorporated in the main LLVM release.”
Another key benefit of the ECP CI effort is that it gives DOE and HPC communities the opportunity to focus on HPC-specific needs. Many HPC and scientific codes are written in Fortran. Access to the liberal LLVM licensing model means that the HPC community can work to create an effective parallelizing and GPU-enabled Fortran compiler. This eliminates a dependency on commercial companies that no longer see a significant commercial demand for a Fortran compiler.
This is not to say that vendors are ignoring the ECP Fortran development efforts. Clement notes that although the Flang Fortran front end has received significant investment by DOE labs, vendors (e.g., NVIDIA, ARM, AMD) participate in and contribute to it.
Without performant compilers that can generate correct binary code for CPU and GPU architectures, the next generation of exascale supercomputers cannot happen.
Given the ubiquity of LLVM-based compilers, including LLVM in the ECP CI infrastructure is a necessary test and validation step to ensure that reliable and performant compilers exist for each DOE supercomputer system. Because of the permissive licensing conditions of the LLVM license, CI also allows the HPC community to identify and fix bugs and performance regressions quickly if they occur at any ECP site, as well as to work to advance the state of the art in compiler technology in a tested and verified manner.
The ECP CI infrastructure is currently testing and verifying much of the Extreme-Scale Scientific Software Stack (E4S) software ecosystem. Users can easily download the E4S software ecosystem for evaluation and production runs. More information can be found at the E4S website.
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national laboratories and commercial organizations.