HPC news headlines often highlight the latest hardware speeds and feeds. While advances on the hardware front are important, improving the ability to write software for advanced systems is equally important. Indeed, HPC software has always been challenging to create and manage.
Traditionally, writing HPC software is accomplished with libraries like MPI, OpenMP, or CUDA in conjunction with C, C++, or Fortran. The choice of library is often dictated by the underlying hardware and often limits portability. While this low-level approach has pushed the HPC community to high performance levels, it has also served to inhibit new non-HPC users from taking advantage of advanced hardware.
The relatively young Chapel language offers a high-level approach to HPC programming that includes both hardware independence and high performance. Version 2.1 of Chapel was recently released (June 2024) to the community. HPCwire reached out to the Chapel development team with nine questions about the current capabilities of the Chapel HPC language.
1. What is your “elevator speech” about why someone should be considering Chapel for HPC applications?
Chapel supports writing applications that can target the distributed multicore processors and GPUs of HPC systems using a single, consistent set of features that express parallelism and locality. This feature sharply contrasts the status quo, in which each level of hardware parallelism tends to come with its own programming dialect, often involving changes/extensions to existing languages, libraries, or vendor-specific approaches. Chapel benefits from HPC-aware optimizations as a compiled language, typically resulting in performance that matches or beats standard approaches like MPI, SHMEM, OpenMP, or CUDA. In practice, Chapel applications have scaled to thousands of compute nodes and over a million cores. Best of all, these performance and scalability benefits are all contained within a general-purpose language whose design supports writing clear, concise code.
2. Does Chapel support GPUs? Is it possible to easily create an application that can recognize GPUs and use them if available and otherwise use the available CPUs (cores)?
Yes, Chapel does support vendor-neutral GPU programming, whereas our recent releases have supported NVIDIA and AMD GPUs. Intel GPUs are also of interest but currently remain future work.
Thanks to Chapel’s locales, It is possible and reasonably easy to write applications that work with or without GPUs. The ‘locale’ type is how we represent system resources within a Chapel program. A Chapel application running on n compute nodes has an n-element array of locale values representing those nodes, which permits users to make use of them and reason about them. For example, locales support queries that return the amount of memory or parallelism available on a given node. Each top-level locale also contains an array of GPU sub-locales, which represent the node’s GPUs and can be used to target GPU processors and memories. When running on a compute node that has no GPUs, this is simply an array of size zero, permitting code to query and respond to that fact or to be written in a way that works whether the array is populated or empty.
For HPCWire readers interested in learning more about Chapel’s GPU support, I’d recommend watching this talk+demo by Engin Kayraklioglu and Jade Abraham or reading this ongoing blog series by Daniel Fedorin and Engin.
3. In a similar vein, depending on the underlying hardware, can Chapel programs be written to use IB and/or Ethernet? (i.e., How hard is it to be portable across the interconnect?)
Yes, Chapel programs port trivially across HPC interconnects, including HPE Slingshot, InfiniBand, AWS’s Elastic Fabric Adaptor (EFA), and Ethernet. Chapel supports a global namespace in which variables that are visible through traditional lexical scoping rules can be accessed, whether they are stored in local memory or on a remote node. Data transfers between nodes are implemented and optimized by Chapel’s compiler and runtime, removing the need to explicitly implement communication using libraries like MPI, Libfabric, or a network-specific API.
Of course, performance can vary depending on the target network’s capabilities. For example, a Chapel program that relies on lots of remote atomic operations will run great on a system with HPE Slingshot, where such operations enjoy native support in the interconnect, but it may bog down on an Ethernet system, where remote computation would be needed to implement the atomicity. Note that this issue isn’t specific to Chapel, though—it’s the classic performance-portability question of whether to use system features that enhance performance yet are not universally available. That said, using consistent features—like atomic operations—without needing to worry about whether they’re supported in the hardware of a given platform is a great starting point for portability and performance, compared to manually mapping down to network-specific features.
I should note that Chapel’s portability across networks benefits greatly from LBNL’s GASNet-EX middleware, thanks to its support for RMA (remote memory access), active messages, and atomic operations—the three types of communication that Chapel needs to run on a given network.
4. In terms of CPU acceleration, does Chapel support things like AVX-512 vector instructions?
Chapel does support vectorized computations, where our compiler benefits greatly from LLVM in this regard (LLVM also plays a key role in our GPU support). Generally speaking, Chapel programs are compiled down to C-level operations that are then translated into LLVM IR. We then have LLVM compile the code down to the ISAs of the target processors, optimizing along the way. When compiling Chapel’s data-parallel constructs—like ‘forall’ loops or whole-array operations—our compiler uses LLVM metadata to mark the operations’ serial inner loops as order-independent, making them candidates for vectorization. In practice, this results in a well-tuned serial code for the target CPU without any additional effort from the user.
We also make features of target CPUs available as standard library routines to handle cases where LLVM can’t be expected to automatically make use of a specific feature or where the user doesn’t want to rely on automated optimization and instruction selection. Examples include computing a fused multiply-add or the ‘popcount’ of an integer.
5. Chapel can be installed across a number of systems, from laptops to clusters. Is it possible to maintain a single version of an application across all these hardware environments? That is, does Chapel allow me to avoid the “two version” problem in HPC?
Definitely. Chapel programs begin by running ‘main()’ on a single core and then introduce parallelism dynamically as the program executes, whether locally, on GPUs, or across compute nodes. This design means the first prototype code you sketch out on your laptop can be incrementally evolved into a scalable, distributed-memory code, often with only modest changes. For example, a Chapel array’s declaration can easily be updated to specify that its elements should be distributed across some or all of the program’s locales. In making this change, the parallel loops and operations over the array automatically switch from being local, multicore operations to distributed computations, making use of the cores of the target locales, with no other source code changes required. Basically, the declarations undergo modest adjustments, yet the science of the computation can remain unchanged.
Contrast this with other HPC technologies where the user has to code in a Single Program, Multiple Data (SPMD) programming model. In such approaches, ‘main()’ changes from being called just once to once per node or core. While that change is conceptually simple and relatively easy to implement, it has a dramatic impact on how a local laptop program needs to be restructured: data structures have to be manually decomposed into per-image chunks, control flow has to be updated to ensure each image performs its local piece of the work, and (typically) explicit communication needs to be added to coordinate and transfer data between the program images.
We consider these SPMD-induced code changes the primary source of the “two version” problems of conventional HPC programming you mention. They also represent a huge barrier to having more laptop programmers, and applications make the transition to HPC systems because it’s such a different way of viewing code—one that we’ve simply become numb to in the HPC community. Our team believes that Chapel’s post-SPMD execution model, scope-based global namespace, and built-in support for parallelism are crucial remedies for these issues.
6. Is there a programming language that Chapel is “most similar to” (i.e., How hard is it to learn Chapel?)
Chapel isn’t an extension of an existing language, though in designing it, we certainly took inspiration from a number of languages—as well as lessons to avoid. It’s difficult for me to say that Chapel is most similar to any one language, but a rough characterization would be that it’s fairly Python-like in terms of code clarity and level of expression, yet with syntactic elements from C (curly brackets and semicolons rather than being whitespace-sensitive) and Modula-3 (left-to-right keyword-based declarations). Chapel also has rich support for multidimensional arrays, as in Fortran 90. Something that pleases me is that we often hear programmers talk about positive resonances they see between Chapel and their favorite language, whether that’s Fortran, C++, or Python. In practice, my sense is that programmers from diverse backgrounds like these resonances and find Chapel easy to get started with.
For those interested in learning about Chapel, we have a number of resources available on our website as well as community support forums for getting answers to questions, either online or in live sessions.
7. How easy is it to use existing libraries with Chapel? (i.e., I have a sequential C++ code I don’t want to rewrite.)
Chapel supports interoperability with other languages, permitting existing libraries to be called from Chapel or for Chapel libraries to be created and invoked from other languages. Calling between Chapel and C is certainly the most exercised and mature path, and since C acts as a lingua franca, this tends to provide a path to any other language. That said, we also have support for more native/direct interoperability with Python and Fortran, such as the ability to pass multidimensional arrays between Fortran and Chapel in a copy-free manner.
Since you asked specifically about C++, In practice, our team does a lot of C++ interoperability given its broad usage in libraries, but this is almost always done by creating C wrappers around the library, primarily due to challenges like name mangling and differences in OOP semantics. Most Chapel users and developers would like to see more native support for C++ interoperability, but it’s a sufficiently heavy lift that we haven’t had the opportunity to prioritize it yet.
8. What about performance? Can you point to some recent benchmarks that show parallel performance?
Here are three recent performance results that I’m particularly proud of for different reasons:
The first is a serial benchmark from the Computer Language Benchmarks Game that computes an n-body interaction between the five largest bodies in our solar system. In the current standings, a single Chapel implementation (Chapel #3) is the fastest entry that doesn’t use hand-written vector instructions or “unsafe” operations while simultaneously being the most compact entry in terms of compressed code size and a very clear implementation. This result is a benchmark whose performance improved significantly over the past year—with no changes to its source code—due to improvements in our integration with LLVM, as mentioned above.
The second result is my favorite scalability run from last year, in which we sorted 256 TiB of data in 30 seconds using 8192 nodes of an HPE Cray EX running Slingshot-11. This performance was an exciting result for three major reasons. The first is simply the scale of the run, which exceeds anything I could’ve imagined doing when we first started the Chapel project. The second is because we got this result on our first run at this scale, despite it using an order of magnitude more nodes than our previous largest run—and as you surely know, this virtually never happens when running at new scales in HPC. The third reason is that this is not simply a benchmark but a crucial part of Arkouda—a flagship Chapel framework that provides Python programmers with interactive data science capabilities at HPC scales.
The third performance result I’d like to highlight is actually a pair of (unrelated) papers presented at SC23 (at the PAW-ATMworkshop), each of which uses Chapel effectively in its respective field—satellite image analysis for coral reefs and exact diagonalization for quantum many-body physics, respectively. Though the scientific areas, approaches, and codes are completely different, each application achieved a significant performance improvement relative to prior art while also benefitting significantly from Chapel’s productivity features.
9. Anything else you want our readers to know about Chapel?
I think we’ve covered a lot of good ground here, thanks to your excellent questions! Two final things that occurred to me to mention are:
Last month, we held our annual flagship Chapel event, which we revamped and rebranded this year to ChapelCon from CHIUW (the Chapel Implementers and Users Workshop). For those interested in a recap, Engin Kayraklioglu, ChapelCon’s general chair, wrote a great summary of the event for our blog. If you were to watch or browse one talk from ChapelCon, it should be Paul Sathre’s keynote, A Case for Parallel-First Languages in a Post-Serial, Accelerated World, which was an excellent testimony to the value of languages like Chapel from an external perspective.
Finally, if you’re interested in keeping up with the latest highlights from the Chapel project, be sure to keep an eye on our blog and social media accounts.