Having just returned from SC13, one burning issue is the choice of a standard approach for programming the next generation HPC systems. While not guaranteed, these systems are likely to be large clusters of nodes with multicore CPUs and some sort of attached accelerators. A standard programming approach is necessary to convince developers, and particularly ISVs, to start adoption now in preparation for this coming generation of systems. John Barr raised the same question in a recent article at Scientific Computing World from a more philosophical point of view. Here I address this question from a deeper technical perspective.
HPC programming is currently dominated by either a flat model with MPI across nodes as well as cores within a node, or a hybrid model with MPI across the nodes and OpenMP shared memory parallelism across the cores in a node. The advantage of flat MPI is a simpler programming model, only one level of parallelism and only one API. The disadvantage is it doesn’t take advantage of the shared data across the ranks on the same node, requiring message and buffer management across all ranks. MPI+OpenMP roughly inverts those advantages and disadvantages.
The reason MPI and MPI+OpenMP have worked so well over the past 20 years now is that most HPC systems are roughly isomorphic, with some differences in instruction set, node topology, and performance profiles. The system is a network of nodes, the nodes have one or more processors, the processors have one or more cores. The cores on a node share virtual and physical memory, with hardware cache coherence to make shared memory programming relatively safe. There are some outliers, like the big SGI shared memory systems, which have some programming model and performance advantages for certain applications.
John Barr suggests that the options for programming next-generation HPC systems will be MPI+X, where X is one of: OpenMP (if accelerators don’t), OpenACC, OpenMP 4.0+ with device constructs, OpenCL, or CUDA. John is probably right, but I want to add a caveat. We don’t know what HPC systems will look like in a decade. We don’t even know what the HPC systems of the near future will look like. The familiar (I hesitate to say “simple”) MPI+OpenMP programming model that has served us so well is unlikely to apply to the newest designs five or ten years from now, because the network of homogeneous nodes of homogeneous cores with uniform shared memory is going to change. HPC is at another inflection point, where the drivers are both economic and technological.
Technology is driving designs towards slower cores with more parallelism to make up the performance slack. Parallelism will come in three forms: classical multicore or multiprocessor parallel processes, threads, or tasks; classical vector operations, sometimes recast as SIMD operations; and multithreading used to overlap long latency operations, mostly to tolerate long memory latencies. If your system has enough native bandwidth, multithreading can allow an appropriately organized application to achieve close to peak performance as long as there is another thread ready to execute when one thread stalls on a cache miss. Bandwidth increase is easier to achieve than latency reduction, though it’s still not free.
The economics of HPC leads us to use the best, lowest-cost commodity parts whenever possible. The economic drivers in computing will be mobile and embedded, such as smart phones and tablets. High end workstations will still be important, but vendors are learning that growth is in mobile, and we can expect the biggest investments for new computing parts there. Luckily for us, those parts need increasing performance and continue to require lower power and low cost, so even if painful, a shift from workstation parts to mobile parts may be beneficial.
But the important point is that we don’t know exactly what these systems will look like, and there’s likely to be quite a bit of variability. Some will look like today’s systems, though with higher core counts on each processor chip. Some will have low power chips designed for or inspired by the embedded space; think Blue Gene L/P/Q, Intel Atom/Quark, or ARM. The ratios of parallelism on a node vs. across nodes will differ.
Many will have accelerators of some sort: GPUs (NVIDIA, AMD), manycore (Intel Xeon Phi), or other (TI DSP). The amount and style of parallelism for each kind of accelerator differs.
Memory on today’s accelerators is physically and logically separate from the host memory; that will not always remain so. There is movement to share the same address space between accelerator and host, which allows a kind of shared memory. However, there will still be a performance penalty for accessing the more distant memory, either using remote access or software managed distributed shared memory, like the classical Treadmarks system. In particular, today’s accelerators are designed with a high bandwidth memory; if the host memory doesn’t provide the same bandwidth, the performance will suffer. The same can be true if the host and accelerator share physical memory. Today’s laptops have an integrated graphics unit, often on the processor chip itself; the IGU uses the system memory for the graphics buffer. This is sufficient for most laptop uses (email, presentations), but for interactive games and other challenging graphical applications, you want a discrete GPU with its own memory. Future memory subsystems, such as the Hybrid Memory Cube, may change the balance allowing high bandwidth shared memory, if we’re willing to pay the cost.
So the challenge proposed by John Barr isn’t just what language we should use to write programs, but how will we write programs that will give us satisfactory performance across the variety of systems available today, tomorrow, and over the next decade. Will any of the options he presented suffice? Let’s discuss each of them.
MPI + OpenMP 3.1
OpenMP 3.1 was a minor update to OpenMP 3.0, released in 2011 after three years of development. For nodes with small numbers of homogeneous cores and uniform memory access, OpenMP serves quite well. It’s hard to argue that it needs to be replaced, unless you are using a higher level language than C, C++ or Fortran. However, even those languages are starting to adopt native parallel constructs. Fortran has long had array assignments and a forall construct, expressing vector-style or SIMD parallelism, and now has a do concurrent construct, subsuming much of the behavior of the OpenMP parallel do There is ongoing discussion in the C and C++ language committees about adding OpenMP-style threading or Cilk-style tasking to those languages. Language standards committees move slowly, thankfully, but depending on how these discussions play out, the need for OpenMP directives may diminish in the future.
Advantages: OpenMP 3.1 has the advantage of being relatively high level, and is quite mature. It is supported across essentially all multicore and multiprocessor systems, giving good functional and performance portability, though not necessarily scalability.
Disadvantages: Anyone just starting to port an existing program to a multicore will either really like the simplicity of OpenMP parallel loops, or will really dislike the necessity of annotating each and every parallel loop in the program. The OpenMP 3.1 model of a system is somewhat dated; systems being built now and in the future are no longer so homogeneous. OpenMP is really dependent on long-lived parallel threads, which pervade the execution model and are exposed in the programming model. While OpenMP now has tasks, the tasking model was grafted (quite elegantly, in my opinion) onto the existing OpenMP threading model.
Portability: As mentioned, OpenMP is supported on essentially all multiprocessor and multicore systems used in HPC today, and provides pretty good performance portability across the homogeneous shared memory (low core-count) multicore environment for which it was designed. However, OpenMP 3.1 has no support for heterogeneous or accelerated systems. There is an impressive research project at Purdue University to implement OpenMP on NVIDIA GPUs with some significant compiler and runtime effort. They have produced a prototype compiler and have some impressive results on benchmark programs. However, it’s still in the research stage, and it suffers from trying to hide or virtualize some of the most performance sensitive issues of accelerator programming.
Future: OpenMP 4.0 adds many new features, including some support for heterogeneous systems and more control to take advantage of data and compute locality. The language is becoming more complex, to adapt to and allow control over a more complex world of parallel systems. Eventually, meaning a decade or more from now, the base languages may support enough native parallelism to obviate the need for OpenMP directives.
MPI + CUDA
CUDA is a lower-level programming model for GPU-style accelerators. It exposes many of the architectural features of a GPU, such as the 3D grid and 3D thread block, the software-managed cache (CUDA shared memory), the SIMD-style execution of adjacent threads in the same thread block, the separate data memory on the GPU, and so on. Used the right way, this is a strength of the model and language, since it gives a dedicated developer access to all the power of the GPU. However, it is a challenging programming model and does require significant investment when porting to the GPU. In addition, when moving from one generation of GPU to the next, highly tuned CUDA kernels will often have to be retuned for the newer hardware, increasing code maintenance cost.
Moreover, CUDA today is a single-vendor (NVIDIA) solution. PGI has developed a CUDA-X86 solution, but its performance is not as good as the corresponding native code. One can envision a CUDA implementation for other GPUs or other devices, but I’ve seen no hints of such a development.
Advantages: The CUDA toolkit is relatively mature, over five years old now. It gives control to the programmer over all the relevant features of the GPU. It comes with debugging and performance analysis tools, and is well-supported by NVIDIA. The language and programming model evolve as the capabilities of the GPU evolve. CUDA Fortran provides the same programming model for Fortran programmers, with the advantages of using the higher level language features available in Fortran.
Disadvantages: It’s a lower-level programming model. Converting an existing application in C, C++ or Fortran to CUDA is a significant rewrite, although such rewrites allow and encourage one to replace inappropriate algorithms and data structures with others more suited to the parallel world.
Portability: CUDA is currently limited to NVIDIA GPUs. There have been efforts to port CUDA programs to parallel execution on CPUs; the CUDA-X86 product uses some of the same techniques. These efforts have some successes, but highly parallel CUDA programs don’t always map all that well to the limited multicore and SIMD parallelism available on CPUs. Moreover, a kernel tuned for one generation of NVIDIA GPU (experimenting with thread block size and shape, unrolling loops) will often have to be retuned for the next generation.
Future: NVIDIA will continue to support and develop CUDA. CUDA will likely play an important role in GPU programming, especially where the investment pays off handsomely, such as highly tuned libraries or very performance sensitive application kernels.
MPI + OpenCL
OpenCL is essentially the same programming model for GPU-style accelerators as CUDA. It exposes the same features of the GPU, but is not NVIDIA-specific. It is also supported on many targets, including DSPs and even x86 and ARM processors. Like CUDA, it gives complete control to the programmer. Unlike CUDA, there is no OpenCL compiler for the host code; the platform API is completely library-based, making it look very verbose, especially for small programs. The difference becomes clear when looking at how a kernel is launched in CUDA and OpenCL. In CUDA C, a kernel launch looks like a procedure call to the kernel, with additional syntax to specify the launch configuration (grid and block sizes):
kernel<<< gridsize, blocksize >>>( arg0, arg1, ... );
The CUDA compiler translates this into the series of runtime calls to manage the arguments and the actual launch. Since OpenCL doesn’t have such a compiler, a kernel launch becomes a series of explicit runtime calls: one for each argument, and one for the launch:
clSetKernelArg( kernel_handle, 0, sizeof(arg0), &arg1 ); clSetKernelArg( kernel_handle, 1, sizeof(arg1), &arg1 ); ... clEnqueueNDRangeKernel( queue_handle, kernel_handle, 1, NULL, &globalsize, &localsize, 0, NULL, NULL );
Other than that, OpenCL programming is potentially just as powerful as CUDA programming.
Given the wider range of target devices implementing OpenCL, there has been quite a discussion about OpenCL and performance portability for the past several years. Use your favorite search engine for the phrase OpenCL performance portability for a sample. A recent article on HPCwire advocates quite well in favor of OpenCL for heterogeneous computing, in part because it accommodates all the available architectures, separately or together. I have to agree with almost everything the article says. As for performance portability, this article states: True, we may need to write a new version of our kernel to get the best performance on Architecture A, but isn’t this what we actually want? The most telling word in that sentence is kernel, singular. In the embedded world, one developer is quite likely to be focused on a single kernel, such as some video processing, or audio encoding, or the like. In that scenario, with only one kernel, then you can afford to tune and retune for each architecture. However, if you have tens of thousands or millions of lines of code to support over the multi-decade lifetime of your application, this is not a productive path.
Advantages: The kernels are written in the same explicit programming model as CUDA, and OpenCL is supported across a wide range of devices.
Disadvantages: It doesn’t have a standard set of tools across all devices, so debugging and profiling support is sometimes spotty. The language is evolving, but slowly, as any committee-designed language will.
Portability: Since the programming model is lower-level, your kernels have to explicitly take advantage of the features of the target architecture. That means your kernels have to be retuned or rewritten for different targets.
Future: The OpenCL model is more aimed at the embedded or mobile market, where the application writer doesn’t necessarily have details about the target device. Features like dynamic, just-in-time compiling are important in that arena. In HPC, having to recompile every kernel on every node of your cluster every time you run your program quickly sounds like overhead you’d rather avoid. However, I predict that like CUDA, OpenCL can play an important role in programming for GPUs and other accelerators, especially where the investment will pay off handsomely.
MPI + OpenACC
OpenACC is a directive-based programming model targeting a CPU+accelerator system, intentionally similar in many ways to OpenMP. It is designed to do for accelerator programming systems what OpenMP does for multicore systems. It hides or virtualizes those features of the system that can be managed automatically by the system without performance penalty, and exposes those features that must be managed by the programmer. For instance, the programming and execution models expose the presence of separate memories on a host+GPU system, and requires the user to manage data movement between the host and device memories for device-resident data. However, OpenACC virtualizes the variable names, using the same name for the CPU and device copies of the data, resolving them depending on where the name is used, and will even work with no overhead when the device shares memory with the host. The original intent was to eventually merge OpenACC into OpenMP, but the two groups decided on different approaches. In full disclosure, this author is a key member of the OpenACC language committee, and formerly participated in the OpenMP language committee.
OpenACC was designed to focus on parallel algorithms that perform well on the accelerators of interest. Accelerators today need a high degree of multicore parallelism as well as vector or SIMD parallelism. They don’t work very efficiently on scalar code, or on scalar tasks; scalar code will always be more efficient on a high-speed host CPU. This is the primary motivation for a programming model that supports closely coupled CPU+Accelerator systems, which will likely be most efficient on maximizing performance across complete applications.
Advantages: OpenACC has demonstrated support for multiple devices, and there is some initial evidence for performance portability across device types.
Disadvantages: OpenACC is still relatively young, and there are as yet no available open-source implementations of the full language. As yet, there are no implementations that target the host multicore as a device, so it currently must be combined with OpenMP. Some critical features are still under development, and there are some differences in how features are implemented by different compilers, making portability across vendors an issue.
Portability: OpenACC can virtualize many aspects of the architecture, so programs should work well across divergent targets. There is some initial evidence of performance portability across devices, but no convincing proof as yet.
Future: OpenACC version 2.0 was just released, and work proceeds on additional critical and important features. There are quite a few large applications being ported to GPUs with OpenACC, and this experience feeds back to the language design and implementations. Expect a great deal of exciting work on OpenACC in the next few years, including commercial and open-source implementations across all HPC target accelerators, as well as host CPUs.
MPI + OpenMP 4.0 Target Directives
One may ask why OpenACC looks so different from OpenMP? Why not just implement OpenMP on GPUs? This is a good question and an important one. I tried to answer this with an article several years ago. GPUs, in particular, have some limitations that we should avoid and some capabilities that we should exploit when programming for the best parallel performance.
The OpenMP ARB recently released version 4.0 of the specification this past summer, a major revision just two years after 3.1. Among many new features are the device constructs. Some of these constructs mirror those in OpenACC, such as data management with the OpenMP target data and target update constructs. The OpenMP committee chose to add a new level of parallelism to the classical OpenMP threads: teams. In OpenMP 3.1, a parallel region was executed by a team of threads, and parallel loop iterations or tasks were work-shared across the threads in a team. With the OpenMP 4.0 device constructs, the user can now create a league of teams of threads. The addition of teams was to accommodate devices like GPUs which do not support an efficient global barrier synchronization across all the threads in the system. This addition has two unfortunate side effects. First, porting an existing OpenMP program to a device like a GPU is not as easy as simply adding a target directive around the parallel loops. As with OpenACC, the programmer will have to manage the data traffic to the device, but now the programmer has to add new types of parallelism as well. Second, some devices, such as the manycore Intel Xeon Phi Coprocessor, don’t need and probably don’t want this extra level of parallelism; OpenMP can be implemented natively on that target. This means that tuning for one target may differ significantly from tuning for another target, making performance portability a real challenge. I fear that the OpenMP committee has made some unfortunate decisions that will be hard to fix.
Intel has been a major supporter of the OpenMP 4.0 device constructs, and presumably will have a conformant implementation for the Xeon Phi Coprocessor. Texas Instruments has also announced plans to support OpenMP 4.0 on some ARM+DSP heterogeneous SoCs. The new Convey MX computer has support for hybrid OpenMP that matches OpenMP 4.0 features. Cray has announced plans to deliver OpenMP 4.0 for their systems as well, including Xeon Phi and Kepler accelerators, though they don’t promise performance portability. The other OpenMP vendor members have yet to announce support for the device constructs.
Advantages: For those targets that can implement full OpenMP, the OpenMP 4.0 device constructs look like a simpler method to port an existing OpenMP program. There is still the requirement to manage data transfers, but that’s the case on any of these programming models.
Disadvantages: For targets that aren’t designed to implement OpenMP, it’s not clear whether good performance can be delivered. No prototype implementations of the teams construct were available. Performance portability is going to be a challenge, though, as with OpenCL, language portability is still a big step in the right direction. Parts of the current 4.0 device constructs are not well specified and need more careful definition.
Portability: It’s too early to say for certain, but I fear the OpenMP device constructs are too prescriptive to support performance portability across the wide variety of accelerator architectures being used now and designed for the future. If a programmer has to write programs differently for each target, then the language loses many of its advantages.
Future: OpenMP 4.0 is still new and has many new features beyond the device constructs. It will be some time before we have several mature complete implementations. The OpenMP group seems committed to more rapid exploration and adoption of new features across the spectrum of parallel programming. The device constructs will be expanded or contracted as necessary to address the targets of interest. Only if the OpenMP device constructs result in good performance on the host as well as DSPs, manycores, and GPUs, will it become the language of choice.
John Barr left off two other heterogeneous programming languages that have been bandied about, important if only because of the large company promoting each. Google Renderscript uses a kernels computation model and managed memory allocations (like OpenCL kernels and buffers). Unlike OpenCL, Renderscript is optimized for kernels that access graphics-like data structures: rectangular arrays of structs. It’s pretty hard to get the kind of random indexed addressing that is available in OpenCL or CUDA. The runtime can launch kernels on any appropriate available device, such as a CPU or GPU, moving data as required to make it accessible on that device. Microsoft C++AMP (Accelerated Massive Parallelism) is a pretty sophisticated C++-specific solution, aiming at the sophisticated C++ programmer (perhaps all C++ programmers are sophisticated), using C++ templates and lambda functions for data and compute. It introduces a very rich parallel_for_each function that takes a computation domain and kernel function as arguments, and runs that function across the parallel domain. Like Renderscript and OpenCL, the runtime manages the memory movement; it can also decide whether to run a kernel on the host or the GPU. Currently it is available only from Microsoft, though there is at least one experimental implementation from outside Redmond.
As John Barr argues, the tools and programming methodologies for upcoming HPC systems will change as the architectures evolve. Resolving to a high level portable heterogeneous programming strategy will attract more users early in this evolution. Before MPI, there were several message passing libraries, some of which were vendor-specific. Settling on MPI allowed applications to port across cluster architectures without rewrites, and allowed vendors to focus on innovation at the architecture and implementation level. It also limited innovation on the message passing libraries. Before OpenMP, there were several sets of high level directives and low-level threading libraries for programming multiprocessor workstations and cluster nodes. Settling on OpenMP similarly allowed applications to port across multiprocessor and eventually multicore architectures without rewrites, allowing vendors to focus on innovation at the architecture and implementation level. It also limited innovation at the programming language level.
For heterogeneous systems, we want to settle on a high level programming strategy. It should be as target agnostic as possible. In particular, it must not require a user to write a different program for different targets. We’re going to always require low-level programming as well, and CUDA and OpenCL will fit this bill nicely for appropriate devices. At the high level, the obvious possibilities are OpenACC or OpenMP; the eventual choice will be made by the users. If the OpenMP device constructs mature to better support performance portable programming across the wide variety of devices, it may become a viable option. If OpenACC demonstrates portability and more generality, it should be the language of choice. There’s still hope that the two specifications will converge. Users will eventually decide based on how well each language supports the variety of targets in terms of ease of use and features, but most importantly, performance and performance portability.