Heterogeneous Compilers Ready for Takeoff
The second wave of GPGPU software development tools is upon us. The first wave, exemplified by NVIDIA’s CUDA and AMD’s Brook+, allowed early adopters to get started with GPU computing via low-level, vendor-specific tools. Next generation tools from The Portland Group Inc. (PGI) and French-based CAPS Enterprise enable everyday C and Fortran programmers to tap into GPU acceleration within an integrated heterogeneous computing environment.
Over the past five years, the HPC community coalesced around the x86 architecture. That made the choice of targets easy for companies like PGI. Today 85 percent of the TOP500 is based on 64-bit x86 microprocessors, and the percentage is probably even higher in the sub-500 realm. While Intel and AMD are continuing to innovate with multicore architectures, they are constrained to clock frequencies of around 2.5-3.5 GHz.
Meanwhile, GPUs have become general-purpose vector processors with hundreds of simple cores and are scaling at a faster rate than CPUs. The fact that compiler vendors like PGI are now targeting GPUs says a lot about where the industry is headed with general-purpose acceleration, especially in the HPC space.
“As a compiler vendor, we asked ourselves: ‘What comes next?’” said Doug Miles, director of Advanced Compilers and Tools at PGI. “Our best guess is that accelerated computing is what comes next.”
PGI is betting that 64-bit x86 with “some type of accelerator” will be the new platform of choice for many HPC applications. Right now, the GPU is the accelerator du jour of supercomputing. The first accelerator target for PGI is CUDA-enabled NVIDIA GPUs. To implement it, PGI will leverage the CUDA toolchain and associated SDK, while the host side compilation will rely on PGI’s x86 technology.
Since GPUs are attached to the host platform as external devices rather than as true coprocessors, the low-level software model is quite complex. From the host side, it involves data transfers between the CPU and the GPU (over PCIe), memory allocation/deallocation, and other low-level device management. On the GPU side, the code can also be fairly involved, since it has to deal with algorithm parallelization and the GPU’s own memory hierarchy.
To make GPUs programming more productive, it’s worthwhile to hide most of these details from the application developer. What PGI has done is define a set of C pragmas and Fortran directives that can be embedded in the source code and direct the compiler to offload the specified code sequences to the GPU.
This approach is analogous to OpenMP, which defines pragmas and directives to apply multithreading on top of a sequential program. Unlike a libraries-based approach, this model enables developers to maintain a common source base for a variety of different targets. In the PGI case, non-accelerator aware compilers can use the same source, but will just ignore the foreign pragmas or directives. Even within the PGI environment, the accelerator pragmas and directives can be switched off at compile time so that only x86 code is generated.
The general form the C accelerator pragma is #pragma acc directive-name [clause [,clause]…] ; the equivalent for Fortran is !$acc directive-name [clause [,clause]…]. Applying an accelerator region to a matrix multiplication loop in Fortran would look like this:
subroutine mm1( a, b, c, m )
real, dimension(:,:) :: a,b,c
do j = 1,m
do i = 1,n
a(i,j) = 0.0
do k = 1,p
do i = 1,n
a(i,j) = a(i,j) + b(i,k) * c(k,j)
!$acc end region
The loop enclosed by the accelerator directive will be parallelized and offloaded to the GPU, assuming one is present. For the entire program, the compiler will generate both CPU and GPU code, which are subsequently linked together in a single executable file. Since the generated code will provide all the necessary data transfers, memory management and device bookkeeping, the programmer does not need any special knowledge of the accelerator architecture.
In fact, this is only partially true. The realization is that parallel programming on any target is likely to require some restructuring for optimum performance. “Parallel programming is not easy,” admits PGI compiler engineer Michael Wolfe. According to him, this first step in CPU-GPU compiler technology is to make the problem more approachable. (A deeper discussion of Wolfe’s thoughts on PGI’s GPU programming model is available here.)
PGI currently has a beta version of the x64+NVIDIA compiler under development, which will be made available for technical preview in January. They hope to have a production version of the product in mid-2009. The company is also working with AMD on a version for the FireStream accelerators and will make use of the associated SDK for that target.
There may be a time when the pragmas and directive can be done away with entirely, and the compiler alone can determine when to generate GPU code. But right now, the difference between high performance and low performance on these accelerators is so great that it’s better to let the programmer direct the compiler to the code that is most compute intensive, and is thus worth the overhead of transferring the data from host to GPU. It’s possible that as compiler technology matures and GPUs are more tightly integrated with CPUs, both performance optimization and auto-acceleration can all be wrapped up into the compiler.
French-based CAPS Enterprise is hoping compilers never get that smart. Like PGI, CAPS is offering a heterogenous software development environment, initially targeting x86 platforms with GPU accelerators. In this case, though, x86 code generation will be accomplished via third-party tools such as Intel’s C and Fortran compilers and GNU’s GCC. For the CAPS offering to make sense, accelerator compilation and CPU compilation have to remain separate.
The CAPS offering, called HMPP, preprocesses its own C pragmas and Fortran directives to generate native accelerator source code, either NVIDIA’s CUDA or AMD’s CAL. The accelerator source code is packaged into a “codelet,” which can subsequently be modified by the developer for further tuning. Then the codelet is passed to the GPU vendor toolchain to create the object binary, which gets loaded onto the GPU. The CPU code left behind is compiled with the third-party x86 compiler tools and the HMPP runtime glues the CPU and GPU code together.
The general syntax of the HMPP pragmas and directives is almost identical to the PGI versions: #pragma hmpp for C and !$hmpp for Fortran.
The CAPS pragmas/directives were designed for low-level control of the GPU in order to enable the developer to optimize operations such as data transfers, synchronous/asynchronous execution, data preloading and device management. PGI has defined some lower level directives as well. (See the PGI accelerator whitepaper at http://www.pgroup.com/lit/pgi_whitepaper_accpre.pdf for more detail.)
CAPS currently supports all CUDA-enabled NVIDIA GPUs and AMD FireStream hardware and is planning to release a Cell-friendly version in the first quarter of 2009. The company is also working on support for Java, due to be released in the same timeframe. CAPS already has a number of customers using the tool for GPU-equipped clusters. Total, a French-based energy multinational, is using HMPP to speed RTM calculations on a Xeon-based machine hooked up to a 4-GPU Tesla S1070. Initial results demonstrated a 3.3-fold speed-up per GPU compared to 8 Xeon cores.