PathScale Looks to One-Up CUDA, OpenCL with New GPU Compiler
HPC compiler maker PathScale has unveiled ENZO, a new GPU software development suite aimed at the high performance computing space. The solution includes a home-grown compiler, runtime system, and device driver. ENZO is being built for performance from top to bottom and will initially target NVIDIA’s high-end GPUs.
Up until now, users looking to exploit graphics processor acceleration for technical computing had to rely on either NVIDIA’s CUDA software stack or OpenCL implementations (from AMD or NVIDIA). Although a number of high-level language implementations have been built on top of these lower level interfaces, PathScale will be the first vendor to offer a complete third-party development stack for GPU computing developers.
PathScale, you’ll remember, was resurrected following the August 2009 dissolution of SiCortex, which had purchased the compiler technology from QLogic two years earlier. Thanks to the support of Cray and some creative financing, the PathScale team was reassembled after SiCortex went belly up. PathScale’s main products today include C/C++ and Fortran compilers for AMD and Intel x86 CPUs.
According to PathScale CTO Christopher Bergström, interest in doing a GPU compiler began shortly after the company rebooted last year. Since NVIDIA was leading the GPGPU charge, they started with the idea of targeting the Tesla GPU line. Hoping to reuse some of NVIDIA’s CUDA stack, they quickly found that the code generator and driver were not optimized for performance computing. “Their drivers, which really dictate quite a bit of what you can do, are supporting everything from gaming to HPC,” says Bergström. “It’s not that they haven’t built a good solution. It’s just not focused enough for HPC.”
Moreover, they found writing CUDA code for performance tedious, requiring a lot of programmer hand-holding to optimize performance. In particular, the PathScale engineers found that the register usage pattern in the CUDA compiler was generalized for all types of GPU cards, so performance opportunities for Tesla were simply missed.
In any case, says Bergström, “we didn’t have permission to use CUDA, and we thought OpenCL sucked.” So PathScale set out to write their own compiler/runtime/driver stack. Unfortunately NVIDIA’s GPU ISA is one of the company’s closely guarded secrets and most programmers only get access to the hardware through software interface abstractions, like CUDA, OpenCL, OpenGL, PTX , or DirectX. NVIDIA is happy to support implementations for all of these, but that eliminates the option of third-party compiler developers controlling the lowest level code generation.
So instead they tapped an open source NVIDIA graphics driver — Nouveau, which is included in the Linux kernel — and created a fork off the source code with high performance computing in mind. PathScale also managed to recruit most of the talent from the driver project. Bergström says the team was able to reverse engineer the NVIDIA ISA, register details, and device exception handling. With that knowledge, they set out to rewrite the code generator (compiler back-end), driver, and runtime, focusing on improved memory management, error handling, security and HPC-specific features, and performance.
The twist here is that GPU ISA is volatile — at least more so than say a CPU. Fortunately, the instruction and register enhancements tend to be incremental. Bergström says they will support all the latest GPU cards being used for HPC, that is, essentially all the cards supported in the three generations of Tesla products. PathScale has a working pre-“Fermi” driver now and is working on the compiler port. “We just got access to the hardware last month,” explains Bergström. “So we’ve basically had 30 days to start tackling the ISA and the registers.” He predicts they’ll have a fairly robust Fermi port within the next 60 to 90 days.
For the GPU compiler front-end, PathScale decided to use a directives-based approach, in which programmers can instrument source code to tell the compiler to parallelize specific code regions for the GPU. The directives approach offers vendor and device independence, while allowing developers to make incremental changes to their source code as they identify more regions for GPU acceleration. OpenMP uses the same directives model for shared-memory parallelization.
PathScale opted for HMPP directives, a set of directives invented by CAPS Enterprise for their C and Fortran GPU compilers. In the CAPS products though, the compiler just converts the HMPP C or HMPP Fortran to CUDA, which is subsequently converted into GPU assembly by NVIDIA’s CUDA back-end. PathScale, on the other hand, has attached their own back-end onto the HMPP front-end without losing any information between source-to-source translations.
The other part of the story is that CAPS, along with PathScale (and some as yet unannounced players) have decided to make the HMPP directives an open standard. The idea here is to attract application developers and tool makers to a standardized GPU programming model which protects their investment but is still targeted at gaining best performance.
Bergström is careful not to claim performance superiority over the CUDA technology just yet. He says ENZO is currently in the alpha or early beta stage. According to him, PathScale engineers have hand-tuned some code using GPU assembly, and have achieved a 15 to 30 percent (or better) performance boost. In other cases, they’re not quite there and need to find the right optimizations. Bergström is confident that those hand-coded optimizations can be incorporated into the compiler infrastructure. They have identified a number of areas where they can reduce register pressure, hide latency, reduce stalls and improve instruction scheduling. “We know the performance is there,” says Bergström.
The alpha/early beta version is now available for selected customers, with the production compiler suite slated for release later this summer. According to Bergström, over the next year, PathScale will be investing heavily in improving the GPGPU programming model. “People shouldn’t have to worry about thread synchronization or register memory bank conflicts,” he says. “The compiler will just handle that. Ultimately we want to have a fully automatic solution.”