NVIDIA Eyes Post-CUDA Era of GPU Computing
Lost in the flotilla of vendor news at the Supercomputing Conference (SC11) in Seattle last month was the announcement of a new directives-based parallel programming standard for accelerators. Called OpenACC, the open standard is intended to bring GPU computing into the realm of the average programmer, while making the resulting code portable across other accelerators and even multicore CPUs.
For obvious reasons, OpenACC is being heavily promoted and supported by NVIDIA, but it is The Portland Group (PGI) and Cray who are driving the early effort to commercialize the technology. PGI already has implemented a very similar a set of accelerator directives, which became part of the foundation for the OpenACC standard. Cray is developing its own OpenACC compiler and its XK6 customers, like Oak Ridge National Lab and the Swiss National Supercomputing Centre, are expected to be among the first supercomputer users of the technology
In a nutshell, OpenACC directives work much the same as OpenMP directives, but are specifically applicable to highly data parallel codes. They can be inserted into standard C, C++ and Fortran programs to direct the compiler to parallelize certain code sections. The compiler takes care of the logistics of moving data back and forth between the CPU and the GPU (or whatever) and mapping the computation onto the appropriate processor.
The idea is to enable developers to make relatively small modifications to existing (or new) code in order to expose parallel regions for acceleration. Since the directives are designed to apply to a generic parallel processor, the same code can run on a multicore CPU, GPU, or any other type of parallel hardware that is supported by the compiler. This hardware independence is especially important to the HPC community, which is loathe to adopt vendor-specific, non-portable programming environments.
From NVIDIA’s perspective, the overriding goal is to bring GPU computing into the post-CUDA age. CUDA C and Fortran are the most widely used programming languages for GPU programming today, but the underlying technology is proprietary to NVIDIA and offers a relatively low-level software model of GPU computing. As a result, the use of CUDA today tends to be restricted to computer science types, rather than the average programmer or researcher.
OpenCL, which is supported by NVIDIA, AMD and many others, also provides a parallel programming framework for GPUs and other accelerators, and unlike CUDA, is a bona fide open standard (under the direction of the Khronos Group — the same organization that brought us OpenGL). But like CUDA, OpenCL is relatively low-level, requiring a fairly intimate knowledge of the inner workings of the target processor. Therefore, like CUDA, use of OpenCL is mostly confined to computer scientists.
NVIDIA estimates there are over 100,000 CUDA programmers on the planet and a substantially smaller number of OpenCL developers, but they see a much larger potential audience if they can make GPU programming more open and developer-friendly. Essentially they believe OpenACC will be able to make GPU technology accessible to the millions of scientists and researchers who don’t care to dabble in the low-level intricacies of processor architectures and chip-to-chip communications.
Steve Scott, CTO of NVIDIA’s Tesla business unit, sums up the goal of OpenACC thusly: “What we’d like to do at this point is to substantially increase the breadth of applicability and the number of people using GPUs.”
According to Scott, the high-level nature of OpenACC is not going to impact execution performance significantly. While in his previous CTO role at Cray, he encountered accelerator directives-based codes that were getting within 5 or 10 percent of the performance of hand-coded CUDA. According to him, that was fairly typical. Some applications, Scott says, were even doing better than their CUDA alternates, thanks to the ability of the compiler to optimize certain codes beyond what mere mortals could achieve. In any case, OpenACC is designed to be interoperable with CUDA, so hand-tuned kernels can work seamlessly with directives-based code if need be.
Besides PGI and Cray, CAPS enterprise, a French developer of multicore software tools, has also signed up to support the new directives. All three vendors are expected to have compilers with OpenACC support ready in the first half of 2012. Notably missing from the list of OpenACC supporters are Intel and AMD, although both have processors (multicore x86, AMD APUs and GPUs, and the Intel MIC) that would certainly be capable targets. That wouldn’t necessarily stop PGI, CAPS, or Cray from building OpenACC-enabled compilers for Intel and AMD hardware, however.
PGI and NVIDIA are in the process of running a free 30-day trial for developers interested in kicking the tires on PGI’s current accelerator directive compiler. The claim is that the technology will at least double application performance with less than 4 weeks of developer effort. Hundreds of researchers have already registered for the trial and this week NVIDIA has reported some initial results. At least one developer was able to get a 5X performance boost on his application after just a single day of tweaking the code.
But the real end game for OpenACC supporters is for the directives to be incorporated into the OpenMP standard. Since OpenACC was derived from work done within the OpenMP Working Group on Accelerators, it stands to reason that this will indeed happen. Although there is no timeline for when the technology will be folded into OpenMP, it’s most likely to be occur in conjunction with the release of OpenMP 4.0, which is expected to be launched sometime in 2012.