Visit additional Tabor Communication Publications
June 24, 2010
HPC compiler maker PathScale has unveiled ENZO, a new GPU software development suite aimed at the high performance computing space. The solution includes a home-grown compiler, runtime system, and device driver. ENZO is being built for performance from top to bottom and will initially target NVIDIA's high-end GPUs.
Up until now, users looking to exploit graphics processor acceleration for technical computing had to rely on either NVIDIA's CUDA software stack or OpenCL implementations (from AMD or NVIDIA). Although a number of high-level language implementations have been built on top of these lower level interfaces, PathScale will be the first vendor to offer a complete third-party development stack for GPU computing developers.
PathScale, you'll remember, was resurrected following the August 2009 dissolution of SiCortex, which had purchased the compiler technology from QLogic two years earlier. Thanks to the support of Cray and some creative financing, the PathScale team was reassembled after SiCortex went belly up. PathScale's main products today include C/C++ and Fortran compilers for AMD and Intel x86 CPUs.
According to PathScale CTO Christopher Bergström, interest in doing a GPU compiler began shortly after the company rebooted last year. Since NVIDIA was leading the GPGPU charge, they started with the idea of targeting the Tesla GPU line. Hoping to reuse some of NVIDIA's CUDA stack, they quickly found that the code generator and driver were not optimized for performance computing. "Their drivers, which really dictate quite a bit of what you can do, are supporting everything from gaming to HPC," says Bergström. "It's not that they haven't built a good solution. It's just not focused enough for HPC."
Moreover, they found writing CUDA code for performance tedious, requiring a lot of programmer hand-holding to optimize performance. In particular, the PathScale engineers found that the register usage pattern in the CUDA compiler was generalized for all types of GPU cards, so performance opportunities for Tesla were simply missed.
In any case, says Bergström, "we didn't have permission to use CUDA, and we thought OpenCL sucked." So PathScale set out to write their own compiler/runtime/driver stack. Unfortunately NVIDIA's GPU ISA is one of the company's closely guarded secrets and most programmers only get access to the hardware through software interface abstractions, like CUDA, OpenCL, OpenGL, PTX , or DirectX. NVIDIA is happy to support implementations for all of these, but that eliminates the option of third-party compiler developers controlling the lowest level code generation.
So instead they tapped an open source NVIDIA graphics driver -- Nouveau, which is included in the Linux kernel -- and created a fork off the source code with high performance computing in mind. PathScale also managed to recruit most of the talent from the driver project. Bergström says the team was able to reverse engineer the NVIDIA ISA, register details, and device exception handling. With that knowledge, they set out to rewrite the code generator (compiler back-end), driver, and runtime, focusing on improved memory management, error handling, security and HPC-specific features, and performance.
The twist here is that GPU ISA is volatile -- at least more so than say a CPU. Fortunately, the instruction and register enhancements tend to be incremental. Bergström says they will support all the latest GPU cards being used for HPC, that is, essentially all the cards supported in the three generations of Tesla products. PathScale has a working pre-"Fermi" driver now and is working on the compiler port. "We just got access to the hardware last month," explains Bergström. "So we've basically had 30 days to start tackling the ISA and the registers." He predicts they'll have a fairly robust Fermi port within the next 60 to 90 days.
For the GPU compiler front-end, PathScale decided to use a directives-based approach, in which programmers can instrument source code to tell the compiler to parallelize specific code regions for the GPU. The directives approach offers vendor and device independence, while allowing developers to make incremental changes to their source code as they identify more regions for GPU acceleration. OpenMP uses the same directives model for shared-memory parallelization.
PathScale opted for HMPP directives, a set of directives invented by CAPS Enterprise for their C and Fortran GPU compilers. In the CAPS products though, the compiler just converts the HMPP C or HMPP Fortran to CUDA, which is subsequently converted into GPU assembly by NVIDIA's CUDA back-end. PathScale, on the other hand, has attached their own back-end onto the HMPP front-end without losing any information between source-to-source translations.
The other part of the story is that CAPS, along with PathScale (and some as yet unannounced players) have decided to make the HMPP directives an open standard. The idea here is to attract application developers and tool makers to a standardized GPU programming model which protects their investment but is still targeted at gaining best performance.
Bergström is careful not to claim performance superiority over the CUDA technology just yet. He says ENZO is currently in the alpha or early beta stage. According to him, PathScale engineers have hand-tuned some code using GPU assembly, and have achieved a 15 to 30 percent (or better) performance boost. In other cases, they're not quite there and need to find the right optimizations. Bergström is confident that those hand-coded optimizations can be incorporated into the compiler infrastructure. They have identified a number of areas where they can reduce register pressure, hide latency, reduce stalls and improve instruction scheduling. "We know the performance is there," says Bergström.
The alpha/early beta version is now available for selected customers, with the production compiler suite slated for release later this summer. According to Bergström, over the next year, PathScale will be investing heavily in improving the GPGPU programming model. "People shouldn't have to worry about thread synchronization or register memory bank conflicts," he says. "The compiler will just handle that. Ultimately we want to have a fully automatic solution."
May 22, 2013 |
At some point in the not-too-distant future, building powerful, miniature computing systems will be considered a hobby for high schoolers, just as robotics or even Lego-building are today. That could be made possible through recent advancements made with the Raspberry Pi computers.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.