NCSA
HPCwire

Since 1986 - Covering the Fastest Computers
in the World and the People Who Run Them

Language Flags

Visit additional Tabor Communication Publications

Datanami
Digital Manufacturing Report
HPC in the Cloud
Green Computing Report

Tabor Communications
Corporate Video

PathScale Looks to One-Up CUDA, OpenCL with New GPU Compiler


HPC compiler maker PathScale has unveiled ENZO, a new GPU software development suite aimed at the high performance computing space. The solution includes a home-grown compiler, runtime system, and device driver. ENZO is being built for performance from top to bottom and will initially target NVIDIA's high-end GPUs.

Up until now, users looking to exploit graphics processor acceleration for technical computing had to rely on either NVIDIA's CUDA software stack or OpenCL implementations (from AMD or NVIDIA). Although a number of high-level language implementations have been built on top of these lower level interfaces, PathScale will be the first vendor to offer a complete third-party development stack for GPU computing developers.

PathScale, you'll remember, was resurrected following the August 2009 dissolution of SiCortex, which had purchased the compiler technology from QLogic two years earlier. Thanks to the support of Cray and some creative financing, the PathScale team was reassembled after SiCortex went belly up. PathScale's main products today include C/C++ and Fortran compilers for AMD and Intel x86 CPUs.

According to PathScale CTO Christopher Bergström, interest in doing a GPU compiler began shortly after the company rebooted last year. Since NVIDIA was leading the GPGPU charge, they started with the idea of targeting the Tesla GPU line. Hoping to reuse some of NVIDIA's CUDA stack, they quickly found that the code generator and driver were not optimized for performance computing. "Their drivers, which really dictate quite a bit of what you can do, are supporting everything from gaming to HPC," says Bergström. "It's not that they haven't built a good solution. It's just not focused enough for HPC."

Moreover, they found writing CUDA code for performance tedious, requiring a lot of programmer hand-holding to optimize performance. In particular, the PathScale engineers found that the register usage pattern in the CUDA compiler was generalized for all types of GPU cards, so performance opportunities for Tesla were simply missed.

In any case, says Bergström, "we didn't have permission to use CUDA, and we thought OpenCL sucked." So PathScale set out to write their own compiler/runtime/driver stack. Unfortunately NVIDIA's GPU ISA is one of the company's closely guarded secrets and most programmers only get access to the hardware through software interface abstractions, like CUDA, OpenCL, OpenGL, PTX , or DirectX. NVIDIA is happy to support implementations for all of these, but that eliminates the option of third-party compiler developers controlling the lowest level code generation.

So instead they tapped an open source NVIDIA graphics driver -- Nouveau, which is included in the Linux kernel -- and created a fork off the source code with high performance computing in mind. PathScale also managed to recruit most of the talent from the driver project. Bergström says the team was able to reverse engineer the NVIDIA ISA, register details, and device exception handling. With that knowledge, they set out to rewrite the code generator (compiler back-end), driver, and runtime, focusing on improved memory management, error handling, security and HPC-specific features, and performance.

The twist here is that GPU ISA is volatile -- at least more so than say a CPU. Fortunately, the instruction and register enhancements tend to be incremental. Bergström says they will support all the latest GPU cards being used for HPC, that is, essentially all the cards supported in the three generations of Tesla products. PathScale has a working pre-"Fermi" driver now and is working on the compiler port. "We just got access to the hardware last month," explains Bergström. "So we've basically had 30 days to start tackling the ISA and the registers." He predicts they'll have a fairly robust Fermi port within the next 60 to 90 days.

For the GPU compiler front-end, PathScale decided to use a directives-based approach, in which programmers can instrument source code to tell the compiler to parallelize specific code regions for the GPU. The directives approach offers vendor and device independence, while allowing developers to make incremental changes to their source code as they identify more regions for GPU acceleration. OpenMP uses the same directives model for shared-memory parallelization.

PathScale opted for HMPP directives, a set of directives invented by CAPS Enterprise for their C and Fortran GPU compilers. In the CAPS products though, the compiler just converts the HMPP C or HMPP Fortran to CUDA, which is subsequently converted into GPU assembly by NVIDIA's CUDA back-end. PathScale, on the other hand, has attached their own back-end onto the HMPP front-end without losing any information between source-to-source translations.

The other part of the story is that CAPS, along with PathScale (and some as yet unannounced players) have decided to make the HMPP directives an open standard. The idea here is to attract application developers and tool makers to a standardized GPU programming model which protects their investment but is still targeted at gaining best performance.

Bergström is careful not to claim performance superiority over the CUDA technology just yet. He says ENZO is currently in the alpha or early beta stage. According to him, PathScale engineers have hand-tuned some code using GPU assembly, and have achieved a 15 to 30 percent (or better) performance boost. In other cases, they're not quite there and need to find the right optimizations. Bergström is confident that those hand-coded optimizations can be incorporated into the compiler infrastructure. They have identified a number of areas where they can reduce register pressure, hide latency, reduce stalls and improve instruction scheduling. "We know the performance is there," says Bergström.

The alpha/early beta version is now available for selected customers, with the production compiler suite slated for release later this summer. According to Bergström, over the next year, PathScale will be investing heavily in improving the GPGPU programming model. "People shouldn't have to worry about thread synchronization or register memory bank conflicts," he says. "The compiler will just handle that. Ultimately we want to have a fully automatic solution."

Sponsored Links

High-Performance Computing in Action
Businesses that want to be on the cutting edge of their industries are increasingly turning to high-performance computing (HPC) solutions to handle complex compute processes and speed up their rate of innovation. Download this Executive Brief to see how businesses in energy, life sciences and entertainment put HPC solutions to work in their operations.

Webinar: Programming Heterogeneous X64+GPU Systems Using OpenACC
Join Michael Wolfe as he compares the advantages and costs of using both low-level models and the directive-based OpenACC model for programming accelerated heterogeneous systems. Registration is free.

Accelerate your science with Seneca
One of the first HPC providers installing a 4X NVIDIA Kepler K-20 cluster. Invites you to a free evaluation on Seneca’s NVIDIA K20 Kepler cluster, pre-loaded with AMBER, NAMD, LAMMPS

May 22, 2013

May 21, 2013

May 20, 2013

May 17, 2013

May 16, 2013

May 15, 2013

May 14, 2013

May 13, 2013

May 10, 2013


Most Read Features

Most Read Around the Web

Most Read This Just In

Supermicro

Short Takes

Building Supercomputers with Raspberries

May 22, 2013 | At some point in the not-too-distant future, building powerful, miniature computing systems will be considered a hobby for high schoolers, just as robotics or even Lego-building are today. That could be made possible through recent advancements made with the Raspberry Pi computers.
Read more...

Running Computational Fluid Dynamics in the Cloud

May 16, 2013 | When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
Read more...

Computing the Physics of Bubbles

May 15, 2013 | Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
Read more...

Internet2 Awards Program Seeks Innovative Applications

May 10, 2013 | Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
Read more...

Sponsored Whitepapers

Best Practices in Big Data Storage

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Progress in Parallel: the Bull Parallel Programming Center

04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.

Sponsored Multimedia

SGI DMF ZeroWatt Disk Solution

In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

SC12 Editorial Feature HPCwire Soundbite sponsored by ISC Xyratex

HPC Job Bank


Featured Events


  • June 16, 2013 - June 20, 2013
    ISC'13
    Leipzig,
    Germany

  • June 17, 2013 - June 18, 2013
    Forecast 2013
    San Francisco, CA
    United States





HPCwire Events