CSCS Top Right Frontpage
HPCwire

Since 1986 - Covering the Fastest Computers
in the World and the People Who Run Them

Language Flags

Visit additional Tabor Communication Publications

Datanami
Digital Manufacturing Report
HPC in the Cloud
Green Computing Report

Tabor Communications
Corporate Video

Heterogeneous Compilers Ready for Takeoff


The second wave of GPGPU software development tools is upon us. The first wave, exemplified by NVIDIA's CUDA and AMD's Brook+, allowed early adopters to get started with GPU computing via low-level, vendor-specific tools. Next generation tools from The Portland Group Inc. (PGI) and French-based CAPS Enterprise enable everyday C and Fortran programmers to tap into GPU acceleration within an integrated heterogeneous computing environment.

Over the past five years, the HPC community coalesced around the x86 architecture. That made the choice of targets easy for companies like PGI. Today 85 percent of the TOP500 is based on 64-bit x86 microprocessors, and the percentage is probably even higher in the sub-500 realm. While Intel and AMD are continuing to innovate with multicore architectures, they are constrained to clock frequencies of around 2.5-3.5 GHz.

Meanwhile, GPUs have become general-purpose vector processors with hundreds of simple cores and are scaling at a faster rate than CPUs. The fact that compiler vendors like PGI are now targeting GPUs says a lot about where the industry is headed with general-purpose acceleration, especially in the HPC space.

"As a compiler vendor, we asked ourselves: 'What comes next?'" said Doug Miles, director of Advanced Compilers and Tools at PGI. "Our best guess is that accelerated computing is what comes next."

PGI is betting that 64-bit x86 with "some type of accelerator" will be the new platform of choice for many HPC applications. Right now, the GPU is the accelerator du jour of supercomputing. The first accelerator target for PGI is CUDA-enabled NVIDIA GPUs. To implement it, PGI will leverage the CUDA toolchain and associated SDK, while the host side compilation will rely on PGI's x86 technology.

Since GPUs are attached to the host platform as external devices rather than as true coprocessors, the low-level software model is quite complex. From the host side, it involves data transfers between the CPU and the GPU (over PCIe), memory allocation/deallocation, and other low-level device management. On the GPU side, the code can also be fairly involved, since it has to deal with algorithm parallelization and the GPU's own memory hierarchy.

To make GPUs programming more productive, it's worthwhile to hide most of these details from the application developer. What PGI has done is define a set of C pragmas and Fortran directives that can be embedded in the source code and direct the compiler to offload the specified code sequences to the GPU.

This approach is analogous to OpenMP, which defines pragmas and directives to apply multithreading on top of a sequential program. Unlike a libraries-based approach, this model enables developers to maintain a common source base for a variety of different targets. In the PGI case, non-accelerator aware compilers can use the same source, but will just ignore the foreign pragmas or directives. Even within the PGI environment, the accelerator pragmas and directives can be switched off at compile time so that only x86 code is generated.

The general form the C accelerator pragma is #pragma acc directive-name [clause [,clause]…] ; the equivalent for Fortran is !$acc directive-name [clause [,clause]…]. Applying an accelerator region to a matrix multiplication loop in Fortran would look like this:

 module mymm
 contains
 subroutine mm1( a, b, c, m )
    real, dimension(:,:) :: a,b,c
    integer i,j,k,m
    !$acc region
       do j = 1,m
          do i = 1,n
             a(i,j) = 0.0
          enddo
          do k = 1,p
             do i = 1,n
                a(i,j) = a(i,j) + b(i,k) * c(k,j)
             enddo
          enddo
       enddo
    !$acc end region
 end subroutine
 end module

The loop enclosed by the accelerator directive will be parallelized and offloaded to the GPU, assuming one is present. For the entire program, the compiler will generate both CPU and GPU code, which are subsequently linked together in a single executable file. Since the generated code will provide all the necessary data transfers, memory management and device bookkeeping, the programmer does not need any special knowledge of the accelerator architecture.

In fact, this is only partially true. The realization is that parallel programming on any target is likely to require some restructuring for optimum performance. "Parallel programming is not easy," admits PGI compiler engineer Michael Wolfe. According to him, this first step in CPU-GPU compiler technology is to make the problem more approachable. (A deeper discussion of Wolfe's thoughts on PGI's GPU programming model is available here.)

PGI currently has a beta version of the x64+NVIDIA compiler under development, which will be made available for technical preview in January. They hope to have a production version of the product in mid-2009. The company is also working with AMD on a version for the FireStream accelerators and will make use of the associated SDK for that target.

There may be a time when the pragmas and directive can be done away with entirely, and the compiler alone can determine when to generate GPU code. But right now, the difference between high performance and low performance on these accelerators is so great that it's better to let the programmer direct the compiler to the code that is most compute intensive, and is thus worth the overhead of transferring the data from host to GPU. It's possible that as compiler technology matures and GPUs are more tightly integrated with CPUs, both performance optimization and auto-acceleration can all be wrapped up into the compiler.

French-based CAPS Enterprise is hoping compilers never get that smart. Like PGI, CAPS is offering a heterogenous software development environment, initially targeting x86 platforms with GPU accelerators. In this case, though, x86 code generation will be accomplished via third-party tools such as Intel's C and Fortran compilers and GNU's GCC. For the CAPS offering to make sense, accelerator compilation and CPU compilation have to remain separate.

The CAPS offering, called HMPP, preprocesses its own C pragmas and Fortran directives to generate native accelerator source code, either NVIDIA's CUDA or AMD's CAL. The accelerator source code is packaged into a "codelet," which can subsequently be modified by the developer for further tuning. Then the codelet is passed to the GPU vendor toolchain to create the object binary, which gets loaded onto the GPU. The CPU code left behind is compiled with the third-party x86 compiler tools and the HMPP runtime glues the CPU and GPU code together.

The general syntax of the HMPP pragmas and directives is almost identical to the PGI versions: #pragma hmpp <label> <directive type> [, <directive parameter>]* [&] for C and !$hmpp <label> <directive type> [, <directive parameter>]* [&] for Fortran.

The CAPS pragmas/directives were designed for low-level control of the GPU in order to enable the developer to optimize operations such as data transfers, synchronous/asynchronous execution, data preloading and device management. PGI has defined some lower level directives as well. (See the PGI accelerator whitepaper at http://www.pgroup.com/lit/pgi_whitepaper_accpre.pdf for more detail.)

CAPS currently supports all CUDA-enabled NVIDIA GPUs and AMD FireStream hardware and is planning to release a Cell-friendly version in the first quarter of 2009. The company is also working on support for Java, due to be released in the same timeframe. CAPS already has a number of customers using the tool for GPU-equipped clusters. Total, a French-based energy multinational, is using HMPP to speed RTM calculations on a Xeon-based machine hooked up to a 4-GPU Tesla S1070. Initial results demonstrated a 3.3-fold speed-up per GPU compared to 8 Xeon cores.

Sponsored Links

Accelerate your science with Seneca
One of the first HPC providers installing a 4X NVIDIA Kepler K-20 cluster. Invites you to a free evaluation on Seneca’s NVIDIA K20 Kepler cluster, pre-loaded with AMBER, NAMD, LAMMPS

High-Performance Computing in Action
Businesses that want to be on the cutting edge of their industries are increasingly turning to high-performance computing (HPC) solutions to handle complex compute processes and speed up their rate of innovation. Download this Executive Brief to see how businesses in energy, life sciences and entertainment put HPC solutions to work in their operations.

May 17, 2013

May 16, 2013

May 15, 2013

May 14, 2013

May 13, 2013

May 10, 2013

May 09, 2013

May 08, 2013

May 07, 2013

May 06, 2013



Short Takes

Running Computational Fluid Dynamics in the Cloud

May 16, 2013 | When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
Read more...

Computing the Physics of Bubbles

May 15, 2013 | Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
Read more...

Internet2 Awards Program Seeks Innovative Applications

May 10, 2013 | Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
Read more...

Floating Funding to Exascale Island

May 09, 2013 | The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
Read more...

HPC and the True Cost of Cloud

May 08, 2013 | For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
Read more...

Sponsored Whitepapers

Best Practices in Big Data Storage

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Progress in Parallel: the Bull Parallel Programming Center

04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.

Sponsored Multimedia

SGI DMF ZeroWatt Disk Solution

In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

SC12 Editorial Feature HPCwire Soundbite sponsored by ISC

HPC Job Bank


Featured Events


  • June 16, 2013 - June 20, 2013
    ISC'13
    Leipzig,
    Germany

  • June 17, 2013 - June 18, 2013
    Forecast 2013
    San Francisco, CA
    United States





HPCwire Events