Heterogeneous Compilers Ready for Takeoff

By Michael Feldman

December 9, 2008

The second wave of GPGPU software development tools is upon us. The first wave, exemplified by NVIDIA’s CUDA and AMD’s Brook+, allowed early adopters to get started with GPU computing via low-level, vendor-specific tools. Next generation tools from The Portland Group Inc. (PGI) and French-based CAPS Enterprise enable everyday C and Fortran programmers to tap into GPU acceleration within an integrated heterogeneous computing environment.

Over the past five years, the HPC community coalesced around the x86 architecture. That made the choice of targets easy for companies like PGI. Today 85 percent of the TOP500 is based on 64-bit x86 microprocessors, and the percentage is probably even higher in the sub-500 realm. While Intel and AMD are continuing to innovate with multicore architectures, they are constrained to clock frequencies of around 2.5-3.5 GHz.

Meanwhile, GPUs have become general-purpose vector processors with hundreds of simple cores and are scaling at a faster rate than CPUs. The fact that compiler vendors like PGI are now targeting GPUs says a lot about where the industry is headed with general-purpose acceleration, especially in the HPC space.

“As a compiler vendor, we asked ourselves: ‘What comes next?'” said Doug Miles, director of Advanced Compilers and Tools at PGI. “Our best guess is that accelerated computing is what comes next.”

PGI is betting that 64-bit x86 with “some type of accelerator” will be the new platform of choice for many HPC applications. Right now, the GPU is the accelerator du jour of supercomputing. The first accelerator target for PGI is CUDA-enabled NVIDIA GPUs. To implement it, PGI will leverage the CUDA toolchain and associated SDK, while the host side compilation will rely on PGI’s x86 technology.

Since GPUs are attached to the host platform as external devices rather than as true coprocessors, the low-level software model is quite complex. From the host side, it involves data transfers between the CPU and the GPU (over PCIe), memory allocation/deallocation, and other low-level device management. On the GPU side, the code can also be fairly involved, since it has to deal with algorithm parallelization and the GPU’s own memory hierarchy.

To make GPUs programming more productive, it’s worthwhile to hide most of these details from the application developer. What PGI has done is define a set of C pragmas and Fortran directives that can be embedded in the source code and direct the compiler to offload the specified code sequences to the GPU.

This approach is analogous to OpenMP, which defines pragmas and directives to apply multithreading on top of a sequential program. Unlike a libraries-based approach, this model enables developers to maintain a common source base for a variety of different targets. In the PGI case, non-accelerator aware compilers can use the same source, but will just ignore the foreign pragmas or directives. Even within the PGI environment, the accelerator pragmas and directives can be switched off at compile time so that only x86 code is generated.

The general form the C accelerator pragma is #pragma acc directive-name [clause [,clause]…] ; the equivalent for Fortran is !$acc directive-name [clause [,clause]…]. Applying an accelerator region to a matrix multiplication loop in Fortran would look like this:

 module mymm
 contains
 subroutine mm1( a, b, c, m )
    real, dimension(:,:) :: a,b,c
    integer i,j,k,m
    !$acc region
       do j = 1,m
          do i = 1,n
             a(i,j) = 0.0
          enddo
          do k = 1,p
             do i = 1,n
                a(i,j) = a(i,j) + b(i,k) * c(k,j)
             enddo
          enddo
       enddo
    !$acc end region
 end subroutine
 end module

The loop enclosed by the accelerator directive will be parallelized and offloaded to the GPU, assuming one is present. For the entire program, the compiler will generate both CPU and GPU code, which are subsequently linked together in a single executable file. Since the generated code will provide all the necessary data transfers, memory management and device bookkeeping, the programmer does not need any special knowledge of the accelerator architecture.

In fact, this is only partially true. The realization is that parallel programming on any target is likely to require some restructuring for optimum performance. “Parallel programming is not easy,” admits PGI compiler engineer Michael Wolfe. According to him, this first step in CPU-GPU compiler technology is to make the problem more approachable. (A deeper discussion of Wolfe’s thoughts on PGI’s GPU programming model is available here.)

PGI currently has a beta version of the x64+NVIDIA compiler under development, which will be made available for technical preview in January. They hope to have a production version of the product in mid-2009. The company is also working with AMD on a version for the FireStream accelerators and will make use of the associated SDK for that target.

There may be a time when the pragmas and directive can be done away with entirely, and the compiler alone can determine when to generate GPU code. But right now, the difference between high performance and low performance on these accelerators is so great that it’s better to let the programmer direct the compiler to the code that is most compute intensive, and is thus worth the overhead of transferring the data from host to GPU. It’s possible that as compiler technology matures and GPUs are more tightly integrated with CPUs, both performance optimization and auto-acceleration can all be wrapped up into the compiler.

French-based CAPS Enterprise is hoping compilers never get that smart. Like PGI, CAPS is offering a heterogenous software development environment, initially targeting x86 platforms with GPU accelerators. In this case, though, x86 code generation will be accomplished via third-party tools such as Intel’s C and Fortran compilers and GNU’s GCC. For the CAPS offering to make sense, accelerator compilation and CPU compilation have to remain separate.

The CAPS offering, called HMPP, preprocesses its own C pragmas and Fortran directives to generate native accelerator source code, either NVIDIA’s CUDA or AMD’s CAL. The accelerator source code is packaged into a “codelet,” which can subsequently be modified by the developer for further tuning. Then the codelet is passed to the GPU vendor toolchain to create the object binary, which gets loaded onto the GPU. The CPU code left behind is compiled with the third-party x86 compiler tools and the HMPP runtime glues the CPU and GPU code together.

The general syntax of the HMPP pragmas and directives is almost identical to the PGI versions: #pragma hmpp for C and !$hmpp for Fortran.

The CAPS pragmas/directives were designed for low-level control of the GPU in order to enable the developer to optimize operations such as data transfers, synchronous/asynchronous execution, data preloading and device management. PGI has defined some lower level directives as well. (See the PGI accelerator whitepaper at http://www.pgroup.com/lit/pgi_whitepaper_accpre.pdf for more detail.)

CAPS currently supports all CUDA-enabled NVIDIA GPUs and AMD FireStream hardware and is planning to release a Cell-friendly version in the first quarter of 2009. The company is also working on support for Java, due to be released in the same timeframe. CAPS already has a number of customers using the tool for GPU-equipped clusters. Total, a French-based energy multinational, is using HMPP to speed RTM calculations on a Xeon-based machine hooked up to a 4-GPU Tesla S1070. Initial results demonstrated a 3.3-fold speed-up per GPU compared to 8 Xeon cores.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire