Heterogeneous Compilers Ready for Takeoff

By Michael Feldman

December 9, 2008

The second wave of GPGPU software development tools is upon us. The first wave, exemplified by NVIDIA’s CUDA and AMD’s Brook+, allowed early adopters to get started with GPU computing via low-level, vendor-specific tools. Next generation tools from The Portland Group Inc. (PGI) and French-based CAPS Enterprise enable everyday C and Fortran programmers to tap into GPU acceleration within an integrated heterogeneous computing environment.

Over the past five years, the HPC community coalesced around the x86 architecture. That made the choice of targets easy for companies like PGI. Today 85 percent of the TOP500 is based on 64-bit x86 microprocessors, and the percentage is probably even higher in the sub-500 realm. While Intel and AMD are continuing to innovate with multicore architectures, they are constrained to clock frequencies of around 2.5-3.5 GHz.

Meanwhile, GPUs have become general-purpose vector processors with hundreds of simple cores and are scaling at a faster rate than CPUs. The fact that compiler vendors like PGI are now targeting GPUs says a lot about where the industry is headed with general-purpose acceleration, especially in the HPC space.

“As a compiler vendor, we asked ourselves: ‘What comes next?'” said Doug Miles, director of Advanced Compilers and Tools at PGI. “Our best guess is that accelerated computing is what comes next.”

PGI is betting that 64-bit x86 with “some type of accelerator” will be the new platform of choice for many HPC applications. Right now, the GPU is the accelerator du jour of supercomputing. The first accelerator target for PGI is CUDA-enabled NVIDIA GPUs. To implement it, PGI will leverage the CUDA toolchain and associated SDK, while the host side compilation will rely on PGI’s x86 technology.

Since GPUs are attached to the host platform as external devices rather than as true coprocessors, the low-level software model is quite complex. From the host side, it involves data transfers between the CPU and the GPU (over PCIe), memory allocation/deallocation, and other low-level device management. On the GPU side, the code can also be fairly involved, since it has to deal with algorithm parallelization and the GPU’s own memory hierarchy.

To make GPUs programming more productive, it’s worthwhile to hide most of these details from the application developer. What PGI has done is define a set of C pragmas and Fortran directives that can be embedded in the source code and direct the compiler to offload the specified code sequences to the GPU.

This approach is analogous to OpenMP, which defines pragmas and directives to apply multithreading on top of a sequential program. Unlike a libraries-based approach, this model enables developers to maintain a common source base for a variety of different targets. In the PGI case, non-accelerator aware compilers can use the same source, but will just ignore the foreign pragmas or directives. Even within the PGI environment, the accelerator pragmas and directives can be switched off at compile time so that only x86 code is generated.

The general form the C accelerator pragma is #pragma acc directive-name [clause [,clause]…] ; the equivalent for Fortran is !$acc directive-name [clause [,clause]…]. Applying an accelerator region to a matrix multiplication loop in Fortran would look like this:

 module mymm
 contains
 subroutine mm1( a, b, c, m )
    real, dimension(:,:) :: a,b,c
    integer i,j,k,m
    !$acc region
       do j = 1,m
          do i = 1,n
             a(i,j) = 0.0
          enddo
          do k = 1,p
             do i = 1,n
                a(i,j) = a(i,j) + b(i,k) * c(k,j)
             enddo
          enddo
       enddo
    !$acc end region
 end subroutine
 end module

The loop enclosed by the accelerator directive will be parallelized and offloaded to the GPU, assuming one is present. For the entire program, the compiler will generate both CPU and GPU code, which are subsequently linked together in a single executable file. Since the generated code will provide all the necessary data transfers, memory management and device bookkeeping, the programmer does not need any special knowledge of the accelerator architecture.

In fact, this is only partially true. The realization is that parallel programming on any target is likely to require some restructuring for optimum performance. “Parallel programming is not easy,” admits PGI compiler engineer Michael Wolfe. According to him, this first step in CPU-GPU compiler technology is to make the problem more approachable. (A deeper discussion of Wolfe’s thoughts on PGI’s GPU programming model is available here.)

PGI currently has a beta version of the x64+NVIDIA compiler under development, which will be made available for technical preview in January. They hope to have a production version of the product in mid-2009. The company is also working with AMD on a version for the FireStream accelerators and will make use of the associated SDK for that target.

There may be a time when the pragmas and directive can be done away with entirely, and the compiler alone can determine when to generate GPU code. But right now, the difference between high performance and low performance on these accelerators is so great that it’s better to let the programmer direct the compiler to the code that is most compute intensive, and is thus worth the overhead of transferring the data from host to GPU. It’s possible that as compiler technology matures and GPUs are more tightly integrated with CPUs, both performance optimization and auto-acceleration can all be wrapped up into the compiler.

French-based CAPS Enterprise is hoping compilers never get that smart. Like PGI, CAPS is offering a heterogenous software development environment, initially targeting x86 platforms with GPU accelerators. In this case, though, x86 code generation will be accomplished via third-party tools such as Intel’s C and Fortran compilers and GNU’s GCC. For the CAPS offering to make sense, accelerator compilation and CPU compilation have to remain separate.

The CAPS offering, called HMPP, preprocesses its own C pragmas and Fortran directives to generate native accelerator source code, either NVIDIA’s CUDA or AMD’s CAL. The accelerator source code is packaged into a “codelet,” which can subsequently be modified by the developer for further tuning. Then the codelet is passed to the GPU vendor toolchain to create the object binary, which gets loaded onto the GPU. The CPU code left behind is compiled with the third-party x86 compiler tools and the HMPP runtime glues the CPU and GPU code together.

The general syntax of the HMPP pragmas and directives is almost identical to the PGI versions: #pragma hmpp for C and !$hmpp for Fortran.

The CAPS pragmas/directives were designed for low-level control of the GPU in order to enable the developer to optimize operations such as data transfers, synchronous/asynchronous execution, data preloading and device management. PGI has defined some lower level directives as well. (See the PGI accelerator whitepaper at http://www.pgroup.com/lit/pgi_whitepaper_accpre.pdf for more detail.)

CAPS currently supports all CUDA-enabled NVIDIA GPUs and AMD FireStream hardware and is planning to release a Cell-friendly version in the first quarter of 2009. The company is also working on support for Java, due to be released in the same timeframe. CAPS already has a number of customers using the tool for GPU-equipped clusters. Total, a French-based energy multinational, is using HMPP to speed RTM calculations on a Xeon-based machine hooked up to a 4-GPU Tesla S1070. Initial results demonstrated a 3.3-fold speed-up per GPU compared to 8 Xeon cores.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire