Optimizing Codes for Heterogeneous HPC Clusters Using OpenACC

By Enrico Calore et. al.

July 3, 2017

Looking at the Top500 and Green500 ranks, one clearly realizes that most HPC systems are heterogeneous architectures using COTS (Commercial Off-The-Shelf) hardware, combining traditional multi-core CPUs with massively parallel accelerators, such as GPUs and MICs.

With processor frequencies now hitting a solid wall, the only truly open avenue for riding Moore’s law today is increasing hardware parallelism in several different ways: more computing nodes, more processors in each node, more cores within each processor, and longer vector instructions in each core. This trend means that applications must learn to use all these levels of hardware parallelism efficiently if we want to see performance measured at the application level growing consistently with hardware performance. Adding to this complexity, single computing nodes adopt different architectures, with multi-core CPUs supporting different instruction-sets, vector lengths and caches organizations. Also GPUs provided by different vendors have different architectures in terms of number of cores, caches organization, etc. For code developers the current goal is to map all the parallelism available at application level onto all hardware resources using architecture-oblivious approaches targeting portability at both level of code and performance across different architectures.

Several programming languages and frameworks try to tackle the different levels of parallelism available in hardware systems, but most of them are not portable across different architectures. As an example, GPUs are largely used for scientific HPC applications because a reasonable compromise of easy programmability and performance has been made possible by ad-hoc proprietary languages (e.g., CUDA for Nvidia GPUs), but these languages are by definition not portable to different accelerators. Several open-standard languages have tried to address this problem (e.g., OpenCL), targeting in principle multiple architectures, but the lack of support from various vendors has limited their usefulness.

The need to exploit the computing power of these systems in conjunction with the lack of standardization in their hardware and/or programming frameworks raised new issues for software development strongly impacting software maintainability, portability and performance. The use of proprietary languages targeting specific architectures, or open-standard languages not embraced by all vendors, often led to multiple implementations of the same code to target different architectures. For this reason there are several implementations for various scientific codes, e.g., MPI plus OpenMP and C/C++ to target CPU based clusters; MPI plus CUDA to target Nvidia GPU based clusters; or MPI plus OpenCL for AMD GPU based clusters.

The developers who pursued this strategy soon realized that maintaining multiple versions of the same code is very expensive. This is even worst for scientific software development, since it is often characterized by frequent code modifications, by the need of a strong optimization from the performance point of view, and also by a long software lifetime, which may span tens of years. Ideally, a programming language for scientific HPC applications should be portable  across most of the current architectures, allow applications to run efficiently, and moreover it should enable to run on future architecture without requiring a complete code rewrite.

Directives based programming models try to address exactly this problem, abstracting parallel programming to a descriptive level, where programmers help the compiler to identify parallelism in the code, as opposite to a prescriptive level, where programmers must specify how the code should be mapped onto the hardware of the target machine.

OpenMP (Open Multi-Processing) is probably the most common of such programming models, already used by a wide scientific community, but initially it was not designed to support accelerators. To fill this gap, in  November 2011, a new standard named OpenACC (Open Accelerators) was proposed by Cray, PGI, Nvidia, and CAPS. OpenACC is a programming standard for parallel computing allowing programmers to annotate C, C++ or Fortran codes to suggest to the compiler parallelizable regions to be offloaded to a generic accelerator.

Both OpenMP and OpenACC are based on directives: OpenMP was introduced to manage parallelism on traditional multi-core CPUs, while OpenACC was initially developed trying to fulfill the missing accelerators support in OpenMP. Today these two frameworks are converging and extending their scope to cover a large subset of HPC architectures: OpenMP version 4.0 has been designed to support also code offloading to accelerators, while compilers supporting OpenACC (such as PGI or GCC) are starting to use the same directives to target also multi-core CPUs.

“First as a member of the Cray technical staff and now as a member of the Nvidia technical staff, I am working to ensure that OpenMP and OpenACC move towards parity whenever possible,”  said James Beyer, Co-chair OpenMP accelerator sub-committee and OpenACC technical committee.

Back in 2014 our research group at the University of Ferrara in collaboration with the Theoretical Physics group of the University of Pisa, started the development of a Lattice QCD Monte Carlo application, aiming to make it portable onto different heterogeneous HPC systems. This kind of simulation, from the computational point of view, executes mainly stencil operations performing complex vector-matrix multiplications on a 4-dimensional lattice.

At the time we were using two different versions developed within the Pisa group: a C++ implementation targeting CPU based clusters and a C++/CUDA implementation targeting Nvidia GPU based clusters. Maintaining the two different versions was particularly expensive, so the availability of a language such as OpenACC offered the interesting possibility to move towards a single portable implementation. The main interest was towards GPU based clusters, but we also aimed to target other architectures like the Intel Knights Landing (KNL, not available yet at the time).

We started this project coming from an earlier experience of porting a similar application to OpenCL, which although being an open-standard, ceased later to be supported on Nvidia GPUs, forcing us to completely rewrite the application. From this point of view a directive-based OpenACC code provides some additional amount of safeguard, as, when ignoring directives, it is still a perfectly working plain C, C++ or Fortran code, which can be “easily” re-annotated using other directives and run on other architectures.

Although decorating a code with directives seems a straightforward operation requiring minimal programming efforts, this is often not enough if performance portability is required in addition to just code portability.

Just to mention one issue, memory data layout has a strong impact on performances with different architectures and this design step is critical in implementing of new codes, as changing data layout at a later stage is seldom a viable option. The two C++ and CUDA versions we were starting from diverged exactly in the data-layout used to store the lattice: we had an AoS (Array of Structure) structure for the CPU-optimized version and an SoA (Structure of Array) layout for GPUs.

We started porting the computationally more intensive kernel of the full code, the so-called Dirac Operator, to plain C, annotating it with OpenACC directives, and developed a first benchmark. This benchmark was used to evaluate possible performance drawbacks associated to an architecture-agnostic implementation. It provided very useful information on the performance impact of different data layouts; we were happy to learn that the Structure of Arrays (SoA) memory data layout is preferred when using GPUs, but also when using modern CPUs, if vectorization is enforced. This stems from the fact that the SoA format allows vector units to process many sites of the application domain (the lattice, in our case) in parallel, favoring architectures with long vector units (e.g. with wide SIMD instructions). Modern CPUs tend to have longer and longer vector units and we expect this trend to continue in the future. For this reason, data structures related to the lattice in our code were designed to follow the SoA paradigm.

Since at that time no OpenACC compiler for CPU was able to use vector instructions, we replaced OpenACC directives with OpenMP ones and compiled the code using the Intel Compiler. Table 1 shows the results of this benchmark.

After this initial benchmark, further development iterations led to a full implementation of the complete Monte Carlo code annotated with OpenACC directives and portable across several architectures. To give an idea of the level of performance portability, we report in Table 2 the execution times of the Dirac operator, compiled by the PGI 16.10 compiler (which now also targets multi-core CPUs) on a variety of architectures: Haswell and Broadwell Intel CPUs, the W9100 AMD GPU and Kepler and Pascal Nvidia GPUs.

Concerning code portability, we have shown that the same user-grade code implementation runs  on an interesting variety of state-of-the-art architectures. As we focus on  performance portability, some issues are still present. The Dirac operator is strongly memory-bound, so both Intel CPUs should be roughly three times slower than Kepler GPUs, corresponding to their respective memory  bandwidths (about 70GB/s vs. 240GB/s); what we measure is that  performance is approximately 10 times worse on  the Haswell CPU than on one K80 GPU. The Broadwell CPU runs approximately two times faster than the Haswell CPU, at least for some lattice sizes, but still does not reach the memory-limit. We have identified two main reasons for this non-optimal behavior, and both of them point to some still immature features of the PGI compiler when targeting x86 architectures:

  • Parallelization: when encountering nested loops, the compiler splits the outer-loop across different threads, while inner loops are executed serially or vectorized within each thread. Thus, in this implementation, the 4-nested loops over the 4 lattice dimensions cannot be efficiently divided in a sufficiently large number of threads to exploit all the available cores of modern CPUs.
  • Vectorization: as reported by the compilation logs, the compiler fails to vectorize the Dirac operator. To verify if this is related to how we have coded these functions, we have translated the OpenACC directives into the corresponding OpenMP ones, without changing the C code, and compiled using the Intel compiler (version 17.0.1). In this case the compiler succeeds in vectorizing the function, running a factor 2 faster.

Also concerning the AMD GPUs, performance is worse than expected and the compiler is not yet sufficiently stable (we had erratic compiler crashes). To make things even worse, we found that the support for this architecture has been dropped by  the PGI compiler (16.10 is the last version supporting AMD devices) and thus if no other compilers appear in the market, running OpenACC applications on AMD GPUs will not be easy in the future.

On the other hand, for Nvidia GPUs, performance results are similar to the ones obtainable by our previous CUDA implementation, showing a maximum performance drop of 25 percent for the full simulation code, only in some particular simulation conditions.

In conclusion, a portable implementation of a full Monte Carlo LQCD simulation is now in production on CPU and GPU clusters. The code runs efficiently on Nvidia GPUs, while performance on Intel CPUs could still be improved. We are confident that future releases of the PGI compiler will be able to fill the gap. Finally, we are able to run also on AMD GPUs, but for this architecture compiler support is an open issue with little hope for the future. In the near future we look forward to testing our code on the Intel KNL, as soon as a reasonably stable official PGI support for that processor becomes available. As a final remark we have shown that translating OpenACC codes to OpenMP and vice-versa is a reasonably easy task, so, whichever the winner, we see a nice future for our application.

Authors:

Claudio Bonati, INFN and University of Pisa
Simone Coscetti, INFN Pisa
Massimo D’Elia, INFN and University of Pisa
Michele Mesiti, INFN and University of Pisa
Francesco Negro, INFN Pisa
Enrico Calore, INFN and University of Ferrara
Sebastiano Fabio Schifano, INFN and University of Ferrara
Giorgio Silvi, INFN and University of Ferrara
Raffaele Tripiccione, INFN and University of Ferrara

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

RSC Reports 500Tflops, Hot Water Cooled System Deployed at JINR

April 18, 2018

RSC, developer of supercomputers and advanced HPC systems based in Russia, today reported deployment of “the world's first 100% ‘hot water’ liquid cooled supercomputer” at Joint Institute for Nuclear Research (JI Read more…

By Staff

New Device Spots Quantum Particle ‘Fingerprint’

April 18, 2018

Majorana particles have been observed by university researchers employing a device consisting of layers of magnetic insulators on a superconducting material. The advance opens the door to controlling the elusive particle Read more…

By George Leopold

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’s introduction of an ARM-based system (XC-50) last November. Read more…

By John Russell

HPE Extreme Performance Solutions

Hybrid HPC is Speeding Time to Insight and Revolutionizing Medicine

High performance computing (HPC) is a key driver of success in many verticals today, and health and life science industries are extensively leveraging these capabilities. Read more…

Hennessy & Patterson: A New Golden Age for Computer Architecture

April 17, 2018

On Monday June 4, 2018, 2017 A.M. Turing Award Winners John L. Hennessy and David A. Patterson will deliver the Turing Lecture at the 45th International Symposium on Computer Architecture (ISCA) in Los Angeles. The Read more…

By Staff

Cray Rolls Out AMD-Based CS500; More to Follow?

April 18, 2018

Cray was the latest OEM to bring AMD back into the fold with introduction today of a CS500 option based on AMD’s Epyc processor line. The move follows Cray’ Read more…

By John Russell

IBM: Software Ecosystem for OpenPOWER is Ready for Prime Time

April 16, 2018

With key pieces of the IBM/OpenPOWER versus Intel/x86 gambit settling into place – e.g., the arrival of Power9 chips and Power9-based systems, hyperscaler sup Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Cloud-Readiness and Looking Beyond Application Scaling

April 11, 2018

There are two aspects to consider when determining if an application is suitable for running in the cloud. The first, which we will discuss here under the title Read more…

By Chris Downing

Transitioning from Big Data to Discovery: Data Management as a Keystone Analytics Strategy

April 9, 2018

The past 10-15 years has seen a stark rise in the density, size, and diversity of scientific data being generated in every scientific discipline in the world. Key among the sciences has been the explosion of laboratory technologies that generate large amounts of data in life-sciences and healthcare research. Large amounts of data are now being stored in very large storage name spaces, with little to no organization and a general unease about how to approach analyzing it. Read more…

By Ari Berman, BioTeam, Inc.

IBM Expands Quantum Computing Network

April 5, 2018

IBM is positioning itself as a first mover in establishing the era of commercial quantum computing. The company believes in order for quantum to work, taming qu Read more…

By Tiffany Trader

FY18 Budget & CORAL-2 – Exascale USA Continues to Move Ahead

April 2, 2018

It was not pretty. However, despite some twists and turns, the federal government’s Fiscal Year 2018 (FY18) budget is complete and ended with some very positi Read more…

By Alex R. Larzelere

Nvidia Ups Hardware Game with 16-GPU DGX-2 Server and 18-Port NVSwitch

March 27, 2018

Nvidia unveiled a raft of new products from its annual technology conference in San Jose today, and despite not offering up a new chip architecture, there were still a few surprises in store for HPC hardware aficionados. Read more…

By Tiffany Trader

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Russian Nuclear Engineers Caught Cryptomining on Lab Supercomputer

February 12, 2018

Nuclear scientists working at the All-Russian Research Institute of Experimental Physics (RFNC-VNIIEF) have been arrested for using lab supercomputing resources to mine crypto-currency, according to a report in Russia’s Interfax News Agency. Read more…

By Tiffany Trader

How the Cloud Is Falling Short for HPC

March 15, 2018

The last couple of years have seen cloud computing gradually build some legitimacy within the HPC world, but still the HPC industry lies far behind enterprise I Read more…

By Chris Downing

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Leading Solution Providers

Deep Learning at 15 PFlops Enables Training for Extreme Weather Identification at Scale

March 19, 2018

Petaflop per second deep learning training performance on the NERSC (National Energy Research Scientific Computing Center) Cori supercomputer has given climate Read more…

By Rob Farber

Lenovo Unveils Warm Water Cooled ThinkSystem SD650 in Rampup to LRZ Install

February 22, 2018

This week Lenovo took the wraps off the ThinkSystem SD650 high-density server with third-generation direct water cooling technology developed in tandem with par Read more…

By Tiffany Trader

AI Cloud Competition Heats Up: Google’s TPUs, Amazon Building AI Chip

February 12, 2018

Competition in the white hot AI (and public cloud) market pits Google against Amazon this week, with Google offering AI hardware on its cloud platform intended Read more…

By Doug Black

HPC and AI – Two Communities Same Future

January 25, 2018

According to Al Gara (Intel Fellow, Data Center Group), high performance computing and artificial intelligence will increasingly intertwine as we transition to Read more…

By Rob Farber

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

US Plans $1.8 Billion Spend on DOE Exascale Supercomputing

April 11, 2018

On Monday, the United States Department of Energy announced its intention to procure up to three exascale supercomputers at a cost of up to $1.8 billion with th Read more…

By Tiffany Trader

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

Google Chases Quantum Supremacy with 72-Qubit Processor

March 7, 2018

Google pulled ahead of the pack this week in the race toward "quantum supremacy," with the introduction of a new 72-qubit quantum processor called Bristlecone. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This