Optimizing Codes for Heterogeneous HPC Clusters Using OpenACC

By Enrico Calore et. al.

July 3, 2017

Looking at the Top500 and Green500 ranks, one clearly realizes that most HPC systems are heterogeneous architectures using COTS (Commercial Off-The-Shelf) hardware, combining traditional multi-core CPUs with massively parallel accelerators, such as GPUs and MICs.

With processor frequencies now hitting a solid wall, the only truly open avenue for riding Moore’s law today is increasing hardware parallelism in several different ways: more computing nodes, more processors in each node, more cores within each processor, and longer vector instructions in each core. This trend means that applications must learn to use all these levels of hardware parallelism efficiently if we want to see performance measured at the application level growing consistently with hardware performance. Adding to this complexity, single computing nodes adopt different architectures, with multi-core CPUs supporting different instruction-sets, vector lengths and caches organizations. Also GPUs provided by different vendors have different architectures in terms of number of cores, caches organization, etc. For code developers the current goal is to map all the parallelism available at application level onto all hardware resources using architecture-oblivious approaches targeting portability at both level of code and performance across different architectures.

Several programming languages and frameworks try to tackle the different levels of parallelism available in hardware systems, but most of them are not portable across different architectures. As an example, GPUs are largely used for scientific HPC applications because a reasonable compromise of easy programmability and performance has been made possible by ad-hoc proprietary languages (e.g., CUDA for Nvidia GPUs), but these languages are by definition not portable to different accelerators. Several open-standard languages have tried to address this problem (e.g., OpenCL), targeting in principle multiple architectures, but the lack of support from various vendors has limited their usefulness.

The need to exploit the computing power of these systems in conjunction with the lack of standardization in their hardware and/or programming frameworks raised new issues for software development strongly impacting software maintainability, portability and performance. The use of proprietary languages targeting specific architectures, or open-standard languages not embraced by all vendors, often led to multiple implementations of the same code to target different architectures. For this reason there are several implementations for various scientific codes, e.g., MPI plus OpenMP and C/C++ to target CPU based clusters; MPI plus CUDA to target Nvidia GPU based clusters; or MPI plus OpenCL for AMD GPU based clusters.

The developers who pursued this strategy soon realized that maintaining multiple versions of the same code is very expensive. This is even worst for scientific software development, since it is often characterized by frequent code modifications, by the need of a strong optimization from the performance point of view, and also by a long software lifetime, which may span tens of years. Ideally, a programming language for scientific HPC applications should be portable  across most of the current architectures, allow applications to run efficiently, and moreover it should enable to run on future architecture without requiring a complete code rewrite.

Directives based programming models try to address exactly this problem, abstracting parallel programming to a descriptive level, where programmers help the compiler to identify parallelism in the code, as opposite to a prescriptive level, where programmers must specify how the code should be mapped onto the hardware of the target machine.

OpenMP (Open Multi-Processing) is probably the most common of such programming models, already used by a wide scientific community, but initially it was not designed to support accelerators. To fill this gap, in  November 2011, a new standard named OpenACC (Open Accelerators) was proposed by Cray, PGI, Nvidia, and CAPS. OpenACC is a programming standard for parallel computing allowing programmers to annotate C, C++ or Fortran codes to suggest to the compiler parallelizable regions to be offloaded to a generic accelerator.

Both OpenMP and OpenACC are based on directives: OpenMP was introduced to manage parallelism on traditional multi-core CPUs, while OpenACC was initially developed trying to fulfill the missing accelerators support in OpenMP. Today these two frameworks are converging and extending their scope to cover a large subset of HPC architectures: OpenMP version 4.0 has been designed to support also code offloading to accelerators, while compilers supporting OpenACC (such as PGI or GCC) are starting to use the same directives to target also multi-core CPUs.

“First as a member of the Cray technical staff and now as a member of the Nvidia technical staff, I am working to ensure that OpenMP and OpenACC move towards parity whenever possible,”  said James Beyer, Co-chair OpenMP accelerator sub-committee and OpenACC technical committee.

Back in 2014 our research group at the University of Ferrara in collaboration with the Theoretical Physics group of the University of Pisa, started the development of a Lattice QCD Monte Carlo application, aiming to make it portable onto different heterogeneous HPC systems. This kind of simulation, from the computational point of view, executes mainly stencil operations performing complex vector-matrix multiplications on a 4-dimensional lattice.

At the time we were using two different versions developed within the Pisa group: a C++ implementation targeting CPU based clusters and a C++/CUDA implementation targeting Nvidia GPU based clusters. Maintaining the two different versions was particularly expensive, so the availability of a language such as OpenACC offered the interesting possibility to move towards a single portable implementation. The main interest was towards GPU based clusters, but we also aimed to target other architectures like the Intel Knights Landing (KNL, not available yet at the time).

We started this project coming from an earlier experience of porting a similar application to OpenCL, which although being an open-standard, ceased later to be supported on Nvidia GPUs, forcing us to completely rewrite the application. From this point of view a directive-based OpenACC code provides some additional amount of safeguard, as, when ignoring directives, it is still a perfectly working plain C, C++ or Fortran code, which can be “easily” re-annotated using other directives and run on other architectures.

Although decorating a code with directives seems a straightforward operation requiring minimal programming efforts, this is often not enough if performance portability is required in addition to just code portability.

Just to mention one issue, memory data layout has a strong impact on performances with different architectures and this design step is critical in implementing of new codes, as changing data layout at a later stage is seldom a viable option. The two C++ and CUDA versions we were starting from diverged exactly in the data-layout used to store the lattice: we had an AoS (Array of Structure) structure for the CPU-optimized version and an SoA (Structure of Array) layout for GPUs.

We started porting the computationally more intensive kernel of the full code, the so-called Dirac Operator, to plain C, annotating it with OpenACC directives, and developed a first benchmark. This benchmark was used to evaluate possible performance drawbacks associated to an architecture-agnostic implementation. It provided very useful information on the performance impact of different data layouts; we were happy to learn that the Structure of Arrays (SoA) memory data layout is preferred when using GPUs, but also when using modern CPUs, if vectorization is enforced. This stems from the fact that the SoA format allows vector units to process many sites of the application domain (the lattice, in our case) in parallel, favoring architectures with long vector units (e.g. with wide SIMD instructions). Modern CPUs tend to have longer and longer vector units and we expect this trend to continue in the future. For this reason, data structures related to the lattice in our code were designed to follow the SoA paradigm.

Since at that time no OpenACC compiler for CPU was able to use vector instructions, we replaced OpenACC directives with OpenMP ones and compiled the code using the Intel Compiler. Table 1 shows the results of this benchmark.

After this initial benchmark, further development iterations led to a full implementation of the complete Monte Carlo code annotated with OpenACC directives and portable across several architectures. To give an idea of the level of performance portability, we report in Table 2 the execution times of the Dirac operator, compiled by the PGI 16.10 compiler (which now also targets multi-core CPUs) on a variety of architectures: Haswell and Broadwell Intel CPUs, the W9100 AMD GPU and Kepler and Pascal Nvidia GPUs.

Concerning code portability, we have shown that the same user-grade code implementation runs  on an interesting variety of state-of-the-art architectures. As we focus on  performance portability, some issues are still present. The Dirac operator is strongly memory-bound, so both Intel CPUs should be roughly three times slower than Kepler GPUs, corresponding to their respective memory  bandwidths (about 70GB/s vs. 240GB/s); what we measure is that  performance is approximately 10 times worse on  the Haswell CPU than on one K80 GPU. The Broadwell CPU runs approximately two times faster than the Haswell CPU, at least for some lattice sizes, but still does not reach the memory-limit. We have identified two main reasons for this non-optimal behavior, and both of them point to some still immature features of the PGI compiler when targeting x86 architectures:

  • Parallelization: when encountering nested loops, the compiler splits the outer-loop across different threads, while inner loops are executed serially or vectorized within each thread. Thus, in this implementation, the 4-nested loops over the 4 lattice dimensions cannot be efficiently divided in a sufficiently large number of threads to exploit all the available cores of modern CPUs.
  • Vectorization: as reported by the compilation logs, the compiler fails to vectorize the Dirac operator. To verify if this is related to how we have coded these functions, we have translated the OpenACC directives into the corresponding OpenMP ones, without changing the C code, and compiled using the Intel compiler (version 17.0.1). In this case the compiler succeeds in vectorizing the function, running a factor 2 faster.

Also concerning the AMD GPUs, performance is worse than expected and the compiler is not yet sufficiently stable (we had erratic compiler crashes). To make things even worse, we found that the support for this architecture has been dropped by  the PGI compiler (16.10 is the last version supporting AMD devices) and thus if no other compilers appear in the market, running OpenACC applications on AMD GPUs will not be easy in the future.

On the other hand, for Nvidia GPUs, performance results are similar to the ones obtainable by our previous CUDA implementation, showing a maximum performance drop of 25 percent for the full simulation code, only in some particular simulation conditions.

In conclusion, a portable implementation of a full Monte Carlo LQCD simulation is now in production on CPU and GPU clusters. The code runs efficiently on Nvidia GPUs, while performance on Intel CPUs could still be improved. We are confident that future releases of the PGI compiler will be able to fill the gap. Finally, we are able to run also on AMD GPUs, but for this architecture compiler support is an open issue with little hope for the future. In the near future we look forward to testing our code on the Intel KNL, as soon as a reasonably stable official PGI support for that processor becomes available. As a final remark we have shown that translating OpenACC codes to OpenMP and vice-versa is a reasonably easy task, so, whichever the winner, we see a nice future for our application.

Authors:

Claudio Bonati, INFN and University of Pisa
Simone Coscetti, INFN Pisa
Massimo D’Elia, INFN and University of Pisa
Michele Mesiti, INFN and University of Pisa
Francesco Negro, INFN Pisa
Enrico Calore, INFN and University of Ferrara
Sebastiano Fabio Schifano, INFN and University of Ferrara
Giorgio Silvi, INFN and University of Ferrara
Raffaele Tripiccione, INFN and University of Ferrara

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Insights from Optimized Codes on Cineca’s Marconi

February 15, 2019

What can you do with 381,392 CPU cores? For Cineca, it means enabling computational scientists to expand a large part of the world’s body of knowledge from the nanoscale to the astronomic, from calculating quantum effe Read more…

By Ken Strandberg

What Will IBM’s AI Debater Learn from Its Loss?

February 14, 2019

The utility of IBM’s latest man-versus-machine gambit is debatable. At the very least its Project Debater got us thinking about the potential uses of artificial intelligence as a way of helping humans sift through al Read more…

By George Leopold

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst of bankruptcy proceedings. According to Dutch news site Drimb Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE Systems With Intel Omni-Path: Architected for Value and Accessible High-Performance Computing

Today’s high-performance computing (HPC) and artificial intelligence (AI) users value high performing clusters. And the higher the performance that their system can deliver, the better. Read more…

IBM Accelerated Insights

Medical Research Powered by Data

“We’re all the same, but we’re unique as well. In that uniqueness lies all of the answers….”

  • Mark Tykocinski, MD, Provost, Executive Vice President for Academic Affairs, Thomas Jefferson University

Getting the answers to what causes some people to develop diseases and not others is driving the groundbreaking medical research being conducted by the Computational Medicine Center at Thomas Jefferson University in Philadelphia. Read more…

South African Weather Service Doubles Compute and Triples Storage Capacity of Cray System

February 13, 2019

South Africa has made headlines in recent years for its commitment to HPC leadership in Africa – and now, Cray has announced another major South African HPC expansion. Cray has been awarded contracts with Eclipse Holdings Ltd. to upgrade the supercomputing system operated by the South African Weather Service (SAWS). Read more…

By Oliver Peckham

Insights from Optimized Codes on Cineca’s Marconi

February 15, 2019

What can you do with 381,392 CPU cores? For Cineca, it means enabling computational scientists to expand a large part of the world’s body of knowledge from th Read more…

By Ken Strandberg

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

UC Berkeley Paper Heralds Rise of Serverless Computing in the Cloud – Do You Agree?

February 13, 2019

Almost exactly ten years to the day from publishing of their widely-read, seminal paper on cloud computing, UC Berkeley researchers have issued another ambitious examination of cloud computing - Cloud Programming Simplified: A Berkeley View on Serverless Computing. The new work heralds the rise of ‘serverless computing’ as the next dominant phase of cloud computing. Read more…

By John Russell

Iowa ‘Grows Its Own’ to Fill the HPC Workforce Pipeline

February 13, 2019

The global workforce that supports advanced computing, scientific software and high-speed research networks is relatively small when you stop to consider the magnitude of the transformative discoveries it empowers. Technical conferences provide a forum where specialists convene to learn about the latest innovations and schedule face-time with colleagues from other institutions. Read more…

By Elizabeth Leake, STEM-Trek

Trump Signs Executive Order Launching U.S. AI Initiative

February 11, 2019

U.S. President Donald Trump issued an Executive Order (EO) today launching a U.S Artificial Intelligence Initiative. The new initiative - Maintaining American L Read more…

By John Russell

Celebrating Women in Science: Meet Four Women Leading the Way in HPC

February 11, 2019

One only needs to look around at virtually any CS/tech conference to realize that women are underrepresented, and that holds true of HPC. SC hosts over 13,000 H Read more…

By AJ Lauer

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

Assessing Government Shutdown’s Impact on HPC

February 6, 2019

After a 35-day federal government shutdown, the longest in U.S. history, government agencies are taking stock of the damage -- and girding for a potential secon Read more…

By Tiffany Trader

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

Cray Unveils Shasta, Lands NERSC-9 Contract

October 30, 2018

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

AMD Sets Up for Epyc Epoch

November 16, 2018

It’s been a good two weeks, AMD’s Gary Silcott and Andy Parma told me on the last day of SC18 in Dallas at the restaurant where we met to discuss their show news and recent successes. Heck, it’s been a good year. Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

Contract Signed for New Finnish Supercomputer

December 13, 2018

After the official contract signing yesterday, configuration details were made public for the new BullSequana system that the Finnish IT Center for Science (CSC Read more…

By Tiffany Trader

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

Nvidia’s Jensen Huang Delivers Vision for the New HPC

November 14, 2018

For nearly two hours on Monday at SC18, Jensen Huang, CEO of Nvidia, presented his expansive view of the future of HPC (and computing in general) as only he can do. Animated. Backstopped by a stream of data charts, product photos, and even a beautiful image of supernovae... Read more…

By John Russell

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

Intel Confirms 48-Core Cascade Lake-AP for 2019

November 4, 2018

As part of the run-up to SC18, taking place in Dallas next week (Nov. 11-16), Intel is doling out info on its next-gen Cascade Lake family of Xeon processors, specifically the “Advanced Processor” version (Cascade Lake-AP), architected for high-performance computing, artificial intelligence and infrastructure-as-a-service workloads. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This