Leveraging Standards-Based Parallel Programming in HPC Applications

By Jeff Larkin, Principal HPC Application Architect at NVIDIA

October 3, 2022

Last month I discussed why standards-based parallel programming should be in your HPC toolbox. Now, I am highlighting the successes of some of the developers who have already made standards-based parallelism an integral part of their strategy. As you will see, success with standards-based programming isn’t limited to just mini-apps.

Fluid Simulation with Palabos

Vorticity plot of airflow around a car.

 

Palabos is an open-sourced library developed at the University of Geneva for performing computational fluid dynamics simulations using Lattice Boltzmann Methods. The core library is written in C++ and the developers desired a way to maintain a single source code between CPUs and GPU accelerators. ISO C++ parallel algorithms provide an attractive means to portable on-node parallelism that composes well with their existing MPI code.

Dr. Jonas Latt and his team started converting their code to use C++ parallel algorithms by first developing the STLBM mini-app. This enabled them to quickly determine the best practices that they would later apply to Palabos. The first thing they learned was that their existing data structures were not ideal for parallelization, on a GPU or modern CPU. They restructured STLBM to be data-oriented, rather than object-oriented.

After restructuring their data structures to be ready for parallelization, the team began to replace their existing for loops with C++ parallel algorithms. In many cases, this is as simple as using a std::for_each or std::transform_reduce, although choosing the right algorithm for the job will result in the best performance.

Once they’d addressed the on-node parallelism, it came time to optimize the scalability of their application. They found that they achieved the best scalability by mixing in the open source Thrust Library from NVIDIA to ensure MPI buffers were pinned in GPU memory. This optimization causes the MPI library to transfer directly between GPU buffers, eliminating the CPU from the communication altogether. The interoperability between ISO C++ and other C++-based libraries enabled this optimization.

Palabos achieves 82% strong scaling efficiency mixing MPI and ISO C++ parallel algorithms

 

Even with using ISO C++ parallelism, instead of a lower-level approach like CUDA C++, the team is able to achieve a 55x performance speed-up from running on their four GPUs instead of all cores of their Xeon Gold CPU. In fact, they recorded a 82% strong scaling efficiency going from one GPU to four GPUs and a 93% weak scaling efficiency by running a 4x larger problem.

Dr. Latt has written a two-part blog post on his experience rewriting STLBM and Palabos to use MPI and ISO C++ parallel algorithms, on the NVIDIA developer blog.

Magnetic field lines and volumetric density of the Solar corona produced by PSI’s models

Simulating Complex Solar Magnetic Fields

Predictive Science Incorporated is a scientific research company that studies the magnetohydrodynamic properties of the Sun’s corona and heliosphere. Their applications support several NASA missions to better understand the Sun. They have a number of scientific applications that use MPI and OpenACC to take advantage of GPU-accelerated HPC systems.

Dr. Ronald Caplan and Miko Stulajter asked the question whether support for the Fortran language has evolved to the point that it’s possible to refactor their applications to use Fortran’s do concurrent loop in place of OpenACC directives in their applications. They first attempted this with a mini-app called diffuse, which is a mini-app for their HipFT application. They found that they could replace OpenACC with do concurrent in diffuse and submitted their results to the “Workshop for Accelerator Programming using Directives”at Supercomputing 2021, winning the best paper award at that workshop.

Following the success of diffuse, they moved to a more complex code, POT3D, which solves a potential field model of the Sun’s coronal magnetic field and is a part of the SPEChpc benchmark suite. Unlike diffuse, POT3D uses MPI in addition to OpenACC, which they believed would make their refactoring more difficult. They found that they could remove all but three  OpenACC directives from their application: one to select the GPU device and two to perform atomic array updates. After removing some 77 directives from their application, their performance using the NVIDIA nvfortran compiler and an NVIDIA A100 GPU was just 10% slower than their hand-written OpenACC code.

POT3D performance with Fortran standard parallelism vs. OpenACC baseline

While a 10% loss in performance is a small cost for reducing their total lines of code by 147 lines, they wanted to understand the cause for this loss and whether they could make up the difference. After some experimentation they determined that the culprit for this performance loss is data migrations that occur due to the use of CUDA Unified Memory by nvfortran. By adding back only enough directives to optimize this data migration in their code, their application performance returned to that of the original baseline code.

Caplan and Stulajter now have a production application with 39 fewer directives and the same performance on both the CPU and GPU as their original MPI+OpenACC code. You can read more about their experience using Fortran do concurrent in POT3D, including example code, here.

In this article I’ve shown just two of the growing number of applications who have migrated their parallelism from specialized APIs to standard language-based solutions. The applications observed little to no performance downside to these changes and significant improvements in productivity and portability.

How to Get Started with Standards-based Parallel Programming

Interested in beginning to use standards-based parallel programming in your application? You can download the NVIDIA HPC SDK free today and experiment with our various compilers and tools.

NVIDIA GTC Fall 2022 just wrapped and has some great on-demand resources you can watch. I recommend checking out “A Deep Dive into the Latest HPC Software” and “Developing HPC Applications with Standard C++, Fortran, and Python”.

Jeff Lark, Principal HPC Application Architect at NVIDIA

About Jeff Larkin

Jeff is a Principal HPC Application Architect in NVIDIA’s HPC Software team. He is passionate about the advancement and adoption of parallel programming models for High Performance Computing. He was previously a member of NVIDIA’s Developer Technology group, specializing in performance analysis and optimization of high performance computing applications. Jeff is also the chair of the OpenACC technical committee and has worked in both the OpenACC and OpenMP standards bodies. Before joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence, located at Oak Ridge National Laboratory.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Point. The system includes Intel's research chip called Loihi 2, Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical.  A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Poin Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire