New C++ Sender Library Enables Portable Asynchrony

By Eric Niebler, Georgy Evtushenko, and Jeff Larkin at NVIDIA

December 5, 2022

NVIDIA is excited to announce the release of the stdexec library on GitHub and in the 22.11 release of the NVIDIA HPC Software Development Kit. The stdexec library is a proof-of-concept implementation of the Sender model (or Senders) for asynchronous computing—expected to become part of the C++ Standard in C++26. Senders help you separate the logic of your algorithm – the work you actually care about – from the details of where and how it gets executed. Unlike the C++17 parallel algorithms, with Senders you can chain multiple asynchronous computations (like GPU kernels) without unnecessary synchronization.

The upshot is that you can write your program logic once and then decide how to execute it, whether on a single thread, a thread pool, or a GPU, without changing the program logic. It is also possible to target multi-GPU systems or heterogeneous multi-node compute clusters.

The design of stdexec is the product of an industry collaboration between NVIDIA and other members of the C++ Standardization Committee. It gives C++ an extensible, asynchronous programming model, a suite of generic algorithms that capture common async patterns, and hooks that let you say precisely where and how you want your work to execute. The Sender model is slated for standardization with C++26, but stdexec allows you to experiment with it today.

Asynchronous vs. Synchronous Programming

In the previous two posts (Leveraging Standards-Based Parallel Programming in HPC Applications and Why Standards-Based Parallel Programming Should be in Your HPC Toolbox), we discussed the benefits of standards-based parallel programming in the form of the C++17 standard parallel algorithms. The standard parallel algorithms provide an easy way to accelerate your application because they are a drop-in replacement for the classic standard algorithms.

Like the classic standard algorithms, the standard parallel algorithms are synchronous — they block until their work is done. They don’t let you leverage the inherent asynchrony of today’s hardware to hide latencies or to overlap communication and computation. We would need a standard asynchronous programming model and asynchronous parallel algorithms for that, things C++ lacks today.

The importance of asynchrony in HPC is illustrated in the example below, which implements a simulation of Maxwell’s Equations using the C++17 standard parallel algorithms. Maxwell’s equations model propagation of electromagnetic waves. We use the finite-difference time-domain method (FDTD), which requires the computational domain to be represented as a grid of cells. The simulation requires two passes over the grid per iteration to update the magnetic and electric fields. Since the computation in each cell is independent within one iteration, we can parallelize cell updates using the C++17 standard parallel algorithms.

for (int step = 0; step < n_steps; step++) {
   std::for_each(par_unseq, cells_begin, cells_end, update_h);
   std::for_each(par_unseq, cells_begin, cells_end, update_e);
}

The straightforward CUDA C++ implementation is similar:

for (int step = 0; step < n_steps; step++) {
   kernel<<<grid_blocks, block_threads, 0, stream>>>(n_cells, update_h);
   kernel<<<grid_blocks, block_threads, 0, stream>>>(n_cells, update_e);
}
cudaStreamSynchronize(stream);

Although both of these implementations run on the GPU, the CUDA implementation performs better than the one that’s based on the standard parallel algorithms. That’s because each call to std::for_each() is synchronous, and the latency of kernel launches is not overlapped. We profiled both implementations, and Figure 1 shows the execution timeline. It shows that the CUDA implementation is faster because the inherent asynchrony of kernel launches allows overlapping launching the kernels for the next iteration with execution of the current iteration—effectively hiding the kernel launch latency. In contrast, the std::for_each implementation must synchronize after every invocation and prevents any opportunity for overlap.

A diagram showing the runtime profiles for both the stdpar solution and the CUDA solution.
Figure 1. Execution timeline for the std::for_each and CUDA implementations shows how the CUDA implementation is faster because std::for_each has to synchronize which prevents kernel launches from being overlapped.

To illustrate the importance of exploiting asynchrony, Figure 2 compares the performance of the std::for_each implementation relative to the CUDA implementation across a variety of problem sizes.

A diagram showing the speedup relative to the CUDA implementation for the parallel algorithm implementation of the Maxwell’s Equation example across six problem sizes ranging from 16,384 to 16,777,216 cells.
Figure 2. The performance of the std::for_each implementation relative to CUDA shows how the CUDA implementation performs better because it can hide kernel launch latency by overlapping with kernel execution. The impact is more significant at smaller problem sizes where launch latency is large relative to overall execution time.

As we can see from Figure 2, the std::for_each implementation suffers at smaller problem sizes where kernel launch latency is large relative to kernel execution time. As discussed above, the CUDA implementation is effective at hiding this latency by overlapping it with useful work. In contrast, the synchronous std::for_each implementation cannot hide this latency and therefore it contributes to the overall execution time. As the problem size increases, kernel launch latency becomes trivial compared to kernel execution time and the performance difference eventually disappears.

While the CUDA C++ implementation is faster, it is platform specific. In order to achieve the same results in Standard C++, we need a new programming model that allows us to exploit asynchrony.  To this end, the Sender model was developed as a new way to describe and execute asynchronous work in Standard C++. Senders are expected to become part of the C++ Standard in C++26, but NVIDIA already has a proof-of-concept implementation provided by the stdexec library.

To show the benefits of Senders, Figure 3 shows the performance of the Maxwell’s Equations example using stdexec compared to the CUDA and std::for_each implementations from earlier. Like the parallel algorithms, it can match raw CUDA performance for larger problem sizes, but it also performs well for smaller problem sizes, when latency dominates. This is because the asynchronous Sender model effectively hides kernel launch latency by overlapping it with execution of another kernel. All of the Maxwell’s Equation example programs can be found in the stdexec repository on GitHub. Keep reading to find out more about Senders and stdexec.

A diagram showing the speedup relative to the CUDA implementation for both the parallel algorithm and stdexec implementations of the Maxwell’s Equation example across six problem sizes ranging from 16,384 to 16,777,216 cells.
Figure 3. The stdexec implementation is on par with the CUDA implementation for all problem sizes because the Sender model allows chaining asynchronous operations without unnecessary synchronization. This enables the stdexec implementation to achieve the same overlap and latency hiding as the CUDA implementation. 

A Standard C++ Model for Asynchrony

The results above show why it is important for Standard C++ to have a way to efficiently chain together asynchronous work. The std::execution, or Senders, proposal solves this problem by providing a programming model baked into the language for describing and executing asynchronous operations. The stdexec library introduced above is the NVIDIA proof-of-concept implementation of the Sender model.

The Sender model has two key concepts: a Scheduler, which describes where and how to perform a computation; and a Sender, which is a description of an asynchronous computation. Senders can be chained together to build a pipeline where a Sender produces a value and then sends its value to the next Sender in the chain. The Sender model provides a set of algorithms to create and compose Senders to build completely asynchronous pipelines.

For example, the following code shows a Sender pipeline that implements the Maxwell’s Equations demo from above.

auto compute = stdexec::just()                               // (1)
             | exec::on( scheduler,                          // (2)
                 nvexec::repeat_n( n_steps,                  // (3)
                   stdexec::bulk( n_cells, update_h )        // (4)
                 | stdexec::bulk( n_cells, update_e ) ) );
stdexec::sync_wait( std::move(compute) );                    // (5)

First, you may notice we use several different namespaces. This is to help differentiate what is part of the formal std::execution proposal from the other things stdexec provides. The stdexec:: namespace is for anything that is already part of the std::execution proposal. The  exec:: namespace is for  generic utilities not yet part of the proposal, but will be considered for future proposals. Finally, nvexec:: is for NVIDIA specific schedulers and algorithms.

The pipeline starts with the stdexec::just() Sender (1), which represents an empty computation that gives us an entry point upon which we can chain additional work using the pipe operator (operator|), like the Unix shell.

Next in our chain, we use exec::on() (2) to transition to a new scheduler and continue executing the pipeline there. The on() algorithm takes a scheduler as its first argument to say where the work should happen. The second argument is the work chain to execute. After the pipeline finishes, on() will transition automatically back to the starting execution context – in this case, the thread that called sync_wait() and is waiting for the pipeline to finish.

The nvexec::repeat_n() algorithm (3) repeats the execution of the work chain passed to it a fixed number of times.

Finally, the stdexec::bulk() algorithm (4) is similar to std::for_each() – it invokes the provided function with every index in [0, n_cells). If the scheduler supports parallelism, each invocation of the function may execute in parallel.

No work has started at this point; the variable compute above is just a description. To submit the entire pipeline and wait for its completion we use the sync_wait() algorithm (5).

Sender expressions like compute above are high-level descriptions of work that capture a computation’s structure and semantics; they are declarative. The scheduler used to run them controls how that description gets translated into the actual instructions that execute that work. Pass a CUDA scheduler, get CUDA execution. Pass a CPU-based thread-pool scheduler and the work is executed on the thread-pool. The translation happens at compile-time so you get native performance.

Application Performance Portability

In the example above, you’ll notice the scheduler is specified in just one place. In the performance results we showed above in Figure 3, we provided a scheduler that executes on the GPU using CUDA. However, we’re free to use other schedulers to execute elsewhere by changing a single line of code. Running this simulation on the CPU is as simple as passing a different scheduler.

For example, Figure 4 below shows the performance of the stdexec pipeline described above using a CPU thread pool scheduler.  We compare its performance relative to two other CPU-based solutions: a parallel std::for_each() with OpenMP, and a highly-tuned implementation using raw std::threads. Here we see the solution with raw threads is fastest, closely followed by stdexec with a thread pool, and std::for_each with OpenMP not far behind.

A bar chart showing the speedup relative to a hand-tuned native C++ implementation of the Maxwell’s Equation simulation for both a CPU-based std::for_each solution and the stdexec solution used with a CPU-based thread pool scheduler.
Figure 4. When executed on a CPU-based thread pool, the stdexec implementation of the Maxwell’s Equation simulation is slightly faster than the CPU-based std::for_each implementation and slightly slower than a hand-tuned C++ implementation that uses std::threads directly.

The flexibility of the Sender model also enables us to trivially scale to multiple GPUs. In stdexec  we also provide a multi-GPU scheduler to take advantage of a system with multiple GPUs. Our preliminary results show the multi-GPU scheduler has 90% strong scaling with four GPUs, as shown in Figure 5.

A bar chart showing the performance speedup of the Maxwell’s Equation solution for different numbers of GPUs ranging from one to four relative to the performance for only one.
Figure 5. When used with a scheduler that launches work on all available GPUs, the performance of the Maxwell’s Equation solution scales almost linearly with the number of GPUs, achieving 90% strong scaling with four GPUs.

Fine-Grained Execution Context Control

It’s common in HPC applications for your work to cascade from one execution context to another to match the workload with the computing resource best suited to it. Senders make it simple to define pipelines that span multiple execution contexts.

For example, consider the code below that uses stdexec for a distributed, multi-node implementation of the Maxwell’s Equation example using MPI for communication. We use a straightforward 2D partitioning of the matrix where each rank operates on a 2D tile of the matrix. It first updates the magnetic field within that tile, then uses MPI to send the new values to the other ranks before doing the same thing for the electric field.

nvexec::stream_scheduler gpu = /*...*/;

auto work = stdexec::just()
          | exec::on( gpu, ex::bulk(accessor.own_cells(), update_h) )
          | stdexec::then( mpi_exchange_hx )
          | exec::on( gpu, ex::bulk(accessor.own_cells(), update_e) )
          | stdexec::then( mpi_exchange_ez );

stdexec::sync_wait( std::move( work ) );

 

Here, we use the exec::on() algorithm to transition to the GPU to perform the computation and back to the CPU to initiate communication in the stdexec::then(). This algorithm calls the specified function using the values produced by the previous operation as arguments. Since we did not specify a scheduler for the MPI communication performed in stdexec::then(), it is implicitly performed on the thread that invoked sync_wait().

The distributed algorithm above works, but it needlessly serializes operations that can be done in parallel; namely, computation and communication. Each tile has neighbors that are processed on other ranks. Those other ranks don’t need the full results of this rank, they only need to know the values of the elements at the edges of the tile. We can hide latency by having each rank compute the updated values at the edges of its tile and then sending those results while the inner elements of the tile are updated. We would do this iteratively, first for the magnetic field and then for the electric.

The code below shows how we can modify the example above to overlap the communication of the boundary elements with the computation of interior elements. It uses stdexec::when_all() which takes an arbitrary number of Senders and executes them concurrently. In this example, there are two concurrent operations:

  1. Processing interior elements
  2. Processing and then exchanging the boundary elements

Both of these operations start at the same time, but the boundary cells are processed on a scheduler with a higher CUDA priority to ensure overlap. When the boundary elements are finished processing, it immediately sends the updated boundary elements to neighbors while processing interior cells may still be ongoing.

auto compute_h = stdexec::when_all(
   stdexec::just() | exec::on(gpu, ex::bulk(interior_cells,        // (1)
                                            interior_h_update)),
   stdexec::just() | exec::on(gpu_with_priority,                   // (2)
                              stdexec::bulk(border_cells,
                                            border_h_update))
                   | stdexec::then(exchange_hx);

Using when_all() to execute these two operations concurrently allows us to overlap the MPI communication inside exchange_hx with the computation of interior cells. This delivers up to 50% speedup compared to the example above. As shown in Figure 6 below, communication overlapping also provides better scaling.

A bar chart showing strong scaling efficiency of a distributed simulation of Maxwell’s Equations involving different numbers of GPUs ranging from eight to 64, both with and without the overlapping of communication and computation.
Figure 6. Shows the strong scaling efficiency of the baseline multi-node implementation versus the implementation that uses stdexec::when_all to overlap communication and computation. The Sender model makes it easy to achieve overlap that improves performance, especially as the number of nodes increases and communication becomes more expensive.

To illustrate the difference in the strong scaling efficiency we selected a small problem size (three GB/GPU) and started with a single node (eight GPUs per node). For larger problem sizes (40 GB/GPU when launched on a single node), the efficiency of the overlapping version is 93%. It’s important to note that no low-level synchronization primitives such as stream synchronization or thread management were used in the example.

How to Get Started with stdexec

If you want to try stdexec in your application, you can download the NVIDIA HPC SDK 22.11 for free today and experiment with our various compilers and tools. Alternatively, if you want to stay up-to-date with the latest developments, stdexec is actively maintained on GitHub. The NVIDIA HPC SDK nvc++ compiler and stdexec are also available on Compiler Explorer to enable you to easily try it out.

Happy computing.


About Eric Niebler

Eric NieblerEric Niebler is a Distinguished Engineer and Developer Lead for the CUDA C++ Core Libraries Team at NVIDIA. He’s passionate about improving C++ programmer productivity and software quality with the use of great libraries. He specializes in generic library design and contributed std::ranges to the C++20 Standard Library. For the past few years, he has been working to give C++ a standard async programming model that accommodates massive parallelism and exotic hardware.

About Georgy Evtushenko

Georgy EvtushenkoGeorgy is a member of the CUDA C++ Core Libraries Team at NVIDIA. His core interest has been high-performance computing ever since the beginning of his career. After developing various production HPC applications, his interest turned into delivering Speed-Of-Light performance through high-level C++ abstractions.

About Jeff Larkin

Jeff Lark, Principal HPC Application Architect at NVIDIAJeff is a Principal HPC Application Architect in the NVIDIA HPC Software team. He is passionate about the advancement and adoption of parallel programming models for high-performance computing. He was previously a member of the NVIDIA Developer Technology group, specializing in performance analysis and optimization of high performance computing applications. Jeff is also the chair of the OpenACC technical committee and has worked in both the OpenACC and OpenMP standards bodies. Before joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence, located at Oak Ridge National Laboratory.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pressing needs and hurdles to widespread AI adoption. The sudde Read more…

Quantinuum Reports 99.9% 2-Qubit Gate Fidelity, Caps Eventful 2 Months

April 16, 2024

March and April have been good months for Quantinuum, which today released a blog announcing the ion trap quantum computer specialist has achieved a 99.9% (three nines) two-qubit gate fidelity on its H1 system. The lates Read more…

Mystery Solved: Intel’s Former HPC Chief Now Running Software Engineering Group 

April 15, 2024

Last year, Jeff McVeigh, Intel's readily available leader of the high-performance computing group, suddenly went silent, with no interviews granted or appearances at press conferences.  It led to questions -- what's Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Institute for Human-Centered AI (HAI) put out a yearly report to t Read more…

Crossing the Quantum Threshold: The Path to 10,000 Qubits

April 15, 2024

Editor’s Note: Why do qubit count and quality matter? What’s the difference between physical qubits and logical qubits? Quantum computer vendors toss these terms and numbers around as indicators of the strengths of t Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Computational Chemistry Needs To Be Sustainable, Too

April 8, 2024

A diverse group of computational chemists is encouraging the research community to embrace a sustainable software ecosystem. That's the message behind a recent Read more…

Hyperion Research: Eleven HPC Predictions for 2024

April 4, 2024

HPCwire is happy to announce a new series with Hyperion Research  - a fact-based market research firm focusing on the HPC market. In addition to providing mark Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

Leading Solution Providers

Contributors

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire