New C++ Sender Library Enables Portable Asynchrony

By Eric Niebler, Georgy Evtushenko, and Jeff Larkin at NVIDIA

December 5, 2022

NVIDIA is excited to announce the release of the stdexec library on GitHub and in the 22.11 release of the NVIDIA HPC Software Development Kit. The stdexec library is a proof-of-concept implementation of the Sender model (or Senders) for asynchronous computing—expected to become part of the C++ Standard in C++26. Senders help you separate the logic of your algorithm – the work you actually care about – from the details of where and how it gets executed. Unlike the C++17 parallel algorithms, with Senders you can chain multiple asynchronous computations (like GPU kernels) without unnecessary synchronization.

The upshot is that you can write your program logic once and then decide how to execute it, whether on a single thread, a thread pool, or a GPU, without changing the program logic. It is also possible to target multi-GPU systems or heterogeneous multi-node compute clusters.

The design of stdexec is the product of an industry collaboration between NVIDIA and other members of the C++ Standardization Committee. It gives C++ an extensible, asynchronous programming model, a suite of generic algorithms that capture common async patterns, and hooks that let you say precisely where and how you want your work to execute. The Sender model is slated for standardization with C++26, but stdexec allows you to experiment with it today.

Asynchronous vs. Synchronous Programming

In the previous two posts (Leveraging Standards-Based Parallel Programming in HPC Applications and Why Standards-Based Parallel Programming Should be in Your HPC Toolbox), we discussed the benefits of standards-based parallel programming in the form of the C++17 standard parallel algorithms. The standard parallel algorithms provide an easy way to accelerate your application because they are a drop-in replacement for the classic standard algorithms.

Like the classic standard algorithms, the standard parallel algorithms are synchronous — they block until their work is done. They don’t let you leverage the inherent asynchrony of today’s hardware to hide latencies or to overlap communication and computation. We would need a standard asynchronous programming model and asynchronous parallel algorithms for that, things C++ lacks today.

The importance of asynchrony in HPC is illustrated in the example below, which implements a simulation of Maxwell’s Equations using the C++17 standard parallel algorithms. Maxwell’s equations model propagation of electromagnetic waves. We use the finite-difference time-domain method (FDTD), which requires the computational domain to be represented as a grid of cells. The simulation requires two passes over the grid per iteration to update the magnetic and electric fields. Since the computation in each cell is independent within one iteration, we can parallelize cell updates using the C++17 standard parallel algorithms.

for (int step = 0; step < n_steps; step++) {
   std::for_each(par_unseq, cells_begin, cells_end, update_h);
   std::for_each(par_unseq, cells_begin, cells_end, update_e);
}

The straightforward CUDA C++ implementation is similar:

for (int step = 0; step < n_steps; step++) {
   kernel<<<grid_blocks, block_threads, 0, stream>>>(n_cells, update_h);
   kernel<<<grid_blocks, block_threads, 0, stream>>>(n_cells, update_e);
}
cudaStreamSynchronize(stream);

Although both of these implementations run on the GPU, the CUDA implementation performs better than the one that’s based on the standard parallel algorithms. That’s because each call to std::for_each() is synchronous, and the latency of kernel launches is not overlapped. We profiled both implementations, and Figure 1 shows the execution timeline. It shows that the CUDA implementation is faster because the inherent asynchrony of kernel launches allows overlapping launching the kernels for the next iteration with execution of the current iteration—effectively hiding the kernel launch latency. In contrast, the std::for_each implementation must synchronize after every invocation and prevents any opportunity for overlap.

A diagram showing the runtime profiles for both the stdpar solution and the CUDA solution.
Figure 1. Execution timeline for the std::for_each and CUDA implementations shows how the CUDA implementation is faster because std::for_each has to synchronize which prevents kernel launches from being overlapped.

To illustrate the importance of exploiting asynchrony, Figure 2 compares the performance of the std::for_each implementation relative to the CUDA implementation across a variety of problem sizes.

A diagram showing the speedup relative to the CUDA implementation for the parallel algorithm implementation of the Maxwell’s Equation example across six problem sizes ranging from 16,384 to 16,777,216 cells.
Figure 2. The performance of the std::for_each implementation relative to CUDA shows how the CUDA implementation performs better because it can hide kernel launch latency by overlapping with kernel execution. The impact is more significant at smaller problem sizes where launch latency is large relative to overall execution time.

As we can see from Figure 2, the std::for_each implementation suffers at smaller problem sizes where kernel launch latency is large relative to kernel execution time. As discussed above, the CUDA implementation is effective at hiding this latency by overlapping it with useful work. In contrast, the synchronous std::for_each implementation cannot hide this latency and therefore it contributes to the overall execution time. As the problem size increases, kernel launch latency becomes trivial compared to kernel execution time and the performance difference eventually disappears.

While the CUDA C++ implementation is faster, it is platform specific. In order to achieve the same results in Standard C++, we need a new programming model that allows us to exploit asynchrony.  To this end, the Sender model was developed as a new way to describe and execute asynchronous work in Standard C++. Senders are expected to become part of the C++ Standard in C++26, but NVIDIA already has a proof-of-concept implementation provided by the stdexec library.

To show the benefits of Senders, Figure 3 shows the performance of the Maxwell’s Equations example using stdexec compared to the CUDA and std::for_each implementations from earlier. Like the parallel algorithms, it can match raw CUDA performance for larger problem sizes, but it also performs well for smaller problem sizes, when latency dominates. This is because the asynchronous Sender model effectively hides kernel launch latency by overlapping it with execution of another kernel. All of the Maxwell’s Equation example programs can be found in the stdexec repository on GitHub. Keep reading to find out more about Senders and stdexec.

A diagram showing the speedup relative to the CUDA implementation for both the parallel algorithm and stdexec implementations of the Maxwell’s Equation example across six problem sizes ranging from 16,384 to 16,777,216 cells.
Figure 3. The stdexec implementation is on par with the CUDA implementation for all problem sizes because the Sender model allows chaining asynchronous operations without unnecessary synchronization. This enables the stdexec implementation to achieve the same overlap and latency hiding as the CUDA implementation. 

A Standard C++ Model for Asynchrony

The results above show why it is important for Standard C++ to have a way to efficiently chain together asynchronous work. The std::execution, or Senders, proposal solves this problem by providing a programming model baked into the language for describing and executing asynchronous operations. The stdexec library introduced above is the NVIDIA proof-of-concept implementation of the Sender model.

The Sender model has two key concepts: a Scheduler, which describes where and how to perform a computation; and a Sender, which is a description of an asynchronous computation. Senders can be chained together to build a pipeline where a Sender produces a value and then sends its value to the next Sender in the chain. The Sender model provides a set of algorithms to create and compose Senders to build completely asynchronous pipelines.

For example, the following code shows a Sender pipeline that implements the Maxwell’s Equations demo from above.

auto compute = stdexec::just()                               // (1)
             | exec::on( scheduler,                          // (2)
                 nvexec::repeat_n( n_steps,                  // (3)
                   stdexec::bulk( n_cells, update_h )        // (4)
                 | stdexec::bulk( n_cells, update_e ) ) );
stdexec::sync_wait( std::move(compute) );                    // (5)

First, you may notice we use several different namespaces. This is to help differentiate what is part of the formal std::execution proposal from the other things stdexec provides. The stdexec:: namespace is for anything that is already part of the std::execution proposal. The  exec:: namespace is for  generic utilities not yet part of the proposal, but will be considered for future proposals. Finally, nvexec:: is for NVIDIA specific schedulers and algorithms.

The pipeline starts with the stdexec::just() Sender (1), which represents an empty computation that gives us an entry point upon which we can chain additional work using the pipe operator (operator|), like the Unix shell.

Next in our chain, we use exec::on() (2) to transition to a new scheduler and continue executing the pipeline there. The on() algorithm takes a scheduler as its first argument to say where the work should happen. The second argument is the work chain to execute. After the pipeline finishes, on() will transition automatically back to the starting execution context – in this case, the thread that called sync_wait() and is waiting for the pipeline to finish.

The nvexec::repeat_n() algorithm (3) repeats the execution of the work chain passed to it a fixed number of times.

Finally, the stdexec::bulk() algorithm (4) is similar to std::for_each() – it invokes the provided function with every index in [0, n_cells). If the scheduler supports parallelism, each invocation of the function may execute in parallel.

No work has started at this point; the variable compute above is just a description. To submit the entire pipeline and wait for its completion we use the sync_wait() algorithm (5).

Sender expressions like compute above are high-level descriptions of work that capture a computation’s structure and semantics; they are declarative. The scheduler used to run them controls how that description gets translated into the actual instructions that execute that work. Pass a CUDA scheduler, get CUDA execution. Pass a CPU-based thread-pool scheduler and the work is executed on the thread-pool. The translation happens at compile-time so you get native performance.

Application Performance Portability

In the example above, you’ll notice the scheduler is specified in just one place. In the performance results we showed above in Figure 3, we provided a scheduler that executes on the GPU using CUDA. However, we’re free to use other schedulers to execute elsewhere by changing a single line of code. Running this simulation on the CPU is as simple as passing a different scheduler.

For example, Figure 4 below shows the performance of the stdexec pipeline described above using a CPU thread pool scheduler.  We compare its performance relative to two other CPU-based solutions: a parallel std::for_each() with OpenMP, and a highly-tuned implementation using raw std::threads. Here we see the solution with raw threads is fastest, closely followed by stdexec with a thread pool, and std::for_each with OpenMP not far behind.

A bar chart showing the speedup relative to a hand-tuned native C++ implementation of the Maxwell’s Equation simulation for both a CPU-based std::for_each solution and the stdexec solution used with a CPU-based thread pool scheduler.
Figure 4. When executed on a CPU-based thread pool, the stdexec implementation of the Maxwell’s Equation simulation is slightly faster than the CPU-based std::for_each implementation and slightly slower than a hand-tuned C++ implementation that uses std::threads directly.

The flexibility of the Sender model also enables us to trivially scale to multiple GPUs. In stdexec  we also provide a multi-GPU scheduler to take advantage of a system with multiple GPUs. Our preliminary results show the multi-GPU scheduler has 90% strong scaling with four GPUs, as shown in Figure 5.

A bar chart showing the performance speedup of the Maxwell’s Equation solution for different numbers of GPUs ranging from one to four relative to the performance for only one.
Figure 5. When used with a scheduler that launches work on all available GPUs, the performance of the Maxwell’s Equation solution scales almost linearly with the number of GPUs, achieving 90% strong scaling with four GPUs.

Fine-Grained Execution Context Control

It’s common in HPC applications for your work to cascade from one execution context to another to match the workload with the computing resource best suited to it. Senders make it simple to define pipelines that span multiple execution contexts.

For example, consider the code below that uses stdexec for a distributed, multi-node implementation of the Maxwell’s Equation example using MPI for communication. We use a straightforward 2D partitioning of the matrix where each rank operates on a 2D tile of the matrix. It first updates the magnetic field within that tile, then uses MPI to send the new values to the other ranks before doing the same thing for the electric field.

nvexec::stream_scheduler gpu = /*...*/;

auto work = stdexec::just()
          | exec::on( gpu, ex::bulk(accessor.own_cells(), update_h) )
          | stdexec::then( mpi_exchange_hx )
          | exec::on( gpu, ex::bulk(accessor.own_cells(), update_e) )
          | stdexec::then( mpi_exchange_ez );

stdexec::sync_wait( std::move( work ) );

 

Here, we use the exec::on() algorithm to transition to the GPU to perform the computation and back to the CPU to initiate communication in the stdexec::then(). This algorithm calls the specified function using the values produced by the previous operation as arguments. Since we did not specify a scheduler for the MPI communication performed in stdexec::then(), it is implicitly performed on the thread that invoked sync_wait().

The distributed algorithm above works, but it needlessly serializes operations that can be done in parallel; namely, computation and communication. Each tile has neighbors that are processed on other ranks. Those other ranks don’t need the full results of this rank, they only need to know the values of the elements at the edges of the tile. We can hide latency by having each rank compute the updated values at the edges of its tile and then sending those results while the inner elements of the tile are updated. We would do this iteratively, first for the magnetic field and then for the electric.

The code below shows how we can modify the example above to overlap the communication of the boundary elements with the computation of interior elements. It uses stdexec::when_all() which takes an arbitrary number of Senders and executes them concurrently. In this example, there are two concurrent operations:

  1. Processing interior elements
  2. Processing and then exchanging the boundary elements

Both of these operations start at the same time, but the boundary cells are processed on a scheduler with a higher CUDA priority to ensure overlap. When the boundary elements are finished processing, it immediately sends the updated boundary elements to neighbors while processing interior cells may still be ongoing.

auto compute_h = stdexec::when_all(
   stdexec::just() | exec::on(gpu, ex::bulk(interior_cells,        // (1)
                                            interior_h_update)),
   stdexec::just() | exec::on(gpu_with_priority,                   // (2)
                              stdexec::bulk(border_cells,
                                            border_h_update))
                   | stdexec::then(exchange_hx);

Using when_all() to execute these two operations concurrently allows us to overlap the MPI communication inside exchange_hx with the computation of interior cells. This delivers up to 50% speedup compared to the example above. As shown in Figure 6 below, communication overlapping also provides better scaling.

A bar chart showing strong scaling efficiency of a distributed simulation of Maxwell’s Equations involving different numbers of GPUs ranging from eight to 64, both with and without the overlapping of communication and computation.
Figure 6. Shows the strong scaling efficiency of the baseline multi-node implementation versus the implementation that uses stdexec::when_all to overlap communication and computation. The Sender model makes it easy to achieve overlap that improves performance, especially as the number of nodes increases and communication becomes more expensive.

To illustrate the difference in the strong scaling efficiency we selected a small problem size (three GB/GPU) and started with a single node (eight GPUs per node). For larger problem sizes (40 GB/GPU when launched on a single node), the efficiency of the overlapping version is 93%. It’s important to note that no low-level synchronization primitives such as stream synchronization or thread management were used in the example.

How to Get Started with stdexec

If you want to try stdexec in your application, you can download the NVIDIA HPC SDK 22.11 for free today and experiment with our various compilers and tools. Alternatively, if you want to stay up-to-date with the latest developments, stdexec is actively maintained on GitHub. The NVIDIA HPC SDK nvc++ compiler and stdexec are also available on Compiler Explorer to enable you to easily try it out.

Happy computing.


About Eric Niebler

Eric NieblerEric Niebler is a Distinguished Engineer and Developer Lead for the CUDA C++ Core Libraries Team at NVIDIA. He’s passionate about improving C++ programmer productivity and software quality with the use of great libraries. He specializes in generic library design and contributed std::ranges to the C++20 Standard Library. For the past few years, he has been working to give C++ a standard async programming model that accommodates massive parallelism and exotic hardware.

About Georgy Evtushenko

Georgy EvtushenkoGeorgy is a member of the CUDA C++ Core Libraries Team at NVIDIA. His core interest has been high-performance computing ever since the beginning of his career. After developing various production HPC applications, his interest turned into delivering Speed-Of-Light performance through high-level C++ abstractions.

About Jeff Larkin

Jeff Lark, Principal HPC Application Architect at NVIDIAJeff is a Principal HPC Application Architect in the NVIDIA HPC Software team. He is passionate about the advancement and adoption of parallel programming models for high-performance computing. He was previously a member of the NVIDIA Developer Technology group, specializing in performance analysis and optimization of high performance computing applications. Jeff is also the chair of the OpenACC technical committee and has worked in both the OpenACC and OpenMP standards bodies. Before joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence, located at Oak Ridge National Laboratory.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Nvidia Touts Strong Results on Financial Services Inference Benchmark

February 3, 2023

The next-gen Hopper family may be on its way, but that isn’t stopping Nvidia’s popular A100 GPU from leading another benchmark on its way out. This time, it’s the STAC-ML inference benchmark, produced by the Securi Read more…

Quantum Computing Firm Rigetti Faces Delisting

February 3, 2023

Quantum computing companies are seeing their market caps crumble as investors patiently await out the winner-take-all approach to technology development. Quantum computing firms such as Rigetti Computing, IonQ and D-Wave went public through mergers with blank-check companies in the last two years, with valuations at the time of well over $1 billion. Now the market capitalization of these companies are less than half... Read more…

US and India Strengthen HPC, Quantum Ties Amid Tech Tension with China

February 2, 2023

Last May, the United States and India announced the “Initiative on Critical and Emerging Technology” (iCET), aimed at expanding the countries’ partnerships in strategic technologies and defense industries across th Read more…

Pittsburgh Supercomputing Enables Transparent Medicare Outcome AI

February 2, 2023

Medical applications of AI are replete with promise, but stymied by opacity: with lives on the line, concerns over AI models’ often-inscrutable reasoning – and as a result, possible biases embedded in those models Read more…

Europe’s LUMI Supercomputer Has Officially Been Accepted

February 1, 2023

“LUMI is officially here!” proclaimed the headline of a blog post written by Pekka Manninen, director of science and technology for CSC, Finland’s state-owned IT center. The EuroHPC-organized supercomputer’s most Read more…

AWS Solution Channel

Shutterstock 2069893598

Cost-effective and accurate genomics analysis with Sentieon on AWS

This blog post was contributed by Don Freed, Senior Bioinformatics Scientist, and Brendan Gallagher, Head of Business Development at Sentieon; and Olivia Choudhury, PhD, Senior Partner Solutions Architect, Sujaya Srinivasan, Genomics Solutions Architect, and Aniket Deshpande, Senior Specialist, HPC HCLS at AWS. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1453953692

Microsoft and NVIDIA Experts Talk AI Infrastructure

As AI emerges as a crucial tool in so many sectors, it’s clear that the need for optimized AI infrastructure is growing. Going beyond just GPU-based clusters, cloud infrastructure that provides low-latency, high-bandwidth interconnects and high-performance storage can help organizations handle AI workloads more efficiently and produce faster results. Read more…

Intel’s Gaudi3 AI Chip Survives Axe, Successor May Combine with GPUs

February 1, 2023

Intel's paring projects and products amid financial struggles, but AI products are taking on a major role as the company tweaks its chip roadmap to account for more computing specifically targeted at artificial intellige Read more…

Quantum Computing Firm Rigetti Faces Delisting

February 3, 2023

Quantum computing companies are seeing their market caps crumble as investors patiently await out the winner-take-all approach to technology development. Quantum computing firms such as Rigetti Computing, IonQ and D-Wave went public through mergers with blank-check companies in the last two years, with valuations at the time of well over $1 billion. Now the market capitalization of these companies are less than half... Read more…

US and India Strengthen HPC, Quantum Ties Amid Tech Tension with China

February 2, 2023

Last May, the United States and India announced the “Initiative on Critical and Emerging Technology” (iCET), aimed at expanding the countries’ partnership Read more…

Intel’s Gaudi3 AI Chip Survives Axe, Successor May Combine with GPUs

February 1, 2023

Intel's paring projects and products amid financial struggles, but AI products are taking on a major role as the company tweaks its chip roadmap to account for Read more…

Roadmap for Building a US National AI Research Resource Released

January 31, 2023

Last week the National AI Research Resource (NAIRR) Task Force released its final report and roadmap for building a national AI infrastructure to include comput Read more…

PFAS Regulations, 3M Exit to Impact Two-Phase Cooling in HPC

January 27, 2023

Per- and polyfluoroalkyl substances (PFAS), known as “forever chemicals,” pose a number of health risks to humans, with more suspected but not yet confirmed Read more…

Multiverse, Pasqal, and Crédit Agricole Tout Progress Using Quantum Computing in FS

January 26, 2023

Europe-based quantum computing pioneers Multiverse Computing and Pasqal, and global bank Crédit Agricole CIB today announced successful conclusion of a 1.5-yea Read more…

Critics Don’t Want Politicians Deciding the Future of Semiconductors

January 26, 2023

The future of the semiconductor industry was partially being decided last week by a mix of politicians, policy hawks and chip industry executives jockeying for Read more…

Riken Plans ‘Virtual Fugaku’ on AWS

January 26, 2023

The development of a national flagship supercomputer aimed at exascale computing continues to be a heated competition, especially in the United States, the Euro Read more…

Leading Solution Providers

Contributors

SC22 Booth Videos

AMD @ SC22
Altair @ SC22
AWS @ SC22
Ayar Labs @ SC22
CoolIT @ SC22
Cornelis Networks @ SC22
DDN @ SC22
Dell Technologies @ SC22
HPE @ SC22
Intel @ SC22
Intelligent Light @ SC22
Lancium @ SC22
Lenovo @ SC22
Microsoft and NVIDIA @ SC22
One Stop Systems @ SC22
Penguin Solutions @ SC22
QCT @ SC22
Supermicro @ SC22
Tuxera @ SC22
Tyan Computer @ SC22
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire