Solving Heterogeneous Programming Challenges with Python, Today

By James Reinders

June 30, 2022

In the fourth of a series of guest posts on heterogenous computing, James Reinders considers how full heterogeneous programming can be realized with Python today.

You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question, check out the first installment.

Key Questions

When considering “how do we program a truly heterogeneous machine?,” broadly we need two things: (1) a way to learn at runtime about all the devices that are available to our application, and (2) a way to utilize the devices to help perform work for our application.

Since utilizing devices involves both data and computation, we are left with three key questions:

  1. How can our Python program recognize available computational devices? (find, query capabilities, and select, all regardless of vendor and architecture)
  2. How does our application manage data sharing in a way that is reasonably optimal for the device(s) involved? Expectations here include that we can avoid excessive data movement, take advantage of memories, and memory movement functionality.
  3. How does our workload specify computations to be offloaded to selected device(s)? Expectations here include ability to intercept exceptions, including asynchronous errors based on code running on an accelerator.

When a program can do all three well, regardless of vendor and architecture, we have made possible open heterogeneous programming. By seeking such open approaches, we aim to increase application portability and reduce unnecessary barriers to using and supporting new hardware innovations.

First, we need to understand how to get parallelism and compiled code because we won’t want to offload serial interpreted code to our accelerator.

Performance starts with Parallelism and Compiling

Numba is an open-source, NumPy-aware optimizing (just-in-time) compiler for Python developed by Anaconda. It uses the LLVM compiler to generate machine code from Python bytecode. Numba can compile a large subset of numerically focused Python, including many NumPy functions. Additionally, Numba has support for automatic parallelization of loops, generation of GPU-accelerated code, and creation of Universal Functions (ufuncs) and C callbacks. Numba includes an auto-parallelizer that was contributed by Intel. The auto-parallelizer can be enabled by setting the parallel=True option in the @numba.jit. The auto-parallelizer analyzes data-parallel code regions in the compiled function and schedules them for parallel execution.

There are two types of operations that Numba can automatically parallelize: Implicitly data-parallel regions such as NumPy array expressions, NumPy ufuncs, NumPy reduction functions, and explicitly data-parallel loops that are specified using the numba.prange expression.

For example, consider a simple Python loop such as:

  def f1(a,b,c,N):
       for i in range(N):
           c[i] = a[i] + b[i]

Next – I found an easy 12X improvement (24.3 seconds to 1.9 seconds) even without offloading (even more with offloading).

We can make it explicitly parallel by changing the serial range (range) to a parallel range (prange), and adding a njit directive (njit = Numba JIT = compile a parallel version):

  @njit(parallel=True)
  def add(a,b,c,N):
      for i in prange(N):
          c[i] = a[i] + b[i]

This dropped from 24.3 seconds to 1.9 seconds when I ran it. Results can easily be more or less depending on the parallelism available on a system. To try it yourself, start with ‘git clone https://github.com/oneapi-src/oneAPI-samples’ and then open the notebook AI-and-Analytics/Jupyter/Numba_DPPY_Essentials_training/Welcome.ipynb. An easy way to do that quickly is to get a free account on DevCloud.

Do we benefit from accelerators?

The key to understanding how accelerators will help is overcoming the overhead of offloading. For many, the above ability to JIT and parallelize our code is more than enough. Over time, the above could evolve to automatically use tightly connected GPUs (on die, on package, coherently shared memory, etc.).

The techniques we cover in this article can be highly effective if and when our application has sufficient computations in it to overcome any costs with overloading and be worth the time for programming.

Compile into a Kernel for Offload

Extending the prior example, we can use Numba Data Parallel Extensions (numba-dpex) to designate a kernel to be compiled and ready for offload. An essentially equivalent computation can be expressed as a kernel as follows (for more details, refer to the Jupyter notebook training):

  @dppy.kernel
  def add(a, b, c):
      i = dppy.get_global_id(0)
      c[i] = a[i] + b[i]

The kernel code is compiled and parallelized, like it was previously using @njit to ready for running the CPU, but this time it is ready for offload to a device. It is compiled into SPIR/V, which the runtime finishes mapping to a device when it is submitted for execution. This gives us a vendor agnostic solution for offload. It turns out, the first code snippet (using only @njit) can also be offloaded as-is without writing a kernel explicitly.

The array arguments to the kernel can be NumPy arrays or USM arrays (an array type explicitly placed in Unified Shared Memory) depending on what we feel fits our programming needs best. Our choice will affect how we set up the data and invoke the kernels.

SYCL bindings solve the three keys

Since SYCL can answer all three key questions we posed, the most consistent and versatile approach is to provide SYCL bindings for Python and use them directly.  This is exactly what the open source Data-Parallel Control (dpctl: C and Python bindings for SYCL) has done. You can learn more from their github docs and “Interfacing SYCL and Python for XPU Programming.”  These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources. This avoids reinventing solutions, reduces how much we have to learn, and allows a high level of compatibility as well.

Connecting to a device is as simple as:

  device = dpctl.select_default_device()
  print("Using device ...")
  device.print_device_info()

Select any device – regardless of vendor.

The default device can be influenced with an environment variable SYCL_DEVICE_FILTER if we want to control device selection without changing this simple program. The dpctl library also supports programmatic controls to review and select an available device based on hardware properties.

The kernel can be invoked (offloaded and run) to the device with a couple lines of Python code:

  with dpctl.device_context(device):
      dpar_add[global_size,dppy.DEFAULT_LOCAL_SIZE](a,b,c)

Our use of device_context has the runtime do all the necessary data copies (our data was still in standard NumPy arrays) to make it all work. The dpctl library also supports an ability for use to allocate and manage USM memory for devices explicitly. That could be valuable when we get deep into optimization, but the simplicity of letting the runtime handle it for standard NumPy arrays is hard to beat.

Asynchronous vs. Synchronous

Python coding style is easily supported by the synchronous mechanisms shown above. Asynchronous capabilities, and their advantages (reducing or hiding latencies in data movement and kernel invocations), are also available if we want to change our Python code even more. That gives us the ability to run kernels with less latency and move data asynchronous to help maximize performance. to learn more about the asynchronous capabilities, see dpctl gemv example.

What about cuPy?

CuPy is a reimplementation of a large subset of NumPy. The CuPy array library acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms. The massive programming effort needed serves as a considerable barrier to reimplementing for any new platforms.

Such an approach also raises questions about supporting multiple vendors from the same Python program, because it does not address our three questions. For device selection, CuPy requires there is a CUDA-enabled GPU device available. For memory, it offers little direct control but does automatically perform memory pooling to reduce the number of calls to cudaMalloc.  When offloading operations, it offers no control over which device is utilized and will fail if no CUDA-enabled GPU is present.

While this is indeed effective for CUDA-GPU-devices, we can have more portability for our application when we are open to address all three of the key questions. There is a strong need for portable and architecture-agnostic abilities to write extensions.

What about SciKit-Learn?

Python programming in general is well suited for compute-follows-data, and using enabled routines is beautifully simple. The dpctl library supports a tensor array type that we connect with a specific device.  In our program, if we cast our data to a device tensor (e.g., dpctl.tensor.asarray(data, device=”gpu:0″)) it will be associated with and placed on the device. Using a patched version of SciKit-Learn that recognizes these device tensors, the patched sklearn methods that involve such a tensor are automatically computed on the device. It is a great use of dynamic typing in Python to sense where the data is and direct the computation to be done where the data is. Our Python code changes very little, only the lines where are recast our tensors to a device tensor. Based on experience thus far, we expect compute-follows-data methods to be the most popular models for Python users.

compute-follows-data: Python dynamic typing allows computation to be directed to where the data is automatic – it is quite beautiful.

Open, Multivendor, Multiarchitecture – Learning Together

Python can be an instrument to embrace the power of diversity of hardware, and harness the impending Cambrian Explosion. Numba Data Parallel Python combined with dpctl to connect use with SYCL, and compute-follows-data patched SciKit-Learn are worth considering because they are vendor and architecture agnostic.

Open is driving vendor and architecture agnostic solutions that helps us all.

While Numba offers great support for NumPy, we can consider what more can be done for SciPy and other Python needs in the future.

The fragmentation of array APIs in Python has generated interest in array-API standardization for Python (read a nice summary) because of the desire to share workloads with devices other than the CPU. A standard array API goes a long way in helping efforts like Numba and dpctl increase their scope and impact. NumPy, and CuPy have embraced array-API, and both dpctl and PyTorch work to adopt is underway. As more libraries head this way, the task of supporting heterogeneous computing (accelerators of all types) becomes more tractable.

The simple use of dpctl.device_context is not sufficient in more sophisticated Python codes with multiple threads or asynchronous tasks (see github issue). It is likely better to pursue a compute-follows-data policy, at least in more complex threaded Python code. It may become the preferred option over the device_context style of programming.

Python Ready for the Cambrian Explosion

Python support is ready today for supporting open multivendor multiarchitecture heterogeneous programming.  This enables nearly limitless controls, and they can in turn be buried in easy to use Python code everyone can use. I have no doubt that we will continue to see exciting developments with support for heterogeneous programming in Python that will come from feedback as we gain experience through usage.

Learn More

For learning, there is nothing better than jumping in and trying it out yourself. I have some suggestions for online resources to help.

A firm understanding of best practices for using NumPy is highly recommended: the video Losing your Loops Fast Numerical Computing with NumPy by Jake VanderPlas, is an delightfully useful talk on how to use NumPy effectively, by the author of the book Python Data Science Handbook.

For Numba and dpctl, there is a 90 minute video talk covering these concepts in more detail titled “Data Parallel Essentials for Python.” Also, the step-by-step Jupyter notebook based training within the oneAPI samples was mentioned earlier (refer back for the git and file information).

These heterogeneous Python capabilities are all open source, and also come prebuilt with the Intel oneAPI Base and AI Toolkits because it bundles the prebuilt Intel Distribution for Python. A SYCL enabled NumPy is hosted on github. Numba compiler extensions for kernel programming and automatic offload capabilities are hosted on github. The open source Data-Parallel Controls (dpctl: C and Python bindings for SYCL) has github docs and  a paper Interfacing SYCL and Python for XPU Programming. These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources.

Exceptions are indeed supported, including asynchronous errors from device code. Async errors will be intercepted once they are rethrown as synchronous exceptions by async error handler function. This behavior is courtesy of Python extensions generators and community documentation explains it well in Cython and Pybind11.

Prior Installments in this Series

  1. Solving Heterogeneous Programming Challenges with SYCL
  2. Why SYCL: Elephants in the SYCL Room
  3. Reflecting on the 25th Anniversary of ASCI Red and Continuing Themes for Our Heterogenous Future

About the Author

James Reinders believes the full benefits of the evolution to full heterogeneous computing will be best realized with an open, multivendor, multiarchitecture approach. Reinders rejoined Intel a year ago, specifically because he believes Intel can meaningfully help realize this open future. Reinders is an author (or co-author and/or editor) of ten technical books related to parallel programming; his latest book is about SYCL (it can be freely downloaded here). 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire