Solving Heterogeneous Programming Challenges with Python, Today

By James Reinders

June 30, 2022

In the fourth of a series of guest posts on heterogenous computing, James Reinders considers how full heterogeneous programming can be realized with Python today.

You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question, check out the first installment.

Key Questions

When considering “how do we program a truly heterogeneous machine?,” broadly we need two things: (1) a way to learn at runtime about all the devices that are available to our application, and (2) a way to utilize the devices to help perform work for our application.

Since utilizing devices involves both data and computation, we are left with three key questions:

  1. How can our Python program recognize available computational devices? (find, query capabilities, and select, all regardless of vendor and architecture)
  2. How does our application manage data sharing in a way that is reasonably optimal for the device(s) involved? Expectations here include that we can avoid excessive data movement, take advantage of memories, and memory movement functionality.
  3. How does our workload specify computations to be offloaded to selected device(s)? Expectations here include ability to intercept exceptions, including asynchronous errors based on code running on an accelerator.

When a program can do all three well, regardless of vendor and architecture, we have made possible open heterogeneous programming. By seeking such open approaches, we aim to increase application portability and reduce unnecessary barriers to using and supporting new hardware innovations.

First, we need to understand how to get parallelism and compiled code because we won’t want to offload serial interpreted code to our accelerator.

Performance starts with Parallelism and Compiling

Numba is an open-source, NumPy-aware optimizing (just-in-time) compiler for Python developed by Anaconda. It uses the LLVM compiler to generate machine code from Python bytecode. Numba can compile a large subset of numerically focused Python, including many NumPy functions. Additionally, Numba has support for automatic parallelization of loops, generation of GPU-accelerated code, and creation of Universal Functions (ufuncs) and C callbacks. Numba includes an auto-parallelizer that was contributed by Intel. The auto-parallelizer can be enabled by setting the parallel=True option in the @numba.jit. The auto-parallelizer analyzes data-parallel code regions in the compiled function and schedules them for parallel execution.

There are two types of operations that Numba can automatically parallelize: Implicitly data-parallel regions such as NumPy array expressions, NumPy ufuncs, NumPy reduction functions, and explicitly data-parallel loops that are specified using the numba.prange expression.

For example, consider a simple Python loop such as:

  def f1(a,b,c,N):
       for i in range(N):
           c[i] = a[i] + b[i]

Next – I found an easy 12X improvement (24.3 seconds to 1.9 seconds) even without offloading (even more with offloading).

We can make it explicitly parallel by changing the serial range (range) to a parallel range (prange), and adding a njit directive (njit = Numba JIT = compile a parallel version):

  def add(a,b,c,N):
      for i in prange(N):
          c[i] = a[i] + b[i]

This dropped from 24.3 seconds to 1.9 seconds when I ran it. Results can easily be more or less depending on the parallelism available on a system. To try it yourself, start with ‘git clone’ and then open the notebook AI-and-Analytics/Jupyter/Numba_DPPY_Essentials_training/Welcome.ipynb. An easy way to do that quickly is to get a free account on DevCloud.

Do we benefit from accelerators?

The key to understanding how accelerators will help is overcoming the overhead of offloading. For many, the above ability to JIT and parallelize our code is more than enough. Over time, the above could evolve to automatically use tightly connected GPUs (on die, on package, coherently shared memory, etc.).

The techniques we cover in this article can be highly effective if and when our application has sufficient computations in it to overcome any costs with overloading and be worth the time for programming.

Compile into a Kernel for Offload

Extending the prior example, we can use Numba Data Parallel Extensions (numba-dpex) to designate a kernel to be compiled and ready for offload. An essentially equivalent computation can be expressed as a kernel as follows (for more details, refer to the Jupyter notebook training):

  def add(a, b, c):
      i = dppy.get_global_id(0)
      c[i] = a[i] + b[i]

The kernel code is compiled and parallelized, like it was previously using @njit to ready for running the CPU, but this time it is ready for offload to a device. It is compiled into SPIR/V, which the runtime finishes mapping to a device when it is submitted for execution. This gives us a vendor agnostic solution for offload. It turns out, the first code snippet (using only @njit) can also be offloaded as-is without writing a kernel explicitly.

The array arguments to the kernel can be NumPy arrays or USM arrays (an array type explicitly placed in Unified Shared Memory) depending on what we feel fits our programming needs best. Our choice will affect how we set up the data and invoke the kernels.

SYCL bindings solve the three keys

Since SYCL can answer all three key questions we posed, the most consistent and versatile approach is to provide SYCL bindings for Python and use them directly.  This is exactly what the open source Data-Parallel Control (dpctl: C and Python bindings for SYCL) has done. You can learn more from their github docs and “Interfacing SYCL and Python for XPU Programming.”  These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources. This avoids reinventing solutions, reduces how much we have to learn, and allows a high level of compatibility as well.

Connecting to a device is as simple as:

  device = dpctl.select_default_device()
  print("Using device ...")

Select any device – regardless of vendor.

The default device can be influenced with an environment variable SYCL_DEVICE_FILTER if we want to control device selection without changing this simple program. The dpctl library also supports programmatic controls to review and select an available device based on hardware properties.

The kernel can be invoked (offloaded and run) to the device with a couple lines of Python code:

  with dpctl.device_context(device):

Our use of device_context has the runtime do all the necessary data copies (our data was still in standard NumPy arrays) to make it all work. The dpctl library also supports an ability for use to allocate and manage USM memory for devices explicitly. That could be valuable when we get deep into optimization, but the simplicity of letting the runtime handle it for standard NumPy arrays is hard to beat.

Asynchronous vs. Synchronous

Python coding style is easily supported by the synchronous mechanisms shown above. Asynchronous capabilities, and their advantages (reducing or hiding latencies in data movement and kernel invocations), are also available if we want to change our Python code even more. That gives us the ability to run kernels with less latency and move data asynchronous to help maximize performance. to learn more about the asynchronous capabilities, see dpctl gemv example.

What about cuPy?

CuPy is a reimplementation of a large subset of NumPy. The CuPy array library acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms. The massive programming effort needed serves as a considerable barrier to reimplementing for any new platforms.

Such an approach also raises questions about supporting multiple vendors from the same Python program, because it does not address our three questions. For device selection, CuPy requires there is a CUDA-enabled GPU device available. For memory, it offers little direct control but does automatically perform memory pooling to reduce the number of calls to cudaMalloc.  When offloading operations, it offers no control over which device is utilized and will fail if no CUDA-enabled GPU is present.

While this is indeed effective for CUDA-GPU-devices, we can have more portability for our application when we are open to address all three of the key questions. There is a strong need for portable and architecture-agnostic abilities to write extensions.

What about SciKit-Learn?

Python programming in general is well suited for compute-follows-data, and using enabled routines is beautifully simple. The dpctl library supports a tensor array type that we connect with a specific device.  In our program, if we cast our data to a device tensor (e.g., dpctl.tensor.asarray(data, device=”gpu:0″)) it will be associated with and placed on the device. Using a patched version of SciKit-Learn that recognizes these device tensors, the patched sklearn methods that involve such a tensor are automatically computed on the device. It is a great use of dynamic typing in Python to sense where the data is and direct the computation to be done where the data is. Our Python code changes very little, only the lines where are recast our tensors to a device tensor. Based on experience thus far, we expect compute-follows-data methods to be the most popular models for Python users.

compute-follows-data: Python dynamic typing allows computation to be directed to where the data is automatic – it is quite beautiful.

Open, Multivendor, Multiarchitecture – Learning Together

Python can be an instrument to embrace the power of diversity of hardware, and harness the impending Cambrian Explosion. Numba Data Parallel Python combined with dpctl to connect use with SYCL, and compute-follows-data patched SciKit-Learn are worth considering because they are vendor and architecture agnostic.

Open is driving vendor and architecture agnostic solutions that helps us all.

While Numba offers great support for NumPy, we can consider what more can be done for SciPy and other Python needs in the future.

The fragmentation of array APIs in Python has generated interest in array-API standardization for Python (read a nice summary) because of the desire to share workloads with devices other than the CPU. A standard array API goes a long way in helping efforts like Numba and dpctl increase their scope and impact. NumPy, and CuPy have embraced array-API, and both dpctl and PyTorch work to adopt is underway. As more libraries head this way, the task of supporting heterogeneous computing (accelerators of all types) becomes more tractable.

The simple use of dpctl.device_context is not sufficient in more sophisticated Python codes with multiple threads or asynchronous tasks (see github issue). It is likely better to pursue a compute-follows-data policy, at least in more complex threaded Python code. It may become the preferred option over the device_context style of programming.

Python Ready for the Cambrian Explosion

Python support is ready today for supporting open multivendor multiarchitecture heterogeneous programming.  This enables nearly limitless controls, and they can in turn be buried in easy to use Python code everyone can use. I have no doubt that we will continue to see exciting developments with support for heterogeneous programming in Python that will come from feedback as we gain experience through usage.

Learn More

For learning, there is nothing better than jumping in and trying it out yourself. I have some suggestions for online resources to help.

A firm understanding of best practices for using NumPy is highly recommended: the video Losing your Loops Fast Numerical Computing with NumPy by Jake VanderPlas, is an delightfully useful talk on how to use NumPy effectively, by the author of the book Python Data Science Handbook.

For Numba and dpctl, there is a 90 minute video talk covering these concepts in more detail titled “Data Parallel Essentials for Python.” Also, the step-by-step Jupyter notebook based training within the oneAPI samples was mentioned earlier (refer back for the git and file information).

These heterogeneous Python capabilities are all open source, and also come prebuilt with the Intel oneAPI Base and AI Toolkits because it bundles the prebuilt Intel Distribution for Python. A SYCL enabled NumPy is hosted on github. Numba compiler extensions for kernel programming and automatic offload capabilities are hosted on github. The open source Data-Parallel Controls (dpctl: C and Python bindings for SYCL) has github docs and  a paper Interfacing SYCL and Python for XPU Programming. These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources.

Exceptions are indeed supported, including asynchronous errors from device code. Async errors will be intercepted once they are rethrown as synchronous exceptions by async error handler function. This behavior is courtesy of Python extensions generators and community documentation explains it well in Cython and Pybind11.

Prior Installments in this Series

  1. Solving Heterogeneous Programming Challenges with SYCL
  2. Why SYCL: Elephants in the SYCL Room
  3. Reflecting on the 25th Anniversary of ASCI Red and Continuing Themes for Our Heterogenous Future

About the Author

James Reinders believes the full benefits of the evolution to full heterogeneous computing will be best realized with an open, multivendor, multiarchitecture approach. Reinders rejoined Intel a year ago, specifically because he believes Intel can meaningfully help realize this open future. Reinders is an author (or co-author and/or editor) of ten technical books related to parallel programming; his latest book is about SYCL (it can be freely downloaded here). 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Google Program to Free Chips Boosts University Semiconductor Design

August 11, 2022

A Google-led program to design and manufacture chips for free is becoming popular among researchers and computer enthusiasts. The search giant's open silicon program is providing the tools for anyone to design chips, which then get manufactured. Google foots the entire bill, from a chip's conception to delivery of the final product in a user's hand. Google's... Read more…

Argonne Deploys Polaris Supercomputer for Science in Advance of Aurora

August 9, 2022

Argonne National Laboratory has made its newest supercomputer, Polaris, available for scientific research. The system, which ranked 14th on the most recent Top500 list, is serving as a testbed for the exascale Aurora sys Read more…

US CHIPS and Science Act Signed Into Law

August 9, 2022

Just a few days after it was passed in the Senate, the U.S. CHIPS and Science Act has been signed into law by President Biden. In a ceremony today, Biden signed and lauded the ambitious piece of legislation, which over t Read more…

12 Midwestern Universities Team to Boost Semiconductor Supply Chain

August 8, 2022

The combined stressors of Covid-19 and the invasion of Ukraine have sent every major nation scrambling to reinforce its mission-critical supply chains – including and in particular the semiconductor supply chain. In the U.S. – which, like much of the world, relies on Asia for its semiconductors – those efforts have taken shape through the recently... Read more…

Quantum Pioneer D-Wave Rings NYSE Bell, Begins Life as Public Company

August 8, 2022

D-Wave Systems, one of the early quantum computing pioneers, has completed its SPAC deal to go public. Its merger with DPCM Capital was completed last Friday, and today, D-Wave management rang the bell on the New York Stock Exchange. It is now trading under two ticker symbols – QBTS and QBTS WS (warrant shares), respectively. Welcome to the public... Read more…

AWS Solution Channel

Shutterstock 1519171757

Running large-scale CFD fire simulations on AWS for

This post was contributed by Matt Broadfoot, Senior Fire Strategy Manager at Amazon Design and Construction, and Antonio Cennamo ProServe Customer Practice Manager, Colin Bridger Principal HPC GTM Specialist, Grigorios Pikoulas ProServe Strategic Program Leader, Neil Ashton Principal, Computational Engineering Product Strategy, Roberto Medar, ProServe HPC Consultant, Taiwo Abioye ProServe Security Consultant, Talib Mahouari ProServe Engagement Manager at AWS. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1689646429

Gain a Competitive Edge using Cloud-Based, GPU-Accelerated AI KYC Recommender Systems

Financial services organizations face increased competition for customers from technologies such as FinTechs, mobile banking applications, and online payment systems. To meet this challenge, it is important for organizations to have a deep understanding of their customers. Read more…

Supercomputer Models Explosives Critical for Nuclear Weapons

August 6, 2022

Lawrence Livermore National Laboratory (LLNL) is one of the laboratories that operates under the auspices of the National Nuclear Security Administration (NNSA), which manages the United States’ stockpile of nuclear we Read more…

Google Program to Free Chips Boosts University Semiconductor Design

August 11, 2022

A Google-led program to design and manufacture chips for free is becoming popular among researchers and computer enthusiasts. The search giant's open silicon program is providing the tools for anyone to design chips, which then get manufactured. Google foots the entire bill, from a chip's conception to delivery of the final product in a user's hand. Google's... Read more…

US CHIPS and Science Act Signed Into Law

August 9, 2022

Just a few days after it was passed in the Senate, the U.S. CHIPS and Science Act has been signed into law by President Biden. In a ceremony today, Biden signed Read more…

Quantum Pioneer D-Wave Rings NYSE Bell, Begins Life as Public Company

August 8, 2022

D-Wave Systems, one of the early quantum computing pioneers, has completed its SPAC deal to go public. Its merger with DPCM Capital was completed last Friday, and today, D-Wave management rang the bell on the New York Stock Exchange. It is now trading under two ticker symbols – QBTS and QBTS WS (warrant shares), respectively. Welcome to the public... Read more…

SEA Changes: How EuroHPC Is Preparing for Exascale

August 5, 2022

Back in June, the EuroHPC Joint Undertaking — which serves as the EU’s concerted supercomputing play — announced its first exascale system: JUPITER, set to be installed by the Jülich Supercomputing Centre (FZJ) in 2023. But EuroHPC has been preparing for the exascale era for a much longer time: eight months before... Read more…

Not Just Cash for Chips – The New Chips and Science Act Boosts NSF, DOE, NIST

August 3, 2022

After two-plus years of contentious debate, several different names, and final passage by the House (243-187) and Senate (64-33) last week, the Chips and Science Act will soon become law. Besides the $54.2 billion provided to boost US-based chip manufacturing, the act reshapes US science policy in meaningful ways. NSF’s proposed budget... Read more…

CXL Brings Datacenter-sized Computing with 3.0 Standard, Thinks Ahead to 4.0

August 2, 2022

A new version of a standard backed by major cloud providers and chip companies could change the way some of the world's largest datacenters and fastest supercomputers are built. The CXL Consortium on Tuesday announced a new specification called CXL 3.0 – also known as Compute Express Link 3.0... Read more…

Inside an Ambitious Play to Shake Up HPC and the Texas Grid

August 2, 2022

With HPC demand ballooning and Moore’s law slowing down, modern supercomputers often undergo exhaustive efficiency efforts aimed at ameliorating exorbitant energy bills and correspondingly large carbon footprints. Others, meanwhile, are asking: is min-maxing the best option, or are there easier paths to reducing the bills and emissions of... Read more…

UCIe Consortium Incorporates, Nvidia and Alibaba Round Out Board

August 2, 2022

The Universal Chiplet Interconnect Express (UCIe) consortium is moving ahead with its effort to standardize a universal interconnect at the package level. The c Read more…

Nvidia R&D Chief on How AI is Improving Chip Design

April 18, 2022

Getting a glimpse into Nvidia’s R&D has become a regular feature of the spring GTC conference with Bill Dally, chief scientist and senior vice president of research, providing an overview of Nvidia’s R&D organization and a few details on current priorities. This year, Dally focused mostly on AI tools that Nvidia is both developing and using in-house to improve... Read more…

Royalty-free stock illustration ID: 1919750255

Intel Says UCIe to Outpace PCIe in Speed Race

May 11, 2022

Intel has shared more details on a new interconnect that is the foundation of the company’s long-term plan for x86, Arm and RISC-V architectures to co-exist in a single chip package. The semiconductor company is taking a modular approach to chip design with the option for customers to cram computing blocks such as CPUs, GPUs and AI accelerators inside a single chip package. Read more…

The Final Frontier: US Has Its First Exascale Supercomputer

May 30, 2022

In April 2018, the U.S. Department of Energy announced plans to procure a trio of exascale supercomputers at a total cost of up to $1.8 billion dollars. Over the ensuing four years, many announcements were made, many deadlines were missed, and a pandemic threw the world into disarray. Now, at long last, HPE and Oak Ridge National Laboratory (ORNL) have announced that the first of those... Read more…

US Senate Passes CHIPS Act Temperature Check, but Challenges Linger

July 19, 2022

The U.S. Senate on Tuesday passed a major hurdle that will open up close to $52 billion in grants for the semiconductor industry to boost manufacturing, supply chain and research and development. U.S. senators voted 64-34 in favor of advancing the CHIPS Act, which sets the stage for the final consideration... Read more…

Top500: Exascale Is Officially Here with Debut of Frontier

May 30, 2022

The 59th installment of the Top500 list, issued today from ISC 2022 in Hamburg, Germany, officially marks a new era in supercomputing with the debut of the first-ever exascale system on the list. Frontier, deployed at the Department of Energy’s Oak Ridge National Laboratory, achieved 1.102 exaflops in its fastest High Performance Linpack run, which was completed... Read more…

Newly-Observed Higgs Mode Holds Promise in Quantum Computing

June 8, 2022

The first-ever appearance of a previously undetectable quantum excitation known as the axial Higgs mode – exciting in its own right – also holds promise for developing and manipulating higher temperature quantum materials... Read more…

AMD’s MI300 APUs to Power Exascale El Capitan Supercomputer

June 21, 2022

Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Read more…

PsiQuantum’s Path to 1 Million Qubits

April 21, 2022

PsiQuantum, founded in 2016 by four researchers with roots at Bristol University, Stanford University, and York University, is one of a few quantum computing startups that’s kept a moderately low PR profile. (That’s if you disregard the roughly $700 million in funding it has attracted.) The main reason is PsiQuantum has eschewed the clamorous public chase for... Read more…

Leading Solution Providers


ISC 2022 Booth Video Tours


Exclusive Inside Look at First US Exascale Supercomputer

July 1, 2022

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…

AMD Opens Up Chip Design to the Outside for Custom Future

June 15, 2022

AMD is getting personal with chips as it sets sail to make products more to the liking of its customers. The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package. "We are focused on making it easier to implement chips with more flexibility," said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week. Read more…

Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps

May 31, 2022

Intel reiterated it is well on its way to merging its roadmap of high-performance CPUs and GPUs as it shifts over to newer manufacturing processes and packaging technologies in the coming years. The company is merging the CPU and GPU lineups into a chip (codenamed Falcon Shores) which Intel has dubbed an XPU. Falcon Shores... Read more…

Nvidia, Intel to Power Atos-Built MareNostrum 5 Supercomputer

June 16, 2022

The long-troubled, hotly anticipated MareNostrum 5 supercomputer finally has a vendor: Atos, which will be supplying a system that includes both Nvidia and Inte Read more…

India Launches Petascale ‘PARAM Ganga’ Supercomputer

March 8, 2022

Just a couple of weeks ago, the Indian government promised that it had five HPC systems in the final stages of installation and would launch nine new supercomputers this year. Now, it appears to be making good on that promise: the country’s National Supercomputing Mission (NSM) has announced the deployment of “PARAM Ganga” petascale supercomputer at Indian Institute of Technology (IIT)... Read more…

Is Time Running Out for Compromise on America COMPETES/USICA Act?

June 22, 2022

You may recall that efforts proposed in 2020 to remake the National Science Foundation (Endless Frontier Act) have since expanded and morphed into two gigantic bills, the America COMPETES Act in the U.S. House of Representatives and the U.S. Innovation and Competition Act in the U.S. Senate. So far, efforts to reconcile the two pieces of legislation have snagged and recent reports... Read more…

AMD Lines Up Alternate Chips as It Eyes a ‘Post-exaflops’ Future

June 10, 2022

Close to a decade ago, AMD was in turmoil. The company was playing second fiddle to Intel in PCs and datacenters, and its road to profitability hinged mostly on Read more…

Exascale Watch: Aurora Installation Underway, Now Open for Reservations

May 10, 2022

Installation has begun on the Aurora supercomputer, Rick Stevens (associate director of Argonne National Laboratory) revealed today during the Intel Vision event keynote taking place in Dallas, Texas, and online. Joining Intel exec Raja Koduri on stage, Stevens confirmed that the Aurora build is underway – a major development for a system that is projected to deliver more... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow