In the fourth of a series of guest posts on heterogenous computing, James Reinders considers how full heterogeneous programming can be realized with Python today.
You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question, check out the first installment.
When considering “how do we program a truly heterogeneous machine?,” broadly we need two things: (1) a way to learn at runtime about all the devices that are available to our application, and (2) a way to utilize the devices to help perform work for our application.
Since utilizing devices involves both data and computation, we are left with three key questions:
- How can our Python program recognize available computational devices? (find, query capabilities, and select, all regardless of vendor and architecture)
- How does our application manage data sharing in a way that is reasonably optimal for the device(s) involved? Expectations here include that we can avoid excessive data movement, take advantage of memories, and memory movement functionality.
- How does our workload specify computations to be offloaded to selected device(s)? Expectations here include ability to intercept exceptions, including asynchronous errors based on code running on an accelerator.
When a program can do all three well, regardless of vendor and architecture, we have made possible open heterogeneous programming. By seeking such open approaches, we aim to increase application portability and reduce unnecessary barriers to using and supporting new hardware innovations.
First, we need to understand how to get parallelism and compiled code because we won’t want to offload serial interpreted code to our accelerator.
Performance starts with Parallelism and Compiling
Numba is an open-source, NumPy-aware optimizing (just-in-time) compiler for Python developed by Anaconda. It uses the LLVM compiler to generate machine code from Python bytecode. Numba can compile a large subset of numerically focused Python, including many NumPy functions. Additionally, Numba has support for automatic parallelization of loops, generation of GPU-accelerated code, and creation of Universal Functions (ufuncs) and C callbacks. Numba includes an auto-parallelizer that was contributed by Intel. The auto-parallelizer can be enabled by setting the parallel=True option in the @numba.jit. The auto-parallelizer analyzes data-parallel code regions in the compiled function and schedules them for parallel execution.
There are two types of operations that Numba can automatically parallelize: Implicitly data-parallel regions such as NumPy array expressions, NumPy ufuncs, NumPy reduction functions, and explicitly data-parallel loops that are specified using the numba.prange expression.
For example, consider a simple Python loop such as:
def f1(a,b,c,N): for i in range(N): c[i] = a[i] + b[i]
Next – I found an easy 12X improvement (24.3 seconds to 1.9 seconds) even without offloading (even more with offloading).
We can make it explicitly parallel by changing the serial range (range) to a parallel range (prange), and adding a njit directive (njit = Numba JIT = compile a parallel version):
@njit(parallel=True) def add(a,b,c,N): for i in prange(N): c[i] = a[i] + b[i]
This dropped from 24.3 seconds to 1.9 seconds when I ran it. Results can easily be more or less depending on the parallelism available on a system. To try it yourself, start with ‘git clone https://github.com/oneapi-src/oneAPI-samples’ and then open the notebook AI-and-Analytics/Jupyter/Numba_DPPY_Essentials_training/Welcome.ipynb. An easy way to do that quickly is to get a free account on DevCloud.
Do we benefit from accelerators?
The key to understanding how accelerators will help is overcoming the overhead of offloading. For many, the above ability to JIT and parallelize our code is more than enough. Over time, the above could evolve to automatically use tightly connected GPUs (on die, on package, coherently shared memory, etc.).
The techniques we cover in this article can be highly effective if and when our application has sufficient computations in it to overcome any costs with overloading and be worth the time for programming.
Compile into a Kernel for Offload
Extending the prior example, we can use Numba Data Parallel Extensions (numba-dpex) to designate a kernel to be compiled and ready for offload. An essentially equivalent computation can be expressed as a kernel as follows (for more details, refer to the Jupyter notebook training):
@dppy.kernel def add(a, b, c): i = dppy.get_global_id(0) c[i] = a[i] + b[i]
The kernel code is compiled and parallelized, like it was previously using @njit to ready for running the CPU, but this time it is ready for offload to a device. It is compiled into SPIR/V, which the runtime finishes mapping to a device when it is submitted for execution. This gives us a vendor agnostic solution for offload. It turns out, the first code snippet (using only @njit) can also be offloaded as-is without writing a kernel explicitly.
The array arguments to the kernel can be NumPy arrays or USM arrays (an array type explicitly placed in Unified Shared Memory) depending on what we feel fits our programming needs best. Our choice will affect how we set up the data and invoke the kernels.
SYCL bindings solve the three keys
Since SYCL can answer all three key questions we posed, the most consistent and versatile approach is to provide SYCL bindings for Python and use them directly. This is exactly what the open source Data-Parallel Control (dpctl: C and Python bindings for SYCL) has done. You can learn more from their github docs and “Interfacing SYCL and Python for XPU Programming.” These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources. This avoids reinventing solutions, reduces how much we have to learn, and allows a high level of compatibility as well.
Connecting to a device is as simple as:
device = dpctl.select_default_device() print("Using device ...") device.print_device_info()
Select any device – regardless of vendor.
The default device can be influenced with an environment variable SYCL_DEVICE_FILTER if we want to control device selection without changing this simple program. The dpctl library also supports programmatic controls to review and select an available device based on hardware properties.
The kernel can be invoked (offloaded and run) to the device with a couple lines of Python code:
with dpctl.device_context(device): dpar_add[global_size,dppy.DEFAULT_LOCAL_SIZE](a,b,c)
Our use of device_context has the runtime do all the necessary data copies (our data was still in standard NumPy arrays) to make it all work. The dpctl library also supports an ability for use to allocate and manage USM memory for devices explicitly. That could be valuable when we get deep into optimization, but the simplicity of letting the runtime handle it for standard NumPy arrays is hard to beat.
Asynchronous vs. Synchronous
Python coding style is easily supported by the synchronous mechanisms shown above. Asynchronous capabilities, and their advantages (reducing or hiding latencies in data movement and kernel invocations), are also available if we want to change our Python code even more. That gives us the ability to run kernels with less latency and move data asynchronous to help maximize performance. to learn more about the asynchronous capabilities, see dpctl gemv example.
What about cuPy?
CuPy is a reimplementation of a large subset of NumPy. The CuPy array library acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms. The massive programming effort needed serves as a considerable barrier to reimplementing for any new platforms.
Such an approach also raises questions about supporting multiple vendors from the same Python program, because it does not address our three questions. For device selection, CuPy requires there is a CUDA-enabled GPU device available. For memory, it offers little direct control but does automatically perform memory pooling to reduce the number of calls to cudaMalloc. When offloading operations, it offers no control over which device is utilized and will fail if no CUDA-enabled GPU is present.
While this is indeed effective for CUDA-GPU-devices, we can have more portability for our application when we are open to address all three of the key questions. There is a strong need for portable and architecture-agnostic abilities to write extensions.
What about SciKit-Learn?
Python programming in general is well suited for compute-follows-data, and using enabled routines is beautifully simple. The dpctl library supports a tensor array type that we connect with a specific device. In our program, if we cast our data to a device tensor (e.g., dpctl.tensor.asarray(data, device=”gpu:0″)) it will be associated with and placed on the device. Using a patched version of SciKit-Learn that recognizes these device tensors, the patched sklearn methods that involve such a tensor are automatically computed on the device. It is a great use of dynamic typing in Python to sense where the data is and direct the computation to be done where the data is. Our Python code changes very little, only the lines where are recast our tensors to a device tensor. Based on experience thus far, we expect compute-follows-data methods to be the most popular models for Python users.
compute-follows-data: Python dynamic typing allows computation to be directed to where the data is automatic – it is quite beautiful.
Open, Multivendor, Multiarchitecture – Learning Together
Python can be an instrument to embrace the power of diversity of hardware, and harness the impending Cambrian Explosion. Numba Data Parallel Python combined with dpctl to connect use with SYCL, and compute-follows-data patched SciKit-Learn are worth considering because they are vendor and architecture agnostic.
Open is driving vendor and architecture agnostic solutions that helps us all.
While Numba offers great support for NumPy, we can consider what more can be done for SciPy and other Python needs in the future.
The fragmentation of array APIs in Python has generated interest in array-API standardization for Python (read a nice summary) because of the desire to share workloads with devices other than the CPU. A standard array API goes a long way in helping efforts like Numba and dpctl increase their scope and impact. NumPy, and CuPy have embraced array-API, and both dpctl and PyTorch work to adopt is underway. As more libraries head this way, the task of supporting heterogeneous computing (accelerators of all types) becomes more tractable.
The simple use of dpctl.device_context is not sufficient in more sophisticated Python codes with multiple threads or asynchronous tasks (see github issue). It is likely better to pursue a compute-follows-data policy, at least in more complex threaded Python code. It may become the preferred option over the device_context style of programming.
Python Ready for the Cambrian Explosion
Python support is ready today for supporting open multivendor multiarchitecture heterogeneous programming. This enables nearly limitless controls, and they can in turn be buried in easy to use Python code everyone can use. I have no doubt that we will continue to see exciting developments with support for heterogeneous programming in Python that will come from feedback as we gain experience through usage.
For learning, there is nothing better than jumping in and trying it out yourself. I have some suggestions for online resources to help.
A firm understanding of best practices for using NumPy is highly recommended: the video Losing your Loops Fast Numerical Computing with NumPy by Jake VanderPlas, is an delightfully useful talk on how to use NumPy effectively, by the author of the book Python Data Science Handbook.
For Numba and dpctl, there is a 90 minute video talk covering these concepts in more detail titled “Data Parallel Essentials for Python.” Also, the step-by-step Jupyter notebook based training within the oneAPI samples was mentioned earlier (refer back for the git and file information).
These heterogeneous Python capabilities are all open source, and also come prebuilt with the Intel oneAPI Base and AI Toolkits because it bundles the prebuilt Intel Distribution for Python. A SYCL enabled NumPy is hosted on github. Numba compiler extensions for kernel programming and automatic offload capabilities are hosted on github. The open source Data-Parallel Controls (dpctl: C and Python bindings for SYCL) has github docs and a paper Interfacing SYCL and Python for XPU Programming. These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources.
Exceptions are indeed supported, including asynchronous errors from device code. Async errors will be intercepted once they are rethrown as synchronous exceptions by async error handler function. This behavior is courtesy of Python extensions generators and community documentation explains it well in Cython and Pybind11.
Prior Installments in this Series
- Solving Heterogeneous Programming Challenges with SYCL
- Why SYCL: Elephants in the SYCL Room
- Reflecting on the 25th Anniversary of ASCI Red and Continuing Themes for Our Heterogenous Future
About the Author
James Reinders believes the full benefits of the evolution to full heterogeneous computing will be best realized with an open, multivendor, multiarchitecture approach. Reinders rejoined Intel a year ago, specifically because he believes Intel can meaningfully help realize this open future. Reinders is an author (or co-author and/or editor) of ten technical books related to parallel programming; his latest book is about SYCL (it can be freely downloaded here).