Spelunking the HPC and AI GPU Software Stacks

By Kevin Jackson and Doug Eadline

June 21, 2024

As AI continues to reach into every domain of life, the question remains as to what kind of software these tools will run on. The choice in software stacks – or collections of software components that work together to enable specific functionality on a computing system – is becoming even more relevant in the GPU-centric computing needs of AI tasks.

With AI and HPC applications pushing the limits of computational power, the choice of software stack can significantly impact performance, efficiency, and developer productivity.

Currently, there are three major players in the software stack competition: Nvidia’s Compute Unified Device Architecture (CUDA), Intel’s oneAPI, and AMD’s Radeon Open Compute (ROCm). While each has pros and cons, Nvidia’s CUDA continues to dominate largely because its hardware has led the way in HPC and now AI.

Here, we will delve into the intricacies of each of these software stacks – exploring their capabilities, hardware support, and integration with the popular AI framework PyTorch. In addition, we will conclude with a quick look at two higher-level HPC languages: Chapel and Julia.

Nvidia’s CUDA

Nvidia’s CUDA is the company’s proprietary parallel computing platform and software stack meant for general-purpose computing on their GPUs. CUDA provides an application programming interface (API) that enables software to leverage the parallel processing capabilities of Nvidia GPUs for accelerated computation.

CUDA must be mentioned first because it dominates the software stack space for AI and GPU-heavy HPC tasks – and for good reason. CUDA has been around since 2006, which gives it a long history of third-party support and a mature ecosystem. Many libraries, frameworks, and other tools have been optimized specifically for CUDA and Nvidia GPUs. This long-held support for the CUDA stack is one of its key advantages over other stacks.

Nvidia provides a comprehensive toolset as part of the CUDA platform, including CUDA compilers like Nvidia CUDA Compiler (NVCC). There are also many debuggers and profilers for debugging and optimizing CUDA applications and development tools for distributing CUDA applications. Additionally, CUDA’s long history has given rise to extensive documentation, tutorials, and community resources.

CUDA’s support for the PyTorch framework is also essential when discussing AI tasks. This package is an open-source machine learning library based on the Torch library, and it is primarily used for applications in computer vision and natural language processing. PyTorch has extensive and well-established support for CUDA. CUDA integration in PyTorch is highly optimized, which enables efficient training and inference on Nvidia GPUs. Again, CUDA’s maturity means access to numerous libraries and tools that PyTorch can use.

In addition to a raft of accelerated libraries, Nvidia also offers a complete deep-learning software stack for AI researchers and software developers. This stack includes the popular CUDA Deep Neural Network library (cuDNN), a GPU-accelerated library of primitives for deep neural networks. CuDNN accelerates widely used deep learning frameworks, including Caffe2, Chainer, Keras, MATLAB, MxNet, PaddlePaddle, PyTorch, and TensorFlow.

What’s more, CUDA is designed to work with all Nvidia GPUs, from consumer-grade GeForce video cards to high-end data center GPUs – giving users a wide range of versatility within the hardware they can use.

That said, CUDA could be better, and Nvidia’s software stack has some drawbacks that users must consider. To begin, though freely available, CUDA is a proprietary technology owned by Nvidia and is, therefore, not open source. This situation locks developers into Nvidia’s ecosystem and hardware, as applications developed on CUDA cannot run on non-Nvidia GPUs without significant code changes or using compatibility layers. In a similar vein, the proprietary nature of CUDA means that the software stack’s development roadmap is controlled solely by Nvidia. Developers have limited ability to contribute to or modify the CUDA codebase.

Developers must also consider CUDA’s licensing costs. CUDA itself is free for non-commercial use, but commercial applications may require purchasing expensive Nvidia hardware and software licenses.


AMD’s ROCm is another software stack that many developers choose. While CUDA may dominate the space, ROCm is distinct because it is an open-source software stack for GPU computing. This feature allows developers to customize and contribute to the codebase, fostering collaboration and innovation within the community. One of the critical advantages of ROCm is its support for both AMD and Nvidia GPUs, which allows for cross-platform development.

This unique feature is enabled by the Heterogeneous Computing Interface for Portability (HIP), which gives developers the ability to create portable applications that can run on different GPU platforms. While ROCm supports both consumer and professional AMD GPUs, its major focus is on AMD’s high-end Radeon Instinct and Radeon Pro GPUs designed for professional workloads.

Like CUDA, ROCm provides a range of tools for GPU programming. These include C/C++ compilers like the ROCm Compiler Collection, AOMP, and AMD Optimizing C/C++ Compiler, as well as Fortran Compilers like Flang. There are also libraries for a variety of domains, such as linear algebra, FFT, and deep learning.

That said, ROCm’s ecosystem is relatively young compared to CUDA and needs to catch up regarding third-party support, libraries, and tools. Being late to the game also translates to more limited documentation and community resources compared to the extensive documentation, tutorials, and support available for CUDA. This situation is especially true for PyTorch, which supports the ROCm platform but needs to catch up to CUDA in terms of performance, optimization, and third-party support due to its shorter history and maturity. Documentation and community resources for PyTorch on ROCm are more limited than those for CUDA. However, AMD is making progress on this front.

Like Nvidia, AMD also provides a hefty load of ROCm libraries. AMD offers an equivalent to cuDNN called MIOpen for deep learning, which is used in the ROCm version of PyTorch (and other popular tools).

Additionally, while ROCm supports both AMD and Nvidia GPUs, its performance may not match CUDA when running on Nvidia hardware due to driver overhead and optimization challenges.

Intel’s oneAPI

Intel’s oneAPI is a unified, cross-platform programming model that enables development for a wide range of hardware architectures and accelerators. It supports multiple architectures, including CPUs, GPUs, FPGAs, and AI accelerators from various vendors. It aims to provide a vendor-agnostic solution for heterogeneous computing and leverages industry standards like SYCL. This feature means that it can run on architectures from outside vendors like AMD and Nvidia as well as on Intel’s hardware.

Like ROCm, oneAPI is an open-source platform. As such, there is more community involvement and contribution to the codebase compared to CUDA. Embracing open-source development, oneAPI supports a range of programming languages and frameworks, including C/C++ with SYCL, Fortran, Python, and TensorFlow. Additionally, oneAPI provides a unified programming model for heterogeneous computing, simplifying development across diverse hardware.

Again, like ROCm, oneAPI has some disadvantages related to the stack’s maturity. As a younger platform, oneAPI needs to catch up to CUDA regarding third-party software support and optimization for specific hardware architectures.

 When looking at specific use cases within PyTorch, oneAPI is still in its early stages compared to the well-established CUDA integration. PyTorch can leverage oneAPI’s Data Parallel Python (DPPy) library for distributed training on Intel CPUs and GPUs, but native PyTorch support for oneAPI GPUs is still in development and is not ready for production.

That said, it’s important to note that oneAPI’s strength lies in its open standards-based approach and potential for cross-platform portability. oneAPI could be a viable option if vendor lock-in is a concern and the ability to run PyTorch models on different hardware architectures is a priority.

For now, if maximum performance on Nvidia GPUs is the primary goal for developers with PyTorch workloads, CUDA remains the preferred choice due to its well-established ecosystem. That said, developers seeking vendor-agnostic solutions or those primarily using AMD or Intel hardware may wish to rely on ROCm or oneAPI, respectively.

While CUDA has a head start regarding ecosystem development, its proprietary nature and hardware specificity may make ROCm and oneAPI more advantageous solutions for certain developers. Also, as time passes, community support and documentation for these stacks will continue to grow. CUDA may be dominating the landscape now, but that could change in the years to come.

Abstracting Away the Stack

In general, many developers prefer to create hardware-independent applications. Within HPC, hardware optimizations can be justified for performance reasons, but many modern-day coders prefer to focus more on their application than on the nuances of the underlying hardware.  

PyTorch is a good example of this trend. Python is not known as a particularly fast language, yet 92% of models on Hugging Face are PyTorch exclusive. As long as the hardware vendor has a PyTorch version built on their libraries, users can focus on the model, not the underlying hardware differences. While this portability is nice, it does not guarantee performance, which is where the underlying hardware architecture may enter the conversation.

Of course, Pytorch is based on Python, the beloved first language of many programmers. This language often trades ease of use for performance (particularly high-performance features like parallel programming). When HPC projects are started with Python, they tend to migrate to scalable high-performance codes based on distributed C/C++ and MPI or threaded applications that use OpenMP. These choices often result in the “two language” problem, where developers must manage two versions of their code.

Currently, two “newer” languages, Chapel and Julia,  offer one easy-to-use language solution that provides a high-performance coding environment. These languages, among other things, attempt to “abstract away” many of the details required to write applications for parallel HPC clusters, multi-core processors, and  GPU/accelerator environments. At their base, they still rely on vendor GPU libraries mentioned above but often make it easy to build applications that can recognize and adapt to the underlying hardware environment at run time.


Initially developed by Cray, Chapel (the Cascade High Productivity Language) is a parallel programming language designed for a higher level of expression than current programming languages (read as “Fortran/C/C++ plus MPI”). Hewlett Packard Enterprise, which acquired Cray, currently supports the development as an open-source project under version 2 of the Apache license. The current release is version 2.0, and the Chapel website posts some impressive parallel performance numbers.

Chapel compiles to binary executables by default, but it can also compile to C code, and the user can select the compiler. Chapel code can be compiled into libraries that can be called from C, Fortran, or Python (and others). Chapel supports GPU programming through code generation for Nvidia and AMD graphics processing units.

There is a growing collection of libraries available for Chapel. A recent neural network library called Chainn is available for Chapel and is tailored to build deep-learning models using parallel programming. The implementation of Chainn in Chapel enables the user to leverage the parallel programming features of the language and to train Deep Learning models at scale from laptops to supercomputers.


Developed at MIT, Julia is intended to be a fast, flexible, and scalable solution to the two-lanague problem mentioned above. Work on Julia began in 2009, when Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman set out to create an open technical computing language that was both high-level and fast.

Like Python, Julia provides a responsive interpretive programming environment (REPL or read–eval–print loop) using a fast, just-in-time compiler. The language syntax is similar to Matlab and provides many advanced features, including:

  • Multiple dispatch: a function can have several implementations (methods) depending on the input types (easy-to-create portable and adaptive codes)
  • Dynamic type system: types for documentation, optimization, and dispatch
  • Performance approaching that of statically typed languages like C.
  • A built-in package manager
  • Designed for parallel and distributed computing  
  • Can compile to binary executables

Julia also has GPU libraries for CUDA, ROCm, OneAPI, and Apple that can be used with the machine learning library Flux.jl (among others). Flux is written in Julia and provides a lightweight abstraction over Julia’s native GPU support.

Both Chapel and Julia offer a high-level and portable approach to GPU programming. As with many languages that hide the underlying hardware details, there can be some performance penalties. However, developers are often fine with trading a few percentage points of performance for ease of portability.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

ARM, Fujitsu Targeting Open-source Software for Power Efficiency in 2-nm Chip

July 19, 2024

Fujitsu and ARM are relying on open-source software to bring power efficiency to an air-cooled supercomputing chip that will ship in 2027. Monaka chip, which will be made using the 2-nanometer process, is based on the Read more…

SCALEing the CUDA Castle

July 18, 2024

In a previous article, HPCwire has reported on a way in which AMD can get across the CUDA moat that protects the Nvidia CUDA castle (at least for PyTorch AI projects.). Other tools have joined the CUDA castle siege. AMD Read more…

Quantum Watchers – Terrific Interview with Caltech’s John Preskill by CERN

July 17, 2024

In case you missed it, there's a fascinating interview with John Preskill, the prominent Caltech physicist and pioneering quantum computing researcher that was recently posted by CERN’s department of experimental physi Read more…

Aurora AI-Driven Atmosphere Model is 5,000x Faster Than Traditional Systems

July 16, 2024

While the onset of human-driven climate change brings with it many horrors, the increase in the frequency and strength of storms poses an enormous threat to communities across the globe. As climate change is warming ocea Read more…

Researchers Say Memory Bandwidth and NVLink Speeds in Hopper Not So Simple

July 15, 2024

Researchers measured the real-world bandwidth of Nvidia's Grace Hopper superchip, with the chip-to-chip interconnect results falling well short of theoretical claims. A paper published on July 10 by researchers in the U. Read more…

Belt-Tightening in Store for Most Federal FY25 Science Budets

July 15, 2024

If it’s summer, it’s federal budgeting time, not to mention an election year as well. There’s an excellent summary of the curent state of FY25 efforts reported in AIP’s policy FYI: Science Policy News. Belt-tight Read more…

SCALEing the CUDA Castle

July 18, 2024

In a previous article, HPCwire has reported on a way in which AMD can get across the CUDA moat that protects the Nvidia CUDA castle (at least for PyTorch AI pro Read more…

Aurora AI-Driven Atmosphere Model is 5,000x Faster Than Traditional Systems

July 16, 2024

While the onset of human-driven climate change brings with it many horrors, the increase in the frequency and strength of storms poses an enormous threat to com Read more…

Shutterstock 1886124835

Researchers Say Memory Bandwidth and NVLink Speeds in Hopper Not So Simple

July 15, 2024

Researchers measured the real-world bandwidth of Nvidia's Grace Hopper superchip, with the chip-to-chip interconnect results falling well short of theoretical c Read more…

Shutterstock 2203611339

NSF Issues Next Solicitation and More Detail on National Quantum Virtual Laboratory

July 10, 2024

After percolating for roughly a year, NSF has issued the next solicitation for the National Quantum Virtual Lab program — this one focused on design and imple Read more…

NCSA’s SEAS Team Keeps APACE of AlphaFold2

July 9, 2024

High-performance computing (HPC) can often be challenging for researchers to use because it requires expertise in working with large datasets, scaling the softw Read more…

Anders Jensen on Europe’s Plan for AI-optimized Supercomputers, Welcoming the UK, and More

July 8, 2024

The recent ISC24 conference in Hamburg showcased LUMI and other leadership-class supercomputers co-funded by the EuroHPC Joint Undertaking (JU), including three Read more…

Generative AI to Account for 1.5% of World’s Power Consumption by 2029

July 8, 2024

Generative AI will take on a larger chunk of the world's power consumption to keep up with the hefty hardware requirements to run applications. "AI chips repres Read more…

US Senators Propose $32 Billion in Annual AI Spending, but Critics Remain Unconvinced

July 5, 2024

Senate leader, Chuck Schumer, and three colleagues want the US government to spend at least $32 billion annually by 2026 for non-defense related AI systems.  T Read more…

Atos Outlines Plans to Get Acquired, and a Path Forward

May 21, 2024

Atos – via its subsidiary Eviden – is the second major supercomputer maker outside of HPE, while others have largely dropped out. The lack of integrators and Atos' financial turmoil have the HPC market worried. If Atos goes under, HPE will be the only major option for building large-scale systems. Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…


Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

Leading Solution Providers


Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

  • arrow
  • Click Here for More Headlines
  • arrow