When discussing GenAI, the term “GPU” almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word “GPU” is assumed to mean “Nvidia” products. (As an aside, the popular Nvidia hardware used in GenAI are not technically Graphical Processing Units. I prefer SIMD units.)
The association of GenAI and GPUs with Nvidia is no accident. Nvidia has always recognized the need for tools and applications to help grow its market. They have created a very low barrier to getting software tools (e.g., CUDA) and optimized libraries (e.g., cuDNN) for Nvidia hardware. Indeed, Nvidia is known as a hardware company, but as Bryan Catanzaro, VP of Applied Deep Learning Research, Nvidia has stated ” Many people don’t know this, but Nvidia has more software engineers than hardware engineers.”
As a result, Nvidia has built a powerful software “moat” around their hardware. While CUDA is not open source, it is freely available and under the firm control of Nvidia. While this situation has benefited Nvidia (As it should. They invested time and money into CUDA), it has created difficulties for those companies and users that want to grab some of the HPC and GenAI market with alternate hardware.
Building on the Castle Foundation
The number of foundational models developed for GenAI continues to grow. Many of these are “open source” because they can be used and shared freely. (For example, the Llama foundational model from Meta) In addition, they require a large number of resources (both people and machines) to create and are limited mainly to the hyperscalers (AWS, Microsoft Azure, Google Cloud, Meta Platforms, and Apple) that have huge amounts of GPUs available, In addition to the hyperscalers, other companies have invested in hardware (i.e. purchased a massive amount of GPUs) to create their own foundational models.
From a research perspective, the models are interesting and can be used for a variety of tasks; however, the expected use and need for even more GenAI computing resources is two fold;
- Fine-tuning — Adding domain-specific data to foundational models to make it work for your use case.
- Inference – Once the model is fine-tuned, it will require resources when used (i.e., asked questions).
These tasks are not restricted to hyperscalers and will need accelerated computing, that is, GPUs. The obvious solution is to buy more “unavailable” Nvidia GPUs, and AMD is ready and waiting now that the demand has far outstripped the supply. To be fair, Intel and some other companies are also ready and waiting to sell into this market. The point is that GenAI will continue to squeeze GPU availability as fine-tuning and inference become more pervasive, and any GPU (or accelerator) is better than no GPU.
Moving away from Nvidia hardware suggests that other vendor GPUs and accelerators must support CUDA to run many of the models and tools. AMD has made this possible with HIP CUDA conversion tool; however, the best results often seem to use the native tools surrounding the Nvidia castle.
The PyTorch Drawbridge
In the HPC sector, CUDA-enabled applications rule the GPU-accelerated world. Porting codes can often realize a speed-up of 5-6x when using a GPU and CUDA. (Note: Not all codes can achieve this speed up, and some may not be able to use the GPU hardware.) However, in GenAI, the story is quite different.
Initially, TensorFlow was the tool of choice for creating AI applications using GPUs. It works both with CPUs and was accelerated with CUDA for GPUs. This situation is changing rapidly.
An alternative to TensorFlow is PyTorch, an open-source machine learning library for developing and training neural network-based deep learning models. Facebook’s AI research group primarily develops it.
In a recent blog post by Ryan O’Connor, a Developer Educator at AssemblyAI notes that the popular site HuggingFace, (that allows users to download and incorporate trained and tuned state of the art models into application pipelines with just a few lines of code), 92% of models available are PyTorch exclusive.
In addition, as shown in Figure One, a comparison of Machine Learning papers shows a significant trend toward PyTorch and away from TensorFlow.
Of course, underneath PyTorch are calls to CUDA, but that is not required because PyTorch insulates the user from the underlying GPU architecture. There is also a version of PyTorch that uses AMD ROCm, an open-source software stack for AMD GPU programming. Crossing the CUDA moat for AMD GPUs may be as easy as using PyTorch.
Instinct for Inference
In both HPC and GenAI, the Nvidia 72-core ARM-based Grace-Hopper superchip with a shared memory H100 GPU (and also the 144-core Grace-Grace version) is highly anticipated. All Nvidia released benchmarks thus far indicate much better performance than the traditional server where the GPU is attached and accessed over the PCIe bus. Grace-Hopper represents an optimized hardware for both HPC and GenAI. It also is expected to find wide use in both fine-tuning and inference. Demand is expected to be high.
AMD has had shared memory CPU-GPU designs since 2006 (AMD acquired graphics card company ATI in 2006). Beginning as the “Fusion” brand many AMD x86_64 processors are now implemented as a combined CPU/GPU called an Accelerated Processing Unit (APU).
The upcoming Instinct MI300A processor (APU) from AMD will offer competition for Grace-Hopper superchip. It will also power the forthcoming El Capitan at Lawrence Livermore National Laboratory. The Integrated MI300A will provide up to 24 Zen4 cores in combination with a CDNA 3 GPU Architecture and up to 192 GB of HBM3 memory, providing uniform access memory for all the CPU and GPU cores. The chip-wide cache-coherent memory reduces data movement between the CPU and GPU, eliminating the PCIe bus bottleneck and improving performance and power efficiency.
AMD is readying the Instinct MI300A for the upcoming inference market. As stated by AMD CEO Lisa Su in a recent article on Yahoo!Finance. “We actually think we will be the industry leader for inference solutions because of some of the choices that we’ve made in our architecture.”
For AMD and many other hardware vendors, PyTorch has dropped the drawbridge on the CUDA moat around the foundational models. AMD has the Instinct MI3000A battle wagon ready to go. The hardware battles for the GenAI market will be won by performance, portability, and availability. The AI day is young.