How GPUs Are Embedded in the HPC Landscape

By Manasi Rashinkar

September 23, 2024

Grasping the basics of Graphics Processing Unit (GPU) architecture is crucial for understanding how these powerful processors function, particularly in high-performance computing (HPC) and other demanding computational tasks. In a previous article, Understanding the GPU —The Catalyst of the Current AI Revolution, the basic concepts of a GPU were presented.  This article continues with a discussion of the following topics:

  • The basic structure of a GPU
  • Explanation of multiply-add operation and its importance 
  • Why GPUs are good for HPC and AI computing

Fundamentals of GPU Architecture:

Simplified view of the GPU architecture (Source: Author)

A single GPU is made up of multiple Processor Clusters (PC), each of which houses several Streaming Multiprocessors (SM). Each SM contains a layer-1 instruction cache(L1) that closely interacts with its cores. Typically, an SM utilizes its layer-1 cache(L1) and shares a layer-2 cache(L2) before accessing data from high-bandwidth dynamic random-access memory (DRAM). The GPU’s architecture is built to handle memory latency, with a greater focus on computation, making it less affected by the time taken to retrieve data from memory. Any potential memory access latency is effectively masked as long as the GPU has sufficient computations to stay busy. SMs are the workhorses of a GPU, responsible for executing parallel tasks, managing memory access, and performing a wide array of computations. These range from basic arithmetic and logical operations to complex matrix manipulations and specialized graphics or scientific calculations. These are all optimized for parallel execution to maximize the GPU’s efficiency and performance.

FMA (Fused Multiply-Add)

FMA is the most frequent operation in modern neural networks, acting as a building block for fully connected and convolutional layers, both of which can be viewed as a collection of vector dot-products. This operation combines multiplication and addition into a single step, providing computational efficiency and numerical accuracy.

Here, a and b are multiplied, and the product is added to d, resulting in c. Multiply-add operations are heavily used in matrix multiplication. In matrix multiplication, each element of the result matrix is the sum of multiple multiply-add operations. 

Consider two matrices, A and B, where A is of size m×n, and B is of size n×p. The result C will be a matrix of size m×p, where each element cij is calculated as:

Each element of the resulting matrix C is the sum of products of corresponding elements from a row of A and a column of B. Given that each of these calculations is independent, they can be performed in parallel (For more information: Matrix multiplication).

Concurrent matrix multiplication is challenging. Achieving efficient matrix multiplication is highly dependent on the specific hardware in use and the size of the problem being addressed. Matrix multiplication involves a large number of independent, element-wise operations. GPUs are designed to handle such parallel workloads efficiently, with thousands of cores working simultaneously to perform these operations.

GPUs are often considered SIMD (Single Instruction Multiple Data) parallel processing units and can perform the same instructions simultaneously on large amounts of data.

Due to the parallel SIMD nature of GPUs, matrix multiplication speed can be significantly increased, and this acceleration is crucial for applications that require real-time or near-real-time processing. Modern GPUs, particularly those from Nvidia, include specialized hardware units called Tensor Cores. These are designed to accelerate tensor operations, which are generalized forms of matrix multiplication, especially in mixed-precision calculations common in AI. GPUs are not only faster but also more energy-efficient for matrix multiplication tasks compared to CPUs. GPUs perform more computations per watt of power consumed. This efficiency is critical in data centers and cloud environments where energy consumption is a significant concern. GPUs can deliver significant performance and precision advantages by combining multiplication and addition into a single, optimized operation.

How GPUs enable HPC

We have now identified the following key qualities of GPUs:

  • Massive Parallelism
  • High Throughput
  • Specialized Hardware
  • High Memory Bandwidth
  • Energy Efficiency
  • Real-Time Processing
  • Acceleration

By leveraging these features, particularly with matrix mathematics, GPUs deliver unmatched performance and efficiency for HPC and AI tasks, making them the top choice for researchers, developers, and organizations engaged in advanced technologies and intricate computational challenges. A general survey of these HPC applications follows.

Molecular Dynamics Simulations

Used to study the physical movements of atoms and molecules. These simulations are crucial for understanding biological processes like protein folding, drug interactions, and material properties at the atomic level. It involves calculating the interactions between millions of particles, which can be parallelized. GPUs can perform these calculations simultaneously across many cores, drastically reducing the time required to simulate long timescales or large systems. This design allows scientists to explore more complex systems more accurately and in less time.

Weather and Climate Modeling

It involves simulating the Earth’s climate system to predict future weather patterns and long-term climate changes. These calculations are essential for understanding global warming, planning for natural disasters, and making policy decisions. It requires solving large-scale differential equations across a global grid that covers the atmosphere, oceans, and land surfaces. GPUs can handle the massive number of parallel computations needed to process data for each grid point, allowing for more detailed models and faster simulations. This feature enhances the ability to make accurate forecasts and understand climate dynamics.

Seismic Data Processing

It is used to explore underground oil and gas reservoirs. This area involves analyzing seismic waves that travel through the Earth’s subsurface to create detailed images of geological formations. The seismic data interpretation process requires applying complex mathematical algorithms to large datasets. GPUs accelerate this process by executing these algorithms in parallel, enabling faster and more precise imaging. These results help identify potential drilling sites and assess their viability, which is critical for resource exploration.

In addition to HPC, AI applications require large amounts of matrix operations and are excellent candidates for GPU acceleration. Representative examples are as follows.

Training Deep Neural Networks

Training deep learning models for tasks like image classification, speech recognition, and natural language processing (NLP) requires processing vast amounts of data through multiple layers of neural networks. This process, known as forward and backward propagation, involves numerous matrix multiplications and other computations. GPUs are optimized for the parallel processing of large matrices, which is essential for deep learning. By distributing the workload across thousands of cores, GPUs can train models much faster than CPUs, reducing training times from weeks to days or even hours. This performance allows researchers and developers to iterate quickly, experiment with different models, and achieve better performance.

Real-Time Object Detection

Autonomous vehicles rely on real-time object detection to identify pedestrians, other vehicles, and obstacles in their path. This capability is crucial for making immediate driving decisions to ensure safety. It requires processing high-resolution video feeds and running complex AI models on each frame. GPUs can handle these tasks efficiently by processing multiple frames simultaneously and applying deep learning models quickly enough to provide real-time analysis. This capability is essential for the development and deployment of autonomous systems.

Natural Language Processing (NLP)

Transformer-based models like BERT and GPT have revolutionized NLP tasks, including language translation, text summarization, and sentiment analysis. These models have billions of parameters and require extensive computation to train and deploy. GPUs enable the training of these large models by handling the vast number of matrix operations and data manipulations required. Moreover, during inference (when the model is applied to new data), GPUs accelerate the processing of large text datasets, making it possible to deploy these models in real-time applications such as chatbots, voice assistants, and automated content generation. 

Summary

GPUs stand out in high-performance computing and AI due to their capacity for parallel processing, managing large datasets, and speeding up complex calculations. They contribute to quicker and more effective simulations, model training, and data analysis across various domains, including scientific research, climate modeling, autonomous vehicles, and financial analytics.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

AMD Announces Flurry of New Chips

October 10, 2024

AMD today announced several new chips including its newest Instinct GPU — the MI325X — as it chases Nvidia. Other new devices announced at the company event in San Francisco included the 5th Gen AMD EPYC processors, Read more…

NSF Grants $107,600 to English Professors to Research Aurora Supercomputer

October 9, 2024

The National Science Foundation has granted $107,600 to English professors at US universities to unearth the mysteries of the Aurora supercomputer. The two-year grant recipients will write up what the Aurora supercompute Read more…

VAST Looks Inward, Outward for An AI Edge

October 9, 2024

There’s no single best way to respond to the explosion of data and AI. Sometimes you need to bring everything into your own unified platform. Other times, you lean on friends and neighbors to chart a way forward. Those Read more…

Google Reports Progress on Quantum Devices beyond Supercomputer Capability

October 9, 2024

A Google-led team of researchers has presented more evidence that it’s possible to run productive circuits on today’s near-term intermediate scale quantum devices that are beyond the reach of classical computing. � Read more…

At 50, Foxconn Celebrates Graduation from Connectors to AI Supercomputing

October 8, 2024

Foxconn is celebrating its 50th birthday this year. It started by making connectors, then moved to systems, and now, a supercomputer. The company announced it would build the supercomputer with Nvidia's Blackwell GPUs an Read more…

ZLUDA Takes Third Wack as a CUDA Emulator

October 7, 2024

The ZLUDA CUDA emulator is back in its third invocation. At one point, the project was quietly funded by AMD and demonstrated the ability to run unmodified CUDA applications with near-native performance on AMD GPUs. Cons Read more…

NSF Grants $107,600 to English Professors to Research Aurora Supercomputer

October 9, 2024

The National Science Foundation has granted $107,600 to English professors at US universities to unearth the mysteries of the Aurora supercomputer. The two-year Read more…

VAST Looks Inward, Outward for An AI Edge

October 9, 2024

There’s no single best way to respond to the explosion of data and AI. Sometimes you need to bring everything into your own unified platform. Other times, you Read more…

Google Reports Progress on Quantum Devices beyond Supercomputer Capability

October 9, 2024

A Google-led team of researchers has presented more evidence that it’s possible to run productive circuits on today’s near-term intermediate scale quantum d Read more…

At 50, Foxconn Celebrates Graduation from Connectors to AI Supercomputing

October 8, 2024

Foxconn is celebrating its 50th birthday this year. It started by making connectors, then moved to systems, and now, a supercomputer. The company announced it w Read more…

The New MLPerf Storage Benchmark Runs Without ML Accelerators

October 3, 2024

MLCommons is known for its independent Machine Learning (ML) benchmarks. These benchmarks have focused on mathematical ML operations and accelerators (e.g., Nvi Read more…

DataPelago Unveils Universal Engine to Unite Big Data, Advanced Analytics, HPC, and AI Workloads

October 3, 2024

DataPelago this week emerged from stealth with a new virtualization layer that it says will allow users to move AI, data analytics, and ETL workloads to whateve Read more…

Stayin’ Alive: Intel’s Falcon Shores GPU Will Survive Restructuring

October 2, 2024

Intel's upcoming Falcon Shores GPU will survive the brutal cost-cutting measures as part of its "next phase of transformation." An Intel spokeswoman confirmed t Read more…

How GenAI Will Impact Jobs In the Real World

September 30, 2024

There’s been a lot of fear, uncertainty, and doubt (FUD) about the potential for generative AI to take people’s jobs. The capability of large language model Read more…

Shutterstock_2176157037

Intel’s Falcon Shores Future Looks Bleak as It Concedes AI Training to GPU Rivals

September 17, 2024

Intel's Falcon Shores future looks bleak as it concedes AI training to GPU rivals On Monday, Intel sent a letter to employees detailing its comeback plan after Read more…

Granite Rapids HPC Benchmarks: I’m Thinking Intel Is Back (Updated)

September 25, 2024

Waiting is the hardest part. In the fall of 2023, HPCwire wrote about the new diverging Xeon processor strategy from Intel. Instead of a on-size-fits all approa Read more…

Ansys Fluent® Adds AMD Instinct™ MI200 and MI300 Acceleration to Power CFD Simulations

September 23, 2024

Ansys Fluent® is well-known in the commercial computational fluid dynamics (CFD) space and is praised for its versatility as a general-purpose solver. Its impr Read more…

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Shutterstock 1024337068

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips

September 4, 2024

Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with perfor Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Leading Solution Providers

Contributors

IBM Develops New Quantum Benchmarking Tool — Benchpress

September 26, 2024

Benchmarking is an important topic in quantum computing. There’s consensus it’s needed but opinions vary widely on how to go about it. Last week, IBM introd Read more…

Intel Customizing Granite Rapids Server Chips for Nvidia GPUs

September 25, 2024

Intel is now customizing its latest Xeon 6 server chips for use with Nvidia's GPUs that dominate the AI landscape. The chipmaker's new Xeon 6 chips, also called Read more…

Quantum and AI: Navigating the Resource Challenge

September 18, 2024

Rapid advancements in quantum computing are bringing a new era of technological possibilities. However, as quantum technology progresses, there are growing conc Read more…

Google’s DataGemma Tackles AI Hallucination

September 18, 2024

The rapid evolution of large language models (LLMs) has fueled significant advancement in AI, enabling these systems to analyze text, generate summaries, sugges Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

Microsoft, Quantinuum Use Hybrid Workflow to Simulate Catalyst

September 13, 2024

Microsoft and Quantinuum reported the ability to create 12 logical qubits on Quantinuum's H2 trapped ion system this week and also reported using two logical qu Read more…

US Implements Controls on Quantum Computing and other Technologies

September 27, 2024

Yesterday the Commerce Department announced export controls on quantum computing technologies as well as new controls for advanced semiconductors and additive Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire