Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with performance benchmarks. In the first paper, Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip, researchers benchmarked various applications of the GH200, which has an integrated CPU and GPU. The numbers highlighted the chip’s blazing speed and how AI and scientific application performance can benefit from the localized HBM3 and DDR5 memory.
One benchmark from the Alps system — which is still being upgraded — measures the GH200 performance when running AI applications.
Another paper, Boosting Earth System Model Outputs And Saving PetaBytes in their Storage Using Exascale Climate Emulators, measures the performance of large clusters of GH200 to AMD’s MI250X in Frontier, the Nvidia A100 in Leonardo, and the Nvidia V100 in Summit. The systems are former Top500 chart-toppers and are now in the top 10.
The GH200 links Nvidia’s proprietary 72 ARM Neoverse V2 CPU cores directly with 132 GPU streaming processors. The CPU and GPU communicate via the NVLink-C2C interconnect, which runs bidirectionally at 900GB/sec. It also has 96GB of HBM3 and pools different types of CPU and GPU memory.
The Informal GH200 Analysis
Researchers got their hands on a partition of the GH200 chips in the Alps supercomputer, which is being upgraded, and measured AI benchmarks on the CUDA 12.3 software stack. The Alps supercomputer is at the Swiss National Supercomputing Centre.
Alps is one of the first supercomputers with access to the GH200, and an optimized subsystem called “preAlps” is ranked number five on the Green500 list. It is based on HPE’s Slingshot interconnect, not Nvidia’s proprietary networking interface.
The researchers tested quad GH200 nodes. A unified memory pool “opens up new possibilities for scaling applications with large memory footprints that go beyond what is directly available to a single GPU or CPU,” the researchers said.
Each node had 288 CPU cores and four Hopper GPUs. The final quad configuration had 896GB of total memory, with each Superchip including 96GB of HBM3 and 128GB of LPDDR5 memory. Each node, containing four GH200 Superchips, was connected via the HPE Slingshot 11 with speeds of 800 Gb/s per node.
The researchers measured various read, write, and performance metrics when data was stored in HBM3 or LPDDR5 in the unified memory pool. When running workloads, data is temporarily stored in memory, and HBM3 is significantly faster than LPDDR5.
The AI performance based on GEMM algorithms, which take advantage of the AI-centric Tensor Cores in GPUs, was measured by researchers.
Performance of the superchip with HBM3 was 612 teraflops, and DDR was 59.2 teraflops. Performance for FP32 was 51.9 teraflops on HBM3 and was narrower with DDR5 at 22.9 teraflops. On FP64, the performance in HBM3 was 58.4 teraflops and 13.2 teraflops with DDR memory.
The LLM inference time was also much quicker with HBM3 memory. To sum it up, the inferencing of 100 tokens was up to four times faster than DDR memory on the Llama-2 model with 13-billion parameters. It was about two times faster with the Llama-2 7-billion parameter model.
Other important finds: The Hopper GPU had read speeds of 420.2 GB/s and write speeds of 380.1 GB/s with DDR, and read speeds of 3795.9 GB/s and write speeds of 3712.1 GB/s with HBM3.
The Hopper GPU had an HBM memory latency of 344.2 nanoseconds and a DDR memory latency of 817.8 nanoseconds.
The researchers also performed the read/write and memory latency tests for the Grace CPUs. They also published copy performance of the CPUs and GPUs.
The researchers were from ETH Zurich and Nvidia.
“We argue that despite the sophisticated memory system of the Quad GH200 node, looking at the system in terms of individual interconnected Superchips is crucial to achieving good performance,” the researchers concluded.
The Nvlink-C2C interconnect “opens up possibilities for the development of heterogeneous applications mixing CPU and GPU computations,” according to the researchers.
The Comparison to MI250X and A100
Researchers also ran a climate simulator application on clusters of GH200 in the Alps, the MI250X in Frontier, the Nvidia A100 in Leonardo, and the Nvidia V100 in Summit. The chips are in former Top500 chart-toppers or in the top 10.
The comparisons aren’t exactly apples-to-apples, particularly the GH200 to Nvidia A100 and V100, which do not include integrated CPUs.
However, the mixed-precision performance numbers, including double precision and half-precision measurements, provide a snapshot of what HPC enthusiasts care about, a big-picture view of these systems delivering more overall performance when mixing scientific and AI simulations.
The numbers show that the GH200 significantly improves climate simulation applications and data. Earth simulation models are demanding on supercomputing systems, making them optimal for measuring GPU performance.
An Alps cluster of 4,096 GPUs and a problem size of 10.4 million topped with 384.2 million petaflops and 93.8 teraflops per GPU.
The MIX250X in Frontier — with 4,096 GPUs and a problem size of 8.39 million — benchmarked at 223.7 petaflops, with 54.6 teraflops per GPU.
The Nvidia A100 in the Leonardo supercomputer — with 4,096 GPUs and a problem size of 8.39 million — benchmarked at 243.1 petaflops, with 57.2 teraflops per GPU.
Leonardo has 3,456 nodes, each with four Nvidia A100 64GB GPUs, with a theoretical double-precision peak performance of 306.31 petaflops.
The V100 in Summit, with 6,144 GPUs and a problem size of 6.29 million, delivered 153.6 petaflops of overall performance and 25 teraflops per GPU. The Summit will soon be retired.
The climate emulator used in the benchmark was trained on 318 billion hourly temperature data points from a 35-year and 31 billion daily data points from global simulations stretching back 83 years.
The researchers claimed their climate emulator — which complements other systems – can eke out more performance from high-performance systems, with up to 0.976 exaflops of performance on 9,025 nodes of Frontier (which has 9,472 nodes).
The emulator can also bring cost and performance efficiencies to data-intensive simulations related to climate. Other simulators generate so many petabytes of data that it becomes expensive to store and limits computational capability.
For example, the National Center for Atmospheric Research’s CMIP6 simulated 37,000 years of climate data — generated from various scenarios — which consumed 190 million CPU hours and 2 petabytes of post-processed time series data.
“Managing data at NCAR incurs costs of approximately $45 per TB annually. This results in substantial financial burdens for projects with petabyte-scale storage needs and can limit science objectives,” the researchers said.
The paper was authored by researchers from NCAR, King Abdullah University of Science and Technology (KAUST), Saint Louis University, and the University of Notre Dame. The researchers are also associated with Nvidia and the University of Tennessee.