As HPCwire reported recently, the latest MLperf benchmarks are out. Not unsurprisingly, Nvidia was the leader across many categories.
The HGX H100 GPU systems, which contain eight H100 GPUs, delivered the highest throughput on every MLPerf inference test in this round. Grace Hopper Superchips and H100 GPUs led across all MLPerf’s data center tests, including inference for computer vision, speech recognition, and medical imaging, in addition to the more demanding use cases of recommendation systems and the large language models (LLMs) used in generative AI.
The latest MLPerf round also included an updated test of recommendation systems and the first inference benchmark on GPT-J, an LLM with six billion parameters, which representing a rough measure of an AI model’s size.
Meet Grace Hopper
One of the most anticipated results was for the GH200 Grace Hopper Superchip. Nvidia has two versions of the Grace Superchip. The first version is a dual Grace-Grace version that we recently covered. Both 72-core Grace chips are connected over a 900 GB/s bidirectional NVLink-C2C to deliver all 144 high-performance Arm Neoverse V2 cores with up to 1 TB/s bandwidth of ECC memory.
The second version is the GH200, which combines a Hopper GPU with a Grace CPU in one superchip. The combination provides more memory, bandwidth, and the ability to automatically shift power between the CPU and GPU to optimize performance. A logical schematic of the Grace-Hopper superchip is shown in Figure 1.

The GH200 brings the advantage of a single share CPU-GPU memory domain. Moving data across the PCI bus between CPU and GPU is unnecessary. Both the CPU and GPU have a consistent view of all memory. As shown in Figure 2, compared to all the popular MLperf inference tests, the GH200 bested the Nvidia H100 SXM (a GH100 Hopper GPU) in every workload.
The results in Figure 2 are for two MLperf test cases:
- Offline—One query with all samples is sent to the System Under Test (SUT). The SUT can send the results back once or multiple times in any order. The performance metric is samples per second.
- Server—The queries are sent to the SUT following a Poisson distribution (to model real-world random events). One query has one sample. The performance metric is queries per second (QPS) within the latency bound.

The memory throughput advantage of the GH200 Grace Hopper Superchip comes from 96 GB of HBM3, which provides up to four TB/s of HBM3 memory bandwidth, compared to 80 GB and 3.35 TB/s for H100 SXM. This larger memory capacity and greater memory bandwidth enabled larger batch sizes for workloads on the NVIDIA GH200 Grace Hopper Superchip compared to the NVIDIA H100 SXM. For example, both RetinaNet and DLRMv2 ran with up to double the batch sizes in the Server scenario and 50% greater batch sizes in the Offline scenario.
Although the MLPerf tests do not relate directly to standard HPC benchmarks, the GH200 should show similar improvements over the H100 PCI-based systems and usher in a new era of shared memory CPU-GPU processors.
Not so fast, but not so rare
There was a new Nvidia entrant in the latest MLPerf benchmarks. The 72-watt L4 GPUs ran the full range of workloads and delivered great performance across the board.Â
 For example, the L4 GPUs running on a compact adapter card delivered up to 6x more performance than CPUs rated for nearly 5x higher power consumption.Â
Based on the on the Ada Lovelace architecture, the L4 is a low-power (72W) GPU with 7424 CUDA cores and 24GB of memory. It is rated at 31.33 TFLOPS (FP32) and 489.6 GFLOPS (FP64). While not as hefty as the H100-based systems, the L4 is aimed at edge inference and general-purpose GPU computing. Due to its size (single-slot) and low power requirements, it is readily incorporated into almost any server or workstation.

Compared to its “big sibling,” Nvidia L40 (18176 Cuda cores, 300W TDP, 90.52 FP32-TFLOPS, 1,414 FP64-GFLOPS) the L4 offer about three times less performance, four times less heat, currently three times less cost, and one significant advantage: availability.
Nvidia L4s are available from known system builders and Google Cloud. According to Wyatt Gorman, Solutions Manager, HPC and AI Infrastructure, Google Cloud, “There is plenty of availability for L4 instances, and performance is quite good.” (Comment from HPC&AI Resources in the Great GPU Squeeze Panel at 2023 HPC&AI on Wall Street.)
The Great GPU Squeeze has forced the market to find GPU cycles anywhere it can, and the L4 represents a solution for accelerated processing in these times of shortages. Nvidia reports the following performance gains from the L4 over a standard CPU node (Your millage may vary; always request benchmark details or run your own)
- Molecular Dynamics – AMBER software to simulate and analyze biomolecular interactions. One of the features of AMBER is the ability to use GPUs to massively accelerate these simulations: L4 is up to 46x faster than a typical CPU node.
- Molecular Dynamics – NAMD (Nanoscale Molecular Dynamics) for high-performance simulation of large biomolecular systems: L4 is up to 13 times faster than a typical CPU.
- Fusion Physics | GTC (Gyrokinetic Toroidal Code): L4 is up to 14 times faster than a typical CPU.
Overall, Nvidia continues its dominance in the AI sector. The performance in the MLperf benchmarks assures continued demand for high-end products, particularly Grace-Hopper and H100 systems, but it seems there are plenty of GPU cycles still to be found at the low end.