Nvidia today announced general availability for its BlueField-3 data processing unit (DPU) along with impressive early deployments including Oracle Cloud Infrastructure. First described in 2021 and now being delivered, BlueField-3 is Nvidia’s third-gen DPU and has roughly 22 billion transistors. The new DPU supports Ethernet and InfiniBand connectivity at up to “400 gigabits per second and provides 4x more compute power, up to 4x faster crypto acceleration, 2x faster storage processing, and 4x more memory bandwidth compared to the previous generation of BlueField,” according to Nvidia.
In his GTC 23 keynote, Nvidia CEO Jensen Huang said, “In a modern, software-defined datacenter, the operating system doing virtualization, networking, storage and security can consume nearly half of the datacenter’s CPU cores and associated power. Datacenters must accelerate every workload to reclaim power and free CPUs for revenue generating workload. Nvidia BlueField offloads and accelerates the datacenter operating system and infrastructure software.”
Back in 2020 Nvidia laid out its strategy for DPUs, arguing that CPUs were being bogged down by house-keeping chores such as those cited by Huang. DPUs, argued Nvidia, would absorb these tasks, thus freeing CPUs for use on applications. Other chip suppliers – notably Intel and AMD – seem to agree and have jumped into the DPU market.
Sometimes described as smartNICs on steroids, the market has drawn interest but not yet translated into broad sales. The change may now be happening. Huang cited “over two dozen ecosystem partners” including such names as Cisco, DDN, Dell EMC, and Juniper, that use BlueField technology now.
In a media/analyst pre-briefing, Kevin Deierling, VP, networking, said “BlueField-3 is in full production and available. [It] has twice as many Arm processor cores [as BlueField-2], more accelerators, and runs workloads up to eight times faster than our previous generation DPU. BlueField-3 offloads, accelerates and isolates workloads across cloud HPC, enterprise and accelerated AI use cases.”
Nvidia is targeting supercomputers, datacenters, and cloud providers for its DPUs. At GTC, Nvidia touted the Oracle Cloud Deployment in which BlueField-3 is part of a larger DGX-in-the-Cloud win for Nvidia.
“As you heard, we are announcing that Oracle Cloud Infrastructure is the first to run DGX Cloud and AI supercomputing service that gives enterprises immediate access to infrastructure and software needed to train advanced models for generative AI. OCI has [also] chosen BlueField-3 for greater performance, efficiency and security together. BlueField-3 delivers massive performance and efficiency gains by offloading data center infrastructure tasks from CPUs, increasing virtualized instances by eight times in comparison to BluField-2,” said Deierling.
In the official announcement, Clay Magouyrk, executive vice president of OCI is quoted, “Oracle Cloud Infrastructure offers enterprise customers nearly unparalleled accessibility to AI and scientific computing infrastructure with the power to transform industries. Nvidia BlueField-3 DPUs are a key component of our strategy to provide state-of-the-art, sustainable cloud infrastructure with extreme performance.”
Other BlueField-3 wins among CSPs include Baidu, CoreWeave. JD.com, Microsoft Azure, and Tencent.
Nvidia also reported BlueField-3 features full backward compatibility “through the DOCA software framework.”
DOCA is the programing tool for BlueField and DOCA 2.0 is the latest release. Nvidia has been steadily adding features to its DPU line. Recently, for example, it’s beefed up inline GPU packet processing “to implement high data rate solutions: data filtering, data placement, network analysis, sensors’ signal processing, and more.” The new DOCA GPUNetIO Library can overcome some of the limitations found in the previous DPDK solution.
Here’s an excerpt from a blog the topic, “Nvidia Real-time GPU processing of network packets is a technique useful to several different application domains, including signal processing, network security, information gathering, and input reconstruction. The goal of these applications is to realize an inline packet processing pipeline to receive packets in GPU memory (without staging copies through CPU memory); process them in parallel with one or more CUDA kernels; and then run inference, evaluate, or send over the network the result of the calculation.”