Nvidia Rolls Out Certified Server Program Targeting AI Applications

By John Russell

January 26, 2021

Nvidia today launched a certified systems program in which participating vendors can offer Nvidia-certified servers with up to eight A100 GPUs. Separate support contracts directly from Nvidia for the certified systems are also available. Besides the obvious marketing motives, Nvidia says the pre-tested systems and contract support should boost confidence and ease deployment for those taking the AI plunge. Nvidia-certified systems would be able to run Nvidia’s NGC catalog of AI workflows and tools.

Adel El-Hallak, director of product management, NGC, announced the program in a blog today and Nvidia held a pre-launch media/analyst briefing yesterday. “Today, we have 13 or 14 systems from at least five OEMs that are Nvidia-certified. We expect to certify up to 70 systems from nearly a dozen OEMs that are already engaged in this program,” said El-Hallak.

The first systems, cited in El-Hallak’s blog, include:

  • Dell EMC PowerEdge R7525 and R740 rack servers
  • GIGABYTE R281-G30, R282-Z96, G242-Z11, G482-Z54, G492-Z51 systems
  • HPE Apollo 6500 Gen10 System and HPE ProLiant DL380 Gen10 Server
  • Inspur NF5488A5
  • Supermicro A+ Server AS -4124GS-TNR and AS -2124GQ-NART

 

Large, technically-sophisticated users, such as hyperscalers and large enterprises, are not expected to be big buyers of Nvidia-certified systems but smaller companies and newcomers to AI may be attracted to them, say analysts.

“I don’t think it will speed up the adoption of AI per se but it will take some of the variables out of the equation. Especially small-scale deployments will benefit,” said Peter Rutten, research director, infrastructure systems, platforms and technologies group, IDC. “There is a certain appeal for end users to be assured that the hardware and software are optimized and have that package be officially ‘certified.’ It relieves them from having to optimize the system themselves or having to research the various offerings in the market for optimal performance based on hard-to-interpret benchmarks.”

Karl Freund, senior analyst HPC and deep learning, Moor Insights and Strategies, said, “I think uncertified systems will be fine for large cloud service and e-commerce datacenters; those customers build their own from ODMs, [and] enterprise customers already buy from the OEMs who will certify. [That said] making it simple and easy to stand up hardware AND software from NGC should speed tome to value for IT shops.”

Nvidia didn’t present a detailed test list for certification but El-Hallak offered the following description:

“It starts with different workloads. We test for AI training and inference, machine learning algorithms, AI inferencing at the edge – so streaming video, streaming voice, and HPC types of workloads. We essentially establish a baseline, a threshold, if you will, internally. We provide our OEM partners with training tips that then run the workloads. So we do things like test with different batch sizes, with different provisions, and test across a single and multiple GPUs.

“We [also] test many different use cases. We look at computer vision types of use cases. We look at machine translation models. We test the line rate as two nodes are connected together to ensure the networking and the bandwidth is optimal. [F]rom a scalability perspective, we test for a MIG instance (a multi-instance GPU), so a fraction of the GPU, a single GPU, across multiple GPUs, [and] across multi-nodes. We also test for GPU direct RDMA to ensure there’s a direct path for data exchange between the GPU and third-party devices. Finally, for security, we test for data encryption with built in security such as TLS and IPsec. We also look into TPM to ensure there’s a hardware security of the device,” he said.

Proven ability to run the NGC catalog is a key element. NGC is Nvidia’s hub for GPU-accelerated software, containerized applications, AI frameworks, domain-specific SDKs, pre-trained models and other resources.

El-Hallak argued the growth of datasets, model sizes, and the dynamic nature of AI software and tools were challenging for all AI adopters, and that certified systems would mitigate some of the challenges. He cited use cases in finance, retail, and HPC where datasets and models have grown extremely large. “Walmart generates 2.5 petabytes every hour,” he said.

Nvidia said there is no cost to OEMs or other partners to participate in the Nvidia-certification program. Once certified, systems are eligible for contract software support directly from Nvidia. “This is where the OEM sells to the end user a support contract and that end user gets access directly to Nvidia. There’s a defined SLA (service level agreement) and escalation path. We support the entire software stack. So whether it’s the CUDA toolkits, the drivers, all the workloads that are certified to run on these systems, users have access directly into Nvidia [for] support,” said El-Hallak.

In the briefing, El-Hallak emphasized use of Mellanox interconnect products (Ethernet and InfiniBand), but in response to an email about whether Mellanox interconnect products were required, Nvidia said, “Partners [can] ship whatever networking their customer wants in Nvidia-certified systems and those systems will be eligible for Nvidia’s enterprise support services. During the certification process we need partners [to] use a standardized hardware and software environment to do a fair apples-to-apples comparison. That standardized environment includes specific releases of the OS, Docker, the Nvidia GPU Driver, and Nvidia network hardware and network drivers. If a partner doesn’t have the required Nvidia networking gear in their certification lab, Nvidia can loan it to them.”

Customers who buy the service contract have two paths to obtain support, according to Nvidia:

  • Customer contacts OEM server vendor first. “If OEM server vendor determines that the problem is an Nvidia SW issue, we will request that the customer open a case with Nvidia and provide the OEM server vendor case number as well in case we need to collaborate and reference the case.”
  • Customer contacts Nvidia first. “Customers can contact Nvidia through the Nvidia enterprise support portal, email or phone: https://www.nvidia.com/en-us/support/enterprise/; If Nvidia determines that the problem is an issue that the OEM server vendor is responsible for, we will request that the customer open a case with their OEM server vendor.”

Pricing for the Nvidia-certified systems software support is on a per system basis, and varies depending upon the system configuration. As an example, Nvidia says the support cost for ‘volume’ servers featuring two A100 GPUs, is about “$4,299 per system with a 3-year support term that customers can renew.”

Both Freund and Rutten think it’s unlikely there will be a significant pricing differential between Nvidia-certified and uncertified systems.

Rutten said, “I think the server OEMs will increase their ASP somewhat for certified systems. But not a lot, because by now the market has learned fairly well how to deploy AI infrastructure and if ASPs go up too much, end users will decide they’d rather do it themselves than pay a premium, especially if they’re looking to deploy a large cluster where a premium is going to add up in absolute dollars.”

It will be interesting to watch how server vendors distinguish Nvidia-certified products from uncertified systems. To some extent, said Freund, there’s not much left to differentiate among GPU-based AI servers anyway beyond price. “I think all hardware vendors are already stuck in that mode, with exceptions such as Cray, and must differentiate on customer service, both before and after the sale,” he said.

Rutten suggested a little more wiggle room, “There are still various differentiators – host processors, storage, interconnects, the infrastructure stack which often has several proprietary layers, RAS features, and the purchasing model (think: HPE Greenlake). And the certified offers will have different performance results, based on these factors.”

Rutten did wonder if two markets could arise, one in which uncertified systems can’t support Nvidia’s NGC stack and another of certified systems which do.

“I think that is a distinct possibility. It all depends on the price difference we’re going to see between certified and non-certified systems. If large, then we’ll see a whole secondary market evolve for non-certified systems, which cannot be what Nvidia intended to achieve. I haven’t spoken to OEMS yet, so don’t know what their pricing strategies will be, but that’s the crux of the matter.”

Maribel Lopez, founder of Lopez Research, said, “I think end buyers will feel comfortable with certified hardware but that doesn’t mean they won’t continue to purchase non-certified solutions. The big win for certified offerings comes in creating a set of systems where the certain fundamental features and functions are a given. It helps organizations scale faster using the hardware. Certification is only one element of differentiation. Manageability and security are the other areas where HW vendors have to focus their differentiation efforts.”

It’s been a busy few years for Nvidia whose early focus on GPUs has widened to encompass interconnect (Mellanox) technology, AI software (NGC catalog), CPUs (pending Arm acquisition), and foray into the large system (DGX A100) business. The Nvidia-certified system program knits many of those elements together and potentially provides marketing ammunition for Nvidia partner OEMs.

Link to Nvidia blog: https://blogs.nvidia.com/blog/2021/01/26/oem-servers-certified-systems/

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Nvidia Shuts Out RISC-V Software Support for GPUs 

September 23, 2022

Nvidia is not interested in bringing software support to its GPUs for the RISC-V architecture despite being an early adopter of the open-source technology in its GPU controllers. Nvidia has no plans to add RISC-V support for CUDA, which is the proprietary GPU software platform, a company representative... Read more…

Microsoft Closes Confidential Computing Loop with AMD’s Milan Chip

September 22, 2022

Microsoft shared details on how it uses an AMD technology to secure artificial intelligence as it builds out a secure AI infrastructure in its Azure cloud service. Microsoft has a strong relationship with Nvidia, but is also working with AMD's Epyc chips (including the new 3D VCache series), MI Instinct accelerators, and also... Read more…

Nvidia Introduces New Ada Lovelace GPU Architecture, OVX Systems, Omniverse Cloud

September 20, 2022

In his GTC keynote today, Nvidia CEO Jensen Huang launched another new Nvidia GPU architecture: Ada Lovelace, named for the legendary mathematician regarded as the first computer programmer. The company also announced tw Read more…

Nvidia’s Hopper GPUs Enter ‘Full Production,’ DGXs Delayed Until Q1

September 20, 2022

Just about six months ago, Nvidia’s spring GTC event saw the announcement of its hotly anticipated Hopper GPU architecture. Now, the GPU giant is announcing that Hopper-generation GPUs (which promise greater energy eff Read more…

NeMo LLM Service: Nvidia’s First Cloud Service Makes AI Less Vague

September 20, 2022

Nvidia is trying to uncomplicate AI with a cloud service that makes AI and its many forms of computing less vague and more conversational. The NeMo LLM service, which Nvidia called its first cloud service, adds a layer of intelligence and interactivity... Read more…

AWS Solution Channel

Shutterstock 1194728515

Simulating 44-Qubit quantum circuits using AWS ParallelCluster

Dr. Fabio Baruffa, Sr. HPC & QC Solutions Architect
Dr. Pavel Lougovski, Pr. QC Research Scientist
Tyson Jones, Doctoral researcher, University of Oxford

Introduction

Currently, an enormous effort is underway to develop quantum computing hardware capable of scaling to hundreds, thousands, and even millions of physical (non-error-corrected) qubits. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1166887495

Improving Insurance Fraud Detection using AI Running on Cloud-based GPU-Accelerated Systems

Insurance is a highly regulated industry that is evolving as the industry faces changing customer expectations, massive amounts of data, and increased regulations. A major issue facing the industry is tracking insurance fraud. Read more…

Nvidia Targets Computers for Robots in the Surgery Rooms

September 20, 2022

Nvidia is laying the groundwork for a future in which humans and robots will be collaborators in the surgery rooms at hospitals. The company announced a computer called IGX for Medical Devices, which will be populated in robots, image scanners and other computers and medical devices involved in patient care close to the point... Read more…

Nvidia Shuts Out RISC-V Software Support for GPUs 

September 23, 2022

Nvidia is not interested in bringing software support to its GPUs for the RISC-V architecture despite being an early adopter of the open-source technology in its GPU controllers. Nvidia has no plans to add RISC-V support for CUDA, which is the proprietary GPU software platform, a company representative... Read more…

Nvidia Introduces New Ada Lovelace GPU Architecture, OVX Systems, Omniverse Cloud

September 20, 2022

In his GTC keynote today, Nvidia CEO Jensen Huang launched another new Nvidia GPU architecture: Ada Lovelace, named for the legendary mathematician regarded as Read more…

Nvidia’s Hopper GPUs Enter ‘Full Production,’ DGXs Delayed Until Q1

September 20, 2022

Just about six months ago, Nvidia’s spring GTC event saw the announcement of its hotly anticipated Hopper GPU architecture. Now, the GPU giant is announcing t Read more…

NeMo LLM Service: Nvidia’s First Cloud Service Makes AI Less Vague

September 20, 2022

Nvidia is trying to uncomplicate AI with a cloud service that makes AI and its many forms of computing less vague and more conversational. The NeMo LLM service, which Nvidia called its first cloud service, adds a layer of intelligence and interactivity... Read more…

Nvidia Targets Computers for Robots in the Surgery Rooms

September 20, 2022

Nvidia is laying the groundwork for a future in which humans and robots will be collaborators in the surgery rooms at hospitals. The company announced a computer called IGX for Medical Devices, which will be populated in robots, image scanners and other computers and medical devices involved in patient care close to the point... Read more…

Survey Results: PsiQuantum, ORNL, and D-Wave Tackle Benchmarking, Networking, and More

September 19, 2022

The are many issues in quantum computing today – among the more pressing are benchmarking, networking and development of hybrid classical-quantum approaches. Read more…

HPC + AI Wall Street to Feature ‘Spooky’ Science for Financial Services

September 18, 2022

Albert Einstein famously described quantum mechanics as "spooky action at a distance" due to the non-intuitive nature of superposition and quantum entangled par Read more…

Analog Chips Find a New Lease of Life in Artificial Intelligence

September 17, 2022

The need for speed is a hot topic among participants at this week’s AI Hardware Summit – larger AI language models, faster chips and more bandwidth for AI machines to make accurate predictions. But some hardware startups are taking a throwback approach for AI computing to counter the more-is-better... Read more…

AWS Takes the Short and Long View of Quantum Computing

August 30, 2022

It is perhaps not surprising that the big cloud providers – a poor term really – have jumped into quantum computing. Amazon, Microsoft Azure, Google, and th Read more…

The Final Frontier: US Has Its First Exascale Supercomputer

May 30, 2022

In April 2018, the U.S. Department of Energy announced plans to procure a trio of exascale supercomputers at a total cost of up to $1.8 billion dollars. Over the ensuing four years, many announcements were made, many deadlines were missed, and a pandemic threw the world into disarray. Now, at long last, HPE and Oak Ridge National Laboratory (ORNL) have announced that the first of those... Read more…

US Senate Passes CHIPS Act Temperature Check, but Challenges Linger

July 19, 2022

The U.S. Senate on Tuesday passed a major hurdle that will open up close to $52 billion in grants for the semiconductor industry to boost manufacturing, supply chain and research and development. U.S. senators voted 64-34 in favor of advancing the CHIPS Act, which sets the stage for the final consideration... Read more…

Top500: Exascale Is Officially Here with Debut of Frontier

May 30, 2022

The 59th installment of the Top500 list, issued today from ISC 2022 in Hamburg, Germany, officially marks a new era in supercomputing with the debut of the first-ever exascale system on the list. Frontier, deployed at the Department of Energy’s Oak Ridge National Laboratory, achieved 1.102 exaflops in its fastest High Performance Linpack run, which was completed... Read more…

Chinese Startup Biren Details BR100 GPU

August 22, 2022

Amid the high-performance GPU turf tussle between AMD and Nvidia (and soon, Intel), a new, China-based player is emerging: Biren Technology, founded in 2019 and headquartered in Shanghai. At Hot Chips 34, Biren co-founder and president Lingjie Xu and Biren CTO Mike Hong took the (virtual) stage to detail the company’s inaugural product: the Biren BR100 general-purpose GPU (GPGPU). “It is my honor to present... Read more…

Newly-Observed Higgs Mode Holds Promise in Quantum Computing

June 8, 2022

The first-ever appearance of a previously undetectable quantum excitation known as the axial Higgs mode – exciting in its own right – also holds promise for developing and manipulating higher temperature quantum materials... Read more…

AMD’s MI300 APUs to Power Exascale El Capitan Supercomputer

June 21, 2022

Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Read more…

Tesla Bulks Up Its GPU-Powered AI Super – Is Dojo Next?

August 16, 2022

Tesla has revealed that its biggest in-house AI supercomputer – which we wrote about last year – now has a total of 7,360 A100 GPUs, a nearly 28 percent uplift from its previous total of 5,760 GPUs. That’s enough GPU oomph for a top seven spot on the Top500, although the tech company best known for its electric vehicles has not publicly benchmarked the system. If it had, it would... Read more…

Leading Solution Providers

Contributors

Exclusive Inside Look at First US Exascale Supercomputer

July 1, 2022

HPCwire takes you inside the Frontier datacenter at DOE's Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…

AMD Opens Up Chip Design to the Outside for Custom Future

June 15, 2022

AMD is getting personal with chips as it sets sail to make products more to the liking of its customers. The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package. "We are focused on making it easier to implement chips with more flexibility," said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week. Read more…

Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps

May 31, 2022

Intel reiterated it is well on its way to merging its roadmap of high-performance CPUs and GPUs as it shifts over to newer manufacturing processes and packaging technologies in the coming years. The company is merging the CPU and GPU lineups into a chip (codenamed Falcon Shores) which Intel has dubbed an XPU. Falcon Shores... Read more…

Nvidia, Intel to Power Atos-Built MareNostrum 5 Supercomputer

June 16, 2022

The long-troubled, hotly anticipated MareNostrum 5 supercomputer finally has a vendor: Atos, which will be supplying a system that includes both Nvidia and Inte Read more…

UCIe Consortium Incorporates, Nvidia and Alibaba Round Out Board

August 2, 2022

The Universal Chiplet Interconnect Express (UCIe) consortium is moving ahead with its effort to standardize a universal interconnect at the package level. The c Read more…

Using Exascale Supercomputers to Make Clean Fusion Energy Possible

September 2, 2022

Fusion, the nuclear reaction that powers the Sun and the stars, has incredible potential as a source of safe, carbon-free and essentially limitless energy. But Read more…

Is Time Running Out for Compromise on America COMPETES/USICA Act?

June 22, 2022

You may recall that efforts proposed in 2020 to remake the National Science Foundation (Endless Frontier Act) have since expanded and morphed into two gigantic bills, the America COMPETES Act in the U.S. House of Representatives and the U.S. Innovation and Competition Act in the U.S. Senate. So far, efforts to reconcile the two pieces of legislation have snagged and recent reports... Read more…

India Launches Petascale ‘PARAM Ganga’ Supercomputer

March 8, 2022

Just a couple of weeks ago, the Indian government promised that it had five HPC systems in the final stages of installation and would launch nine new supercomputers this year. Now, it appears to be making good on that promise: the country’s National Supercomputing Mission (NSM) has announced the deployment of “PARAM Ganga” petascale supercomputer at Indian Institute of Technology (IIT)... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire