Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

By Tiffany Trader

May 10, 2017

At Nvidia’s GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company’s much-anticipated Volta architecture and flagship high-end GPU, the Tesla V100, noting that it took several thousand engineers several years to create, at an approximate development cost of $3 billion.

One thing is undeniable about the Volta V100: it is a giant chip, 33 percent larger than the Pascal P100 and once again “the biggest GPU ever made.” Fabricated by TSMC on a custom 12-nm FFN high performance manufacturing process, the V100 GPU squeezes 21.1 billion transistors and almost 100 billion via connectors on an 815 mm2 die, about the size of the Apple watch, said Huang.

“It is at the limits of photolithography,” Huang told the crowd. “You can’t make a chip any bigger than this because transistors would fall on the ground. Every single transistor that is possible to make by today’s physics was crammed into this processor.”

“To make one chip work per 12-inch wafer, I would characterize as unlikely,” added the CEO. “And so the fact that this was manufactured was a great feat.”

This is a domain specific chip, said Jonah Alben, senior vice president of GPU engineering at Nvidia. “This chip can run games very well if we want it to, but the focus [of the V100] is to be a great chip for AI and for HPC, so we dedicated all the resources we could until it was illegal to do more.”

“The first thing to know about Volta is it a giant leap for machine learning,” Luke Durant, principal engineer, CUDA Software, Nvidia followed. “[However,] we still are completely focused on high-performance computing. Across the board we’re seeing about a 1.5x speedup as compared to Pascal, just one year ago.”

Volta is a major launch for Nvidia, but not exactly a surprise. Back in 2014, the architecture was tapped to power the next-generation CORAL supercomputers, Summit and Sierra, in partnership with IBM, Mellanox and the Department of Energy. Those computers, expected to reach at least 200 petaflops of performance, are now due to be installed later this year into early 2018.

The new V100 touts spec’d performance of 7.5 teraflops double-precision, 15 teraflops single-precision, and 30 teraflops half-precision. This is nearly a 42 percent increase in peak flops over one year.

The Volta architecture introduces a brand new type of processor, Tensor Core, designed to accelerate AI workloads. With 640 Tensor Cores (8 per SM), V100 delivers 120 teraflops of deep learning performance, providing 6-12 times higher peak teraflops for Tensor operations compared with previous-generation silicon.

Volta is also slated to provide up to 60 tera-ops of INT8 performance. Nvidia kept the INT8 instructions to maintain compatibility with existing code bases and also reported that having a dedicated integer unit on Volta would help write machine learning kernels.

Tesla comparison over the last five years. Source: Nvidia. Click to Expand.

“With the V100, the most important statement isn’t the raw performance, although Nvidia managed to raise eyebrows with that,” commented Intersect360 Research CEO Addison Snell. “It’s that they are designing chips for double-precision 64-bit performance, single-precision 32-bit performance, or tensor performance, in the same package, so a single processor targets a range of applications in AI and HPC.”

Volta comes with 6MB of L2 cache and 16GB of HBM2 memory, providing 900 GB/s of bandwidth. The SMX2 form factor V100 features NVLink2 connectivity with nearly twice the throughput of the prior generation NVLink, going from 160 GB/s to 300 GB/s. Designers accomplished this by adding 50 percent more links and running them 28 percent faster.

Similar to the Pascal GP100, the Volta GV100 SM incorporates 64 FP32 cores and 32 FP64 cores per SM, however the new GPU has 80 SMs compared with 56 on the GP100. It thus has many more registers and supports more threads, warps, and thread blocks compared with previous Tesla generation GPUs, according to Nvidia.

Major features of the Volta SM include:

+ New mixed-precision FP16/FP32 Tensor Cores purpose-built for deep learning matrix arithmetic.

+ Enhanced L1 data cache for higher performance and lower latency.

+ Streamlined instruction set for simpler decoding and reduced instruction latencies.

+ Higher clocks and higher power efficiency.

“It has a completely different instruction set than Pascal,” remarked Bryan Catanzaro, vice president, Applied Deep Learning Research at Nvidia. “It’s fundamentally extremely different. Volta is not Pascal with Tensor Core thrown onto it – it’s a completely different processor.”

Catanzaro, who returned to Nvidia from Baidu six months ago, emphasized how the architectural changes wrought greater flexibility and power efficiency.

“It’s worth noting that Volta has the biggest change to the GPU threading model basically since I can remember and I’ve been programming GPUs for a while,” he said. “With Volta we can actually have forward progress guarantees for threads inside the same warp even if they need to synchronize, which we have never been able to do before. This is going to enable a lot more interesting algorithms to be written using the GPU, so a lot of code that you just couldn’t write before because it potentially would hang the GPU based on that thread scheduling model is now possible. I’m pretty excited about that, especially for some sparser kinds of data analytics workloads there’s a lot of use cases where we want to be collaborating between threads in more complicated ways and Volta has a thread scheduler can accommodate that.

“It’s actually pretty remarkable to me that we were able to get more flexibility and better performance-per-watt. Because I was really concerned when I heard that they were going to change the Volta thread scheduler that it was going to give up performance-per-watt, because the reason that the old one wasn’t as flexible is you get a lot of energy efficiency by ganging up threads together and having the capability to let the threads be more independent then makes me worried that performance-per-watt is going to be worse, but actually it got better, so that’s pretty exciting.”

Added Alben: “This was done through a combination of process and architectural changes but primarily architecture. This was a very significant rewrite of the processor architecture. The Tensor Core part is obviously very [significant] but even if you look at FP32 and FP64, we’re talking about 50 percent more performance in the same power budget as where we’re at with Pascal. Every few years, we say, hey we discovered something really cool. We basically discovered a new architectural approach we could pursue that unlocks even more power efficiency than we had previously. The Volta SM is a really ambitious design; there’s a lot of different elements in there, obviously Tensor Core is one part, but the architectural power efficiency is a big part of this design.”

 

Nvidia showed off three different V100 form factors at GTC: the 300 watt SXM2 (mezzanine) module; an inferencing accelerator for hyperscale that is a 150 watt full height, half length (FHHL) PCIe card about the size of a CD case; and the standard PCIe two-slot, full-length card.

DGX-1 with eight V100s

V100 GPUs will be available starting next quarter, according to Nvidia. Customers can pre-order the Volta-series DGX-1 box now for $149,000, $20,000 more than the list price for the Pascal-equipped version.

In addition to the coming DGX-1 Volta refresh, Nvidia also released the new DGX Station. Billed as a “personal supercomputer for AI development,” DGX Station provides four NVLink-connected Tesla V100s to deliver 480 (peak) Tensor teraflops in a 1,500 watt water-cooled chassis for $69,000.

Riding the wave of AI and HPC announcements made this week and on the heels of a stronger-than-expected first quarter (recording revenue of $1.94 billion with record datacenter sales of $409 million), Nvidia shares were up 18 percent as of close of market Wednesday, reaching $121.29, an all-time high.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip using standard CMOS fabrication. At Hot Chips 31 in Stanfor Read more…

By Tiffany Trader

Talk to Me: Nvidia Claims NLP Inference, Training Records

August 15, 2019

Nvidia says it’s achieved significant advances in conversation natural language processing (NLP) training and inference, enabling more complex, immediate-response interchanges between customers and chatbots. And the co Read more…

By Doug Black

Trump Administration and NIST Issue AI Standards Development Plan

August 14, 2019

Efforts to develop AI are gathering steam fast. On Monday, the White House issued a federal plan to help develop technical standards for AI following up on a mandate contained in the Administration’s AI Executive Order Read more…

By John Russell

AWS Solution Channel

Efficiency and Cost-Optimization for HPC Workloads – AWS Batch and Amazon EC2 Spot Instances

High Performance Computing on AWS leverages the power of cloud computing and the extreme scale it offers to achieve optimal HPC price/performance. With AWS you can right size your services to meet exactly the capacity requirements you need without having to overprovision or compromise capacity. Read more…

HPE Extreme Performance Solutions

Bring the combined power of HPC and AI to your business transformation

FPGA (Field Programmable Gate Array) acceleration cards are not new, as they’ve been commercially available since 1984. Typically, the emphasis around FPGAs has centered on the fact that they’re programmable accelerators, and that they can truly offer workload specific hardware acceleration solutions without requiring custom silicon. Read more…

IBM Accelerated Insights

Cloudy with a Chance of Mainframes

[Connect with HPC users and learn new skills in the IBM Spectrum LSF User Community.]

Rapid rates of change sometimes result in unexpected bedfellows. Read more…

Scientists to Tap Exascale Computing to Unlock the Mystery of our Accelerating Universe

August 14, 2019

The universe and everything in it roared to life with the Big Bang approximately 13.8 billion years ago. It has continued expanding ever since. While we have a good understanding of the early universe, its fate billions Read more…

By Rob Johnson

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Scientists to Tap Exascale Computing to Unlock the Mystery of our Accelerating Universe

August 14, 2019

The universe and everything in it roared to life with the Big Bang approximately 13.8 billion years ago. It has continued expanding ever since. While we have a Read more…

By Rob Johnson

AI is the Next Exascale – Rick Stevens on What that Means and Why It’s Important

August 13, 2019

Twelve years ago the Department of Energy (DOE) was just beginning to explore what an exascale computing program might look like and what it might accomplish. Today, DOE is repeating that process for AI, once again starting with science community town halls to gather input and stimulate conversation. The town hall program... Read more…

By Tiffany Trader and John Russell

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

Lenovo Drives Single-Socket Servers with AMD Epyc Rome CPUs

August 7, 2019

No summer doldrums here. As part of the AMD Epyc Rome launch event in San Francisco today, Lenovo announced two new single-socket servers, the ThinkSystem SR635 Read more…

By Doug Black

Building Diversity and Broader Engagement in the HPC Community

August 7, 2019

Increasing diversity and inclusion in HPC is a community-building effort. Representation of both issues and individuals matters - the more people see HPC in a w Read more…

By AJ Lauer

Xilinx vs. Intel: FPGA Market Leaders Launch Server Accelerator Cards

August 6, 2019

The two FPGA market leaders, Intel and Xilinx, both announced new accelerator cards this week designed to handle specialized, compute-intensive workloads and un Read more…

By Doug Black

High Performance (Potato) Chips

May 5, 2006

In this article, we focus on how Procter & Gamble is using high performance computing to create some common, everyday supermarket products. Tom Lange, a 27-year veteran of the company, tells us how P&G models products, processes and production systems for the betterment of consumer package goods. Read more…

By Michael Feldman

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

Cray, AMD to Extend DOE’s Exascale Frontier

May 7, 2019

Cray and AMD are coming back to Oak Ridge National Laboratory to partner on the world’s largest and most expensive supercomputer. The Department of Energy’s Read more…

By Tiffany Trader

Graphene Surprises Again, This Time for Quantum Computing

May 8, 2019

Graphene is fascinating stuff with promise for use in a seeming endless number of applications. This month researchers from the University of Vienna and Institu Read more…

By John Russell

AMD Verifies Its Largest 7nm Chip Design in Ten Hours

June 5, 2019

AMD announced last week that its engineers had successfully executed the first physical verification of its largest 7nm chip design – in just ten hours. The AMD Radeon Instinct Vega20 – which boasts 13.2 billion transistors – was tested using a TSMC-certified Calibre nmDRC software platform from Mentor. Read more…

By Oliver Peckham

TSMC and Samsung Moving to 5nm; Whither Moore’s Law?

June 12, 2019

With reports that Taiwan Semiconductor Manufacturing Co. (TMSC) and Samsung are moving quickly to 5nm manufacturing, it’s a good time to again ponder whither goes the venerable Moore’s law. Shrinking feature size has of course been the primary hallmark of achieving Moore’s law... Read more…

By John Russell

Deep Learning Competitors Stalk Nvidia

May 14, 2019

There is no shortage of processing architectures emerging to accelerate deep learning workloads, with two more options emerging this week to challenge GPU leader Nvidia. First, Intel researchers claimed a new deep learning record for image classification on the ResNet-50 convolutional neural network. Separately, Israeli AI chip startup Hailo.ai... Read more…

By George Leopold

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Nvidia Embraces Arm, Declares Intent to Accelerate All CPU Architectures

June 17, 2019

As the Top500 list was being announced at ISC in Frankfurt today with an upgraded petascale Arm supercomputer in the top third of the list, Nvidia announced its Read more…

By Tiffany Trader

Top500 Purely Petaflops; US Maintains Performance Lead

June 17, 2019

With the kick-off of the International Supercomputing Conference (ISC) in Frankfurt this morning, the 53rd Top500 list made its debut, and this one's for petafl Read more…

By Tiffany Trader

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

A Behind-the-Scenes Look at the Hardware That Powered the Black Hole Image

June 24, 2019

Two months ago, the first-ever image of a black hole took the internet by storm. A team of scientists took years to produce and verify the striking image – an Read more…

By Oliver Peckham

Cray – and the Cray Brand – to Be Positioned at Tip of HPE’s HPC Spear

May 22, 2019

More so than with most acquisitions of this kind, HPE’s purchase of Cray for $1.3 billion, announced last week, seems to have elements of that overused, often Read more…

By Doug Black and Tiffany Trader

Chinese Company Sugon Placed on US ‘Entity List’ After Strong Showing at International Supercomputing Conference

June 26, 2019

After more than a decade of advancing its supercomputing prowess, operating the world’s most powerful supercomputer from June 2013 to June 2018, China is keep Read more…

By Tiffany Trader

In Wake of Nvidia-Mellanox: Xilinx to Acquire Solarflare

April 25, 2019

With echoes of Nvidia’s recent acquisition of Mellanox, FPGA maker Xilinx has announced a definitive agreement to acquire Solarflare Communications, provider Read more…

By Doug Black

Qualcomm Invests in RISC-V Startup SiFive

June 7, 2019

Investors are zeroing in on the open standard RISC-V instruction set architecture and the processor intellectual property being developed by a batch of high-flying chip startups. Last fall, Esperanto Technologies announced a $58 million funding round. Read more…

By George Leopold

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This