Baidu Researcher Pushes GPU Scalability for Deep Learning

By Tiffany Trader

June 20, 2016

Editor’s Note: While Andrew Ng, chief scientist at Baidu was delivering his ISC keynote this morning on how HPC is supercharging AI, his colleague Greg Diamos, research scientist at Baidu’s Silicon Valley AI Lab, was preparing to present a paper on GPU-based deep learning at the 33rd International Conference on Machine Learning in New York.

Greg Diamos, senior researcher, Silicon Valley AI Lab, Baidu, is on the front lines of the reinvigorated frontier of machine learning. Before joining Baidu, Diamos was in the employ of NVIDIA, first as a research scientist and then an architect (for the GPU streaming multiprocessor and the CUDA software). Given this background, it’s natural that Diamos’ research is focused on advancing breakthroughs in GPU-based deep learning. Ahead of the paper he is presenting, Diamos answered questions about his research and his vision for the future of machine learning.

HPCwire: How would you characterize the current era of machine learning?

Greg Diamos Baidu headshot
Greg Diamos

Diamos: There are two strong forces in machine learning. One is big data, or the availability of massive data sets enabled by the growth of the internet. The other is deep learning, or the discovery of how to train very deep artificial neural networks effectively. The combination of these two forces is driving fast progress on many hard problems.

HPCwire: There’s a lot of excitement for deep learning – is it warranted and what would you say to those who say they aren’t on-board yet?

Diamos: It is warranted. Deep learning has already tremendously advanced the state of the art of real world problems in computer vision and speech recognition. Many problems in these domains and others that were previously considered too difficult are now within reach.

HPCwire: What’s the relationship between machine learning and high-performance computing and how is it evolving?

Diamos: The ability to train deep artificial neural networks effectively and the abundance of training data has pushed machine learning into a compute bound regime, even on the fastest machines in the world. We find ourselves in a situation where faster computers directly enable better application level performance, for example, better speech recognition accuracy.

HPCwire: So you’re presenting a paper at the 33rd International Conference on Machine Learning in New York today. The title is Persistent RNNs: Stashing Recurrent Weights On-Chip. First, can you explain what Recurrent Neural Networks are and what problems they solve?

Diamos: Recurrent neural networks are functions that transform sequences of data – for example, they can transform an audio signal into a transcript, or a sentence in English into a sentence in Chinese. They are similar to other deep artificial neural networks, with the key difference being that they operate on sequences (e.g. an audio signal of arbitrary length) instead of fixed sized data (e.g. an image of fixed dimensions).

Figure 5 Baidu Diamos ICML 2016HPCwire: Can you provide an overview of your paper? What problem(s) did you set out to solve and what was achieved?

Diamos: It turns out that although deep learning algorithms are typically compute bound, we have not figured out how to train them at the theoretical limits of performance of large clusters, and there is a big opportunity remaining. The difference between the sustained performance of the fastest RNN training system that we know about at Baidu, and the theoretical peak performance of the fastest computer in the world is approximately 2500x.

The goal of this work is to improve the strong scalability of training deep recurrent neural networks in an attempt to close this gap. We do this by making GPUs 30x more efficient on smaller units of work, enabling better strong scaling. We achieve a 16x increase in strong scaling, going from 8 GPUs without our technique to 128 GPUs with it. Our implementation sustains 28 percent of peak floating point throughput at 128 GPUs over the entire training run, compared to 31 percent on a single GPU.

HPCwire: GPUs are closely associated with machine learning, especially deep neural networks. How important have GPUs been to your research and development at Baidu?

Diamos: GPUs are important for machine learning because they have high computational throughput, and much of machine learning, deep learning in particular, is compute limited.

HPCwire: And a related question – what does the scalability offered by dense servers all the way up to large clusters enable for deep learning and other machine learning workloads?

Diamos: Scaling training to large clusters enables training bigger neural networks on bigger datasets than are possible with any other technology.

HPCwire: What are you watching in terms of other processing architecture (Xeon Phi Knights Landing, FPGAs, ASICs, DSPs, ARM and so forth)?

Diamos: In the five year timeframe I am watching two things: peak floating point throughput and software support for deep learning. So far GPUs are leading both categories, but there is certainly room for competition. If other processors want to compete in this space, they need to be serious about software, in particular, releasing deep learning primitive libraries with simple C interfaces that achieve close to peak performance. Looking farther ahead to the limits of technology scaling, I hope that a processor is developed in the next two decades that enables deep learning model training at 10 PFLOP/s in 300 Watts, and 150 EFLOP/s in 25 MWatts.

HPCwire: Baidu is using machine learning for image recognition, speech recognition, the development of autonomous vehicles and more, what does the research you’ve done here help enable?

Diamos: This research allows us to train our models faster, which so far has translated into better application level performance, e.g. speech recognition accuracy. I think that this is an important message for people who work in high performance computing systems. It provides a clear link between the work that they do to build faster systems and our ability to apply machine learning to important problems.

Relevant links:

ICML paper: Persistent RNNs: Stashing Recurrent Weights On-Chip: http://jmlr.org/proceedings/papers/v48/diamos16.pdf

Video about Greg’s work at Baidu: https://www.youtube.com/watch?v=JkXbTOt_JxE

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

San Diego Supercomputer Center Opens ‘Expanse’ to Industry Users

April 15, 2021

When San Diego Supercomputer Center (SDSC) at the University of California San Diego was getting ready to deploy its flagship Expanse supercomputer for the large research community it supports, it also sought to optimize Read more…

GTC21: Dell Building Cloud Native Supercomputers at U Cambridge and Durham

April 14, 2021

In conjunction with GTC21, Dell Technologies today announced new supercomputers at universities across DiRAC (Distributed Research utilizing Advanced Computing) in the UK with plans to explore use of Nvidia BlueField DPU technology. The University of Cambridge will expand... Read more…

The Role and Potential of CPUs in Deep Learning

April 14, 2021

Deep learning (DL) applications have unique architectural characteristics and efficiency requirements. Hence, the choice of computing system has a profound impact on how large a piece of the DL pie a user can finally enj Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Nvidia Aims Clara Healthcare at Drug Discovery, Imaging via DGX

April 12, 2021

Nvidia Corp. continues to expand its Clara healthcare platform with the addition of computational drug discovery and medical imaging tools based on its DGX A100 platform, related InfiniBand networking and its AGX developer kit. The Clara partnerships announced during... Read more…

AWS Solution Channel

Research computing with RONIN on AWS

To allow more visibility into and management of Amazon Web Services (AWS) resources and expenses and minimize the cloud skills training required to operate these resources, AWS Partner RONIN created the RONIN research computing platform. Read more…

Nvidia Serves Up Its First Arm Datacenter CPU ‘Grace’ During Kitchen Keynote

April 12, 2021

Today at Nvidia’s annual spring GPU Technology Conference (GTC), held virtually once more due to the pandemic, the company unveiled its first ever Arm-based CPU, called Grace in honor of the famous American programmer Grace Hopper. The announcement of the new... Read more…

San Diego Supercomputer Center Opens ‘Expanse’ to Industry Users

April 15, 2021

When San Diego Supercomputer Center (SDSC) at the University of California San Diego was getting ready to deploy its flagship Expanse supercomputer for the larg Read more…

GTC21: Dell Building Cloud Native Supercomputers at U Cambridge and Durham

April 14, 2021

In conjunction with GTC21, Dell Technologies today announced new supercomputers at universities across DiRAC (Distributed Research utilizing Advanced Computing) in the UK with plans to explore use of Nvidia BlueField DPU technology. The University of Cambridge will expand... Read more…

The Role and Potential of CPUs in Deep Learning

April 14, 2021

Deep learning (DL) applications have unique architectural characteristics and efficiency requirements. Hence, the choice of computing system has a profound impa Read more…

GTC21: Nvidia Launches cuQuantum; Dips a Toe in Quantum Computing

April 13, 2021

Yesterday Nvidia officially dipped a toe into quantum computing with the launch of cuQuantum SDK, a development platform for simulating quantum circuits on GPU-accelerated systems. As Nvidia CEO Jensen Huang emphasized in his keynote, Nvidia doesn’t plan to build... Read more…

Nvidia Aims Clara Healthcare at Drug Discovery, Imaging via DGX

April 12, 2021

Nvidia Corp. continues to expand its Clara healthcare platform with the addition of computational drug discovery and medical imaging tools based on its DGX A100 platform, related InfiniBand networking and its AGX developer kit. The Clara partnerships announced during... Read more…

Nvidia Serves Up Its First Arm Datacenter CPU ‘Grace’ During Kitchen Keynote

April 12, 2021

Today at Nvidia’s annual spring GPU Technology Conference (GTC), held virtually once more due to the pandemic, the company unveiled its first ever Arm-based CPU, called Grace in honor of the famous American programmer Grace Hopper. The announcement of the new... Read more…

Nvidia Debuts BlueField-3 – Its Next DPU with Big Plans for an Expanded Role

April 12, 2021

Nvidia today announced its next generation data processing unit (DPU) – BlueField-3 – adding more substance to its evolving concept of the DPU as a full-fledged partner to CPUs and GPUs in delivering advanced computing. Nvidia is pitching the DPU as an active engine... Read more…

Nvidia’s Newly DPU-Enabled SuperPod Is a Multi-Tenant, Cloud-Native Supercomputer

April 12, 2021

At GTC 2021, Nvidia has announced an upgraded iteration of its DGX SuperPods, calling the new offering “the first cloud-native, multi-tenant supercomputer.” Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

CERN Is Betting Big on Exascale

April 1, 2021

The European Organization for Nuclear Research (CERN) involves 23 countries, 15,000 researchers, billions of dollars a year, and the biggest machine in the worl Read more…

Programming the Soon-to-Be World’s Fastest Supercomputer, Frontier

January 5, 2021

What’s it like designing an app for the world’s fastest supercomputer, set to come online in the United States in 2021? The University of Delaware’s Sunita Chandrasekaran is leading an elite international team in just that task. Chandrasekaran, assistant professor of computer and information sciences, recently was named... Read more…

HPE Launches Storage Line Loaded with IBM’s Spectrum Scale File System

April 6, 2021

HPE today launched a new family of storage solutions bundled with IBM’s Spectrum Scale Erasure Code Edition parallel file system (description below) and featu Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Saudi Aramco Unveils Dammam 7, Its New Top Ten Supercomputer

January 21, 2021

By revenue, oil and gas giant Saudi Aramco is one of the largest companies in the world, and it has historically employed commensurate amounts of supercomputing Read more…

Quantum Computer Start-up IonQ Plans IPO via SPAC

March 8, 2021

IonQ, a Maryland-based quantum computing start-up working with ion trap technology, plans to go public via a Special Purpose Acquisition Company (SPAC) merger a Read more…

Leading Solution Providers

Contributors

Can Deep Learning Replace Numerical Weather Prediction?

March 3, 2021

Numerical weather prediction (NWP) is a mainstay of supercomputing. Some of the first applications of the first supercomputers dealt with climate modeling, and Read more…

Livermore’s El Capitan Supercomputer to Debut HPE ‘Rabbit’ Near Node Local Storage

February 18, 2021

A near node local storage innovation called Rabbit factored heavily into Lawrence Livermore National Laboratory’s decision to select Cray’s proposal for its CORAL-2 machine, the lab’s first exascale-class supercomputer, El Capitan. Details of this new storage technology were revealed... Read more…

New Deep Learning Algorithm Solves Rubik’s Cube

July 25, 2018

Solving (and attempting to solve) Rubik’s Cube has delighted millions of puzzle lovers since 1974 when the cube was invented by Hungarian sculptor and archite Read more…

African Supercomputing Center Inaugurates ‘Toubkal,’ Most Powerful Supercomputer on the Continent

February 25, 2021

Historically, Africa hasn’t exactly been synonymous with supercomputing. There are only a handful of supercomputers on the continent, with few ranking on the Read more…

The History of Supercomputing vs. COVID-19

March 9, 2021

The COVID-19 pandemic poses a greater challenge to the high-performance computing community than any before. HPCwire's coverage of the supercomputing response t Read more…

AMD Launches Epyc ‘Milan’ with 19 SKUs for HPC, Enterprise and Hyperscale

March 15, 2021

At a virtual launch event held today (Monday), AMD revealed its third-generation Epyc “Milan” CPU lineup: a set of 19 SKUs -- including the flagship 64-core, 280-watt 7763 part --  aimed at HPC, enterprise and cloud workloads. Notably, the third-gen Epyc Milan chips achieve 19 percent... Read more…

HPE Names Justin Hotard New HPC Chief as Pete Ungaro Departs

March 2, 2021

HPE CEO Antonio Neri announced today (March 2, 2021) the appointment of Justin Hotard as general manager of HPC, mission critical solutions and labs, effective Read more…

Microsoft, HPE Bringing AI, Edge, Cloud to Earth Orbit in Preparation for Mars Missions

February 12, 2021

The International Space Station will soon get a delivery of powerful AI, edge and cloud computing tools from HPE and Microsoft Azure to expand technology experi Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire