Continuing the machine learning push that set the tone for this year’s GPU Technology Conference, NVIDIA is refreshing its GPU-accelerated deep learning software in tandem with the 2015 International Conference on Machine Learning (ICML), one of the major international conferences focused on the burgeoning domain. The announcement involves updates to CUDA, cuDNN, and DIGITS.
Altogether the new features provide significant performance enhancements to help data scientists and researchers create more accurate neural networks through faster model training and more sophisticated model design, says NVIDIA. Without GPUs, training neural nets would be prohibitively slow, Ian Buck vice president of NVIDIA’s accelerated computing business unit, tells HPCwire.
He adds that deep learning is a big hammer that started in image classification, but has moved onto object detection and other higher-order capabilities, for example, determining when pictures represent anger, identifying impressionist paintings, or a war scene. Other use cases include speech recognition, speech translation and natural language processing. The technology is carving out a niche in the automotive space, where deep learning is being explored for car safety systems, and in medicine, where it’s used for understanding and detecting brain cancers.
NVIDIA’s Deep Learning GPU Training System (DIGITS) is an interactive deep learning software package with a Web-based interface, built on top of CUDA and cuDNN. The new version, DIGITS 2, adds support for multiple GPUs (up to four) and provides up to two times faster training on a four-GPU node. Where before DIGITS could only train a neural network on a single GPU, now NVIDIA has mapped the algorithm, math and computation to automatically scale across multiple GPUs to reduce training time even further.
Networks can typically take up to a week to train, says Buck, but moving to multi-GPU can reduce that to days or even overnight. He notes that the performance scaling is not linear due to the additional communication steps that are necessary with the added chips, but doubling performance is going to be an important productivity booster for many users.
“Training one of our deep nets for auto-tagging on a single NVIDIA GeForce GTX Titan X takes about sixteen days, but using the new automatic multi-GPU scaling on four Titan X GPUs training completes in just five days. This is a major advantage and allows us to see results faster, as well letting us more extensively explore the space of models to achieve higher accuracy,” said Simon Osindero, A.I. architect at Yahoo’s Flickr.
While current support is for up to four GPUs, Buck notes that moving to Pascal and NVLINK interconnect technology will enable scaling beyond this point with “great scaling performance.” Specifically, communication will become less of a bottleneck with NVLINK, and NVIDIA is excited about those future opportunities. It’s not hard to imagine applications such as fraud and healthcare benefiting from even faster speeds.
cuDNN 3
NVIDIA’s CUDA Deep Neural Network (cuDNN) is a math library used by developers that contains all of the mathematical operations that have been hyper-optimized by NVIDIA engineers. In recent tests, cuDNN 3 yielded up to two times faster performance on a single TITAN X GPU.
Buck explains the speedup is about half due to tuning for Maxwell architecture with the addition of optimized functions such as 2D convolutions and FFT convolutions and half due to new algorithms. “You should be able to achieve 40 percent additional performance just by replacing the old cuDNN with the new version and then up to a 2X boost using some of the new algorithms, and we are working with the developer community to get that deployed,” he states.
cuDNN also adds support for 16-bit floating point data storage in GPU memory. “Using the smaller floating point number (FP16 instead of FP32), it’s less accurate but let’s you store twice as much data and in some use cases that tradeoff is very beneficial,” Buck explains. The effect with cuDNN 3 is to enable researchers to train larger and more sophisticated neural networks.
“We believe FP16 storage support in NVIDIA’s libraries will enable us to scale our models even further, since it will increase effective memory capacity of our hardware, as well as improve efficiency as we scale training of a single model to many GPUs. This will lead to further improvements in the accuracy of our model,” said Bryan Catanzaro, senior researcher at Baidu Research.
CUDA 7.5
The CUDA toolkit provides a comprehensive development environment for GPU-accelerated applications. The latest update, CUDA 7.5 toolkit, adds the 16-bit floating point (FP16) data format, doubling the amount of data that can be stored in memory and optimizing memory bandwidth. Notably, at this time, only the Tegra X1 offers the FP16 mode, but the next-gen chip Pascal, scheduled to debut in 2016, is on track to operate in a mixed-precision mode. Along with the purported scaling benefits offered by the NVLINK GPU interconnect, NVIDIA has said that Pascal would provide five to ten times the performance of Maxwell GPUs for deep learning tasks.
Now available as a release candidate, CUDA 7.5 also introduces new cuSPARSE GEMVI routines, and instruction-level profiling to address performance bottlenecks within the CUDA code.