Scaling deep neural networks for a fixed problem onto large systems with thousands of nodes is challenging. Indeed, it is one of several hurdles confronting efforts to converge artificial intelligence (AI) and HPC. Pradeep Dubey, Intel Fellow and director of Intel’s Parallel Computing Lab (PCL, has written a blog describing Intel efforts to better understand and solve that problem along with others and promises more details to come at SC2017.
Dubey’s blog posted last week – Ushering in the convergence of AI and HPC: What will it take? – acknowledges the path forward is uneven. Besides the scaling problem mentioned above, Dubey writes, “Adding to the dilemma, unlike a traditional HPC programmer who is well-versed in low-level APIs for parallel and distributed programming, such as OpenMP or MPI, a typical data scientist who trains deep neural networks on a supercomputer is likely only familiar with high-level scripting-language based frameworks like Caffe or TensorFlow.”
No surprise, Intel is vigorously attacking the scaling problem. “Working in collaboration with researchers at the National Energy Research Scientific Computing Center (NERSC), Stanford University, and the University of Montreal, we have achieved a scaling breakthrough for deep learning training. We have scaled to over 9,000 Intel Xeon Phi processor based nodes on the Cori supercomputer, while staying under the accuracy and small batch-size constraints of today’s popular stochastic gradient descent variants method using a hybrid parameter update scheme. We will share this work at the upcoming Supercomputing Conference in Denver, November 12 – 17, 2017.”
The blog links to an interesting paper, On Large-Batch Training For Deep Learning: Generalization Gap And Sharp Minima, written by Intel and Northwestern University researchers, that targets the scaling problem.
Here’s an excerpt from the abstract: “We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.”
The blog is a good read and provides a glimpse into Intel efforts and thinking. According to Dubey’s posted bio, his research focus is computer architectures to efficiently handle new compute- and data-intensive application paradigms for the future computing environment. He holds over 36 patents, has published over 100 technical papers, won the Intel Achievement Award in 2012 for Breakthrough Parallel Computing Research, and was honored with Outstanding Electrical and Computer Engineer Award from Purdue University in 2014.