Data Management at NERSC in the Era of Petascale Deep Learning

By Rob Farber

May 9, 2018

Now that computer scientists at Lawrence Berkeley National Laboratory’s National Energy Research Scientific Computing Center (NERSC) have demonstrated 15 petaflops deep-learning training performance on the Cray Cori supercomputer, the NERSC staff is working to address the data management issues that arise when running production deep-learning codes at such scale. The existing deep learning tools were not designed to efficiently ingest or manage the terabyte- to petabyte-sized deep-learning training sets that scientists can now use on this leadership class supercomputer. “Enabling the NERSC user community to perform deep learning at scale on Cori,” Quincey Koziol (Staff, Berkeley Lab) observes, “means scientists can use deep learning as part of their leading-edge scientific efforts.”

Thus NERSC staff are working to break new ground in adapting existing deep-learning frameworks to run efficiently at scale on thousands of nodes while giving researchers the ability to create and manage training sets containing tens to hundreds of terabytes of data in a portable fashion. For these datasets, it is imperative that they are formatted so Cori can ingest them efficiently at runtime.

Appreciating the magnitude of the petascale data management problem

To appreciate the magnitude of the petascale data management problem, consider that the 9,600 Intel Xeon Phi nodes used in the 15 petaflops deep learning training performance contained over a petabyte of main memory. (Specifically, 921.6 terabytes of DDR4 RAM and 153.6 terabytes of high-bandwidth 3D stacked memory.)

The first petascale training runs on the Cray XC40 Cori supercomputer focused on scalability, which left lots of room for groundbreaking research in training on really big datasets. Kurth, et.al. noted in their paper “Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data” that the climate dataset contained 15 TB of data and the HEP (High Energy Physics) data contained 10 million images. With more than a petabyte of RAM contained in 9,600 nodes, Cori can obviously utilize much larger data sets.

Not so obvious are the asynchronous data management issues that crop up after the data has been ingested and the training run has started. These asynchronous methods use prefetching and lots of communications, so per-node memory usage and network performance are critical to running at the petascale.

Without getting too technical, the 15 petaflops deep learning performance was achieved using a hybrid, asynchronous implementation of the SGD (stochastic gradient descent) numerical optimization method. SGD is a common numerical method used by popular packages such as Caffe (used in the 15 petaflops Cori runs) and TensorFlow.

Thorsten Kurth (Application Performance Specialist, NERSC) observes that, “Tensorflow is the most widely used framework and is therefore a primary optimization target at the moment, but the deep learning software world changes rapidly so that sustainable implementations are necessary. Thus it makes sense to create libraries of optimized kernels that can be used by many deep learning frameworks. This same idea can be used to create methods for the data feeding/IO operations.” These optimized libraries can then be rapidly adopted to new upcoming frameworks such as pytorch and mxnet, Kurth observes.

Addressing the challenges

Given the popularity of TensorFlow, the NERSC team is working to adapt TensorFlow to run at scale on Cori. The main challenges, Koziol observes, are threefold:

  • TensorFlow uses text or binary images for input rather than HDF5 or another data format typically used by HPC scientists. Koziol and NERSC are currently integrating HDF5 with TensorFlow.
  • TensorFlow uses a client-server model rather than MPI, which is the typical communications package for scientific applications that run on HPC systems. This means that there are no collective operations inside TensorFlow, which can cause performance issues.
  • TensorFlow uses an asynchronous training that is very loosely coupled, which means data prefetching is critical to prevent performance from suffering due to data starvation. Conversely, prefetching increases the per-node memory consumption, so an appropriate balance must be struck to prefetch “just enough and no more.” Finding that ideal balance without overburdening any node or set of nodes with data in a large (think hundred- or thousand-node) training run is a fertile research area as NERSC brings TensorFlow into a new scaling realm.

HDF5 integration

The data management aspects of deep learning are often overlooked as researchers work to speed training and find the right ANN (Artificial Neural Network) architecture(s) to solve complex problems.

In reality, much of a data scientist’s time is spent creating a clean, representative dataset for training. The data challenge becomes that much larger and unwieldly when creating data for a petascale, deep-learning-capable, leadership-class supercomputer like Cori. Data management is sometimes referred to as the Victorian Era Child of the 21st Century – to be seen and not heard. Unfortunately, the challenges associated with Cori-sized datasets simply cannot be ignored.

Prior to joining NERSC, Koziol was director of core software and high-performance computing at the HDF Group, where he spent 11 years developing the HDF5 I/O middleware package and overseeing the group’s HPC development efforts. This makes Koziol a natural to incorporate the versatile HDF5 data model into TensorFlow. HDF5 is a Hierarchical Data Format that can represent very large, complex numerical datasets along with their metadata in a portable format that can be moved between machines. HDF5 1.10.2 is the current, latest version. The specification is open, and the tools are open source. Development of HDF5 is done by the HDF Group, a nonprofit corporation.

The benefits of HDF5 integration into TensorFlow means that scientists can use tools and a data format that have been developed over decades to enable scientists to portably manage even the largest scientific datasets. Portability means the data preprocessing and data cleaning can happen on remote systems using familiar open-source tools and frameworks. Once ready, the data can be moved onto Cori and ingested into TensorFlow. According to Koziol, this helps address the challenge of “How do we get data into the system fast enough?”

Those who are interested can find the scripts and one example of HDF5 integration in the NERSC cori-tf-distributed-examples repository on github. Specifically, https://github.com/NERSC/cori-tf-distributed-examples.

Other work in progress

NERSC is also working to address TensorFlow’s memory consumption issue and speed the collective operations. However, these are non-trivial problems that will take time. As Koziol observes, “The MPI community has been thinking about collectives for about 20 years. TensorFlow is currently only about two years old.”

Along with the per-node memory consumption challenges that must be addressed when using asynchronous training methods, researchers are also rapidly increasing the complexity of the ANNs they use to solve complex problems. Deeper and more complex ANNs utilize more parameters, which further exacerbates the memory consumed per node problem. For example, calculating the gradient for SGD in TensorFlow is becoming an issue even when running on small systems.

The NERSC team has to contend with those issues as well as prefetching and buffering of data used to support the asynchronous operations during training, so the CPU is used as effectively as possible. The large memory of the Intel Xeon Phi nodes helps, as does the fact that the data extraction and training both occur on the CPU, but finding the right configuration can be challenging, Koziol notes. “Sometimes it helps to have a small number of fat nodes,” he observes.

Steps to the future

Koziol emphasizes that deep learning workloads stress the data ingest capabilities of current supercomputers. He hopes future supercomputer designs will incorporate more features to speed data ingest for data-intensive workloads like deep learning.

Current supercomputer designs have focused on burst buffers for checkpoint/restart, a common write-optimized I/O operation used in modeling and simulation software in which the state of the simulation is quickly saved (the checkpoint operation) so that thousands of hours of compute time won’t be lost in the event of a failure. In the unlikely event that something bad does happen, the supercomputer simply reloads the last checkpoint from storage (a restart operation) and continues with the calculation once the problem is fixed. The frequency of the checkpoint operation dictates how much supercomputer runtime will be lost in the event of a failure.

As deep learning becomes an ever more common workload on supercomputers, Koziol envisions a future where supercomputers are specifically designed to support faster data ingest for deep learning and other data-intensive workloads.

Summary

The NERSC Cori supercomputer has made the training of deep-learning ANNs a member of the petascale application club. Now the NERSC data management team is working to make this petascale capability available to its users to facilitate their ability to perform leading-edge science. Incorporating HDF5 into TensorFlow is an excellent beginning to making TensorFlow a petascale-capable platform for deep learning.

Rob Farber is a global technology consultant and author with an extensive background in HPC and advanced computational technology that he applies at national labs and commercial organizations. He can be reached at [email protected]

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Is Amazon’s Plunge into Server Chips a Watershed Moment?

December 11, 2018

For several years now the big cloud providers – Amazon, Microsoft Azure, Google, et al – have been transforming from technology consumers into technology creators in hardware and software. The most recent example bei Read more…

By John Russell

Mellanox Uses Univa to Extend Silicon Design HPC Operation to Azure

December 11, 2018

Call it a corollary to Murphy’s Law: When a system is most in demand, when end users are most dependent on the system performing as required, when it’s crunch time – that’s when the system is most likely to blow up. Or make you wait in line to use it. Read more…

By Doug Black

Clemson’s Cautionary Cryptomining Tale

December 11, 2018

In some ways, the bigger the computer, the more vulnerable it is to cryptomining as Clemson University discovered after cryptominers dug into its Palmetto supercomputer. When a number of nodes on Clemson University’s P Read more…

By Staff

HPE Extreme Performance Solutions

AI Can Be Scary. But Choosing the Wrong Partners Can Be Mortifying!

As you continue to dive deeper into AI, you will discover it is more than just deep learning. AI is an extremely complex set of machine learning, deep learning, reinforcement, and analytics algorithms with varying compute, storage, memory, and communications needs. Read more…

IBM Accelerated Insights

Blurring the Lines Between HPC and AI @ SC18

The dominant topic at SC18 was the convergence of HPC and Artificial Intelligence (AI) with some of the biggest research and enterprise HPC users providing perspectives on how HPC and AI are moving closer together. Read more…

Data West Brings Technology Leaders to SDSC

December 6, 2018

Data and technology enthusiasts from around the world descended upon the San Diego Supercomputing Center (SDSC) for the third annual Data West conference, which is taking place this week on the campus of the University o Read more…

By Alex Woodie

Topology Can Help Us Find Patterns in Weather

December 6, 2018

Topology--–the study of shapes-- seems to be all the rage. You could even say that data has shape, and shape matters. Shapes are comfortable and familiar conc Read more…

By James Reinders

Zettascale by 2035? China Thinks So

December 6, 2018

Exascale machines (of at least a 1 exaflops peak) are anticipated to arrive by around 2020, a few years behind original predictions; and given extreme-scale performance challenges are not getting any easier, it makes sense that researchers are already looking ahead to the next big 1,000x performance goal post: zettascale computing. Read more…

By Tiffany Trader

Robust Quantum Computers Still a Decade Away, Says Nat’l Academies Report

December 5, 2018

The National Academies of Science, Engineering, and Medicine yesterday released a report – Quantum Computing: Progress and Prospects – whose optimism about Read more…

By John Russell

Revisiting the 2008 Exascale Computing Study at SC18

November 29, 2018

A report published a decade ago conveyed the results of a study aimed at determining if it were possible to achieve 1000X the computational power of the the Read more…

By Scott Gibson

AWS Debuts Lustre as a Service, Accelerates Data Transfer

November 28, 2018

From the Amazon re:Invent main stage in Las Vegas today, Amazon Web Services CEO Andy Jassy introduced Amazon FSx for Lustre, citing a growing body of applicati Read more…

By Tiffany Trader

AWS Launches First Arm Cloud Instances

November 28, 2018

AWS, a macrocosm of the emerging high-performance technology landscape, wants to be everywhere you want to be and offer everything you want to use (or at least Read more…

By Doug Black

Move Over Lustre & Spectrum Scale – Here Comes BeeGFS?

November 26, 2018

Is BeeGFS – the parallel file system with European roots – on a path to compete with Lustre and Spectrum Scale worldwide in HPC environments? Frank Herold Read more…

By John Russell

DOE Under Secretary for Science Paul Dabbar Interviewed at SC18

November 21, 2018

During the 30th annual SC conference in Dallas last week, SC18 hosted U.S. Department of Energy Under Secretary for Science Paul M. Dabbar. In attendance Nov. 13-14, Dabbar delivered remarks at the Top500 panel, met with a number of industry stakeholders and toured the show floor. He also met with HPCwire for an interview, where we discussed the role of the DOE in advancing leadership computing. Read more…

By Tiffany Trader

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

Cray Unveils Shasta, Lands NERSC-9 Contract

October 30, 2018

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

IBM at Hot Chips: What’s Next for Power

August 23, 2018

With processor, memory and networking technologies all racing to fill in for an ailing Moore’s law, the era of the heterogeneous datacenter is well underway, Read more…

By Tiffany Trader

House Passes $1.275B National Quantum Initiative

September 17, 2018

Last Thursday the U.S. House of Representatives passed the National Quantum Initiative Act (NQIA) intended to accelerate quantum computing research and developm Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

CERN Project Sees Orders-of-Magnitude Speedup with AI Approach

August 14, 2018

An award-winning effort at CERN has demonstrated potential to significantly change how the physics based modeling and simulation communities view machine learni Read more…

By Rob Farber

AMD Sets Up for Epyc Epoch

November 16, 2018

It’s been a good two weeks, AMD’s Gary Silcott and Andy Parma told me on the last day of SC18 in Dallas at the restaurant where we met to discuss their show news and recent successes. Heck, it’s been a good year. Read more…

By Tiffany Trader

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

TACC’s ‘Frontera’ Supercomputer Expands Horizon for Extreme-Scale Science

August 29, 2018

The National Science Foundation and the Texas Advanced Computing Center announced today that a new system, called Frontera, will overtake Stampede 2 as the fast Read more…

By Tiffany Trader

HPE No. 1, IBM Surges, in ‘Bucking Bronco’ High Performance Server Market

September 27, 2018

Riding healthy U.S. and global economies, strong demand for AI-capable hardware and other tailwind trends, the high performance computing server market jumped 28 percent in the second quarter 2018 to $3.7 billion, up from $2.9 billion for the same period last year, according to industry analyst firm Hyperion Research. Read more…

By Doug Black

Nvidia’s Jensen Huang Delivers Vision for the New HPC

November 14, 2018

For nearly two hours on Monday at SC18, Jensen Huang, CEO of Nvidia, presented his expansive view of the future of HPC (and computing in general) as only he can do. Animated. Backstopped by a stream of data charts, product photos, and even a beautiful image of supernovae... Read more…

By John Russell

Germany Celebrates Launch of Two Fastest Supercomputers

September 26, 2018

The new high-performance computer SuperMUC-NG at the Leibniz Supercomputing Center (LRZ) in Garching is the fastest computer in Germany and one of the fastest i Read more…

By Tiffany Trader

Houston to Field Massive, ‘Geophysically Configured’ Cloud Supercomputer

October 11, 2018

Based on some news stories out today, one might get the impression that the next system to crack number one on the Top500 would be an industrial oil and gas mon Read more…

By Tiffany Trader

Intel Confirms 48-Core Cascade Lake-AP for 2019

November 4, 2018

As part of the run-up to SC18, taking place in Dallas next week (Nov. 11-16), Intel is doling out info on its next-gen Cascade Lake family of Xeon processors, specifically the “Advanced Processor” version (Cascade Lake-AP), architected for high-performance computing, artificial intelligence and infrastructure-as-a-service workloads. Read more…

By Tiffany Trader

Google Releases Machine Learning “What-If” Analysis Tool

September 12, 2018

Training machine learning models has long been time-consuming process. Yesterday, Google released a “What-If Tool” for probing how data point changes affect a model’s prediction. The new tool is being launched as a new feature of the open source TensorBoard web application... Read more…

By John Russell

The Convergence of Big Data and Extreme-Scale HPC

August 31, 2018

As we are heading towards extreme-scale HPC coupled with data intensive analytics like machine learning, the necessary integration of big data and HPC is a curr Read more…

By Rob Farber

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This