NERSC, Intel, Cray Harness the Power of Deep Learning to Better Understand the Universe

September 7, 2018

Sept. 7, 2018 — A Big Data Center collaboration between computational scientists at Lawrence Berkeley National Laboratory’s (Berkeley Lab) National Energy Research Scientific Computing Center (NERSC) and engineers at Intel and Cray has yielded another first in the quest to apply deep learning to data-intensive science: CosmoFlow, the first large-scale science application to use the TensorFlow framework on a CPU-based high performance computing platform with synchronous training. It is also the first to process three-dimensional (3D) spatial data volumes at this scale, giving scientists an entirely new platform for gaining a deeper understanding of the universe.

Cosmological ”big data” problems go beyond the simple volume of data stored on disk. Observations of the universe are necessarily finite, and the challenge that researchers face is how to extract the most information from the observations and simulations available. Compounding the issue is that cosmologists typically characterize the distribution of matter in the universe using statistical measures of the structure of matter in the form of two- or three-point functions or other reduced statistics. Methods such as deep learning that can capture all features in the distribution of matter would provide greater insight into the nature of dark energy. First to realize that deep learning could be applied to this problem were Siamak Ravanbakhsh and his colleagues, as referenced in proceedings of The 33rd International Conference on Machine Learning (http://proceedings.mlr.press/v48/ravanbakhshb16.pdf). However, computational bottlenecks when scaling up the network and dataset limited the scope of the problem that could be tackled.

Motivated to address these challenges, CosmoFlow was designed to be highly scalable; to process large, 3D cosmology datasets; and to improve deep learning training performance on modern HPC supercomputers such as the Intel® processor-based Cray® XC40™ Cori supercomputer at NERSC. CosmoFlow is built on top of the popular TensorFlow machine learning framework and uses Python as the front end. The application leverages the Cray PE Machine Learning Plugin to achieve unprecedented scaling of the TensorFlow Deep Learning framework to more than 8,000 nodes. It also benefits from Cray’s DataWarp™ I/O accelerator technology, which provides the I/O throughput required to reach this level of scalability.

In a technical paper to be presented at SC18 in November, the CosmoFlow team describes the application and initial experiments using dark matter N-body simulations produced using the MUSIC and pycola packages on the Cori supercomputer at NERSC. In a series of single-node and multi-node scaling experiments, the team was able to demonstrate fully synchronous data-parallel training on 8,192 of Cori with 77% parallel efficiency and 3.5 Pflop/s sustained performance.

“Our goal was to demonstrate that TensorFlow can run at scale on multiple nodes efficiently,” said Deborah Bard, a big data architect at NERSC and a co-author of the technical paper. “As far as we are aware, this is the largest ever deployment of TensorFlow on CPUs, and we think it is the largest attempt to run TensorFlow on the largest number of CPU nodes.”

Early on, the CosmoFlow team laid out three primary goals for this project: science, single-node optimization and scaling. The science goal was to demonstrate that deep learning can be used on 3D volumes to learn the physics of the universe. The team also wanted to ensure that TensorFlow ran efficiently and effectively on a single Intel® Xeon Phi™ processor node with 3D volumes, which are common in science but not so much in industry, where most deep learning applications deal with 2D image data sets. And finally, ensure high efficiency and performance when scaled across 1000’s of nodes on the Cori supercomputer system.

As Joe Curley, Sr. Director of the Code Modernization Organization in Intel’s Data Center Group, noted, “The Big Data Center collaboration has produced amazing results in computer science through the combination of Intel technology and dedicated software optimization efforts. During the CosmoFlow project, we identified framework, kernel and communication optimization that led to more than 750x performance increase for a single node. Equally as impressive, the team solved problems that limited scaling of deep learning techniques to 128 to 256 nodes – to now allow the CosmoFlow application to scale efficiently to the 8,192 nodes of the Cori supercomputer at NERSC.”

“We’re excited by the results and the breakthroughs in artificial intelligence applications from this collaborative project with NERSC and Intel,” said Per Nyberg, vice president of market development, artificial intelligence and cloud at Cray. “It is exciting to see the CosmoFlow team take advantage of unique Cray technology and leverage the power of the a Cray supercomputer to effectively scale deep learning models. It is a great example of what many of our customers are striving for in converging traditional modeling and simulation with new deep learning and analytics algorithms, all on a single, scalable platform.”

Prabhat, Group Leader of Data & Analytics Services at NERSC, added, “From my perspective, CosmoFlow is an exemplar project for the Big Data Center collaboration. We’ve truly leveraged competencies from various institutions to solve a hard scientific problem and enhance our production stack, which can benefit the broader NERSC user community.”

In addition to Bard and Prabhat, co-authors on the SC18 paper include Amrita Mathuriya, Lawrence Meadows, Lei Shao, Tuomas Karna, John Pennycook, Jason Sewall, Nalini Kumar and Victor Lee from Intel; Peter Mendygral, Diana Moise, Kristyn Maschhoff and Michael Ringenburg from Cray; Siyu He and Shirley Ho from the Flatiron Institute; and James Arnemann from UC Berkeley.

About NERSC and Berkeley Lab

The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 6,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. DOE Office of Science. »Learn more about computing sciences at Berkeley Lab.


Source: Kathy Kincade, NERSC

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Insights from Optimized Codes on Cineca’s Marconi

February 15, 2019

What can you do with 381,392 CPU cores? For Cineca, it means enabling computational scientists to expand a large part of the world’s body of knowledge from the nanoscale to the astronomic, from calculating quantum effe Read more…

By Ken Strandberg

What Will IBM’s AI Debater Learn from Its Loss?

February 14, 2019

The utility of IBM’s latest man-versus-machine gambit is debatable. At the very least its Project Debater got us thinking about the potential uses of artificial intelligence as a way of helping humans sift through al Read more…

By George Leopold

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst of bankruptcy proceedings. According to Dutch news site Drimb Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE Systems With Intel Omni-Path: Architected for Value and Accessible High-Performance Computing

Today’s high-performance computing (HPC) and artificial intelligence (AI) users value high performing clusters. And the higher the performance that their system can deliver, the better. Read more…

IBM Accelerated Insights

Medical Research Powered by Data

“We’re all the same, but we’re unique as well. In that uniqueness lies all of the answers….”

  • Mark Tykocinski, MD, Provost, Executive Vice President for Academic Affairs, Thomas Jefferson University

Getting the answers to what causes some people to develop diseases and not others is driving the groundbreaking medical research being conducted by the Computational Medicine Center at Thomas Jefferson University in Philadelphia. Read more…

South African Weather Service Doubles Compute and Triples Storage Capacity of Cray System

February 13, 2019

South Africa has made headlines in recent years for its commitment to HPC leadership in Africa – and now, Cray has announced another major South African HPC expansion. Cray has been awarded contracts with Eclipse Holdings Ltd. to upgrade the supercomputing system operated by the South African Weather Service (SAWS). Read more…

By Oliver Peckham

Insights from Optimized Codes on Cineca’s Marconi

February 15, 2019

What can you do with 381,392 CPU cores? For Cineca, it means enabling computational scientists to expand a large part of the world’s body of knowledge from th Read more…

By Ken Strandberg

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

UC Berkeley Paper Heralds Rise of Serverless Computing in the Cloud – Do You Agree?

February 13, 2019

Almost exactly ten years to the day from publishing of their widely-read, seminal paper on cloud computing, UC Berkeley researchers have issued another ambitious examination of cloud computing - Cloud Programming Simplified: A Berkeley View on Serverless Computing. The new work heralds the rise of ‘serverless computing’ as the next dominant phase of cloud computing. Read more…

By John Russell

Iowa ‘Grows Its Own’ to Fill the HPC Workforce Pipeline

February 13, 2019

The global workforce that supports advanced computing, scientific software and high-speed research networks is relatively small when you stop to consider the magnitude of the transformative discoveries it empowers. Technical conferences provide a forum where specialists convene to learn about the latest innovations and schedule face-time with colleagues from other institutions. Read more…

By Elizabeth Leake, STEM-Trek

Trump Signs Executive Order Launching U.S. AI Initiative

February 11, 2019

U.S. President Donald Trump issued an Executive Order (EO) today launching a U.S Artificial Intelligence Initiative. The new initiative - Maintaining American L Read more…

By John Russell

Celebrating Women in Science: Meet Four Women Leading the Way in HPC

February 11, 2019

One only needs to look around at virtually any CS/tech conference to realize that women are underrepresented, and that holds true of HPC. SC hosts over 13,000 H Read more…

By AJ Lauer

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

Assessing Government Shutdown’s Impact on HPC

February 6, 2019

After a 35-day federal government shutdown, the longest in U.S. history, government agencies are taking stock of the damage -- and girding for a potential secon Read more…

By Tiffany Trader

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

Cray Unveils Shasta, Lands NERSC-9 Contract

October 30, 2018

Cray revealed today the details of its next-gen supercomputing architecture, Shasta, selected to be the next flagship system at NERSC. We've known of the code-name "Shasta" since the Argonne slice of the CORAL project was announced in 2015 and although the details of that plan have changed considerably, Cray didn't slow down its timeline for Shasta. Read more…

By Tiffany Trader

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

AMD Sets Up for Epyc Epoch

November 16, 2018

It’s been a good two weeks, AMD’s Gary Silcott and Andy Parma told me on the last day of SC18 in Dallas at the restaurant where we met to discuss their show news and recent successes. Heck, it’s been a good year. Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

US Leads Supercomputing with #1, #2 Systems & Petascale Arm

November 12, 2018

The 31st Supercomputing Conference (SC) - commemorating 30 years since the first Supercomputing in 1988 - kicked off in Dallas yesterday, taking over the Kay Ba Read more…

By Tiffany Trader

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

Contract Signed for New Finnish Supercomputer

December 13, 2018

After the official contract signing yesterday, configuration details were made public for the new BullSequana system that the Finnish IT Center for Science (CSC Read more…

By Tiffany Trader

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

Nvidia’s Jensen Huang Delivers Vision for the New HPC

November 14, 2018

For nearly two hours on Monday at SC18, Jensen Huang, CEO of Nvidia, presented his expansive view of the future of HPC (and computing in general) as only he can do. Animated. Backstopped by a stream of data charts, product photos, and even a beautiful image of supernovae... Read more…

By John Russell

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

Intel Confirms 48-Core Cascade Lake-AP for 2019

November 4, 2018

As part of the run-up to SC18, taking place in Dallas next week (Nov. 11-16), Intel is doling out info on its next-gen Cascade Lake family of Xeon processors, specifically the “Advanced Processor” version (Cascade Lake-AP), architected for high-performance computing, artificial intelligence and infrastructure-as-a-service workloads. Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This