PEARC21 Panel: Wafer-Scale-Engine Technology Accelerates Machine Learning, HPC

By Ken Chiacchia, Pittsburgh Supercomputing Center/XSEDE

July 21, 2021

Early use of Cerebras’ CS-1 server and wafer-scale engine (WSE) has demonstrated promising acceleration of machine-learning algorithms, according to participants in the Scientific Research Enabled by CS-1 Systems panel, presented at the PEARC21 conference. The panel, which for the first time brought together leading teams employing the CS-1 at the Pittsburgh Supercomputing Center (PSC), Argonne National Laboratory (ANL) and Lawrence Livermore National Laboratory (LLNL) charted out the promise of the technology as well as the next steps in applying it to artificial intelligence and HPC projects that do not utilize AI.

The PEARC conference series provides a forum for discussing challenges, opportunities and solutions among the broad range of participants in the research computing community. This community-driven effort builds on successes of the past, and aims to grow and be more inclusive by involving additional local, regional, national and international cyberinfrastructure and research computing partners spanning academia, government and industry. PEARC21, Evolution Across All Dimensions, was offered this year as a virtual event (July 19-22).

Cerebras Systems Technology Summary and Outlook

Moderated by co-organizer Sergiu Sanielevici, director of user support at PSC, the panel kicked off with a presentation by co-organizer Natalia Vassilieva, director of product at Cerebras.

While the field has progressed phenomenally over the past decade, these advances have come at a computational cost, Vassilieva explained. Ballooning memory requirements for training have driven proportional increases in required petaflops-days, with OpenAI’s GPT-3 language model requiring about 116 days to train on 1,024 Nvidia V100 GPUs.

“Modern models need much more compute than can be [fit] on a single processor,” she said, and scaling for distributed training is far from ideal. “As you scale out to multiple devices, at some point you start to observe communication bottlenecks” and other limitations. “We need more compute per device, and the ability to rely less on data parallel training.”

“I think everybody understands that the current approach…is not sustainable,” she said.

The CS-1, and the newly introduced second-generation WSE CS-2, contain 400,000 and 850,000 cores per WSE, providing 18 or 40 GB of memory, respectively, one cycle away from the compute element – a memory bandwidth of 9 or 20 PByte/s. The extreme bandwidth within the chip (100 or 220 Pbit/s) also avoids a bottleneck in conventional systems. By harnessing unprecedented local memory and compute cores, the WSEs offers computational scaling with vastly reduced bottlenecks. The system represents a flexible and dynamic solution to the challenges of fine-grained sparsity and conditional and dynamic machine-learning techniques, Vassilieva said.

The Promise of CS-1/WSE for Research in Science and Engineering

Paola Buitrago, co-organizer and director of AI & Big Data at PSC and principal investigator of the center’s CS-1 based Neocortex, surveyed the state of the art in machine learning – and the challenges the field currently faces. Neocortex explores a slightly different approach to leveraging the CS-1 than other deployments, using an HPE SuperDome Flex server as a single, high-memory CPU intermediary with the user and federation with PSC’s larger Bridges-2 system. The unique high-memory configuration is intended to offer advantages in combined Big Data/AI applications.

Improvements in neural language models’ performance have required more compute and more memory, with the number of parameters of recent transformer-type networks surpassing hundreds of billions or trillions. Generative adversarial networks, domain adaptation and reinforcement learning approaches have also added complexity, Buitrago said.

Nor is the expense of improved ML models limited to computation, she explained. An analysis based on information released by Google estimated the monetary cost of training the 175-billion parameter GPT-3 at about $10 million. At that scale, reducing ImageNet error rate from 11.5 percent to 1 percent would represent $100 billion billion – $1020 – she added.

“As models increase in size and the compute requirements increase, [we] also find that to further improve the models’ performance … becomes prohibitive with existing approaches,” Buitrago said. “The field is calling for a change in paradigm…CS-1, certainly as it was conceived, proposes a different approach to machine-learning training” by offering ways around the compute and memory limitations of current systems when used for AI training.

Scientific ML on Disaggregated Cognitive Simulation HPC Platforms

On behalf of Brian Van Essen, informatics group leader at LLNL, Vassilieva presented the Livermore group’s federation of the Lassen massively parallel compute cluster with a CS-1 WSE. The goal, she said, is to introduce machine learning steps into traditional simulations in an intimate and iterative way that speeds attainment of accuracy. The team is using inertial confinement fusion at LLNL’s National Ignition Facility as a testbed for the approach, called “cognitive simulation.”

Simulations must simplify natural phenomena to bring the computational burden down to a manageable level. Often, this means that their predictions don’t match experimental findings. Cognitive simulation improves a simulation’s accuracy using machine learning at different levels of a simulation job.

At the “in the loop” level, ML inferences are made at every time step of the computation. The “on the loop” level consists of ML training or inference at every ~1,000 time steps. “Around the loop” training or inference happens with each simulation. Finally, with the addition of experimental data, “outside the loop” transfer learning occurs every ~10,000 simulations. The combination allows frequent training and potentially very-high-frequency inference to accelerate the simulation. The approach leverages vast quantities of data generated by the simulations, couples simulations with experimental results and provides more accurate predictions for complex multi-physics nature of ICF than possible with traditional simulation-only modeling.

Stream-AI-MD: Streaming AI-Driven Adaptive Molecular Simulations for Heterogeneous Computing Platforms

Arvind Ramanathan of ANL and the University of Chicago presented an application of machine learning to another traditional HPC domain, that of molecular simulations.

“The general idea is we want to implement…machine learning training on the fly, as simulations are running,” he said.

Pursued by traditional HPC means, multiscale simulations, for example those of spike-protein dynamics in the SARS-CoV-2 viral particle, can generate hundreds of terabytes of data. The visualization task is huge, he said: “It’s humanly impossible to peek into biologically interesting events.”

Inserting an iterative, ML-driven loop between successive simulations and analytics has proved a promising means of refining model results, predicting folded, unfolded and misfolded states without human supervision. The method has to date improved resolution and accuracy of atomic contacts within the protein structure with a 50X speedup in sampling folded states. Using the approach, the team acquired a 10,000-fold acceleration of sampling effectiveness compared with traditional molecular dynamics simulations running on specialized hardware.

Atomistic Machine Learning Potentials on Neocortex

“We want to model what we call a potential energy surface, and we want to do it at quantum accuracy,” said Keith Phuthi, a PhD student working with Matthew Guttenberg in Venkat Viswanathan’s group at Carnegie Mellon University. Traditional “density functional theory cannot scale to [the] hundreds to thousands of atoms” needed in many problems in physics, chemistry and materials science. PSC’s Neocortex offered a route beyond that limitation, he added.

The empirical potentials method provides a much simpler analytic form that reduces the cost of the computation in terms of steps per atom. But it is much more approximate and often doesn’t capture details in a molecule’s photoelectronic properties that are required for certain applications. Machine learning potentials offer a bridge between the two methods, offering a better balance of accuracy with computational cost. But current GPU systems limit the data that can be used to train a model, with poor scaling to boot.

By computing invariant atomic features on its SuperDome Flex Server and feeding the data into a CS-1 for prediction, Neocortex enabled Phuthi to model the energy potential of each atom in a given compound as a neural network, summing them to obtain the potential for the molecule.

“Our goal with the early programming [was] to get to this target where we work with much bigger datasets and much bigger molecules than are typically trained on,” and to determine how that scaleup affects accuracy,” Phuthi said. The group has to date run batch sizes larger than 32,000 samples, as compared to a memory-driven limit of about 200 on a GPU system. While the team hasn’t yet optimized parameters, the prediction accuracies are similar.

Physics-Informed Neural Networks (PINNs) for Navier-Stokes Equation

Khemraj Shukla, Assistant professor in the CRUNCH group led by George Em Karniadakis of Brown University, described his use of the WSE technology in solving the Navier-Stokes Equations for the motion of viscous fluids using neural networks. 

“Most of these typical systems have very few high-dimensional data points, but very sparse selection,” he said. “In a conventional approach, it requires forward modeling to run many times to do the system identification (described by partial differential equations), whereas by using  PINNs we can solve the forward and inverse problem with few data-points in one shot”. 

To date Shukla has executed his code for lid driven cavity flow at a Reynold’s number (Re) of 100 on Neocortex, taking a total of 150 seconds. This low Re, signifying a fluid with high relative viscosity and a tendency to exhibit laminar flow, represented a modest starting point for the computations. A similar computation on a V100 GPU took about 10 minutes. Future efforts will include larger Re, which represent lower-viscosity fluids with more complex, turbulent flow, and creating an application programming interface for integrating automated differentiation into the computations.

Wafer-Scale Engines for More than AI

A very different application of WSE technology formed the basis of the final presentation in the panel. Instead of AI, Dirk Van Essendelft, PI of the AI/ML Enhanced CFD group at NETL, described using a CS-1 in cooperation with Cerebras for direct physical simulations for phenomena such as astrophysical events.

“The principle of locality is important” for describing the physics of interacting objects, Van Essendelft said. “This principle holds true for almost all physical systems, outside of quantum entanglement.”

The lack of “spooky action at a distance” outside the quantum realm enables simulators to approximate answers by dividing a problem into a grid of discrete cells. Since each cell in the grid only interacts with its immediate neighbors, at one level the computation is simple. The problem arises when a scientist wants to achieve fine resolution, and the cells become numerous.

Ideally, Van Essendelft said, computing hardware would reproduce the 3D grid as a 3D matrix of processors, with each processor holding the description of the cell it represents and interacting with its immediate neighbors. Conventional distributed computing reproduces this ideal very poorly, with a limited number of processors, slow internal communication, non-localized memory and access to neighboring memory taking thousands of cycles.

The 2D grid of the CS-1’s processors offers a better mirror of the model’s physical grid. With each processor directly interacting with its four nearest neighbors, the local memory can hold field values for a column of cells in the 3D grid, offering access to local and neighbor memory in only a single cycle. Van Essendelft’s BiCgStab Solver achieved near-linear scaling in calculating flow parameters in both a 370-cubed-cell and a 600-cubed-cell mesh.

Footnote: The panelists plan to organize a Cerebras technologies user group to foster information exchange on this promising technology. If you are interested, send an email to neocortex@psc.edu.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

The Case for an Edge-Driven Future for Supercomputing

September 24, 2021

“Exascale only becomes valuable when it’s creating and using data that we care about,” said Pete Beckman, co-director of the Northwestern-Argonne Institute of Science and Engineering (NAISE), at the most recent HPC Read more…

Three Universities Team for NSF-Funded ‘ACES’ Reconfigurable Supercomputer Prototype

September 23, 2021

As Moore’s law slows, HPC developers are increasingly looking for speed gains in specialized code and specialized hardware – but this specialization, in turn, can make testing and deploying code trickier than ever. Now, researchers from Texas A&M University, the University of Illinois at Urbana... Read more…

Qubit Stream: Monte Carlo Advance, Infosys Joins the Fray, D-Wave Meeting Plans, and More

September 23, 2021

It seems the stream of quantum computing reports never ceases. This week – IonQ and Goldman Sachs tackle Monte Carlo on quantum hardware, Cambridge Quantum pushes chemistry calculations forward, D-Wave prepares for its Read more…

Asetek Announces It Is Exiting HPC to Protect Future Profitability

September 22, 2021

Liquid cooling specialist Asetek, well-known in HPC circles for its direct-to-chip cooling technology that is inside some of the fastest supercomputers in the world, announced today that it is exiting the HPC space amid multiple supply chain issues related to the pandemic. Although pandemic supply chain... Read more…

TACC Supercomputer Delves Into Protein Interactions

September 22, 2021

Adenosine triphosphate (ATP) is a compound used to funnel energy from mitochondria to other parts of the cell, enabling energy-driven functions like muscle contractions. For ATP to flow, though, the interaction between the hexokinase-II (HKII) enzyme and the proteins found in a specific channel on the mitochondria’s outer membrane. Now, simulations conducted on supercomputers at the Texas Advanced Computing Center (TACC) have simulated... Read more…

AWS Solution Channel

Introducing AWS ParallelCluster 3

Running HPC workloads, like computational fluid dynamics (CFD), molecular dynamics, or weather forecasting typically involves a lot of moving parts. You need a hundreds or thousands of compute cores, a job scheduler for keeping them fed, a shared file system that’s tuned for throughput or IOPS (or both), loads of libraries, a fast network, and a head node to make sense of all this. Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-apples) datacenter and edge categories. Perhaps more interesti Read more…

The Case for an Edge-Driven Future for Supercomputing

September 24, 2021

“Exascale only becomes valuable when it’s creating and using data that we care about,” said Pete Beckman, co-director of the Northwestern-Argonne Institut Read more…

Three Universities Team for NSF-Funded ‘ACES’ Reconfigurable Supercomputer Prototype

September 23, 2021

As Moore’s law slows, HPC developers are increasingly looking for speed gains in specialized code and specialized hardware – but this specialization, in turn, can make testing and deploying code trickier than ever. Now, researchers from Texas A&M University, the University of Illinois at Urbana... Read more…

Qubit Stream: Monte Carlo Advance, Infosys Joins the Fray, D-Wave Meeting Plans, and More

September 23, 2021

It seems the stream of quantum computing reports never ceases. This week – IonQ and Goldman Sachs tackle Monte Carlo on quantum hardware, Cambridge Quantum pu Read more…

Asetek Announces It Is Exiting HPC to Protect Future Profitability

September 22, 2021

Liquid cooling specialist Asetek, well-known in HPC circles for its direct-to-chip cooling technology that is inside some of the fastest supercomputers in the world, announced today that it is exiting the HPC space amid multiple supply chain issues related to the pandemic. Although pandemic supply chain... Read more…

TACC Supercomputer Delves Into Protein Interactions

September 22, 2021

Adenosine triphosphate (ATP) is a compound used to funnel energy from mitochondria to other parts of the cell, enabling energy-driven functions like muscle contractions. For ATP to flow, though, the interaction between the hexokinase-II (HKII) enzyme and the proteins found in a specific channel on the mitochondria’s outer membrane. Now, simulations conducted on supercomputers at the Texas Advanced Computing Center (TACC) have simulated... Read more…

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion Research analyst and noted storage expert Mark Nossokoff looks at key storage trends in the context of the evolving HPC (and AI) landscape... Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Leading Solution Providers

Contributors

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

Intel Unveils New Node Names; Sapphire Rapids Is Now an ‘Intel 7’ CPU

July 27, 2021

What's a preeminent chip company to do when its process node technology lags the competition by (roughly) one generation, but outmoded naming conventions make it seem like it's two nodes behind? For Intel, the response was to change how it refers to its nodes with the aim of better reflecting its positioning within the leadership semiconductor manufacturing space. Intel revealed its new node nomenclature, and... Read more…

Top500: Fugaku Still on Top; Perlmutter Debuts at #5

June 28, 2021

The 57th Top500, revealed today from the ISC 2021 digital event, showcases many of the same systems as the previous edition, with Fugaku holding its significant lead and only one new entrant in the top 10 cohort: the Perlmutter system at the DOE Lawrence Berkeley National Laboratory enters the list at number five with 65.69 Linpack petaflops. Perlmutter is the largest... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire