Wafer Scale to ‘Brain-Scale’ – Cerebras Touts Linear Scaling up to 192 CS-2 Systems

By Tiffany Trader

August 24, 2021

At the Hot Chips conference today, held as a virtual event, wafer-scale computing company Cerebras Systems unveiled its “brain-scale” approach for running the largest models in the world across up to 192 CS-2 systems. To enable this, Cerebras is debuting its weight streaming technology, which flips the way that models are usually run, and launching two new products: MemoryX and SwarmX.

Cerebras introduced the CS-2 system earlier this year, doubling the performance of the original CS-1, which debuted at SC19. The CS-2 system, now shipping, houses the second-generation Cerebras Wafer Scale Engine (WSE-2), which contains 850,000 cores and 40 GB of memory.

Inspired by the human brain’s harnessing of 100 trillion synapses, the brain-scale approach is Cerebras’ answer to running the very largest AI models, which are seeing exponential hikes in the number of parameters. In 2018, Google’s BERT debuted with 340 million parameters, taking about nine petaflops days to train. In 2019, T5 upped the scale to 11 billion parameters and took 900 petaflops days to train. In 2020, Microsoft announced MSFT-1T – a one trillion parameter model – that took about 25-30,000 petaflops days to train, according to Cerebras Founder and CEO Andrew Feldman.

Exponential growth of neural networks (source: Cerebras)

Referencing the chart (above), Feldman told HPCwire, “It is rare you have an exponential log graph on both the x and y axis and you can see it increasing an extraordinary amount on both. Over a two-and-a-half year period, model sizes grew 1,000 times and the amount of compute necessary to work on them increased by over 1,000X as well.”

Cerebras is announcing that it can now support 120 trillion parameter models on a single CS-2, using a new custom memory extension technology called Cerebras MemoryX, which provides the second-generation WSE-2 with up to 2.4 petabytes of memory, allowing parameters to be stored off-system. And further, with the integration of new interconnect technology Cerebras SwarmX, the company can build clusters with up to 192 CS-2 systems, comprising an aggregate 163 million cores.

It is possible to configure these clusters with “a push of a button,” said Feldman, due to all the systems (nodes) using an identical initial configuration. The company has also implemented algorithmic techniques said to further increase the capabilities of the system to use less flops and power.

The technologies being introduced today enable the separation of model memory, compute and training data, such that each dimension can scale independently. “So the user can right-size the solution to their problem,” said Sean Lie, chief architect and co-founder, in his Hot Chips talk.

The heart of the innovation is a new execution mode called weight streaming. In the traditional execution mode, weights are held on the wafer and activations are streamed in. For models of extraordinary size, Cerebras reversed this, such that the weights are streamed in and the activations are held on chip. Cerebras’ MemoryX appliance uses a mix of DRAM and flash storage and scales from four terabytes to 2.4 petabytes in capacity, equivalent to between 200 billion and 120 trillion weights. Associated internal compute handles weight updates and provides other optimizer functionality.

The weight streaming approach only makes sense if you have a model big enough to fill up an entire wafer-scale chip (and beyond) and that era has definitely arrived, said Feldman.

The largest layers of the largest models fit comfortably within a single CS-2, according to Cerebras’ benchmarking. This is the reverse approach of the GPU which breaks up elements into pieces and distributes them across many compute units.

 

For training models across a cluster of CS-2s, each CS-2 starts with an identical configuration, and holds identical information. They differ only by the data coming in, which modifies the gradients, which are then broadcast across the SwarmX fabric. The data are reduced on the way out and enter the MemoryX technology. The process repeats until the all the layers are updated and the model is trained.

Because the systems are all identical, they can be configured in a single keystroke, Feldman emphasized.

Cerebras CS-2 machine

Linking units together is what supercomputing expertise is based on, he continued. “Our strategy at Cerebras was first: take as many of those little nodes as we can and put them on one wafer, and that’s our CS-2. Our new weight streaming technology allows us to map work to multiple CS-2s the exact same way we do it for one. You don’t have to partition, you don’t have to run model parallel. You basically compile to one CS-2 and you copy that config to n number of CS-2s, that’s it. The only thing you do is shard the data.”

Even the decision to run with the traditional pipeline mode or weight streaming mode is taken care of internally. Models larger than BERT and beyond GPT-2 or GPT-3 will trigger the switch to weight streaming mode, Feldman said, adding, “you write your TensorFlow or PyTorch code, and we take care of everything else.”

Cerebras has also enabled a feature to leverage the efficiencies of sparsity that saves time and energy. “What we can do because of our fine-grained dataflow architecture is that we never multiple by zero,” said Feldman. “It is enabled by technologies in the chip, and foremost massive memory bandwidth,” said Feldman.

“This means if your model has 50 percent sparsity, the Cerebras system can do it twice as fast,” he added.

While envisioning a future where 192 CS-2 machines work together as one system, Feldman believes a realistic near-term goal is to stand up 16 and 32 node clusters. The company has GPT-3 layers running on the CS-2, and they expect to show additional performance “on some of the largest networks” within months.

Linear performance scaling to 192 CS-2s. Projections based on Scaling Laws for Neural Language Models [OpenAI].

The news drew positive comments from market-watchers in the space.

“The wafer-scale approach is unique and clearly better for big models than much smaller GPUs,” said Linley Gwennap, president and principal analyst of The Linley Group. “By coupling the WSE with the new MemoryX and SwarmX technology, Cerebras has created what should be the industry’s best solution for training very large neural networks. Streaming the weights is a unique idea that hasn’t been tried, so it’s unclear exactly how much better this approach will be, but the ability to store even the largest model layers in the WSE gives Cerebras a big leg up on these enormous models.”

The market for models with billions of parameters is small today but should grow quickly, said Gwennap. “For the most part, these models remain experimental and aren’t yet used in production,” he added. “But even a handful of customers buying 16-32 system clusters would be a big revenue boost for Cerebras. The main obstacle to deploying these models is the long time that it takes to train them on GPUs, so the faster Cerebras can train the models, the sooner customers can move these models into production, which will require purchases of many more systems.”

Karl Freund, founder and principal analyst of Cambrian AI Research, said he’s been wondering how Cerebras was going to be able to run huge models and provide the scalability needed. “Models are doubling every 3.5 months,” Freund told HPCwire. “Nvidia has a solution coming: Grace (Arm CPU). Cerebras’ solution provides scads of memory AND data prep processing. I think the company has benefited greatly by the close relationships they have developed with researchers.”

Cerebras’ customer base for its CS systems include multiple DOE labs in the U.S., EPCC in the UK, as well as commercial sites GlaxoSmithKline and AstraZeneca.

MemoryX and SwarmX products will begin shipping in Q4 of this year, Cerebras said. Pricing was not disclosed.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion research analyst and noted storage expert Mark No Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together any HPC and AI resources and integrate them with networking, Read more…

What’s New in HPC Research: Solar Power, ExaWorks, Optane & More

September 16, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

AWS Solution Channel

Supporting Climate Model Simulations to Accelerate Climate Science

The Amazon Sustainability Data Initiative (ASDI), AWS is donating cloud resources, technical support, and access to scalable infrastructure and fast networking providing high performance computing (HPC) solutions to support simulations of near-term climate using the National Center for Atmospheric Research (NCAR) Community Earth System Model Version 2 (CESM2) and its Whole Atmosphere Community Climate Model (WACCM). Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

Amazon, NCAR, SilverLining Team for Unprecedented Cloud Climate Simulations

September 10, 2021

Earth’s climate is, to put it mildly, not in a good place. In the wake of a damning report from the Intergovernmental Panel on Climate Change (IPCC), scientis Read more…

After Roadblocks and Renewals, EuroHPC Targets a Bigger, Quantum Future

September 9, 2021

The EuroHPC Joint Undertaking (JU) was formalized in 2018, beginning a new era of European supercomputing that began to bear fruit this year with the launch of several of the first EuroHPC systems. The undertaking, however, has not been without its speed bumps, and the Union faces an uphill... Read more…

How Argonne Is Preparing for Exascale in 2022

September 8, 2021

Additional details came to light on Argonne National Laboratory’s preparation for the 2022 Aurora exascale-class supercomputer, during the HPC User Forum, held virtually this week on account of pandemic. Exascale Computing Project director Doug Kothe reviewed some of the 'early exascale hardware' at Argonne, Oak Ridge and NERSC (Perlmutter), while Ti Leggett, Deputy Project Director & Deputy Director... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

Leading Solution Providers

Contributors

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire