Cerebras Systems Unveils the Industry’s First Trillion Transistor Chip

August 19, 2019

LOS ALTOS, Calif., August 19, 2019 – Cerebras Systems, a startup dedicated to accelerating Artificial intelligence (AI) compute, today unveiled the largest chip ever built. Optimized for AI work, the Cerebras Wafer Scale Engine (WSE) is a single chip that contains more than 1.2 trillion transistors and is 46,225 square millimeters. The WSE is 56.7 times larger than the largest graphics processing unit which measures 815 square millimeters and 21.1 billion transistors. The WSE also contains 3,000 times more high speed, on-chip memory, and has 10,000 times more memory bandwidth.

In AI, chip size is profoundly important. Big chips process information more quickly, producing answers in less time. Reducing the time-to-insight, or “training time,” allows researchers to test more ideas, use more data, and solve new problems. Google, Facebook, OpenAI, Tencent, Baidu, and many others argue that the fundamental limitation to today’s AI is that it takes too long to train models. Reducing training time removes a major bottleneck to industry-wide progress.

“Designed from the ground up for AI work, the Cerebras WSE contains fundamental innovations that advance the state-of-the-art by solving decades-old technical challenges that limited chip size—such as cross-reticle connectivity, yield, power delivery, and packaging,” said Andrew Feldman, founder and CEO of Cerebras Systems. “Every architectural decision was made to optimize performance for AI work. The result is that the Cerebras WSE delivers, depending on workload, hundreds or thousands of times the performance of existing solutions at a tiny fraction of the power draw and space.”

These performance gains are accomplished by accelerating all the elements of neural network training. A neural network is a multistage computational feedback loop. The faster inputs move through the loop, the faster the loop learns or “trains.” The way to move inputs through the loop faster is to accelerate the calculation and communication within the loop.

With an exclusive focus on AI, the Cerebras Wafer Scale Engine accelerates calculation and communication and thereby reduces training time. The approach is straightforward and is a function of the size of the WSE: With 56.7 times more silicon area than the largest graphics processing unit, the WSE provides more cores to do calculations and more memory closer to the cores so the cores can operate efficiently. Because this vast array of cores and memory are on a single chip, all communication is kept on-silicon. This means the WSE’s low-latency communication bandwidth is immense, so groups of cores can collaborate with maximum efficiency, and memory bandwidth is no longer a bottleneck.

The 46,225 square millimeters of silicon in the Cerebras WSE house 400,000 AI-optimized, no-cache, nooverhead, compute cores and 18 gigabytes of local, distributed, superfast SRAM memory as the one and only level of the memory hierarchy. Memory bandwidth is 9 petabytes per second. The cores are linked together with a fine-grained, all-hardware, on-chip mesh-connected communication network that delivers an aggregate bandwidth of 100 petabits per second. More cores, more local memory and a low latency high bandwidth fabric together create the optimal architecture for accelerating AI work.

“While AI is used in a general sense, no two data sets or AI tasks are the same. New AI workloads continue to emerge and the data sets continue to grow larger,” said Jim McGregor, principal analyst and founder at TIRIAS Research. “As AI has evolved, so too have the silicon and platform solutions. The Cerebras WSE is an amazing engineering achievement in semiconductor and platform design that offers the compute, high-performance memory, and bandwidth of a supercomputer in a single wafer-scale solution.”

The Cerebras WSE’s record-breaking achievements could not have been made possible without years of close collaboration with TSMC, the world’s largest semiconductor foundry and leader in advanced process technologies. The WSE is manufactured by TSMC on its advanced 16nm process technology.

“We are very pleased with the result of our collaboration with Cerebras Systems in manufacturing the Cerebras Wafer Scale Engine, an industry milestone for wafer scale development,” said JK Wang, TSMC Senior Vice President of Operations. “TSMC’s manufacturing excellence and rigorous attention to quality enables us to meet the stringent defect density requirements to support the unprecedented die size of Cerebras’s innovative design.”

More Cores, More Memory Close to Cores, More Low Latency Communication Bandwidth

The WSE contains 400,000 AI-optimized compute cores. Called SLAC for Sparse Linear Algebra Cores, the compute cores are flexible, programmable, and optimized for the sparse linear algebra that underpins all neural network computation. SLAC’s programmability ensures cores can run all neural network algorithms in the constantly changing machine learning field.

Because the Sparse Linear Algebra Cores are optimized for neural network compute primitives, they achieves industry-best utilization—often triple or quadruple that of a graphics processing unit. In addition, the WSE cores include Cerebras-invented sparsity harvesting technology to accelerate computational performance on sparse workloads (workloads that contain zeros) like deep learning.

Zeros are prevalent in deep learning calculations: often, the majority of the elements in the vectors and matrices that are to be multiplied together are zero. And yet multiplying by zero is a waste of silicon, power, and time. No new information is made.

Because graphics processing units and tensor processing units are dense execution engines—engines designed to never encounter a zero– they multiply every element even when it is zero. When 50 to 98 percent of the data are zeros, as is often the case in deep learning, most of the multiplications are wasted. Imagine trying to run forward quickly, when most of your steps don’t move you toward the finish line. The Cerebras Sparse Linear Algebra Cores never multiply by zero. All zero data is filtered out and can be skipped in the hardware. Instead, useful work is done in its place.

Memory

Memory is a key component of every computer architecture. Memory closer to compute translates to faster calculation, lower latency, and better power efficiency for data movement. High-performance deep learning requires massive compute with frequent access to data. This requires close proximity between the compute cores and memory. This is not the case in graphics processing units where the vast majority of the memory is slow and far away (off-chip).

The Cerebras Wafer Scale Engine includes more cores, with more local memory, than any chip in history. This enables fast, flexible computation, at lower latency and with less energy. The WSE has 18 Gigabytes of on-chip memory accessible by its core in one clock cycle. The collection of core-local memory aboard the WSE delivers an aggregate of 9 petabytes per second of memory bandwidth—this is 3,000 X more on-chip memory and 10,000 X more memory bandwidth than the leading graphics processing unit has.

Communication Fabric

Swarm communication fabric, the interprocessor communication fabric used on the WSE, achieves breakthrough bandwidth and low latency at a fraction of the power draw of the traditional communication techniques. Swarm provides a low-latency, high-bandwidth, 2D mesh that links all 400,000 cores on the WSE with an aggregate 100 petabits per second of bandwidth. Swarm supports single-word active messages that can be handled by receiving cores without any software overhead.

Routing, reliable message delivery, and synchronization are handled in hardware. Messages automatically activate application handlers for every arriving message. Swarm provides a unique, optimized communication path for each neural network. Software configures the optimal communication path through the 400,000 cores to connect processors according to the structure of the particular user-defined neural network being run.

Swarm’s results are industry-defining. Typical messages traverse one hardware link with nanosecond latency. The aggregate bandwidth across a Cerebras WSE is measured is 100 petabits per second. Communication software such as TCP/IP and MPI are not needed, so their performance penalties are avoided. The energy cost of communication in this architecture is well under one picojoule per bit, which is nearly two orders of magnitude lower than in graphics processing units. With the rare combination of massive bandwidth and exceptionally low latency, the Swarm communication fabric enables the Cerebras WSE to learn faster than all available alternative solutions.

About Cerebras Systems

Cerebras Systems is a team of pioneering computer architects, computer scientists, deep learning researchers, and engineers of all types. We have come together to build a new class of computer to accelerate artificial intelligence work by three orders of magnitude beyond the current state of the art. The first announced element of the Cerebras solution is the Wafer Scale Engine (WSE). The WSE is the largest chip ever built. It contains 1.2 trillion transistors and covers more than 46,225 square millimeters of silicon. The largest graphics processor on the market has 21.1 billion transistors and covers 815 square millimeters. In artificial intelligence work, large chips process information more quickly producing answers in less time. As a result, neural networks that in the past took months to train, can train in minutes on the Cerebras WSE.


Source: Cerebras Systems 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

New Algorithm Overcomes Hurdle in Fusion Energy Simulation

January 15, 2022

The exascale era has brought with it a bevy of fusion energy simulation projects, aiming to stabilize the notoriously delicate—and so far, unmastered—clean energy source that would transform the world virtually overn Read more…

Summit Powers Novel Protein Function Prediction Work

January 13, 2022

There are hundreds of millions of sequenced proteins and counting—but only 170,000 have had their structures solved by researchers, bottlenecking our understanding of proteins and their functions across organisms’ ge Read more…

Q-Ctrl – Tackling Quantum Hardware’s Noise Problems with Software

January 13, 2022

Implementing effective error mitigation and correction is a critical next step in advancing quantum computing. While a lot of attention has been given to efforts to improve the underlying ‘noisy’ hardware, there's be Read more…

Nvidia Defends Arm Acquisition Deal: a ‘Once-in-a-Generation Opportunity’

January 13, 2022

GPU-maker Nvidia is continuing to try to keep its proposed acquisition of British chip IP vendor Arm Ltd. alive, despite continuing concerns from several governments around the world. In its latest action, Nvidia filed a 29-page response to the U.K. government to point out a list of potential benefits of the proposed $40 billion deal. Read more…

SDSC Supercomputers Helped Enable Safer School Reopenings

January 13, 2022

The omicron variant of Covid-19 is sending cases skyrocketing around the world. Still, many national and local governments are hesitant to disrupt society in major ways as they did in 2020, opting instead to leave school Read more…

AWS Solution Channel

shutterstock 377963800

New – Amazon EC2 Hpc6a Instance Optimized for High Performance Computing

High Performance Computing (HPC) allows scientists and engineers to solve complex, compute-intensive problems such as computational fluid dynamics (CFD), weather forecasting, and genomics. Read more…

Voyager AI Supercomputer Gives Investigators New Deep Learning Experimental Platform

January 13, 2022

As human-caused climate change warms the planet, creating drier conditions across the Western U.S., wildfire intensity has grown. California’s wildfires over the last few years have devastated land, families, and commu Read more…

Q-Ctrl – Tackling Quantum Hardware’s Noise Problems with Software

January 13, 2022

Implementing effective error mitigation and correction is a critical next step in advancing quantum computing. While a lot of attention has been given to effort Read more…

Nvidia Defends Arm Acquisition Deal: a ‘Once-in-a-Generation Opportunity’

January 13, 2022

GPU-maker Nvidia is continuing to try to keep its proposed acquisition of British chip IP vendor Arm Ltd. alive, despite continuing concerns from several governments around the world. In its latest action, Nvidia filed a 29-page response to the U.K. government to point out a list of potential benefits of the proposed $40 billion deal. Read more…

Nvidia Buys HPC Cluster Management Company Bright Computing

January 10, 2022

Graphics chip powerhouse Nvidia today announced that it has acquired HPC cluster management company Bright Computing for an undisclosed sum. Unlike Nvidia’s bid to purchase semiconductor IP company Arm, which has been stymied by regulatory challenges, the Bright deal is a straightforward acquisition that aims to expand... Read more…

SC21 Panel on Programming Models – Tackling Data Movement, DSLs, More

January 6, 2022

How will programming future systems differ from current practice? This is an ever-present question in computing. Yet it has, perhaps, never been more pressing g Read more…

Edge to Exascale: A Trend to Watch in 2022

January 5, 2022

Edge computing is an approach in which the data is processed and analyzed at the point of origin – the place where the data is generated. This is done to make data more accessible to end-point devices, or users, and to reduce the response time for data requests. HPC-class computing and networking technologies are critical to many edge use cases, and the intersection of HPC and ‘edge’ promises to be a hot topic in 2022. Read more…

Citing ‘Shortfalls,’ NOAA Targets Hundred-Fold HPC Increase Over Next Decade

January 5, 2022

From upgrading the Global Forecast System (GFS) to acquiring new supercomputers, the National Oceanic and Atmospheric Administration (NOAA) has been making big moves in the HPC sphere over the last few years—but now it’s setting the bar even higher. In a new report, NOAA’s Science Advisory Board (SAB) highlighted... Read more…

HPC Career Notes: January 2022 Edition

January 4, 2022

In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it Read more…

Climavision Targets Weather Forecasting Through HPC Cloud Bursts

January 4, 2022

If Climavision isn’t on your radar just yet, that’s understandable: the company launched from stealth just six months ago, emerging in June with a formidable $100 million in funding. Its promise: to roll out a combination of numerical weather prediction (NWP), AI, traditional weather observations, satellite data... Read more…

IonQ Is First Quantum Startup to Go Public; Will It be First to Deliver Profits?

November 3, 2021

On October 1 of this year, IonQ became the first pure-play quantum computing start-up to go public. At this writing, the stock (NYSE: IONQ) was around $15 and its market capitalization was roughly $2.89 billion. Co-founder and chief scientist Chris Monroe says it was fun to have a few of the company’s roughly 100 employees travel to New York to ring the opening bell of the New York Stock... Read more…

US Closes in on Exascale: Frontier Installation Is Underway

September 29, 2021

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, held by Zoom this week (Sept. 29-30), it was revealed that the Frontier supercomputer is currently being installed at Oak Ridge National Laboratory in Oak Ridge, Tenn. The staff at the Oak Ridge Leadership... Read more…

AMD Launches Milan-X CPU with 3D V-Cache and Multichip Instinct MI200 GPU

November 8, 2021

At a virtual event this morning, AMD CEO Lisa Su unveiled the company’s latest and much-anticipated server products: the new Milan-X CPU, which leverages AMD’s new 3D V-Cache technology; and its new Instinct MI200 GPU, which provides up to 220 compute units across two Infinity Fabric-connected dies, delivering an astounding 47.9 peak double-precision teraflops. “We're in a high-performance computing megacycle, driven by the growing need to deploy additional compute performance... Read more…

Intel Reorgs HPC Group, Creates Two ‘Super Compute’ Groups

October 15, 2021

Following on changes made in June that moved Intel’s HPC unit out of the Data Platform Group and into the newly created Accelerated Computing Systems and Graphics (AXG) business unit, led by Raja Koduri, Intel is making further updates to the HPC group and announcing... Read more…

Nvidia Buys HPC Cluster Management Company Bright Computing

January 10, 2022

Graphics chip powerhouse Nvidia today announced that it has acquired HPC cluster management company Bright Computing for an undisclosed sum. Unlike Nvidia’s bid to purchase semiconductor IP company Arm, which has been stymied by regulatory challenges, the Bright deal is a straightforward acquisition that aims to expand... Read more…

D-Wave Embraces Gate-Based Quantum Computing; Charts Path Forward

October 21, 2021

Earlier this month D-Wave Systems, the quantum computing pioneer that has long championed quantum annealing-based quantum computing (and sometimes taken heat fo Read more…

Killer Instinct: AMD’s Multi-Chip MI200 GPU Readies for a Major Global Debut

October 21, 2021

AMD’s next-generation supercomputer GPU is on its way – and by all appearances, it’s about to make a name for itself. The AMD Radeon Instinct MI200 GPU (a successor to the MI100) will, over the next year, begin to power three massive systems on three continents: the United States’ exascale Frontier system; the European Union’s pre-exascale LUMI system; and Australia’s petascale Setonix system. Read more…

Three Chinese Exascale Systems Detailed at SC21: Two Operational and One Delayed

November 24, 2021

Details about two previously rumored Chinese exascale systems came to light during last week’s SC21 proceedings. Asked about these systems during the Top500 media briefing on Monday, Nov. 15, list author and co-founder Jack Dongarra indicated he was aware of some very impressive results, but withheld comment when asked directly if he had... Read more…

Leading Solution Providers

Contributors

The Latest MLPerf Inference Results: Nvidia GPUs Hold Sway but Here Come CPUs and Intel

September 22, 2021

The latest round of MLPerf inference benchmark (v 1.1) results was released today and Nvidia again dominated, sweeping the top spots in the closed (apples-to-ap Read more…

Lessons from LLVM: An SC21 Fireside Chat with Chris Lattner

December 27, 2021

Today, the LLVM compiler infrastructure world is essentially inescapable in HPC. But back in the 2000 timeframe, LLVM (low level virtual machine) was just getting its start as a new way of thinking about how to overcome shortcomings in the Java Virtual Machine. At the time, Chris Lattner was a graduate student of... Read more…

2021 Gordon Bell Prize Goes to Exascale-Powered Quantum Supremacy Challenge

November 18, 2021

Today at the hybrid virtual/in-person SC21 conference, the organizers announced the winners of the 2021 ACM Gordon Bell Prize: a team of Chinese researchers leveraging the new exascale Sunway system to simulate quantum circuits. The Gordon Bell Prize, which comes with an award of $10,000 courtesy of HPC pioneer Gordon Bell, is awarded annually... Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

Three Universities Team for NSF-Funded ‘ACES’ Reconfigurable Supercomputer Prototype

September 23, 2021

As Moore’s law slows, HPC developers are increasingly looking for speed gains in specialized code and specialized hardware – but this specialization, in turn, can make testing and deploying code trickier than ever. Now, researchers from Texas A&M University, the University of Illinois at Urbana... Read more…

Top500: No Exascale, Fugaku Still Reigns, Polaris Debuts at #12

November 15, 2021

No exascale for you* -- at least, not within the High-Performance Linpack (HPL) territory of the latest Top500 list, issued today from the 33rd annual Supercomputing Conference (SC21), held in-person in St. Louis, Mo., and virtually, from Nov. 14–19. "We were hoping to have the first exascale system on this list but that didn’t happen," said Top500 co-author... Read more…

TACC Unveils Lonestar6 Supercomputer

November 1, 2021

The Texas Advanced Computing Center (TACC) is unveiling its latest supercomputer: Lonestar6, a three peak petaflops Dell system aimed at supporting researchers Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire