Cerebras Builds ‘Exascale’ AI Supercomputer

By Agam Shah

November 14, 2022

Cerebras is putting down stakes to be a player in the AI cloud computing with a supercomputer called Andromeda, which achieves over an exaflops of “AI performance.”

The company called Andromeda one of the fastest AI systems in the U.S. The system strings together 16 CS-2 systems in a cluster, with a total of 13.5 million compute cores focused on AI.

Each CS-2 system has a wafer-sized chip with 850,000 cores, which is considered the largest piece of silicon ever made. The Andromeda system has 96.8 terabits of internal bandwidth.

For preprocessing, Andromeda is attached to 284 single-socket servers, with each system having an AMD Epyc 7713 “Milan” CPUs, 128GB RAM, three 1.92TB NVMe drives and two 100Gb Ethernet network cards.

Linear scaling, training models from scratch. Source: Cerebras Systems.

The exaflop benchmark is based on 16-bit, half precision performance with linear scaling, said Andrew Feldman, CEO of Cerebras.

“Linear scaling means when you go from one to two systems, it takes half as long for your work to be completed. That is a very unusual property in computing,” Feldman said, adding that Andromeda can scale beyond the 16 connected systems.

A single chip in the CS-2 can train language models with billions of parameters. Andromeda can potentially train larger language models with trillions of parameters, or train smaller models in less time.

Andromeda cost about $30 million to build and was set up in just three days, Feldman said.

Feldman said the system offers comparable performance to the Polaris supercomputer at Argonne Leadership Computing Facility, which has 2,240 Nvidia A100 GPUs and 560 AMD Milan CPUs. The Top500 implementation of Polaris delivers 25.8 Linpack petaflops (out of a theoretical 34.2 petaflops) in 64-bit double-precision, or roughly 700 petaflops of FP16 tensor core performance.

The Andromeda system is built over 16 racks, and is smaller in size than Polaris, which uses 40 racks. Polaris is the 14th fastest supercomputer in the world, according to the Top500 list released in June this year.

The supercomputer, which is deployed at a Colovore datacenter in Santa Clara, California, is being used in multiple ways. The system is accessible via the cloud to companies who want to try the hardware before buying. It is also available to companies looking to rent computing resources.

“We’re using it for companies who have a big problem that they want to solve and don’t want all the equipment necessary to solve it,” Feldman said.

The Andromeda system is also available for free to students and academics. It takes just a few lines of code to deploy AI models, Feldman said.

Cerebras’ shift to becoming an AI cloud provider follows the footsteps of chip makers like Intel and Nvidia, which have their cloud services for customers to test chips or code. The company’s customer list includes TotalEnergies and GlaxoSmithKline.

Meta is also developing an AI supercomputer with 6,080 Nvidia A100 GPUs to augment its AI and metaverse driven computing, while Tesla also has an AI supercomputer with 7,360 A100 GPUs.

The systems in the Andromeda cluster are connected by a fabric called SwarmX to disaggregate memory, computing and networking into separate clusters. The compute and memory elements operate as a single system and scale independently, which helps in faster training of AI models. The parameters are stored in an internal system called MemoryX.

The disaggregation of resources in CS-2 is different from GPU environments, in which computing is broken up over AI cores distributed over a wide area. Calculations need to be orchestrated over this network of cores, which Feldman has said can be time consuming and inefficient. AI calculations also need GPUs to operate identically across thousands of cores to get a coordinated response time.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

PFAS Regulations, 3M Exit to Impact Two-Phase Cooling in HPC

January 27, 2023

Per- and polyfluoroalkyl substances (PFAS), known as “forever chemicals,” pose a number of health risks to humans, with more suspected but not yet confirmed – and, as a result, PFAS are coming under increasing regu Read more…

Sweden Plans Expansion for Nvidia-Powered Berzelius Supercomputer

January 26, 2023

The Atos-built, Nvidia SuperPod-based Berzelius supercomputer – housed in and operated by Sweden’s Linköping-based National Supercomputer Centre (NSC) – is already no slouch. But now, Nvidia and NSC have announced Read more…

Multiverse, Pasqal, and Crédit Agricole Tout Progress Using Quantum Computing in FS

January 26, 2023

Europe-based quantum computing pioneers Multiverse Computing and Pasqal, and global bank Crédit Agricole CIB today announced successful conclusion of a 1.5-year POC study “to evaluate the contribution of an algorithmi Read more…

Critics Don’t Want Politicians Deciding the Future of Semiconductors

January 26, 2023

The future of the semiconductor industry was partially being decided last week by a mix of politicians, policy hawks and chip industry executives jockeying for influence at the World Economic Forum. Intel CEO Pat Gels Read more…

Riken Plans ‘Virtual Fugaku’ on AWS

January 26, 2023

The development of a national flagship supercomputer aimed at exascale computing continues to be a heated competition, especially in the United States, the European Union, China, and Japan. What is the value to be gained Read more…

AWS Solution Channel

Shutterstock_1687123447

Numerix Scales HPC Workloads for Price and Risk Modeling Using AWS Batch

  • 180x improvement in analytics performance
  • Enhanced risk management
  • Decreased bottlenecks in analytics
  • Unlocked near-real-time analytics
  • Scaled financial analytics

Overview

Numerix, a financial technology company, needed to find a way to scale its high performance computing (HPC) solution as client portfolios ballooned in size. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1453953692

Microsoft and NVIDIA Experts Talk AI Infrastructure

As AI emerges as a crucial tool in so many sectors, it’s clear that the need for optimized AI infrastructure is growing. Going beyond just GPU-based clusters, cloud infrastructure that provides low-latency, high-bandwidth interconnects and high-performance storage can help organizations handle AI workloads more efficiently and produce faster results. Read more…

Supercomputer Research Predicts Extinction Cascade

January 25, 2023

The immediate impacts of climate change and land-use change are severe enough, but increasingly, researchers are warning that large enough changes can then snowball into catastrophic changes. New, supercomputer-powered r Read more…

PFAS Regulations, 3M Exit to Impact Two-Phase Cooling in HPC

January 27, 2023

Per- and polyfluoroalkyl substances (PFAS), known as “forever chemicals,” pose a number of health risks to humans, with more suspected but not yet confirmed Read more…

Critics Don’t Want Politicians Deciding the Future of Semiconductors

January 26, 2023

The future of the semiconductor industry was partially being decided last week by a mix of politicians, policy hawks and chip industry executives jockeying for Read more…

Riken Plans ‘Virtual Fugaku’ on AWS

January 26, 2023

The development of a national flagship supercomputer aimed at exascale computing continues to be a heated competition, especially in the United States, the Euro Read more…

Shutterstock 1134313550

Semiconductor Companies Create Building Block for Chiplet Design

January 24, 2023

Intel's CEO Pat Gelsinger last week made a grand proclamation that chips will be for the next few decades what oil and gas was to the world over the last 50 years. While that remains to be seen, two technology associations are joining hands to develop building blocks to stabilize the development of future chip designs. The goal of the standard is to set the stage for a thriving marketplace that fuels... Read more…

Royalty-free stock photo ID: 1572060865

Fujitsu Study Says Quantum Decryption Threat Still Distant

January 23, 2023

Global computer and chip manufacturer Fujitsu today reported that a new study performed on its 39-qubit quantum simulator suggests it will remain difficult for Read more…

At ORNL, Jeff Smith Becomes Interim Director, as Search for Permanent Lab Chief Continues

January 20, 2023

UT-Battelle, which manages Oak Ridge National Laboratory (ORNL) for the U.S. Department of Energy, has appointed Jeff Smith as interim director for the lab as t Read more…

Top HPC Players Creating New Security Architecture Amid Neglect

January 20, 2023

Security of high-performance computers is being neglected in the pursuit of horsepower, and there are concerns that the ignorance may be costly if safeguards ar Read more…

Ohio Supercomputer Center Debuts ‘Ascend’ GPU Cluster

January 19, 2023

Less than 10 months after it was announced, the Columbus-based Ohio Supercomputer Center (OSC) has debuted its Dell-built GPU cluster, “Ascend.” Designed to Read more…

Leading Solution Providers

Contributors

SC22 Booth Videos

AMD @ SC22
Altair @ SC22
AWS @ SC22
Ayar Labs @ SC22
CoolIT @ SC22
Cornelis Networks @ SC22
DDN @ SC22
Dell Technologies @ SC22
HPE @ SC22
Intel @ SC22
Intelligent Light @ SC22
Lancium @ SC22
Lenovo @ SC22
Microsoft and NVIDIA @ SC22
One Stop Systems @ SC22
Penguin Solutions @ SC22
QCT @ SC22
Supermicro @ SC22
Tuxera @ SC22
Tyan Computer @ SC22
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire