2023 finds every major tech player working furiously to build out tools and infrastructure amid the massive surge in large-language models (LLMs). Long-time AI players are no exception – and at Meta’s AI Infra @Scale event today, the tech giant announced two major updates to its AI-targeted infrastructure: first, the completion of its huge Research SuperCluster (RSC); second, its next-gen datacenter design.
Meta announced the Research SuperCluster in January 2022. At the time, “only” the first phase of the system was operational: 760 Nvidia DGX A100 nodes, totaling 6,080 Nvidia A100 GPUs and networked with Nvidia’s 200Gb/s InfiniBand. At the time, Meta promised a second phase to the system, targeting that build-out for July 2022.
Now, Meta – working with Nvidia, Penguin Computing and Pure Storage – has completed the second phase of the RSC. The full system includes 2,000 DGX A100 systems, totaling a staggering 16,000 A100 GPUs. Each node has dual AMD Epyc “Rome” CPUs and 2TB of memory. The RSC has up to half an exabyte of storage and, according to Meta, one of the largest known flat InfiniBand fabrics in the world, with 48,000 links and 2,000 switches. (“AI training at scale is nothing if we cannot supply the data fast enough to the GPUs, right?” said Kalyan Saladi – a software engineer at Meta – in a presentation at the event.)
In HPCwire’s initial coverage, we estimated that this full configuration of the system would deliver around 227 Linpack petaflops. For reference, that would place it fourth on the most recent Top500 list of the world’s most powerful publicly benchmarked supercomputers. Of course, we’ll have a new Top500 list in just a few days – no word on whether or not Meta’s RSC will appear on there, but the company is touting five exaflops of mixed-precision AI computing.
Scott Jeschonek (technical program manager at Meta) explained that the system’s development had begun in 2021 and that the second phase had been finalized around October or November of last year following a series of challenges (e.g. supply chain issues).
Saladi explained the changes Meta needed to make in its approach to infrastructure for the RSC. “It really comes down to understanding and realizing the unique demands large-scale AI training places on the infrastructure,” Saladi said. This included changing from air cooling to liquid cooling, along with increasing the power envelope by a sizable margin due to the many thousands of GPUs. Both changes, Saladi said, were significant deviations from Meta’s production datacenters.
Meta framed the development of the RSC in terms of LLMs, with Jeschonek showing a graph of the number of parameters in top LLMs over the last five years – but Jeschonek also highlighted use cases for flagging harmful or biased content, performing translation tasks (a “significant area of investment” for Meta) and AR/VR work.
Meta has been using RSC to train its “Large Language Model Meta AI,” or “LLaMA.” LLaMA is a foundational 65 billion-parameter LLM aimed at the research community. Meta says that its aim with LLaMA was to provide a “smaller, more performant model” that could be fine-tuned without major hardware requirements. It also comes in two smaller variants (33 billion parameters and 7 billion parameters). The 65 billion-parameter variant was trained on 2,048 of the RSC’s A100 GPUs across three weeks.
Speaking of AI-focused datacenter work, Meta announced its next-generation datacenter design. Details on the datacenter design were relatively slim, but Alan Duong, Meta’s global lead for strategic engineering and design, provided some information. The new design is focused on next-gen technology, and Duong said that Meta “[needed] to plan for roughly 4× scale,” also mentioning possible 1.5-2× increases in power consumption by accelerators (and 1.5× for high-bandwidth memory). Duong said that flexibility and fungibility were key to the design, with colocated server and networking hardware, along with flexibility in server type and cooling type; Duong said that Meta is working on developing cooling systems that are capable of delivering both air cooling and various levels of liquid cooling.
In a blog post, Meta confirmed that the new datacenter design will be targeted at AI optimization, “supporting liquid-cooled AI hardware and a high-performance AI network connecting thousands of AI chips for datacenter-scale AI training clusters.” Meta also touted speed and cost savings, which Duong confirmed: “We anticipate that our next-gen datacenter will be 31% more cost-effective, and we’re going to be able to build it 2x faster for a complete full region.”