Elon Musk’s xAI cluster, named Colossus (possibly after the 1970 movie about a massive computer that does not end well), has been brought online. Musk recently posted the following on X/Twitter:
“This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent work”
Bringing up a cluster of this magnitude in 122 days is no small feat, although it was reported Dell and Supermicro had a hand in the effort. Still, sourcing huge amounts of power, cooling, and hardware for such a massive undertaking in four months is remarkable.
As Musk mentioned, the first phase will have 100,000 H100 GPUs, and the second phase will hit 200,000 GPUs, 50,000 of which will be H200s. The entire array of systems is connected over a single RDMA fabric and is possibly the largest Gen-AI cluster to date. The number of “active” Nvidia GPUs is difficult to ascertain because many companies (e.g., Meta) have been reported to hoard GPUs.
The stated use of the massive system is to train the Grok-3 AI. Reportedly, training of the Grok-2 required up to 24,000 Nvidia H100 GPUs from Oracle Cloud, and Grok-3 will need an eight-fold increase in capacity.
The short schedule also questioned how the xAI would obtain power for the cluster. Using a conservative estimate of 700 Watts for each H100 GPU, the first phase of Colossus requires at least 150 megawatts (~1400 Watts per server) when all the servers and auxiliary equipment are factored into the power equation.
It was also reported that the Tennessee Valley Authority could initially deliver 8 megawatts (MW) and then 50 MW in August of this year providing enough power for about 41,000 servers. With power for less than half the servers, there was concern that most of the systems would sit idle while waiting for resources.
However, after examining satellite imagery, Dylan Patel of SemiAnalysis figured out that Musk brought in 14 VoltaGrid natural gas mobile generators connected to what looks like four mobile substations. Each of these semi-trailer-sized generators can provide an additional 2.5 MW of power (35 MW total), boosting the total to 93 MW and enough power for about 66,000 servers. There is no information on how the cluster will get the additional power it needs to light up all the GPUs (and the additional 100,000 GPUs as part of Phase 2)
The cost of Colossus Phase 1 is estimated to be between $3-4 billion. The cost for an Nvidia H100 is approximately $25,000, and buying 100,000 is going to set you back $2.5 billion (unless Nvidia is giving volume discounts).
When complete, Colossus will place xAI in the Supercomputing big leagues and hold the record for the largest Gen-AI training cluster in the world. Curiously, there was no mention of any storage capability for Colossus. Building a bigger, better Grok-3 LLM will certainly require colossal amounts of data as well.