Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

By Oliver Peckham

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Welcome to summer 2021 – it’s time for your annual Dojo update.

Well, sort of: instead of revealing the ins and outs of Dojo, Tesla instead opted to reveal a precursor cluster that the company estimates may be the fifth most-powerful supercomputer in the world.

The newly revealed Tesla cluster. Image courtesy of Karpathy/Tesla.

The casual reveal happened during a talk by Andrej Karpathy, the senior director of AI at Tesla, at the 4th International Joint Conference on Computer Vision and Pattern Recognition (CCVPR 2021). “I wanted to briefly give a plug to this insane supercomputer that we are building and using now,” Karpathy said. As he explained, the cluster (if it has a name, Karpathy didn’t share it with the audience) sports 720 nodes, each powered by eight of Nvidia’s A100 GPUs (the 80GB model), for a whopping 5,760 A100s throughout the system. This accelerator firepower is complemented by ten petabytes of “hot tier” NVMe storage, which has a transfer rate of 1.6 terabytes per second. Karpathy said that this “incredibly fast storage” constitutes “one of the world’s fastest filesystems.”

“So this is a massive supercomputer,” Karpathy said. “I actually believe that in terms of flops this is roughly the number five supercomputer in the world, so it’s actually a fairly significant computer here.”

Some back-of-the-envelope flops math seems to bear out Karpathy’s remarkable claim. According to Nvidia’s marketing materials, each A100 is capable of 9.7 peak teraflops, but in benchmarking for systems like the Selene supercomputer, eight-A100 nodes each deliver around 113.3 Linpack teraflops (~14.2 Linpack teraflops per GPU, inclusive of accompanying processors). 720 eight-A100 nodes later, you get around 81.6 Linpack petaflops — enough to place the Tesla cluster well above the aforementioned Selene system, operated by Nvidia, which delivers 63.5 Linpack petaflops and placed fifth on the most recent Top500 list. (The Top500 often does not include corporate systems like Tesla’s due to trade secrecy, and the list is due to be refreshed at ISC21 this coming week.)

This cluster – and, eventually, Dojo – are being deployed in service of Tesla’s feverish push for the next generation of vehicle automation: full self-driving (FSD) vehicles. In the talk, Karpathy discussed why the electric vehicle juggernaut is moving toward FSD and how its clusters – including the new one – serve that ambition.

One of Karpathy’s first slides was particularly telling: a poorly-Photoshopped brain in the driver’s seat of a zooming car, captioned with statistics characterizing humans as meat computers with a “250 ms reaction latency” in a “tight control loop with one-ton objects at 80 miles per hour.” For Tesla, FSD is about replacing that sluggish computer (which Karpathy noted could write poetry, but often had trouble staying within the lines on the road) with a faster, safer one.

But training computers to understand roads – even with cameras and lidar on-board – is difficult, involving innumerable contingencies and bizarre scenarios that impede the vehicle’s ability to process its surroundings in a traditional manner. In one example, Karpathy showed a truck kicking up dust and debris that obscured the cameras, effectively blinding the vehicle for several seconds.

A network switch on the cluster. Image courtesy of Karpathy/Tesla.

In order to train systems that can cope with these obstacles, Tesla first collects mountains of data. “For us, computer vision is the bread and butter of what we do and what enables the autopilot,” Karpathy said. “And for that to work really well, you need a massive dataset – we get that from the fleet.” And, indeed, the dataset is massive: one million ten-second videos from each of the eight cameras on the sampled Teslas, each running at 36 frames per second and capturing “highly diverse scenarios.” These videos contain six billion object labels (including accurate depth and velocity data) and total 1.5 petabytes.

“You … need to train massive neural nets and experiment a lot,” Karpathy said. “Training this neural network – like I mentioned, this is a 1.5 petabyte dataset – requires a huge amount of compute.” Accordingly, he said, Tesla “invested a lot” into this capability. In particular, Karpathy explained, the newly unveiled cluster is optimized for rapid video transfer and processing, thanks to that aforementioned “incredibly fast storage” and “a very efficient fabric” that enables distributed training across the nodes.

Dojo, for its part, is still being teased. “We’re currently working on Project Dojo, which will take this to the next level,” Karpathy said. “But I’m not ready to reveal any more details about that at this point.” Little is known about the mysterious forthcoming system beyond a handful of tweets by Musk referencing the exaflop target, claiming that “Dojo uses our own chips [and] a computer architecture optimized for neural net training, not a GPU cluster” and sharing that Dojo will be available as a web service for model training “once we work out the bugs.”

“Could be wrong,” Musk tweeted, “but I think it will be best in the world.”

For now, though, Tesla is content to let the world know that it’s betting big on HPC – and that the bets are only getting bigger. Karpathy said that the HPC team is “growing a lot,” and encouraged audience members who were excited by HPC applications in self-driving cars to reach out to the company.

Shares
training computer... Read more…" share_counter=""]
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Nvidia Announces BlueField-3 GA, Oracle Cloud Is Early User

March 21, 2023

Nvidia today announced general availability for its BlueField-3 data processing unit (DPU) along with impressive early deployments including Oracle Cloud Infrastructure. First described in 2021 and now being delivered, B Read more…

Nvidia Announces ‘Tokyo-1’ Generative AI Supercomputer Amid Gradual H100 Rollout

March 21, 2023

Nvidia’s Hopper-generation H100 GPU is continuing its slow march toward “current-generation.” After Nvidia announced that the H100 was in “full production” last September, the chip made its formal debut in Nove Read more…

Nvidia’s AI Factory Services Start at $37,000

March 21, 2023

If you are a die-hard Nvidia loyalist, be ready to pay a fortune to use its AI factories in the cloud. Renting the GPU company's DGX Cloud, which is an all-inclusive AI supercomputer in the cloud, starts at $36,999 per instance for a month. The rental includes access to a cloud computer with eight Nvidia H100 or A100 GPUs and 640GB of GPU memory. Read more…

Nvidia Advances Hybrid Quantum with Introduction of DGX Quantum Platform

March 21, 2023

Nvidia's homegrown combined CPU and GPU offering, called Grace Hopper, has been slow out of the gate, but the chipmaker is finding a new use for it in simulating quantum computing. The Grace Hopper chip is being used alongside Quantum Machines' quantum hardware to facilitate quantum-classical computing. Read more…

Quantum Bits: IBM-Cleveland Clinic Launch; D-Wave Adds Solver; DOE/AWS Offer QICK

March 20, 2023

IBM today launched the first installation of an IBM Quantum System One at a collaborator site in the U.S. – this one is at the Cleveland Clinic where IBM’s 127-qubit Eagle QPU will be used for medical research. D-Wav Read more…

AWS Solution Channel

Shutterstock_2206622211

Install optimized software with Spack configs for AWS ParallelCluster

With AWS ParallelCluster, you can choose a computing architecture that best matches your HPC application. But, HPC applications are complex. That means they can be challenging to get working well. Read more…

 

Get the latest on AI innovation at NVIDIA GTC

Join Microsoft at NVIDIA GTC, a free online global technology conference, March 20 – 23 to learn how organizations of any size can power AI innovation with purpose-built cloud infrastructure from Microsoft. Read more…

SCA23: Pawsey’s Mark Stickells on Sustainable Australian Supercomputing

March 17, 2023

“While the need for supercomputing is great, we have, in my view, reached a tipping point,” said Mark Stickells, executive director of Australia’s Pawsey Supercomputing Centre, as he opened his keynote (“Energy E Read more…

Nvidia Announces BlueField-3 GA, Oracle Cloud Is Early User

March 21, 2023

Nvidia today announced general availability for its BlueField-3 data processing unit (DPU) along with impressive early deployments including Oracle Cloud Infras Read more…

Nvidia Announces ‘Tokyo-1’ Generative AI Supercomputer Amid Gradual H100 Rollout

March 21, 2023

Nvidia’s Hopper-generation H100 GPU is continuing its slow march toward “current-generation.” After Nvidia announced that the H100 was in “full producti Read more…

Nvidia’s AI Factory Services Start at $37,000

March 21, 2023

If you are a die-hard Nvidia loyalist, be ready to pay a fortune to use its AI factories in the cloud. Renting the GPU company's DGX Cloud, which is an all-inclusive AI supercomputer in the cloud, starts at $36,999 per instance for a month. The rental includes access to a cloud computer with eight Nvidia H100 or A100 GPUs and 640GB of GPU memory. Read more…

Quantum Bits: IBM-Cleveland Clinic Launch; D-Wave Adds Solver; DOE/AWS Offer QICK

March 20, 2023

IBM today launched the first installation of an IBM Quantum System One at a collaborator site in the U.S. – this one is at the Cleveland Clinic where IBM’s Read more…

SCA23: Pawsey’s Mark Stickells on Sustainable Australian Supercomputing

March 17, 2023

“While the need for supercomputing is great, we have, in my view, reached a tipping point,” said Mark Stickells, executive director of Australia’s Pawsey Read more…

Optical I/O Technology Needed for Zettascale, Say Top Chipmakers

March 16, 2023

Optical I/O is being singled out by top companies to push computing beyond exascale and into zettascale. The technology was singled out in a recent speech by AM Read more…

Tasty CHIPS – New MEC Program to Expand US Prototyping Capabilities Gains Steam

March 16, 2023

Sometime later this year, perhaps around July, the Department of Defense is expected to announce the sites and focus of up to nine hubs associated with the Micr Read more…

Leibniz QIC’s Mission to Coax Qubits and Bits to Work Together

March 14, 2023

Four years after passing the U.S. National Quantum Initiative Act and decades after early quantum development and commercialization efforts started – think D- Read more…

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

Leading Solution Providers

Contributors

CORNELL I-WAY DEMONSTRATION PITS PARASITE AGAINST VICTIM

October 6, 1995

Ithaca, NY --Visitors to this year's Supercomputing '95 (SC'95) conference will witness a life-and-death struggle between parasite and victim, using virtual Read more…

SGI POWERS VIRTUAL OPERATING ROOM USED IN SURGEON TRAINING

October 6, 1995

Surgery simulations to date have largely been created through the development of dedicated applications requiring considerable programming and computer graphi Read more…

U.S. Will Relax Export Restrictions on Supercomputers

October 6, 1995

New York, NY -- U.S. President Bill Clinton has announced that he will definitely relax restrictions on exports of high-performance computers, giving a boost Read more…

Dutch HPC Center Will Have 20 GFlop, 76-Node SP2 Online by 1996

October 6, 1995

Amsterdam, the Netherlands -- SARA, (Stichting Academisch Rekencentrum Amsterdam), Academic Computing Services of Amsterdam recently announced that it has pur Read more…

Cray Delivers J916 Compact Supercomputer to Solvay Chemical

October 6, 1995

Eagan, Minn. -- Cray Research Inc. has delivered a Cray J916 low-cost compact supercomputer and Cray's UniChem client/server computational chemistry software Read more…

NEC Laboratory Reviews First Year of Cooperative Projects

October 6, 1995

Sankt Augustin, Germany -- NEC C&C (Computers and Communication) Research Laboratory at the GMD Technopark has wrapped up its first year of operation. Read more…

Sun and Sybase Say SQL Server 11 Benchmarks at 4544.60 tpmC

October 6, 1995

Mountain View, Calif. -- Sun Microsystems, Inc. and Sybase, Inc. recently announced the first benchmark results for SQL Server 11. The result represents a n Read more…

New Study Says Parallel Processing Market Will Reach $14B in 1999

October 6, 1995

Mountain View, Calif. -- A study by the Palo Alto Management Group (PAMG) indicates the market for parallel processing systems will increase at more than 4 Read more…

SC22 Booth Videos

AMD @ SC22
Altair @ SC22
AWS @ SC22
Ayar Labs @ SC22
CoolIT @ SC22
Cornelis Networks @ SC22
DDN @ SC22
Dell Technologies @ SC22
HPE @ SC22
Intel @ SC22
Intelligent Light @ SC22
Lancium @ SC22
Lenovo @ SC22
Microsoft and NVIDIA @ SC22
One Stop Systems @ SC22
Penguin Solutions @ SC22
QCT @ SC22
Supermicro @ SC22
Tuxera @ SC22
Tyan Computer @ SC22
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire