A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers.
Argonne National Laboratory (ANL) is creating a generative AI model called AuroraGPT and is pouring a giant mass of scientific information into creating the brain.
The model is being trained on its Aurora supercomputer, which delivers more than an half an exaflop performance at ANL. The system has Intel’s Ponte Vecchio GPUs, which provide the main computing power.
Intel and ANL are partnering with other labs in the US and worldwide to make scientific AI a reality.
“It combines all the text, codes, specific scientific results, papers, into the model that science can use to speed up research,” said Ogi Brkic, vice president and general manager for data center and HPC solutions, in a press briefing.
Brkic called the model “ScienceGPT,” indicating it will have a chatbot interface, and researchers can submit questions and get responses.
Chatbots could help in a wide range of scientific research, including biology, cancer research, and climate change.
Training a model with complex data can take time and massive computing resources. ANL and Intel are in the early stages of testing the hardware before putting the model into full training mode.
While it will operate like ChatGPT, it is unclear if the generative model will be multimodal or whether it will generate images and videos. Inference will also be a big part of the system as scientists seek answers from the chatbot and feed more information into the model.
Training AuroraGPT has just started and could take months to complete. The training is currently limited to 256 nodes, which will then be scaled to all of the nodes — about 10,000 — of the Aurora supercomputer.
OpenAI has not shared details on how long it took to train GPT-4, which takes place on Nvidia GPUs. In May, Google said it was training its upcoming large-language model called Gemini, which is likely happening on its TPUs.
The biggest challenge in training large language models is the memory requirements, and in most cases, the training needs to be sliced down to smaller bits across a wide range of GPUs. The AuroraGPT is enabled by Microsoft’s Megatron/DeepSpeed, which does exactly that and ensures the training is happening in parallel.
Intel and ANL are testing the 1-trillion parameter model training on a string of 64 Aurora nodes.
“The number of nodes is lower than we typically see on these large language models… because [of the] unique Aurora design,” Brkic said.
Intel has worked with Microsoft on fine-tuning the software and hardware, so the training can scale to all nodes. The goal is to extend this to the entire system of 10,000 plus nodes.
Intel also hopes to achieve linear scaling so the performance increases as the number of nodes increases.
Brkic said its Ponte Vecchio GPUs outperformed Nvidia’s A100 GPUs in another Argonne supercomputer called Theta, which has a peak performance of 11.7 petaflops.