Famously, a team of researchers from the University of Massachusetts, Amherst, concluded in 2019 that training a single large AI model could emit five times the carbon as is emitted in the manufacture and use of an average car over its entire lifetime. With the sudden boom in large language models (LLMs) and similar generative AI tools over the past several years, that ominous climate metric is casting an ever-larger shadow. Now, researchers at the University of Michigan have developed a new energy optimization framework for AI training that the developers say can reduce energy use during AI training by up to 75%.
“At extreme scales, training the GPT-3 model just once consumes 1,287MWh, which is enough to supply an average U.S. household for 120 years,” said Mosharaf Chowdhury, an associate professor of electrical engineering and computer science at the University of Michigan, in an interview with Zachary Champion. For reference, the Amherst study had estimated energy needs for training the largest model they studied at ~656MWh. Per the EPA, the average impact of using 1,287MWh is about 557 metric tons of carbon dioxide equivalent, which would be like driving a standard combustion-engine car for over 1.4 million miles.
Suffice to say: it’s a problem. The key, the researchers explained, was optimizing with more than just time-to-completion in mind. The researchers built Zeus to throttle power use by GPUs, combining adjustments on the GPUs’ power limits with the reductions to the batch size parameter of the deep learning model being trained. Typically, batch sizes and power use are increased to reduce training time; Zeus optimizes training for energy use by reducing those limits until the tradeoffs to training time would be too great.
“Existing work primarily focuses on optimizing deep learning training for faster completion, often without considering the impact on energy efficiency,” said Jae-Won Chung, a doctoral student at the University of Michigan and co-first author of the study. “We discovered that the energy we’re pouring into GPUs is giving diminishing returns, which allows us to reduce energy consumption significantly, with relatively little slowdown.”
The researchers say that Zeus does not require any hardware changes and is designed to easily integrate with existing workflows. Zeus is complemented by Chase, a software tool that ramps up the training speed when lower-carbon energy is available – a load-shaping strategy that appears to be gaining traction in the industry as the energy costs of HPC and AI grow otherwise-unmanageable.
“It is not always possible to readily migrate DNN training jobs to other locations due to large dataset sizes or data regulations,” said Zhenning Yang, a master’s student in computer science and engineering. “Deferring training jobs to greener time frames may not be an option either, since DNNs must be trained with the most up-to-date data and quickly deployed to production to achieve the highest accuracy. Our aim is to design and implement solutions that do not conflict with these realistic constraints, while still reducing the carbon footprint of DNN training.”
To learn more about this research, read the paper here and read the reporting from the University of Michigan’s Zachary Champion here.