Recently, Dan Stanzione, Executive Director, TACC and Associate Vice President for Research, UT-Austin, gave a presentation on HPC sustainability at the Fall 2023 HPC Users Forum. The complete set of slides is available on the forum site.
The Texas Advance Computing Center (TACC) runs a significant number of machines, including Frontera, Stampede-2, Jetstream, and Chameleon systems for the National Science Foundation. In addition, they run Longhorn and Lonestar-6 for the Texas academic and industry users.
In total, they manage about 20,000 servers with more than one million cores and 1000 GPUs. The typical power budget is about 6 MW, with a peak usage of 9.5 MW. An additional 30MW of datacenter capacity will be added for the NSF Leadership-Class Computing Facility (LCCF) in 2025.
The TACC cooling strategies include in-row chillers, direct processor liquid cooling, immersion cooling, and chilled water storage to offset power-grid peak demand periods. In addition, TACC employs roughly 200kw of direct solar and buys wind credits for about 20% of the remainder.
In the slides, Dan’s states that sustainability is a priority, and at the same time TACC’s mission is to provide the best computational resources to users. Some of TACCS’s sustainability plans are outlined, and Dan notes that well-run datacenters are already pretty efficient. At most, he estimates they can squeeze out another 10-15% from datacenter efficiency measures.
The real opportunity for change means tackling the software efficiency issue.
Software and Sustainability
Dan indicates the well-known 5-6x peak FLOP boost by moving to GPUs. (The assumption is that the faster the program completes, the less energy it uses, and GPUs have a good performance/watt ratio.) But he notes, outside of AI, a significant fraction of codes don’t run on GPUs, and a boost of 5-6x is much more than the estimated remaining datacenter 15% efficiency. Another known issue is that most applications actually get a single-digit percent of the potential peak performance. This situation means code efficiency offers a possible order of magnitude improvement in efficiency. Again, more optimized code usually uses more power but has a shorter run time and thus increases the actual job throughput.
The software issue is hard, and according to Dan, “a crappy job on software, with 1,000% potential, is probably better than a great job on datacenter, with 10% potential.”
To help users with these efforts, TACC is sampling performance data every few minutes on every job to keep a profile of efficiency. However, Dan notes users just want the fastest answer – no incentive to get a slower answer that uses less power. He also suggested changing our charging units from wall clock hours to total Joules consumed might be a possibility or incentivizing moving loads to optimal power cost times.
There is an often-heard phase in HPC; “at some point, it all comes down to software.” Based on TACC’s experience, one path to sustainability means taking a hard look at software efficiency. Besides what is the worst that can happen, your application will finish faster.
Dan Stanzione’s slides