At SC22, Carbon Emissions and Energy Costs Eclipsed Hardware Efficiency

By Oliver Peckham

December 2, 2022

The race to ever-better flops-per-watt and power usage effectiveness (PUE) has, historically, dominated the conversation over sustainability in HPC – but at SC22, held last month in Dallas, something felt different. Across a bevy of panels and birds-of-a-feather sessions – both sustainability-focused and more general – the message became clear: the conference’s eyes had shifted to carbon emissions and energy costs.

Perhaps the most boisterous session, “Addressing HPC’s Carbon Footprint,” featured seven participants: Jay Boisseau, HPC and AI technology strategist at Dell; Andrew Chien, professor at the University of Chicago and senior computer scientist at Argonne National Laboratory; Andrew Grimshaw, president of Lancium Compute (and moderator of the session); Dieter Kranzlmüller, director of the Leibniz Supercomputing Center (LRZ) in Germany; Vincent Lim, executive director of NSCC in Singapore; Satoshi Matsuoka, director of RIKEN in Japan; and Alan Sill, managing director of the High-Performance Computing Center (HPCC) at Texas Tech University.

Moving past PUE and flops-per-watt

Grimshaw’s company, Lancium, had first come to our attention during this panel’s predecessor at SC21. The company’s pitch, essentially, is to position cheap, hot datacenters in west Texas, where an overabundance of variable, congested renewable energy leads to frequent negative energy pricing (where users are paid to accept energy load) and near-constant low prices for much of the rest of the time. For the remaining 5% of the time – when demand increases, fossil plants kick in and prices go up – Lancium stops running workloads in its datacenters. The result: fully renewable datacenters with bargain-bin energy prices – as long as you can stomach putting your workloads on pause every now and then.

“If we want to do low-carbon … computing and be really inexpensive, we need to move computing to the load,” Grimshaw (pictured in the header) said. “We’re well-suited to do that in HPC because, believe it or not … we have loads that we can pause.” HPC workloads, he continued, tend to operate in batches and typically don’t have a human in the loop. “If you stop it for 20 minutes, at the end of the day, nobody will really know.”


“If you flex with the grid, it’ll allow you to access low-cost, low-carbon power.”


Grimshaw was joined in this refrain by Chien, who hosted the panel last year. “What we’ve learned over the last five years of studying this problem is that actually, there’s an opportunity for greater-capacity, lower-cost HPC if you think about these things in the right way,” he said. “If you flex with the grid, it’ll allow you to access low-cost, low-carbon power. … This doesn’t have to be a downer, doesn’t have to be a restriction and doesn’t have to be a more expensive [option].” The macro-level electric grid trends, Chien said, were also headed in this direction: the grids need it just as much as HPC does. And: “This notion of stranded power or excess renewable power is much more widespread than west Texas,” he said, “so don’t think this doesn’t exist in your geo.”

Andrew Chien, professor at the University of Chicago and senior computer scientist at Argonne National Laboratory.

This premise is about as total a rejection of the flops-per-watt and PUE metrics as you can get: cheap, inefficient hardware and infrastructure that instead derives sustainability from its location and operation. Grimshaw wasn’t shy about that. “We’ve been using these metrics in the community [for] at least the last 10 or 15 years – power efficiency and flops-per-watt,” he said. “Flops-per-watt isn’t the right metric, because if watts are free – including carbon-free – why focus on that? What we really care about is flops per dollar (and power is our problem), or flops per kilogram of CO2 if we’re concerned about that. … Flops-per-watt was really a proxy for what we really cared about, which was CO2 or energy cost.”

Boisseau similarly cast doubt on the energy efficiency regime. “I spent years and years,” he said, “trying to figure out how to build a power-efficient system and then a datacenter where I could get the PUE from 1.3 to 1.2, or [from] 1.2 to 1.1 – and yet, it was in Texas, it was all fossil-based fuels anyway, so I really needed to address the 1.0 with sustainable energy at some point.” Boisseau said he liked what Lancium was doing, adding that Dell – while a leader in hardware – had “not been an innovator in delivery models” and would be “taking a more proactive approach on that going forward.”


“I spent years and years trying to figure out how to build a power-efficient system … and yet, it was in Texas, it was all fossil-based fuels anyway.”


“I’d really much rather see a metric that is carbon efficiency than PUE,” Boisseau later added. This was a sentiment he would reiterate the following week at one of Dell’s webinars: “I’ve actually proposed that it should be measured in carbon per flop instead of flops-per-watt. Flops-per-watt you want to be an increasing number, but what you really want to zero out is carbon, so trying to reduce carbon per flop to zero would be great. And it really changes the way you think about energy if you’re using 100% green energy.”

Chien also suggested that the kinds of high-efficiency hardware measures that reduce PUE might not mesh well with efforts to shape workloads to the grid. “Several people were very proud of the fact that they were able to drive down their PUE by using hot-water cooling,” he said. “I think that was a good idea; I think it’s the wrong idea going into this world.” If you want to flex your capacity up and down, he said, you need to be able to increase your heat-carrying capacity out of the datacenter. “The way you increase the heat-carrying capacity out of the building is by dropping the temperature of the water and increasing the flow rate, both of which increase PUE.”

Lim shared difficulties managing energy efficiency in high-humidity, high-heat environments, but added that there were issues with relocating workloads across international borders. They were open to hosting in other countries, he said, but “the most challenging thing about that is to take care of the data sovereignty issue.”

Rising energy costs

Elsewhere in the world, rising energy prices caused paradigm shifts for two of the other panelists. Matsuoka, speaking for the #2-ranked Fugaku system, shared that the prices had forced a dramatic move for RIKEN, showing a graph of Fugaku’s operating over the course of the year with a steep drop over the last months. “That’s when we had to turn off 30% of the nodes because we were facing financial crisis due to this sudden surge in electricity prices,” he said. Matsuoka was, however, reluctant to fully endorse variable capacity as a solution. Amortizing the billion-dollar cost of Fugaku over five years, each year represented $200 million in capital expenditures; even in the face of $40 million a year in energy costs, shutting it down represented a net loss on investment.

Satoshi Matsuoka presents at the BOF. Behind him, a graph showing the time that Fugaku spun down 30% of its nodes due to rising energy costs. Image courtesy of Lancium Compute.

Instead, he said, RIKEN would be pushing its user community to pursue energy efficiency. “Starting next year, we plan on allocating people energy instead of runtime hours,” Matsuoka said.


“We plan on allocating people energy instead of runtime hours.”


Kranzlmüller, meanwhile, shared that the rising energy prices had opened doors for managing the heat from LRZ’s SuperMUC-NG system. While SuperMUC-NG operated at a very low PUE – 1.06, thanks to hot-water cooling – they had been unable, until recently, to engage in the kind of waste heat reuse seen with the LUMI system. “We wanted to use the heat from the system just for heating – a very simple, straightforward thing. We wanted to do this for ten years, and nobody wanted to take the heat because it’s too much effort, you need to connect it to the loops and so on,” he said. “Today, with the strange global situation, suddenly they want our heat. … You see how stupid that is? We could have done that years ago.”

In response to an audience question about moving workloads to more sustainable colocation sites that used these kinds of technologies, Sill mentioned Quebec-based startup QScale, which is intending to use heat from its ultra-renewable HPC datacenters to help warm greenhouses in the Canadian winters. HPCwire had a chance to visit QScale’s first datacenter in October, and the company – like Lancium – made its booth debut at SC22.

A different conversation

This session was just one of many to discuss these themes at SC22: another carbon footprint session was hosted the following day (sadly, we were unable to attend), along with several other sustainability-oriented sessions and meetings. Discussions of carbon emissions and power savings pervaded vendor announcements and general panels, and the ACM announced that climate change research is taking over for Covid-19 research as the Gordon Bell Special Prize subject that will be awarded over the coming years.

Curiously quiet among the sustainability news at SC22 was the Green500 and its associated birds-of-a-feather session. There was news, of course: as we covered during the conference, Nvidia’s H100 GPU debuted in a small system named Henri (operated by the Flatiron Institute), achieving unparalleled flops per watt in its Linpack run and dethroning the Frontier-style HPE/AMD systems that now dominate even more of the top ten than they did when they debuted on the May list.

Experientially, though, discussion of these efficiency achievements was relatively muted at the conference (despite AMD’s ubiquitous, tree-lined banners advertising that its hardware powered the world’s most energy-efficient supercomputers). Instead, the conversations around flops per watt and PUE seemed to fade against an ever-louder, increasingly urgent awareness that the climate and financial costs of powering supercomputers have finally become too hefty to ignore – and hardware isn’t enough to stop it.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Nvidia Touts Strong Results on Financial Services Inference Benchmark

February 3, 2023

The next-gen Hopper family may be on its way, but that isn’t stopping Nvidia’s popular A100 GPU from leading another benchmark on its way out. This time, it’s the STAC-ML inference benchmark, produced by the Securi Read more…

Quantum Computing Firm Rigetti Faces Delisting

February 3, 2023

Quantum computing companies are seeing their market caps crumble as investors patiently await out the winner-take-all approach to technology development. Quantum computing firms such as Rigetti Computing, IonQ and D-Wave went public through mergers with blank-check companies in the last two years, with valuations at the time of well over $1 billion. Now the market capitalization of these companies are less than half... Read more…

US and India Strengthen HPC, Quantum Ties Amid Tech Tension with China

February 2, 2023

Last May, the United States and India announced the “Initiative on Critical and Emerging Technology” (iCET), aimed at expanding the countries’ partnerships in strategic technologies and defense industries across th Read more…

Pittsburgh Supercomputing Enables Transparent Medicare Outcome AI

February 2, 2023

Medical applications of AI are replete with promise, but stymied by opacity: with lives on the line, concerns over AI models’ often-inscrutable reasoning – and as a result, possible biases embedded in those models Read more…

Europe’s LUMI Supercomputer Has Officially Been Accepted

February 1, 2023

“LUMI is officially here!” proclaimed the headline of a blog post written by Pekka Manninen, director of science and technology for CSC, Finland’s state-owned IT center. The EuroHPC-organized supercomputer’s most Read more…

AWS Solution Channel

Shutterstock 2069893598

Cost-effective and accurate genomics analysis with Sentieon on AWS

This blog post was contributed by Don Freed, Senior Bioinformatics Scientist, and Brendan Gallagher, Head of Business Development at Sentieon; and Olivia Choudhury, PhD, Senior Partner Solutions Architect, Sujaya Srinivasan, Genomics Solutions Architect, and Aniket Deshpande, Senior Specialist, HPC HCLS at AWS. Read more…

Microsoft/NVIDIA Solution Channel

Shutterstock 1453953692

Microsoft and NVIDIA Experts Talk AI Infrastructure

As AI emerges as a crucial tool in so many sectors, it’s clear that the need for optimized AI infrastructure is growing. Going beyond just GPU-based clusters, cloud infrastructure that provides low-latency, high-bandwidth interconnects and high-performance storage can help organizations handle AI workloads more efficiently and produce faster results. Read more…

Intel’s Gaudi3 AI Chip Survives Axe, Successor May Combine with GPUs

February 1, 2023

Intel's paring projects and products amid financial struggles, but AI products are taking on a major role as the company tweaks its chip roadmap to account for more computing specifically targeted at artificial intellige Read more…

Quantum Computing Firm Rigetti Faces Delisting

February 3, 2023

Quantum computing companies are seeing their market caps crumble as investors patiently await out the winner-take-all approach to technology development. Quantum computing firms such as Rigetti Computing, IonQ and D-Wave went public through mergers with blank-check companies in the last two years, with valuations at the time of well over $1 billion. Now the market capitalization of these companies are less than half... Read more…

US and India Strengthen HPC, Quantum Ties Amid Tech Tension with China

February 2, 2023

Last May, the United States and India announced the “Initiative on Critical and Emerging Technology” (iCET), aimed at expanding the countries’ partnership Read more…

Intel’s Gaudi3 AI Chip Survives Axe, Successor May Combine with GPUs

February 1, 2023

Intel's paring projects and products amid financial struggles, but AI products are taking on a major role as the company tweaks its chip roadmap to account for Read more…

Roadmap for Building a US National AI Research Resource Released

January 31, 2023

Last week the National AI Research Resource (NAIRR) Task Force released its final report and roadmap for building a national AI infrastructure to include comput Read more…

PFAS Regulations, 3M Exit to Impact Two-Phase Cooling in HPC

January 27, 2023

Per- and polyfluoroalkyl substances (PFAS), known as “forever chemicals,” pose a number of health risks to humans, with more suspected but not yet confirmed Read more…

Multiverse, Pasqal, and Crédit Agricole Tout Progress Using Quantum Computing in FS

January 26, 2023

Europe-based quantum computing pioneers Multiverse Computing and Pasqal, and global bank Crédit Agricole CIB today announced successful conclusion of a 1.5-yea Read more…

Critics Don’t Want Politicians Deciding the Future of Semiconductors

January 26, 2023

The future of the semiconductor industry was partially being decided last week by a mix of politicians, policy hawks and chip industry executives jockeying for Read more…

Riken Plans ‘Virtual Fugaku’ on AWS

January 26, 2023

The development of a national flagship supercomputer aimed at exascale computing continues to be a heated competition, especially in the United States, the Euro Read more…

Leading Solution Providers

Contributors

SC22 Booth Videos

AMD @ SC22
Altair @ SC22
AWS @ SC22
Ayar Labs @ SC22
CoolIT @ SC22
Cornelis Networks @ SC22
DDN @ SC22
Dell Technologies @ SC22
HPE @ SC22
Intel @ SC22
Intelligent Light @ SC22
Lancium @ SC22
Lenovo @ SC22
Microsoft and NVIDIA @ SC22
One Stop Systems @ SC22
Penguin Solutions @ SC22
QCT @ SC22
Supermicro @ SC22
Tuxera @ SC22
Tyan Computer @ SC22
  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire