In the wake of SC22 last year, HPCwire wrote that “the conference’s eyes had shifted to carbon emissions and energy intensity” rather than the historical emphasis on flops-per-watt and power usage effectiveness (PUE). At ISC 2023 in Hamburg, Germany, this week, that trend continued, with nearly every mention of flops-per-watt or the Green500 list followed by mentions of the importance of other aspects of sustainability in HPC.
As part of the official ISC program, HPCwire had the opportunity to host a special event – “HPC’s Energy Crossroads: The Roles of Hardware, Software and Location in Low-Carbon HPC” – that brought together three HPC leaders coming at the question of low-carbon HPC from different angles.
On the panel:
- Andrew Grimshaw, president of Lancium Compute, which colocates datacenters near plentiful, congested renewable energy in west Texas in order to provide clean, ultra-low-cost computing services. (See previous HPCwire coverage here.)
- Jen Huffstetler, chief product sustainability officer and VP & GM for future platform strategy and sustainability at Intel, which has been placing an increasing emphasis on the energy and carbon benefits of its hardware and software offerings. (See previous HPCwire coverage here.)
- Vincent Thibault, co-founder of QScale, which is building a massive, renewably-powered campus in Quebec that will leverage large-scale heat reuse to warm industrial greenhouses. (See previous HPCwire coverage here.)
Over the course of the hour, we posed three questions to the participants, all centered around how popular ideas of sustainable HPC have shifted – and how they may need to shift even more. We’ll cover some of the discussion below, but the full session is available to stream exclusively through ISC’s digital platform.
What sustainable HPC needs (and what it doesn’t)
“Over the last few decades, our community has focused on flops per watt and PUE … under the assumption that reducing watts is the way to reduce CO2 emissions,” Grimshaw said. At the same time, he explained, there was an inflection point in the energy world, with incredibly cheap wind and solar energy in some places. But, of course, those renewables have issues: namely, variability and – less visibly – congestion. Grimshaw’s company, Lancium, exploits those issues by building “clean campuses” where renewables are plentiful and congested, allowing HPC customers to run their workloads on a fully renewable, ultra-low-cost grid. Customers can even opt into allowing their workloads to be paused when the resources are less available, further saving on costs and carbon.
“If you think about it, there are many applications in compute – particularly in HPC and HTC – where there are really no humans in the loop, so if the application is paused for an hour or maybe half a day, it’s not the end of the world,” Grimshaw said.
Huffstetler agreed, pointing out an adjacent issue where a high-power server might not be fully utilized by its available workloads – which might be pausable. “If you’re powering a server up and it’s not being utilized,” Huffstetler said, “you can turn the server off and wake it up when necessary. So [we’re] really starting to think about what workloads can handle that, versus this ‘always on, all the time’ mentality that we’ve had.”
“I think we can all agree that not all HPC workloads are the same,” Grimshaw said.
Thibault said that they, too, had noticed an increase in workloads that “are not sensitive to latency,” like large-scale HPC models and AI training.
“If the model runs for 24 hours, if the latency is 60ms, that’s going to be 24 hours and 60 milliseconds instead of 24 hours and 2 milliseconds,” Thibault said. “That means you should locate those workloads where energy is 100% renewable and the climate is as cold as possible.”
“The things that hardware vendors can do for us most effectively are: give us the ability to rapidly boot and unboot the machines,” Grimshaw added, “because we want to transition them essentially between a running state to a non-running state.” Grimshaw pointed out that an idling server can take up 65W, “which doesn’t sound like a lot of power until you multiply it by 10,000,” and that GPUs in particular boot very slowly.
Thibault took a different perspective. “If you have a $40,000 GPU or a $10,000 CPU, our clients want to run them pedal-to-the-metal, 24/7, 365 days a year,” he said, pointing out that for a chip like that, the cost of powering it for its entire lifetime might be just a small fraction of its capital cost.
For his part, Thibault said that the thing QScale needed most from hardware vendors was advances in liquid cooling – an item that had been mentioned by Huffstetler earlier in the talk. Thibault explained that as transistors shrank, voltage leaks increased, “so the power consumption of the chips is increasing kind of exponentially.” And with GPUs moving past the 1kW envelope, the problem was increasingly urgent. “What I believe will be a big, big change is if we can move to warm-water cooling,” Thibault said.
Huffstetler agreed that it was crucial to “look at the holistic energy consumption of the datacenter overall” – something she said Intel did in partnership with QScale and others. “This isn’t only looking at the energy efficiency or the performance-per-watt for the processor or the GPU, it’s also looking at the system-level power – the power required for cooling in the datacenter, which can at times represent up to 40% of the datacenter energy consumption.”
Huffstetler also said that software could be an enormous help to “leave no transistor behind,” adding that Intel had seen energy efficiency improvements of up to 100× through co-optimization.
“Furthermore, hardware players can think about how they provide more granularity into what is actually happening on the platform itself,” Huffstetler said. “So: management software that enables monitoring, analysis and even emissions control by forecasting carbon emissions, future power space and needs, monitoring the device and the datacenter energy consumption.” Huffstetler also mentioned advanced telemetry being added to Intel chips that would allow monitoring and management of system-level processes, enabling things like carbon-aware workloads.
Increasing the appeal of renewable colocation
Typically, we hear trepidation when we hear HPC users talk about moving computing offsite and colocating with renewable energy through providers like Lancium and QScale: many HPC users are used to having direct access to their systems, and for sensitive research (e.g. medicine, security, corporate secrets) there can be serious worries about data sovereignty and security.
Thibault said addressing these concerns was a question they were “facing daily at QScale,” but brought the question back to its fundamentals by talking about how New York banks slowly had their HPC operations pushed out of Manhattan due to ballooning tech needs, then ballooning energy needs. “When you move from a system that consumes 1MW of energy to something that’s consuming 15MW of energy, let me tell you: upgrading the headquarters in Manhattan’s going to be impossible,” Thibault said.
Some organizations, Thibault said, like the Department of Defense, genuinely couldn’t move workloads – but most organizations weren’t the DOD, and he said that the full cost of hosting at QScale was often less than the cost of the energy for those workloads in places like Germany.
“How many of us have friends in Europe who run supercomputing centers who, even if they wanted to build, can’t get the power to do so?” Grimshaw agreed, saying that remote management was increasingly the norm rather than the exception.
Huffstetler added that “the state of security and trust has been evolving” and that colocation is “not only becoming more cost-effective,” but also companies like Intel were providing many new tools (like Project Amber) to help build user trust.
Heat reuse was also a hot topic (get it?) during the panel. As mentioned above, QScale is planning massive heat reuse in partnership with industrial-scale greenhouses – which, in Quebec, need to be substantially heated during the long, harsh winters. “We believe that the cloud is turning into smog, and our objective is to turn the smog into tomatoes,” Thibault quipped.
Huffstetler said she was “fully aligned with QScale” on heat reuse, adding that it was “how we’re going to be giving back to local communities wherever these datacenters are built”; Thibault replied that heat reuse was a “key point to get community buy-in.”
“My hope is that we’ll switch from using PUE as a factor of efficiency to ‘ERE,’ which is energy recreation efficiency – so the amount of the heat that is produced by the computer that we can effectively reuse,” Thibault added.
A cohesive path forward?
Near the end of the session, Thibault took the opportunity to present his vision for a more sustainable HPC pipeline.
“The way I hope the world will move forward is that we have the latest and greatest equipment that’s running in facilities like QScale for a period of two, three, four years maybe,” he said. “And after that, the hardware is getting replaced and instead of moving to a dump site, it could be repurposed in a site like Lancium’s where the power cost is going to be basically nil to run it.”
“We know that there are users, principally in the research community, where they have more time, [but] they have less resources,” he continued. “How do we actually give those people those resources?”
The quotes in this feature are excerpts from the ISC special event “HPC’s Energy Crossroads: The Roles of Hardware, Software and Location in Low-Carbon HPC.” The full session is available exclusively through ISC’s digital platform.