Liquid Cooling Delivering on the Promise

By Larry Vertal

November 10, 2014

The demand for more efficient HPC liquid cooling in data centers is being driven by a number of factors. While the push to Exascale is a fundamental driver in the most extreme cases where power and cooling loom large as barriers, there is an overall the need to cool server racks of higher power usage and heat density generally. This is especially the case in HPC where server racks are being seen with 30kW per rack or more and where individual nodes must deal with multiple processors of 150 watts augmented with GPUs and coprocessors of 200 watts or more.

Of course, liquid cooling is already used in “air cooled” data centers in using liquid to remove bulk-heat expelled from air cooling racks from the data center itself. Traditionally, most data centers and HPC sites bring liquid into the computer room via Computer Room AC (CRAC) or Computer Room Air Handler (CRAH) units and use this liquid to cool the air in the data center. The refrigerant that comes into the CRAC units is liquid and chilled water is used as the coolant in CRAH units. The inefficiency is rooted in the path heat travels from the server into this liquid.

Beyond addressing the power and heat density barriers, done correctly, liquid cooling results in CapEx avoidance by mitigating both data center physical expansion through increased rack density and in reducing infrastructure investments such as the need for chiller, HVAC and cooling plant build-outs. With cooling costs often using one third of data center energy, the compelling OpEx benefits are available by reducing both overall data center cooling and server power consumption, and even in enabling energy recovery.

Approaches to Liquid Cooling

Liquid cooling approaches can be divided into two groups: general-purpose and close-coupled.

General-purpose solutions move the air-cooling unit closer to standard air-cooled servers in their racks. These approaches include such things as heat transferring rear doors, sealed racks and in-row coolers.

The use of rear door, in-row and over-row liquid coolers as a liquid cooling solution focuses on reducing the cost of moving air by placing the air-cooling unit closer to the servers. For example, rear door coolers replace rear doors on server racks with a liquid cooled heat exchanger that transfers server heat into liquid when hot air exits the servers. Servers are still air-cooled and facilities liquid must be brought in at the same temperatures needed for CRAH units (<65 degrees F). That liquid then exits the data center at <80 degrees F, too cool for useful energy recovery.

An issue with these general-purpose solutions is that expensive chillers are still required and server fans still consume the same amount of energy.   They only address part of the problem in providing overall bulk-heat room neutrality and still need cold water and continued investment in chiller infrastructure.

Closely-coupled solutions on the other hand bring cooling liquid directly to the high heat flux components within servers such as CPUs, GPUs and memory taking advantage of the superior heat transfer and capacity of liquids and reduce or even eliminate the need to expensive chiller infrastructure. Additionally there are power savings from reducing fan power.

This approach began in the mid eighties when liquid cooling was brought inside the supercomputer with systems like the Cray-2 and began resurgence over a decade ago, with systems like IBM’s Power 575.

Closely-coupled approaches of note today include such things as Direct Touch, Immersion and Direct to Chip.

“Direct Touch” has not gained significant traction partially due to cooling infrastructure still needed to cool the refrigerant to <61°F.   This approach to liquid cooling replaces air heat sinks in the servers with ‘Heat Risers’ that transfer heat to skin of server chassis where cold plates between servers transfer heat to refrigerant so the heat can be removed from the building. While this eliminates fans in the server and

Asetek’s RackCDU™ D2C™ (Direct-to-Chip)
RackCDU Extension Installed on 96 node rack and server node with CPU coolers

the need to move air around the data center for server cooling but continues to require refrigerant cooling and reduces the capacity of a typical 42U rack to around 35 RUs due to the cold plates which is counter to HPC trends.

Immersion cooling solutions remove server heat by placing servers in tanks of dielectric fluid or filling custom servers with dielectric fluid. Challenges with this approach include the maintenance of servers, modification of servers with non-standard parts, large quantities of oil-based coolant in the data center and density issues with poor space utilization due to the server “racks” lying horizontally.

Direct to chip liquid cooling systems such as Asetek’s RackCDU™ D2C™ (Direct-to-Chip) hot water cooling is an approach that brings cooling liquid directly to the high heat flux components within servers such as CPUs, GPUs and memory. CPUs run quite hot (153°F to 185°F) and hotter still for memory and GPUs. The cooling efficiency of water (4000x air) allows D2C to cool with hot water. Hot water cooling allows the use of dry coolers rather than expensive chillers to cool the water returning from the servers. Removing CPU, GPU and memory heat with liquid also reduces the power required for server fans.

The RackCDU D2C solution is an extension to a standard rack (RackCDU) combined with direct-to-chip server coolers (D2C) in each server. Because RackCDU has quick connects for each server, facilities teams can remove/replace servers as they do today.

Much more efficient pumps replace fan energy in the data center and server, and hot water eliminates the need for chilling the coolant. D2C liquid cooling dramatically reduces chiller use, CRAH fan energy and server fan energy. By doing so it delivers IT equipment energy savings of up to 10%, cooling energy savings greater than 50% and rack density increases of 2.5x-5x times versus air-cooled data centers.

RackCDU D2C uses a distributed pumping model. The cooling plate/pump replaces the air heat sink on the CPUs or GPUs in the server. Each pump/cold plate has sufficient pumping power to cool the whole server, providing redundancy.  Unlike centralized pumping systems requiring high pressures, the pressure needed to operate the system is very low making it an inherently more reliable system.

In addition, RackCDU includes a software suite providing monitoring and alerts for temperatures, flow, pressures and leak detection that can report into data center software suites.

Direct to Chip Liquid Cooling Adoption Increasing

Direct to Chip hot water liquid cooling is showing significant momentum in usage models important to both HPC and Commercial data centers:

  • Mississippi State University (MSU) installed a Cray 300LC supercomputing cluster that incorporates Asetek’s D2C. Key in the purchase decision was the ability to increase computing capacity without
    Cray 300LC Cluster
    Cray’s 300LC Liquid Cooled Cluster Supercomputer

    buying new chillers and related equipment, and install more compute within a fixed CapEx budget.

  • Lawrence Berkeley National Laboratory (LBNL) has found that Asetek’s direct cooling technology not only showed cooling energy savings of over 50%, but also savings of 21% of total data center energy, benefiting OpEx.
  • At the University of Tromso (UIT) in Norway, the Stallo HPC cluster is targeting 70% IT energy re-use and district heating for it’s north of the Arctic Circle campus.
  • Beyond HPC, highly virtualized applications are being implemented with Asetek’s D2C at the U.S. Army’s Sparkman Center Data Center. The goals of this installation include 60% cooling energy savings and 2.5x consolidation within existing infrastructure, and 40% waste-heat recovery.

Delivering on the Promise

For HPC, liquid cooling to be done right™ must address the power, cooling and density demands required to support the drive to greater densities and looming Exascale systems.   At the same time, much like the commercial segment it must address serviceability, monitoring and redundancy.

Asetek has paid careful attention to these factors in designing RackCDU D2C liquid cooling and the adoption and increasing momentum of Direct to Chip reflects applicability of Asetek’s approach to support the needs of HPC and data centers generally.

 

For more information on Asetek and RackCDU D2C:  http://asetek.com/data-center/data-center-coolers.aspx

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire