Extending the Viability of Air-cooling in High-Performance Data Centers With HPE Cray XD2000 Systems

July 3, 2023

Executive summary

For organizations operating high-performance data centers, cooling is a significant challenge. With the demand for compute ever-increasing, the performance required to deliver on these needs also needs to rise. Subsequently, high-end processor TDPs are also climbing with each generation, and we are rapidly approaching the point where cooling these processors with traditional air will become infeasible.[1] Liquid cooling is an obvious answer but brings new challenges, including cost, maintenance issues, and concerns about leakage and safety.

In this paper, we introduce the latest server solutions from Hewlett Packard Enterprise and Intel® and explain how innovative new designs can help organizations deploy dense, high-performance servers while extending the viability of air cooling. We also present a series of benchmarks conducted by HPE that explore the trade-offs between air and liquid cooling related to performance and power efficiency. With this information, data center managers can make informed decisions about how best to evolve their infrastructure based on their unique requirements.

Driven by data, compute requirements are on the rise

According to the latest IDC Global DataSphere forecast, the new data created, captured, replicated, and consumed yearly is expected to double between 2022 and 2026.[2]  Fueled by this growth in data, new predictive and analytic techniques, and competitive imperatives such as machine learning and AI, compute requirements are growing at a similar pace.

Today, high-performance systems are critical in applications ranging from manufacturing to financial services, life sciences, and data analytics. Manufacturers rely on HPC for structural design, computational fluid dynamics (CFD), and machine condition monitoring. Life sciences firms require large amounts of computing power for genomic analysis, surveillance, computational chemistry, and image analysis. Organizations are continually looking for ways to deliver additional computing capacity to keep pace with growing demands and stay a step ahead of the competition.

Cooling — a looming data center challenge

Meeting the high levels of performance required by today’s data and compute-intensive applications is driving increased power and cooling requirements. Power, density, cooling, and concerns about sustainability are issues for almost all data center operators. Per-socket TDPs for top-bin CPUs range from 270 watts to 350 watts or more, and next-gen CPUs are likely to be even more power-hungry, reaching 400 to 500 watts per socket.[3]

The challenges are even more significant for GPUs critical to AI and machine learning workloads, where next-gen TDPs are expected to reach 700 watts.

A key issue is that silicon designs in modern processors are increasingly going 3D with components layered on top of one another. This presents new thermal challenges and requires that case temperatures be cooled to ever-lower levels to avoid component overheating and damage.[4] These conflicting trends are illustrated in Figure 1.

Figure 1. Air cooling is becoming unsustainable as component TDPs increase.

Compounding this challenge, customers increasingly demand high-density racks that pack more computing power into a smaller data center footprint. Today, two-thirds of US data centers already have peak power demands of up to 16 to 20 kW per rack.[5] Per rack power consumption is rising quickly, with dense HPC racks already consuming 40-60 kW or more.[6]

Organizations will need to make trade-offs, either investing in new cooling technologies to accommodate next-generation processors and GPUs or settling for less powerful processors and more sparsely populated data center racks. It is no secret that liquid cooling provides better thermal transfer efficiencies than air cooling, so for many, liquid cooling is a logical path forward.

Liquid cooling

Liquid cooling spans a range of technologies, from rear-door chillers and heat exchangers to directly attached liquid cooling plates to immersion cooling. Liquid cooling can bring clear benefits:

  • Improved efficiency — In an analysis conducted by HPE, liquid cooling has been shown to reduce data center power usage effectiveness (PUE) and cooling-related power costs by up to 87%.[7]
  • Reduced environmental impact — Reducing power consumption with more efficient cooling can help organizations meet environmental, social, and governance (ESG) goals and reduce their data center’s CO2 equivalent (CO2e) footprint.
  • Defer expensive data center upgrades — In space-constrained data centers, liquid cooling can enable denser rack configurations, helping maximize available space.
  • Improve reliability and predictability — Liquid cooling can prolong component life by providing stable operating temperatures, avoiding overheating conditions, and improving overall availability.

Despite these benefits, transitioning to liquid cooling is often easier said than done.

A challenging transition

Presently, air cooling is the predominant way of cooling high-performance servers. In a survey of 268 HPC sites from 252 organizations, Intersect360 Research found that 58% of respondents use air cooling exclusively.[8]  42% use liquid cooling in some systems, with the largest share using rear door chillers. Only 23% of commercial organizations operate fully plumbed racks with facility heat exchangers. For most, plumbing is extended to only a subset of their data center racks. In other words, there is still a long way to go before most facilities can fully embrace liquid cooling.

When deciding on a liquid cooling solution, customers must consider several factors: cost, sustainability, maintenance, and ease of management.[9] Among commercial and industrial HPC users, the majority operate multiple clusters. According to the same Intersect360 Research study, 37% of organizations operate ten or more clusters ranging from entry-level HPC systems with 16 nodes or less to supercomputers that consists of more than 512 nodes. Upgrading these systems to liquid cooling poses technical, logistical, and financial challenges. These include:

  • The added expense of operating two cooling systems instead of one
  • A lack of standardization in cooling systems complicating adoption in multivendor environments
  • Concerns about corrosion and safety hazards, such as risks of electrocution and arcing
  • Increased operational complexity and the risk of cooling system failures

Organizations must consider multiple factors when introducing liquid cooling, including existing data center space, rack composition, power constraints, cooling capacity, utility costs, and projected growth requirements.

Extending the life of air cooling

Fortunately, new technologies from HPE and Intel provide data center managers with the flexibility to deploy the latest server hardware in air-cooled environments. Organizations can significantly extend the viability of air cooling by taking advantage of the latest HPE Cray XD2000 Systems powered by 4th Gen Intel® Xeon® Scalable processors. By deploying these Intel-based systems, organizations can:

  • Help maximize performance while minimizing data center impact
  • Avoid costly capital upgrades to data center facilities
  • Protect existing investments in software and hardware

As illustrated in Figure 2, organizations can extend the life of air cooling without sacrificing performance, enabling them to gradually manage the transition to liquid cooling based on their own schedule.

 

Figure 2. Extending the life of air cooling with HPE Cray XD2000 Systems

HPE Cray XD2000 Systems

With the HPE Cray family, HPE and Intel bring innovation from the world’s most powerful supercomputers, making them available in commercial data center settings.[10] The HPE Cray XD2000 System is a dense, multiserver platform that packs exceptional performance and workload flexibility into a small data center space while delivering the efficiencies of a shared infrastructure.

Each HPE Cray XD2000 2U Chassis supports up to four HPE Cray XD220v 1U Servers powered by the latest 4th Gen Intel Xeon CPUs. Each server can be serviced without impacting the operation of other servers in the same chassis for maximum server availability. The HPE Cray XD2000 delivers up to 4 times the density of a traditional rackmount 2U server in standard racks and provides rear-aisle serviceability access.[11] Up to 20 HPE Cray XD2000 Chassis can be installed in either 42U or 48U HPE standard racks delivering up to 80 2P servers and 160 x 4th Gen Intel Xeon Scalable processors per data rack, subject to power and cooling considerations.

These systems offer a complete, scalable solution for customers requiring high-performance solutions. They feature flexible power and cooling options, including air cooling and direct liquid cooling (DLC), delivering superior performance while reducing TCO.

Figure 3. Density-optimized HPE Cray XD2000 Chassis supporting up to 4x HPE Cray XD220v 1U servers

 

Figure 4. HPE data center rack with optional direct liquid cooling (DLC)

Innovative engineering enables air cooling up and down the stack

Thanks to the design of the 1U HPE Cray XD220v Server residing in the HPE Cray XD2000 chassis; customers can benefit from the latest high-performance Intel Xeon Scalable processors in air-cooled environments. Customers can deploy fully populated HPE Cray XD2000 Racks using the latest processor technology without worrying about liquid cooling.

What makes this possible is the unique design of the Intel-powered HPE Cray XD220v Server illustrated in Figure 5. The HPE Cray XD220v is 20% wider than the previous generation HPE ProLiant XD200n designed for the HPE Apollo 2000 Chassis. However, this updated design still fits in industry-standard racks.

Figure 5. The HPE Cray XD220v supports efficient air-cooling of the latest Intel Xeon processors.

With larger heatsinks and additional cooling fans, this redesigned server supports the full range of 4th Gen Intel Xeon Scalable processors, from 12 to 56 cores per socket, to be efficiently air-cooled — including the most powerful 350 watt 56-core Intel® Xeon® Platinum 8480+ and Intel Xeon 9480 Max Series processors.

The redesigned system features a special baffling to optimize airflow and 16 fans (40 mm each) per HPE Cray XD2000 Chassis for reliable cooling of dense server configurations in the most demanding HPC environments. Better still, these servers are designed to support air-cooling of future Intel Xeon processors. This translates into exceptional investment protection and flexibility. Organizations can deploy HPE Cray XD2000 Systems with air-cooling today and easily add liquid cooling in the future.

This advantage is unique to the Intel-powered HPE Cray XD220v Server. Servers powered by competing processor technologies with similar TDPs require liquid cooling, adding cost and complexity to server deployments.

Direct liquid cooling (DLC)

While Intel-powered HPE Cray XD2000 Systems are at home in air-cooled environments, for customers with suitably equipped data centers, these systems also support plug-and-play support for DLC. Engineered and supported by HPE, DLC offers clear advantages compared to third-party or immersion-based cooling solutions. These include:

  • Efficient thermal transfer for more efficient cooling
  • No need for expensive hazardous chemicals or specialized fluids
  • Server equipment remains easily accessible for serviceability
  • HPE server racks connect directly to facility water supplies without secondary plumbing

By taking advantage of DLC, organizations can substantially reduce data center PUEs and cooling-related costs and improve energy efficiency. Options are available for CPU only or CPU plus memory cooling.

Measuring the impact of air vs. liquid cooling

In March 2023, HPE ran a series of six internal benchmarks to evaluate the latest Intel Xeon processors’ performance and power efficiency in air- and liquid-cooled environments. The tests involved an HPE Cray XD2000 Chassis fully populated with 4 x HPE Cray XD220v compute nodes, each with two Intel Xeon 8480+ processors. The benchmarks included SPEC CPU 2017 (SPECrate 2017_int_base and SPECrate 2017_fp_base), three separate SPEChpc™ 2021 benchmarks, and a High-Performance Linpack (HPL) benchmark.

For each benchmark, results were obtained in both air- and liquid-cooled configurations. Details were tabulated, including performance, power consumed by the HPE Cray XD2000 Chassis while the benchmarks ran, and performance per kW. The results of these tests are summarized in Table 2.

As shown, the latest top-bin Xeon processors in air-cooled configurations delivered performance on par with that achievable with liquid cooling. The variance between the air and liquid-cooled result for all six benchmarks was less than ~3%.[12]

Liquid-cooled configurations delivered slightly better performance because the higher temperatures in the air-cooled configurations led to higher leakage current in silicon. This resulted in a higher power draw, leaving less power available for boosting clock frequencies within the processor’s fixed TDP budget.[13]

Table 2. Comparing performance and power requirements in air- and liquid-cooled HPE Cray XD2000 Systems

Click to enlarge

The average impact of air vs. liquid cooling across all six benchmarks in Table 2 is shown in Figure 7. On average, the liquid-cooled configurations delivered 1.8% better performance and consumed 14.6% less power. The liquid cooled HPE Cray XD2000 delivered a 19.2% boost in power efficiency measured in terms of throughput per kW.

Figure 6. HPE Cray XD2000 System powered by Intel® Xeon® 8480+ processors — the average impact of air vs. liquid cooling on performance, power consumption, and power efficiency across six standard benchmarks

The results in Figure 7 show that while liquid cooling delivers superior power efficiency, air-cooled servers running the latest Intel Xeon 8480+ processors deliver excellent performance. Air-cooled HPE Cray XD2000 Systems are an excellent solution for customers that are either unable or not yet ready to make the transition to liquid cooling.

Help maximize performance, flexibility, and value

With power requirements for high-end CPUs rising, many organizations are considering liquid cooling to increase density and improve cooling efficiency. However, this transition can be expensive and disruptive, and not all organizations are ready to take this step.

Fortunately, the latest HPE Cray XD2000 powered by 4th Gen Intel Xeon Scalable processors provide customers the flexibility to navigate this transition at their own pace. With Intel-powered HPE Cray XD2000 Systems, customer can:

  • Avoid or delay expensive data center upgrades and refits by extending the viability of air cooling
  • Experience up to twice the throughput of previous-generation servers[14]
  • Deploy dense, energy-efficient servers to help maximize data center space
  • Protect existing investments in software and tools
  • Gradually adopt energy-efficient direct liquid cooling at their own pace

Learn more at

HPE.com/servers/CrayXD2000


[1] Thermal Design Power (TDP) is defined as the theoretical maximum amount of heat generated by a CPU or GPU, usually expressed in watts, that a computer’s cooling system must be designed to dissipate.
[2]Worldwide IDC Global DataSphere Forecast, 2022–2026: Enterprise Organizations Driving Most of the Data Growth,” IDC, 2022
[3] HPE internal estimates.
[4] TCASE refers to the temperature at the interface between a CPU package and its heatsink.
[5] How Power Density is Changing in Data Centers and What It Means for Liquid Cooling, JETCOOL Technologies Inc., March 2022
[6] HPE internal estimates.
[7] See the HPE whitepaper Addressing sustainability in the financial services industry. PUE refers to power usage efficiency, a measure of how much power is used to power data center servers vs. ancillary requirements such as lighting and air conditioning.
[8] “HPC Technology Survey 2021: Server Technologies and Configurations,” Intersect360 Research, August 2021
[9] 2022 Data Center Trends: Liquid Cooling Adoption Survey
[10] Aurora Supercomputer and Argonne National Laboratory | HPE
[11] HPE Cray XD2000 QuickSpecs
[12] On average, the liquid-cooled configurations ran 1.8% faster across the six benchmarks. The largest impact was seen with the estimated SPECrate 2017_int_base result, where the liquid-cooled configuration ran 2.9% faster.
[13] Intel Turbo boost algorithm takes operating temperature into consideration, explaining the slight difference in performance.
[14] 2P Intel Xeon Platinum 8480+ (112C) scored 932 SPECrate 2017_fp_base – http://spec.org/cpu2017/results/res2023q1/cpu2017-20221204-32903.html. 2P Intel Xeon Platinum 8380 (80C) scoring 467 SPECrate 2017_fp_base – http://spec.org/cpu2017/results/res2021q2/cpu2017-20210524-26430.html. 932/467 represents an ~2x performance improvement. SPEC, SPEC CPU, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. All rights reserved. All stated results are as of April 15, 2023. See spec.org for more information.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

U.S. Quantum Director Charles Tahan Calls for NQIA Reauthorization Now

February 29, 2024

(February 29, 2024) Origin stories make the best superhero movies. I am no superhero, but I still remember what my undergraduate thesis advisor said when I told him that I wanted to design quantum computers in graduate s Read more…

pNFS Provides Performance and New Possibilities

February 29, 2024

At the cusp of a new era in technology, enterprise IT stands on the brink of the most profound transformation since the Internet's inception. This seismic shift is propelled by the advent of artificial intelligence (AI), Read more…

Celebrating 35 Years of HPCwire by Recognizing 35 HPC Trailblazers

February 29, 2024

In 1988, a new IEEE conference debuted in Orlando, Florida. The planners were expecting 200-300 attendees because the conference was focused on an obscure topic called supercomputing, but when it was announced that S Read more…

Forrester’s State of AI Report Suggests a Wave of Disruption Is Coming

February 28, 2024

The explosive growth of generative artificial intelligence (GenAI) heralds opportunity and disruption across industries. It is transforming how we interact with technology itself. During this early phase of GenAI technol Read more…

Q-Roundup: Google on Optimizing Circuits; St. Jude Uses GenAI; Hunting Majorana; Global Movers

February 27, 2024

Last week, a Google-led team reported developing a new tool - AlphaTensor Quantum - based on deep reinforcement learning (DRL) to better optimize circuits. A week earlier a team working with St. Jude Children’s Hospita Read more…

AWS Solution Channel

Shutterstock 2283618597

Deep-dive into Ansys Fluent performance on Ansys Gateway powered by AWS

Today, we’re going to deep-dive into the performance and associated cost of running computational fluid dynamics (CFD) simulations on AWS using Ansys Fluent through the Ansys Gateway powered by AWS (or just “Ansys Gateway” for the rest of this post). Read more…

Argonne Aurora Walk About Video

February 27, 2024

In November 2023, Aurora was ranked #2 on the Top 500 list. That ranking was with half of Aurora running the HPL benchmark. It seems after much delay, 2024 will finally be Aurora's time in the spotlight. For those cur Read more…

Royalty-free stock illustration ID: 1988202119

pNFS Provides Performance and New Possibilities

February 29, 2024

At the cusp of a new era in technology, enterprise IT stands on the brink of the most profound transformation since the Internet's inception. This seismic shift Read more…

Celebrating 35 Years of HPCwire by Recognizing 35 HPC Trailblazers

February 29, 2024

In 1988, a new IEEE conference debuted in Orlando, Florida. The planners were expecting 200-300 attendees because the conference was focused on an obscure t Read more…

Forrester’s State of AI Report Suggests a Wave of Disruption Is Coming

February 28, 2024

The explosive growth of generative artificial intelligence (GenAI) heralds opportunity and disruption across industries. It is transforming how we interact with Read more…

Q-Roundup: Google on Optimizing Circuits; St. Jude Uses GenAI; Hunting Majorana; Global Movers

February 27, 2024

Last week, a Google-led team reported developing a new tool - AlphaTensor Quantum - based on deep reinforcement learning (DRL) to better optimize circuits. A we Read more…

South African Cluster Competition Team Enjoys Big Texas HPC Adventure

February 26, 2024

Texas A&M University's High-Performance Research Computing (HPRC) hosted an elite South African delegation on February 8 - undergraduate computer science (a Read more…

A Big Memory Nvidia GH200 Next to Your Desk: Closer Than You Think

February 22, 2024

Students of the microprocessor may recall that the original 8086/8088 processors did not have floating point units. The motherboard often had an extra socket fo Read more…

Apple Rolls out Post Quantum Security for iOS

February 21, 2024

Think implementing so-called Post Quantum Cryptography (PQC) isn't important because quantum computers able to decrypt current RSA codes don’t yet exist? Not Read more…

QED-C Issues New Quantum Benchmarking Paper

February 20, 2024

The Quantum Economic Development Consortium last week released a new paper on benchmarking – Quantum Algorithm Exploration using Application-Oriented Performa Read more…

Training of 1-Trillion Parameter Scientific AI Begins

November 13, 2023

A US national lab has started training a massive AI brain that could ultimately become the must-have computing resource for scientific researchers. Argonne N Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia Wins SC23, But Gets Socked by Microsoft’s AI Chip

November 16, 2023

Nvidia was invisible with a very small booth and limited floor presence, but thanks to its sheer AI dominance, it was a winner at the Supercomputing 2023. Nv Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Analyst Panel Says Take the Quantum Computing Plunge Now…

November 27, 2023

Should you start exploring quantum computing? Yes, said a panel of analysts convened at Tabor Communications HPC and AI on Wall Street conference earlier this y Read more…

Royalty-free stock illustration ID: 1675260034

RISC-V Summit: Ghosts of x86 and ARM Linger

November 12, 2023

Editor note: See SC23 RISC-V events at the end of the article At this year's RISC-V Summit, the unofficial motto was "drain the swamp," that is, x86 and Read more…

China Deploys Massive RISC-V Server in Commercial Cloud

November 8, 2023

If the U.S. government intends to curb China's adoption of emerging RISC-V architecture to develop homegrown chips, it may be getting late. Last month, China Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Leading Solution Providers

Contributors

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Chinese Company Developing 64-core RISC-V Chip with Tech from U.S.

November 13, 2023

Chinese chip maker SophGo is developing a RISC-V chip based on designs from the U.S. company SiFive, which highlights challenges the U.S. government may face in Read more…

Royalty-free stock illustration ID: 1182444949

Forget Zettascale, Trouble is Brewing in Scaling Exascale Supercomputers

November 14, 2023

In 2021, Intel famously declared its goal to get to zettascale supercomputing by 2027, or scaling today's Exascale computers by 1,000 times. Moving forward t Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire