That’s Hot: Cooling Supercomputing Power

By Nicole Hemsoth

September 9, 2005

The good news is users of HPC have more compute power than ever. The bad news: HPC users are stuffing all the processing power into data centers that are not designed to handle asymmetrical thermal loads and the extreme density of today's systems. The thermal problem in the data center is analogous to trying to keep people cool and comfortable in a hall where there are 150 people per seat.

Recognizing that data center environmental conditions limit server performance, HP Labs has partnered with HP product divisions to assembled the “HP Cool Team” — a group of engineers with expertise in heat transfer, fluid mechanics, thermo-mechanical physical design and system product design. HPCwire recently talked with Scott McClellan, CTO of HP's High Performance Computing Division, and Chandrakant D. Patel, an HP Distinguished Technologist and leader of the thermal mechanical team at HP Labs, to find out how the company is helping customers manage high performance heat.

HPCwire: According to a recent AFCOM (an association for data center professionals) survey, approximately 60 percent of data center professionals believe new equipment is being acquired without adequate concern for power or cooling requirements. What's HP doing to solve this?

McClellan: The industry, as a whole, has been focused on increasing density. At the same time, there is little attention paid to the other side of the problem, “How are we going to deploy reliable systems in room stuffed full of equipment?” For example, a data center with 1,000 racks, over 30,000 square feet, requires over 10 MW of power for the computing infrastructure. That's a lot of heat.

HP complements our portfolio of HPC solutions with services to help customers deal with deployment challenges. HP has a ton of experience in the thermal area. From this experience, we have learned that asymmetrical loading can frequently result in “hot spots” that can easily result in reliability problems.

By applying what we call “Smart Cooling” technology to HP Labs' data centers, we've been able to save more than 25 percent in cooling costs. This technology has been successfully used by a number of customers, such as DreamWorks, to reach maximum power density and system reliability. For example, in working with DreamWorks, HP Labs provided scalable, off-site rendering capacity for DreamWorks' animated production, Shrek 2. The HP Utility Rendering Service – operating in a 1,000-processor data center researchers built in HP Labs' Palo Alto, Calif., headquarters – uses HP's Smart Cooling solutions to provide the maximum compute capability in the smallest, most cost-efficient footprint possible.

HPCwire: What exactly are “Smart Cooling” products or services? What do you deliver to the customer?

Patel: HP takes a holistic view when developing cooling solutions — from chip core inside a server to the cooling tower on the rooftop. The thermo- mechanical research team at HP Labs has pioneered CFD modeling techniques with the Smart Cooling service. The intent of this service is to provide proper and efficient thermal management to reduce the power required by cooling resources up to 25 percent. An example of such a service delivery to a customer is an analysis to “statically” optimize the layout of a data center prior to the installation of high power, high density cluster. The analysis suggests vent layout, rack layout, etc., to assure achievement of thermal management goals — a given temperature at the inlet of all the systems in the data centers. The asymmetrical loading from the high density cluster could otherwise lead to reliability issues. Additional Smart Cooling services include site preparation and verification, assessment, refresh, design and relocation; equipment layout and installation; site selection; and ongoing maintenance.

Going forward, our goal is to help customers combine efficient thermal management with efficient cost management. The emergence of the compute utility and growth in data center-based computer services is forcing an examination of costs associated with the housing and powering of the compute, networking, and storage equipment. The cost model must take into account the complexity of power delivery, cooling, and required levels of redundancies for a given service level agreement. The cost of maintenance and amortization of power delivery and cooling equipment must be included. Furthermore, the advent of slim servers, such as blades, has led to immense physical compaction. While compaction enables consolidation of multiple data centers into one, the resulting high thermal power density must be addressed from a power delivery and cooling point of view.

We also have developed a simple cost model that examines the total cost of ownership, by highlighting factors that drive the costs, and examining the “burdened” cost of delivery of power to the hardware in the data center. In terms of delivery of value-added services to customers, the CFD-based modeling services helps the customer optimally provision the resources and assists in maximizing data center utilization – one of the key factors in the cost model. In addition, we are developing future products and services based on real time sensing and control that will enable up to a savings of 50 percent in recurring cost of power.

HPCwire: How do you see power issues impacting the deployment of HPC solutions?

McClellan: The last few years have brought us tremendous increases in the physical density of compute solutions. As a result, we now have much more processing capability per square foot of data center floor space. On average, most data centers are designed to last 10 to 15 years but, in practice, are used much longer than that. As a figure of merit, these same data centers are frequently provisioned with sufficient air conditioning to cool a maximum of 100 watts to 150 watts per square foot.

Is that sufficient cooling for modern HPC clusters? Consider this rule of thumb: If your data center is 50 percent filled with 10 KW racks (typical for today's servers), and the rest is filled with 3 KW racks (typical for storage) you would be at approximately 150 watts per square foot. Theoretically, you have enough air conditioning capacity, but do you really? The answer is probably not — unless the flow of air within your data center is well optimized.

Our blades offer reduced power consumption and lower power distribution costs, saving customers more than $6,000 per rack of 32 servers. New technologies, like the 68-watt AMD Opteron processors, Intel Xeon processors, and HP power management tools that can monitor and dynamically adjust the power consumption, can significantly reduce the power required and heat generated. Based on processor utilization and application performance demands, customers can maximize power efficiency more effectively with these technologies.

Patel: I would add that the emergence of the global compute utility and the ensuing high power needs of the utility will require energy aware, smart design approaches at all levels — chips, systems and data centers. Imagine a hundred high performance computing clusters worldwide made up of 15 racks each — 150 KW per instantiation. Such a worldwide set of HPC installations requires 15 MW of power for the compute hardware and an additional 15 MW for the cooling resources. The total power draw of 30 MW, without any dynamic control of power and cooling, is roughly the power required for 30,000 homes in the United States. Therefore, specifically with regard to dissipation of power by electronic equipment and in limiting power used to facilitate the heat removal, we must develop techniques that dynamically scale power use based on need with complementary scaling of power used the by cooling resources.

The dynamic techniques we are developing will reduce the power consumption by half. This will aid the guidelines and regulations that are emerging. For example, Japanese government regulations call for a 10 percent reduction in energy consumption by 2008. That regulation obviously impacts the data centers.

HPCwire: Give me an example of an HPC data center that HP has redesigned to be energy efficient. What did you do? What were the results?

McClellan: In one detailed case study, the customer's data center was designed to cool 600 KW max, and the total loading was only 100 KW. All of the racks were logically laid out in hot-cold aisles. We found that deployment of an additional high density instance of 15 racks totaling 150 KW resulted in air flow problems so severe there would have been reliability issues, even though the total heat load of 250 KW was well below the air conditioning capacity.

The customer required the installation to be available immediately. The modeling team worked with the front end services team to collect room data and enable proper provisioning of all the air conditioning resources using layout and vent tile placement as the only degree of freedom. Even with this extreme design constraint, HP was able to recommend a solution for the customer.

In another instance, the customer wanted to see the impact of high density cluster installation in a given location as a “what if” scenario. In most of these instances, the issue that primarily arises has to do with high local power density resulting in poor provisioning of air conditioning resources. For example, ACs are typically distributed evenly in the data center. However, an HPC cluster results in asymmetrical heat load extraction to the point where a single AC gets overloaded and cannot extract the heat load. In these instances, our modeling team uses the degrees of freedom available to balance out the heat load extraction throughout the data center.

This enables all the ACs to be well utilized, and utilized in an efficient range of operation. Furthermore, when one AC is burdened, there usually exists overcompensation by other ACs that unnecessarily wastes power. So, in some cases, we are able to advise on shutting down an AC which we find is not playing any role in heat extraction. Analyzing these “what if” scenarios enables the customer to properly provision cooling resources and results in energy savings and better reliability.

HPCwire: How is HP unique in the market?

Patel: Early on, we recognized that the “data center is the computer” and set about working on a portfolio of products and services that cuts the power consumption by half. We developed an architecture for dynamic control of power and cooling resources and built a production “smart” data center at HP Laboratories in Palo Alto with distributed sensing and policy based control. We have demonstrated 50 percent reduction in energy consumption by the cooling resources in this data center. The dynamic “smart” cooling technology complements the static thermal modeling work mentioned earlier.

Indeed, this design of an intelligent, energy efficient data center that dynamically allocates cooling resources where and when they are needed is unique in the industry. We know that conventional data center cooling systems lack fine grain sensing and control and are provisioned for peak utilization, that is, they operate as if each rack is generating a maximum amount of heat even if that's not the case. Consequently, conventional cooling systems often incur greater amounts of operating expenses than may be necessary to sufficiently cool the heat-generating components contained in the racks of data centers.

HPCwire: What type of research are you doing?

Patel: As you can tell, our mantra is holistic “smart” design and control from chip core to the cooling tower. And you have heard about the “smart” data center products and services. Going forward, we will drive our dynamic provisioning of power and cooling all the way down to chip level. In a recent paper at Semitherm 2005 and, at a keynote at Temperature Aware Computing Workshop held as part of the ISCA (International Symposium on Computer Architecture) conference, we outlined the concept of a global control system that provisions power and cooling resources based on need. Of course, such a global control system needs an evaluation and control engine that can work from chips to data centers.

Together, with our research partners at U.C. Berkeley, we are proposing a second law of thermodynamics-based evaluation and control engine and the use of a metric – such as MIPS per unit of exergy destroyed — as a performance criteria. The use of exergy destruction, or destruction of available energy, is based on the second law of thermodynamics that effectively states that while energy cannot be destroyed, the quality of energy — or its potential to do work -can be destroyed due to “thermodynamic irreversibilities.” Thus, an exergy destruction modeling technique based on the second law of thermodynamics pinpoints power inefficiencies in data centers.

This second law or exergy modeling, as reported by the U.C. Berkeley and HP Labs team at various public forums, is a solid metric for optimizing a data center. Indeed, it is a metric that can be used to optimally provision all power and cooling resources from chip to cooling tower, including running the cooling tower, AC, the compressors for AC systems, computer system fans, inkjet spray mechanism, and setting power levels on computers accordingly. We like the quantification of exergy destruction as it is scalable from microns to meters and can be applied to our vision of “energy aware chip, systems and data centers.” And it promotes sustainability.

HPCwire: We'd heard HP was experimenting with inkjet-assisted spray to cool data center equipment. What's the status?

Patel: Our research in this promising area continues. We are using HP's classic inkjet technology to attack the problem of heat generation in powerful microprocessors. The inkjet head's ability to target spray cooling allows it to cool chips even when temperature and heat flux levels vary across surfaces. The spray cooling mechanism shoots a measured amount of liquid coolant onto specific areas of a chip, according to its heat level. The device controls the distribution, flow-rate and velocity of the liquid in much the same way that inkjet printers control the placement of ink on a printed page.

The liquid vaporizes on impact, cooling the chip, and the vapor is then passed through a heat exchanger and pumped back into a reservoir that feeds the spray device. HP's spray cooling technology avoids the “pooling” effect of other phase change liquid cooling methods that, due to residual liquid left on the chip, actually form an insulating vapor bubble, causing chips to overheat and malfunction. Above all, the precise “on demand” dispensation of cooling resources meets our chip to cooling tower vision noted earlier.

HPCwire: What is the status of your robot that “rolls around a data center floor” looking for hot spots? Is this real?

Patel: The robot is real at the HP Labs Smart Data Center. Also, it is a means to an end. It was motivated by the need to develop a distributed sensing platform, the foundation of our dynamic control system. The sensing system is akin to having a home with thousands of sensors to control air conditioning resources. When we found that cost of distributed temperature sensing in data centers was very high — for hundreds or even thousands of points sought to create a 3D temperature map — we came up with the robot. We have two of these robots at two data centers. We have added features to the robot, and use it today in the HP Labs Smart Data Center in conjunction with a low cost static sensor network we have developed. The robot can be cost effectively used in existing data centers for temperature audits.

HPCwire: Any final thoughts on energy management for HPC customers?

McClellan: Think holistic. The data center is a complex beast. In terms of energy and cost management, customers need to be concerned about the efficiency of their high performance computing equipment as well as the efficiency of the three fundamental data center subsystems. Think about the three “P”s — Power, Ping and Pipe — aka the power delivery system, which includes conditioning and backup equipment; the networking equipment, which includes all connectivity except the rack switches; and the cooling infrastructure which includes both the central chillers and the computer room air conditioners.

Projecting power use only in terms of “watts per square foot” can be misleading. Today's high-density data centers and current design approaches need to be augmented with CFD simulations that analyze the heat load distribution and the cooling infrastructure provisioning. CFD analyses provide HPC customers with the visual proof they need to create policies and procedures designed to ensure reliable, cost-efficient data center operations.

 

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire