As seen at ISC17 and will be seen at SC17, the application of HPC in finance, logistics, manufacturing, big science and oil & gas is continuing to expand into areas of traditional enterprise computing often tied to the exploitation of Big Data. It is clear that all of these segments are using (or planning to use) machine learning and AI resulting in architectures that is very HPC-like.
The physical implementation of these systems requires a greater focus on heat capture and rejection due to the wattage trends in CPUs, GPUs and emerging neural chips required to meet accelerating computational demands in HPC-style clusters. The resulting heat and its impact on node, rack and cluster heat density is seen with Intel’s Knight’s Landing and Knight’s Mill, Nividia’s P100 and the Platinum versions of the latest Intel Skylake processors.
Wattages are now sufficiently high that to cool nodes containing these highest performance chips leaves one with little choice other than liquid cooling to maintain reasonable rack densities. In sustained compute sessions, there must be no throttling or down-clocking of the compute resources. If not addressed at the node level with liquid cooling, floor space build-outs or data center expansions become necessary. Even more importantly, reducing node and rack densities can drive an increase in interconnect distances between all types of cluster nodes.
These developments are a direct result of a wattage inflection and not simply an extension of trends seen previously. Depending on the approach taken, machine learning and AI exacerbate this trend. Heat and wattage issues seen with GPUs during the training or learning phase of an AI application (especially if used in a deep learning/ neural network approach) are now well known. And in some cases, these issues continue into application rollout if GPUs are applied to that as well.
Even if the architecture uses quasi-GPUs like Knight’s Mill in the training phase (via “basic” machine learning or deep learning followed by a handoff to scale-out CPUs like Skylake for actual usage) the issues of wattage/density/cooling remains. And it isn’t getting any better.
With distributed cooling’s ability to address site needs in a variety of heat rejection scenarios, it can be argued that the compute-wattage-inflection-point is a major driver in the accelerating global adoption of Asetek liquid cooling at HPC sites and by the OEMs that serve them. And as will be shown at SC17, quite of few of the nodes OEMs are showing with liquid cooling are targeted at machine learning.
Given the variety of clusters (especially with the entrance of AI), the adaptability of the cooling approach becomes quite important. Asetek distributed pumping architecture is based on low pressure, redundant pumps and closed loop liquid cooling within each server node. This allows for a high level of flexibility in heat capture and heat rejection.
Asetek ServerLSL™ is a server-level liquid assisted air cooling (LAAC) solution. It can be used as a transitional stage in the introduction of liquid cooling or as a tool to enable the immediate incorporation of the highest performance computing nodes into the data center. ServerLSL allows the site to leverage existing HVAC, CRAC and CRAH units with no changes to data center cooling. ServerLSL replaces less efficient air coolers in the servers with redundant coolers (cold plate/pumps) and exhausts 100% of this hot air into the data center via heat exchangers (HEXs) in each server. This enables high wattage server nodes to have 1U form factors and maintain high cluster rack densities. At a site level, the heat is handled by existing CRACs and chillers with no changes to the infrastructure. With ServerLSL, liquid cooled nodes can be mixed in racks with traditional air-cooled nodes.
While ServerLSL isolates the system within each server, Asetek RackCDU systems are rack-level focused, enabling a much greater impact on cooling costs of the datacenter overall. RackCDU systems leverage the same pumps and coolers used with ServerLSL nodes. RackCDU is in use by all of the current sites in the TOP500 using Asetek liquid cooling.
Asetek RackCDU provides the answer both at the node level and for the facility overall. As with ServerLSL, RackCDU D2C (Direct-to-Chip) utilizes redundant pumps/cold plates atop server CPUs & GPUs (and optionally other high wattage components like memory). But the collected heat is move it via a sealed liquid path to heat exchangers in the RackCDU for transfer into facilities water. RackCDU D2C captures between 60% and 80% of server heat into liquid, reducing data center cooling costs by over 50% and allowing 2.5x-5x increases in data center server density.
The remaining heat in the data center air is removed by existing HVAC systems in this hybrid liquid/air approach. When there is unused cooling capacity available, data centers may choose to cool facilities water coming from the RackCDU with existing CRAC and cooling towers.
The high level of flexibility in addressing cooling at the server, rack, cluster and site levels provided by Asetek distributed pumping is lacking in approaches that utilize centralized pumping. Asetek’s approach continues to deliver flexibility in the areas of heat capture, coolant distribution and heat rejection.
At SC17, Asetek will also have on display a new cooling technology in which servers share a rack mounted HEX. The servers utilizing this shared HEX approach allow them to continue to be used if the site later moves to RackCDU.
To learn more about Asetek liquid cooling, stop by booth 1625 at SC17 in Denver.
Appointments for in-depth discussions about Asetek’s data center liquid cooling solutions at SC17 may be scheduled by sending an email to [email protected].