Heat and the impact of high processor memory bandwidth are key factors that must be considered when procuring a cluster that can realize the full potential of the latest generation of high memory bandwidth processors. System designers must address this dilemma as CPU vendors are now competing over memory bandwidth to achieve leadership application performance. High memory bandwidth is an extraordinary boon for all users as it means higher application performance – so long as more efficient use of the vector floating-point units does not cause the processor to overheat and reduce performance.
Thermal issues affect everyone
Everyone is affected by the heat versus memory bandwidth dilemma, as even small-scale workloads (by current standards) can experience downclocking. This means the every HPC and enterprise deployment is caught in this dilemma, regardless of whether the system is a small organizational cluster, large commercial enterprise datacenter, dedicated AI workhorse cluster, or an academic group or campus-wide datacenter.
HPE notes, “A system’s “high performance” claims may look impressive on paper. However, their real-world performance results can lag very far behind. For instance, as HPC clusters tune groups of cores to their own unique frequencies, temperature, and power regulation; competing groups can overthrow the system’s actual performance.” [i]
AI and tightly coupled HPC applications running at scale are particularly susceptible to performance degradation from heat-related issues such as thermal downclocking. Basically, tightly coupled applications including those that use reduction type operations (essential to AI training, common in most HPC applications) become rate limited by the slowest node(s) when processors run at different rates.
Good system design and system management are key to eliminating heat-related issues – as they will likely affect the performance of every application on the system – which in turn lets applications run faster due to the greater parallelism, floating-point capability and memory bandwidth of the latest modern processors.
Preserving high-bandwidth CPU performance
To understand the impact of heat on high memory-bandwidth system performance, we look at the Magma installation at bellwether LLNL (Lawrence Livermore National Laboratory). Magma employs Intel Xeon Platinum 9200 liquid-cooled SKUs. We focus on these SKUs because they currently provide the highest number of memory channels per CPU, and they provide very high floating-point performance with dual per-core AVX-512 vector units which generate heat when fully utilized. Thus, they provide a glimpse into our high-bandwidth CPU future.
High bandwidth processors are the future
Higher memory bandwidth is critical to performance for many HPC and AI workloads; processor cores that are starved for data simply don’t deliver performance.
Figure 1 (below) shows that the new 9200 12-channel processors provide better performance when compared against other Intel six memory channel per socket processors.
Guideline: Air-cooling vs. Liquid-cooling
While increased memory bandwidth translates to faster application performance, it also creates a dilemma for systems designers as the heat generated when running all the cores and dual floating-point units in a high core-count processor at full speed can cause the chip to slow down (downclock) to stay within its thermal design limits.
Look closely the TDP (Thermal Design Power) ratings to understand when it becomes necessary to consider liquid cooling. As a guideline, think: the more cores, the higher the TDP and the greater the importance of the cooling solution. Also, consider that most compute nodes run dual-socket, so these TDP numbers must be doubled for all 2S computational nodes.
Air-cooling is fine for many HPC and data intensive HPDA workloads that perform many floating-point operations so long as there is sufficient air-flow to keep the processor(s) cool.
Liquid-cooling solves many thermal issues
In contrast, look to liquid cooling when running highly parallel, floating-point intensive vector codes that are cache intensive. DGEMM (double- precision general matrix multiplication) operations are the textbook example because such dense matrix operations can scale to all the processor cores on a chip and keep all the floating-point units active.
As always, look to your workloads. If they reflect LINPACK benchmark behavior, then liquid cooling is the best way to keep all parts of the chip within thermal limits to achieve full performance. Otherwise, the processor may have to downclock to stay within its thermal envelope, thus decreasing performance.
Don’t forget to consider the impact of thermal issues when running at scale!
In particular, look at the proximity of vector intensive dense matrix operations relative to tightly coupled distributed operations such as a reduction operator. Hot nodes in air cooled systems are known to slow tightly coupled computations significantly, by a factor of two or more. [ii] Essentially the distributed computation becomes rate limited by the slowest node(s). The impact of hot nodes can be observed at scale – even when running small scale jobs using on a few hundred nodes.[iii] Liquid cooling eliminates the problem of hot nodes.
LLNL Magma system
Funded through NNSA’s Advanced Simulation & Computing (ASC) program, the Magma supercomputer is a liquid-cooled supercomputer designed to support mission simulations critical to ensuring the safety, security and reliability of the nation’s nuclear weapons in the absence of underground testing. As of November 2019, Magma is ranked as the 69th fastest system in the world according to the Top500 list.
Magma consists of 760 compute nodes, with each node each configured with dual 12-memory channel per socket Xeon Platinum 9242 48-core processors — for a total of 72,960 cores. Its total memory capacity is 293 terabytes, with a total memory bandwidth of 430 terabytes per second. The cluster utilizes Penguin’s Relion XE2142eAP compute servers connected by an Intel Omni-Path interconnect. The system is supported by CoolIT Systems’ complete direct liquid cooling solution. [iv]
The physical reality of floating-point arithmetic
It’s unavoidable—floating-point arithmetic operations generate heat. This is exacerbated by wider vector units (meaning more operations can be performed per second) and the multiple per-core vector units that now co-exist alongside modern CPU cores.
Much software analysis has been performed to reduce the impact of downclocking when running floating-point intensive codes in software, [v] but the easiest solution is to exploit the greater thermal conductivity of liquid to remove heat.
Of course, cost, complexity and the practicality of installing plumbing in the datacenter may become an issue when considering liquid cooling. However, standardized full-service liquid providers along with broad support for a multitude of OEMs and various hyperscaler partners make it easier to implement standardized liquid cooling solutions.
You can see the dilemma caused when more data lets more of the vector units in a processor stay busy, which in turn generates more heat. Liquid has better thermal conductivity than air, so if your system workload tends to be dominated by floating-point calculations (easily determined by application profiling) then liquid cooling might be required.
The key takeaway is that higher memory bandwidth processors are a very good thing. Don’t starve your computing hardware for data. However, higher work efficiency in the processor does create a cooling dilemma.
Check your workloads to see if air cooling is still an option, or if your users would be better served with a liquid cooled solution. It might prove to take less room, operate more efficiently, and deliver higher performance on each node and when running tightly coupled applications at scale.
Rob Farber is a global technology consultant and author with an extensive background in HPC, AI, and teaching. Rob can be reached at [email protected]