Datacenter power management is a ubiquitous challenge and in few places is it more so than at Kyoto University Academic Center for Computing and Media Studies (ACCMS)) where power consumption limits were imposed following the devastating 2011 Tohoku earthquake and tsunami. HPC resource expansions in 2012 and 2016, part of a regular four-year upgrade cycle, prompted Kyoto introduce more granular controls, in this instance Intel’s Data Center Manager (DCM), which the university now says has enabled it to achieve significant power savings.
“[T]here was a movement to cap HPC power consumption across Japan in the wake of the 2011 Tohoku earthquake,” says professor Hiroshi Nakashima, a researcher at ACCMS who among other things studies power consumption management strategies. Currently, power consumption is a facility-related issue which means the maximum amount of suppliable power is based on the specific facility, not just its HPC resources.
After nuclear power plants throughout Japan stopped operating as a result of the earthquake, power prices rose roughly 50 percent during peak times in the Kansai region. As usage charges for the HPC system also include electricity fees, there was a demand for improved power efficiency rates to reduce the cost to the user. DCM was deployed as part of the 2016 system refresh.
“Although we had been measuring power performance at the level of individual racks and nodes in previous environments, we decided to introduce Intel DCM to monitor individual servers in more detail,” says supercomputing section leader Junichi Hikita.
As 0f 2018, the ACCMS HPC environment consists of three systems: two cluster systems comprised of HPC servers featuring Xeon processors, and one MPP system comprised of HPC servers with Xeon Phi processors. This system provides an overall computational performance of 6.5524 petaflops.
“At present, 40 percent of the use of the ACCMS HPC system occurs within Kyoto University, with the remaining 60 percent occurring externally. Although the number of users is growing year upon year, our policy is to maintain an operating ratio of approximately 70 percent to allow us room to cope,” says Nakashima.
Says Keiichiro Fukazawa, an associate professor at ACCMS, “We were able to confirm there were variations in power consumption caused by individual differences in CPUs with the same specifications, and that there were some with high power efficiency and some with low power efficiency. With the low-performance CPUs, the hotter they become, the more power they consume. Improving power efficiency requires correct monitoring, and we expected that allocating jobs from nodes with high power performance would reduce power consumption.”
In practice, schedules were set based on monitoring values, and a comparison of cases where jobs were allocated from nodes with high power efficiency and cases where jobs were allocated randomly confirmed a 2-4 percent reduction in power consumption, even with a 70 percent node usage rate. In addition, when compared to cases in which jobs were allocated from nodes with the worst power efficiency, there was a 5-8 percent reduction in power consumption at a 70 percent node usage rate.
“A 2-4 percent decrease in power consumption may seem small, but it is a huge result for ACCMS, where the yearly electricity fee reaches about 150 million yen,” says Nakashima.
Link to Intel case history: https://www.intel.com/content/www/us/en/software/reducing-power-consumption-hpc-environments.html