Dec. 9, 2021 — Two different studies produced by the Analytics and AI Methods at Scale (AAIMS) group, which resides within the National Center for Computational Sciences at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory, won best paper awards at both the Bench’21 and SC21 conferences—all in the same week.
On November 16, “Comparative Evaluation of Deep Learning Workloads for Leadership-class Systems” won the BenchCouncil Best Paper Award at the BenchCouncil International Symposium on Benchmarking, Measuring and Optimizing. Coauthored by Junqi Yin, Aristeidis Tsaris, Sajal Dash, Ross Miller, Feiyi Wang, and Mallikarjun (Arjun) Shankar, the paper was based on an AAIMS project that compared machine learning/deep learning software stacks on two different GPU accelerator architectures. The team studied the Tensor Core–equipped NVIDIA V100 GPUs used in the Oak Ridge Leadership Computing Facility’s (OLCF’s) Summit supercomputer and the AMD Instinct MI100 accelerator, a precursor to the GPU used in the OLCF’s upcoming Frontier exascale system.
“We take a layered perspective on deep learning benchmarking and point to opportunities for future optimizations in the technologies that we consider,” said Feiyi Wang, AAIMS group leader and one of the authors. “It is essential to gain a holistic understanding from compute kernels, models, and frameworks of popular deep learning stacks and to assess their impact on science-driven, mission-critical applications.”
On November 18, “Revealing Power, Energy, and Thermal Dynamics of a 200PF Pre-Exascale Supercomputer” won the prestigious SC21 Best Paper Award at the International Conference for High Performance Computing, Networking, Storage, and Analysis. Coauthored by Woong Shin, Vladyslav Oles, Ahmad Maroof Karimi, J. Austin Ellis, and Wang, the paper is based on a study of high-performance computing (HPC) power use on an unprecedented scale.
The AAIMS team examined a full year’s worth (2020) of operational data from Summit, which is currently the nation’s most powerful supercomputer. Its 4,626 nodes were monitored by the AAIMS team for over 100 different metrics at a 1 Hz frequency, which resulted in a high-resolution dataset of 8.5 terabytes (compressed) for study. Insights from these efforts eventually led to this award-winning paper and to ideas for improving the efficiency and reliability of HPC data centers.
“Data-driven operational intelligence is one of the main focuses of the AAIMS group. Instead of hand-waving or back-of-the-envelope calculation, we leverage data to help decision makers choose informed options,” Wang said. “We think the recognition by SC21 and Bench’21 highlights the importance of this area. It is also a testament to the quality of our work—not just the AAIMS power team per se, but many unsung heroes in the OLCF program.”
AAIMS’s power analytics project is unique in the amount and resolution of the operational data it collected and analyzed as well as in the overall scope of its effort. The team examined Summit’s entire system from end to end, rather than just the machine itself, by including its central energy plant in the analysis. To put Summit’s dataset into context, the team gathered the system’s job allocation history for the same year and constructed per-job, fine-grained power consumption profiles for over 840,000 jobs.
With the imminent arrival of the OLCF’s Frontier exascale system, the AAIMS group has begun a new power analytics project that will take their Summit dashboard one step further. The Frontier project will include research on how machine learning can be used to understand the usage and power requirements for Frontier and apply it to the system in practice. Meanwhile, the team is also working on a Smart Facility for Science project, with the ultimate goal of providing ongoing production insight into HPC systems and AI-driven suggestions to inform system operators.
“We are moving forward with R&D plans for a Smart Facility for Science—power and energy profiling is one of the aspects, and we are taking compute efficiency, reliability, and I/O under consideration to establish an end-to-end holistic understanding of system,” Wang said. “We want to realize continuous integration, continuous delivery, and continuous learning to best support our science mission.”
The OLCF is a DOE Office of Science User Facility located at ORNL.
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.
Source: Coury Turczyn, OLCF