The ever-expanding complexity of high-performance computing continues to elevate the concerns posed by massive energy consumption and increasing points of failure. Now, the AI Ops collaboration between Hewlett Packard Enterprise (HPE) and the U.S. Department of Energy’s National Renewable Energy Laboratory (NREL) aims to leverage artificial intelligence and machine learning (ML) to address these pain points for the exascale era.
The collaboration is the nexus of NREL’s ongoing mission to advance sustainable solutions and HPE’s research and development efforts with PathForward, a Department of Energy program intended to prepare the U.S. for exascale computing.
“We are passionate about architecting new technologies that are impactful to powering the next era of innovation with exascale computing and its extent of operational needs,” said Mike Vildibill, vice president of the Advanced Technologies Group at HPE. “We believe our journey to develop and test AI Ops with NREL, one of our longstanding and innovative partners, will allow the industry to build and maintain smarter and more efficient supercomputing data centers as they continue to scale power and performance.”
The three-year collaboration, which will bring monitoring and predictive analytics to power and cooling systems in NREL’s Energy Systems Integration Facility (ESIF) datacenter, is leveraging over five years of historical operational data – more than 16 terabytes – from real-world systems. This data was collected from NREL’s Peregrine and Eagle supercomputers, which were fitted with sensors with an aim to eventually use the information they gleaned to improve system efficiency – specifically with eyes toward achieving power, water and carbon efficiency.
To achieve these goals, HPE wrote, it is leveraging open source tools such as TensorFlow, NumPy and Sci-kit for ML development and focusing on the following key areas:
Monitoring: Collect, process and analyze vast volumes of IT and facility telemetry from disparate sources before applying algorithms to data in real time.
Analytics: Big data analytics and machine learning will be used to analyze data from various tools and devices spanning the data center facility
Control: Algorithms will be applied to enable machines to solve issues autonomously as well as intelligently automate repetitive tasks and perform predictive maintenance on both the IT and the datacenter facility.
Datacenter operations: AI Ops will evolve to become a validation tool for continuous integration (CI) and continuous deployment (CD) for core IT functions that span the modern datacenter facility.
The researchers have already trained AI models with historical sensor data, and HPE says that initial results showed that the models were able to predict or identify events that had occurred in the ESIF datacenter. Now, HPE has set its sights on enhancing the HPE High Performance Cluster Management (HPCM) system (allowing it to quickly provision, manage and monitor clusters with up to 100,000 nodes) and possibly integrating HPE InfoSight, a cloud- and AI-based management tool that can predict and prevent disruptions.
“Our research collaboration will span the areas of data management, data analytics, and AI/ML optimization for both manual and autonomous intervention in data center operations,” said Kristin Munch, manager for the Data, Analysis and Visualization Group at NREL. “We’re excited to join HPE in this multi-year, multistaged effort – and we hope to eventually build capabilities for an advanced smart facility after demonstrating these techniques in our existing data center.”