RENCI Collaboration to Leverage AI and ML for DOE Workflows

Sept. 10, 2021 — The Department of Energy (DOE) advanced Computational and Data Infrastructures (CDIs) – such as supercomputers, edge systems at experimental facilities, massive data storage, and high-speed networks – are brought to bear to solve the nation’s most pressing scientific problems, including assisting in astrophysics research, delivering new materials, designing new drugs, creating more efficient engines and turbines, and making more accurate and timely weather forecasts and climate change predictions.

Increasingly, computational science campaigns are leveraging distributed, heterogeneous scientific infrastructures that span multiple locations connected by high-performance networks, resulting in scientific data being pulled from instruments to computing, storage, and visualization facilities.

However, since these federated services infrastructures tend to be complex and managed by different organizations, domains, and communities, both the operators of the infrastructures and the scientists that use them have limited global visibility, which results in an incomplete understanding of the behavior of the entire set of resources that science workflows span.

“Although scientific workflow systems like Pegasus increase scientists’ productivity to a great extent by managing and orchestrating computational campaigns, the intricate nature of the CDIs, including resource heterogeneity and the deployment of complex system software stacks, pose several challenges in predicting the behavior of the science workflows and in steering them past system and application anomalies,” said Ewa Deelman, research professor of computer science and research director at the University of Southern California’s Information Sciences Institute and lead principal investigator (PI). “Our new project, Poseidon, will provide an integrated platform consisting of algorithms, methods, tools, and services that will help DOE facility operators and scientists to address these challenges and improve the overall end-to-end science workflow.”

Under a new DOE grant, Poseidon aims to advance the knowledge of how simulation and machine learning (ML) methodologies can be harnessed and amplified to improve the DOE’s computational and data science.

Research institutions collaborating on Poseidon include the University of Southern California, the Argonne National Laboratory, the Lawrence Berkeley National Laboratory, and the Renaissance Computing Institute (RENCI) at the University of North Carolina at Chapel Hill.

Poseidon will add three important capabilities to current scientific workflow systems — (1) predicting the performance of complex workflows; (2) detecting and classifying infrastructure and workflow anomalies and “explaining” the sources of these anomalies; and (3) suggesting performance optimizations. To accomplish these tasks, Poseidon will explore the use of novel simulation, ML, and hybrid methods to predict, understand, and optimize the behavior of complex DOE science workflows on DOE CDIs.

Poseidon will explore hybrid solutions where data collected from DOE and NSF testbeds, as well as from an ML simulator, will be strategically inputted into an ML training system.

“In addition to creating a more efficient timeline for researchers, we would like to provide CDI operators with the tools to detect, pinpoint, and efficiently address anomalies as they occur in the complex DOE facilities landscape,” said Anirban Mandal, Poseidon co-PI, assistant director for network research and infrastructure at RENCI, University of North Carolina at Chapel Hill. “To detect anomalies, Poseidon will explore real-time ML models that sense and classify anomalies by leveraging underlying spatial and temporal correlations and expert knowledge, combine heterogeneous information sources, and generate real-time predictions.”

RENCI will play a pivotal role in the Poseidon project. RENCI researchers Cong Wang and Komal Thareja will lead project efforts in data acquisition from the DOE CDI and NSF testbeds (FABRIC and Chameleon Cloud) and emulation of distributed facility models, enabling ML model training and validation on the testbeds and DOE CDI. Additionally, Poseidon co-PI Anirban Mandal will lead the project portion on performance guidance for optimizing workflows.

Successful Poseidon solutions will be incorporated into a prototype system with a dashboard that will be used for evaluation by DOE scientists and CDI operators. Poseidon will enable scientists working on the frontier of DOE science to efficiently and reliably run complex workflows on a broad spectrum of DOE resources and accelerate time to discovery.

Furthermore, Poseidon will develop ML methods that can self-learn corrective behaviors and optimize workflow performance, with a focus on explainability in its optimization methods.

Working together, the researchers behind Poseidon will break down the barriers between complex CDIs, accelerate the scientific discovery timeline, and transform the way that computational and data science are done.

Please visit the project website for more information.

Source: RENCI