A National Science Foundation-funded project headed by Purdue University Professor Saurabh Bagchi is using Purdue’s new Rice cluster, Blue Waters at the National Center for Supercomputing Applications (NCSA) and other supercomputers in research to find ways make such high-end systems more reliable.
An article detailing the project posted at Purdue with a brief excerpt below. Given the growing use of Supercomputing in both research and industry it’s hoped the project will improve ease-of-use and identify common code issues and develop approaches for solving the problems. Excerpt:
“…The project is building a repository of usage and failure data from supercomputers, analysis of which can be used to help researchers run their code more efficiently and reliably and get results faster. Purdue research computing staff already has tapped some of the findings to assist Purdue’s cluster users.
“Data is king in this,” says Bagchi, a professor in Purdue’s School of Electrical and Computer Engineering. “I like to build solutions with some idea of what the real problem is and it turns out that finding failure data on real computer systems is very, very difficult. This project steps toward remedying that situation.”
“…Bagchi’s research focuses on software systems to make heterogeneous distributed computing systems like high-performance computing clusters more reliable and secure. He started the usage and failure data project with a pilot collecting and analyzing data from Purdue’s Conte cluster, deployed in 2014. The pilot’s success prompted the NSF to expand the project. It now includes Conte, Purdue’s Rice cluster and other clusters at Purdue, along with Blue Waters.”
Here is a link to the full article: http://www.itap.purdue.edu/newsroom/news/150813_communityclusters_usefailresearch.html