July 30, 2014

DOE to Fund Exascale Resilience Research

Tiffany Trader
As the calendar counts down to the first exascale supercomputer, efforts to resolve the steep technological challenges are increasing in number and urgency. Among the many obstacles inhibiting extreme-scale computing platforms – resilience is one of the most significant. As systems approach billion-way parallelism, the proliferation of errors at current rates just won’t do. In recognition of the severity of this challenge, the federal government is seeking proposals for basic research that addresses the resilience challenges of extreme-scale computing platforms.

On July 28, 2014, the Office of Advanced Scientific Computing Research (ASCR) in the Office of Science announced a funding opportunity under the banner of “Resilience for Extreme Scale Supercomputing Systems.” The program aims to spur research into fault and error mitigation so that exascale applications can run efficiently to completion, generating correct results in a timely manner.

“The next-generation of scientific discovery will be enabled by research developments that can effectively harness significant or disruptive advances in computing technology,” states the official summary. “Applications running on extreme scale computing systems will generate results with orders of magnitude higher resolution and fidelity, achieving a time-to-solution significantly shorter than possible with today’s high performance computing platforms. However, indications are that these new systems will experience hard and soft errors with increasing frequency, necessitating research to develop new approaches to resilience that enable applications to run efficiently to completion in a timely manner and achieve correct results.”

The authors of the request estimate that at least twenty percent of the computing capacity in large-scale computing systems is wasted due to failures and recoveries. As systems increase in size and complexity, even more capacity will be lost unless new targeted approaches are developed.

The DOE is specifically looking for proposals in three areas of focus:

1. Fault Detection and Categorization – current supercomputing systems must be better understood in order to prevent similar behavior on future machines, according to DOE computing experts.
2. Fault Mitigation – this category breaks into two parts: the need for more efficient and effective checkpoint/restart (C/R) and the need for effective alternatives to C/R.
3. Anomaly Detection and Fault Avoidance – using machine learning strategies to anticipate faults far enough in advance to take preemptive measures, such as migrating the running application to another node.

Approximately four to six research awards will be made over a period of three years with award sizes ranging from $100,000 per year to $1,250,000 per year.
Total funding up to $4,000,000 annually is expected to be available subject to congressional approval. The pre-application due date is set for August 27, 2014.