Controlling Soft Errors at Scale
There are many important issues when it comes to advancing the field of HPC toward the exascale era, but among all these variables, there are about five or so sticking points that really stand-out: one of these is controlling for soft errors.
As the number of cores per machine increases, incorrect behaviors, known as soft errors, begin to threaten the validity of simulations. When you consider that exascale machines will employ billion-way parallelism, the necessity to address this problem is clear.
A team of scientists from PNNL performed experiments revealing the high risk of soft errors on large-scale computers. The research team found that without intervention, soft errors invalidate simulations in a large fraction of cases, but they also developed a technique that will correct 95 percent of them.
According to their paper in the Journal of Chemical Theory and Computation, the next generation of systems will combine millions of cores, which will increase the odds for soft errors, thereby producing unexpected results.
“Even if every core is highly reliable the sheer number of them will mean that the mean time between failures will become so short that most application runs will suffer at least one fault. In particular soft errors caused by intermittent incorrect behavior of the hardware are a concern as they lead to silent data corruption,” note the authors.
The only way to deal with these errors is to identify and remedy them. The paper explores the impact of soft errors on optimization algorithms, which start with an initial guess and iteratively reduce the error until a correct solution is obtained. For a concrete example, the team used the Hartree–Fock method from quantum chemistry.
The results indicate that the optimization algorithms worked well for soft errors of small magnitudes but not for large errors. In other words, calculations still failed in a significant fraction of cases. The team suggests that mechanisms for different classes of data structures will allow large errors to be detected and corrected. They conclude it is possible to correct more than 95% of the soft errors using these techniques with only a modest increase in computational cost.
The work was supported by the eXtreme Scale Computing Initiative using resources from the Environmental Molecular Sciences Laboratory, located at PNNL, as well as the PNNL Institutional Computing Facility. The paper was authored by PNNL researchers Hubertus J. J. van Dam, Abhinav Vishnu, and Wibe A. de Jong.