Think your day is going badly? Let this article on IEEE Spectrum by Al Geist on the many ways that Supercomputers can crash lift your spirits. Geist, chief technologist for the computer science and mathematics division at Oak Ridge National Laboratory, has written a lively account of gremlins with a nasty tendency to gum up supercomputer works.
His article, How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder, isn’t just a walk down miserable memory lane – yes, memory corruption is in there – but also looks at the challenges that will probably bedevil exascale machines. For example, the smaller transistors (low voltage) likely needed to hit DOE’s 20MW target for exascale machines will also make them more susceptible to spontaneous flipping on and off. Resiliency mechanisms will need to reach new highs.
Geist writes, “While I’ve talked a lot about faults causing machines to crash, these are not, in fact, the most dangerous. More menacing are the errors that allow the application to run to the end and give an answer that looks correct but is actually wrong. You wouldn’t want to fly in an airliner designed using such a calculation. Nor would you want to certify a new nuclear reactor based on one. These undetected errors—their types, rates, and impact—are the scariest aspect of supercomputing’s monster in the closet,” writes Geist.
Weirdly as it sounds, Geist’s piece is a fun, fast read. Here’s a link to the full article on IEEE Spectrum: http://spectrum.ieee.org/computing/hardware/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder