As more powerful systems encompass ever-increasing numbers of components, even a small fault rate on individual processors will generate multiple faults across the components, stopping long-running applications in their tracks. At a workshop in March, U.S. experts met to discuss issues relating to the fault-tolerance of today’s and tomorrow’s petascale and exascale computing systems. The group explored past practices and common pitfalls, and discussed strategies to ensure that these systems and the applications they run can tolerate the inevitable faults.
Embrace Failure!
April 22, 2009