Supercomputing systems will soon contain millions of CPUs, petabytes of memory, exabytes of disk, and a multi-level communication network that ties it all together. Since all of these components fail on a regular basis, HPC users will have to deal with unreliable hardware on a scale never imagined before. That means applications will have to figure out a way to run on systems that are constantly breaking.
The application resiliency problem is the one that seems to bubble to the top whenever an HPC’er talks about the road to exascale. It’s obviously not the only problem for exascale computing, but it’s the one that’s still mostly in the research stage. Given that the petascale to exascale transition is already underway, that’s somewhat worrisome.
In my recent conversation with Cray CTO Steve Scott, he expressed a great deal of confidence in fielding an exaflop supercomputer before the end of this decade. From his perspective, the hardware required to build such a system is pretty much in the pipeline today. But even he conceded that the application resiliency problem is one with no clear solution yet.
“We’re confident that we can make the system resilient, that is, keep it running, in the face of hardware failures — processors dying, interconnects failing, etc.,” Scott told me. “The problem is how to deal with that at the application that’s running over the entire machine.”
The current state of the art uses the checkpoint-restart model, where the running application saves its state to disk periodically. If a component fails, the application is resumed from the last checkpoint, avoiding a complete restart from scratch. For apps that run over weeks or months, this has been the only practical way to get through an entire run, even for many terascale applications.
But as the application size grows, it becomes increasingly impractical to use the checkpoint model. The problem becomes clear when you realize the time to do the checkpoint is approaching the average time between failures. So for exascale-size code, transferring an application snapshot from memory to disk is just not an option.
In fact, though, there’s nothing magical about exascale-sized programs. According to a 2009 white paper (PDF) by the Illinois-INRIA Joint Laboratory on PetaScale Computing: “Some projections estimate that, with the current technique, the time to checkpoint and restart may exceed the mean time to interrupt of top supercomputers before 2015.”
The INRIA study does a pretty good job of outlining the problem in more detail and discussing some possible solutions. Approaches include diskless checkpointing (using RAM or SSD devices), minimizing checkpoint size, maintaining redundant hardware (memory, CPUs, etc.), and proactively predicting hardware and software faults. In general though, the authors imply that the full system software stack and applications themselves will ultimately have to be made fault-tolerant. That means extra logic will have to be added to detect failures and then re-execute the appropriate code.
Dumping the responsibility for resiliency into the laps of application and systems programmers is going to be particularly burdensome, given the other work that has to be done in parallelizing code to run on exascale machines. Eventually, the hardware may catch up and supercomputers with transparent resiliency can be built economically. Until then, software developers following the path to exascale may find the road even rougher than imagined.