Tag: fault tolerance
Among the chief challenges of deploying useful exascale machines, resilience looms large. Today’s error rates combined with tomorrow’s node counts cannot sustain a productive workflow without intervention. The significance of this issue has not gone unnoticed. A comprehensive collection of fault tolerance techniques are presented in one volume, called “Fault Tolerance Techniques for High-Performance Computing,” by editors Thomas Herault and Yves Read more…
With the proliferation of public cloud infrastructures, our dependability on them has increased. Many of our vital services pertaining to the research, industry or even lifestyle domain have been massively moved onto the cloud. Then, what happens when the cloud services we are depending on go down? Dr. Jose Luis Vazquez-Poletti shares some key aspects on how the scientific community can provide answers to this problem.
Achieving workable software-based fault tolerance will require a fresh approach for developers.
Supercomputing apps may have to ditch the checkpoint-restart model.
Can smart checkpoints and fault-resilient applications avert a Malthusian Catastrophe?