Tag: fault tolerance

Reading List: Fault Tolerance Techniques for HPC

Aug 6, 2015 |

Among the chief challenges of deploying useful exascale machines, resilience looms large. Today’s error rates combined with tomorrow’s node counts cannot sustain a productive workflow without intervention. The significance of this issue has not gone unnoticed. A comprehensive collection of fault tolerance techniques are presented in one volume, called “Fault Tolerance Techniques for High-Performance Computing,” by editors Thomas Herault and Yves Read more…

Toward a Fault-Tolerant Cloud

Jun 23, 2011 |

With the proliferation of public cloud infrastructures, our dependability on them has increased. Many of our vital services pertaining to the research, industry or even lifestyle domain have been massively moved onto the cloud. Then, what happens when the cloud services we are depending on go down? Dr. Jose Luis Vazquez-Poletti shares some key aspects on how the scientific community can provide answers to this problem.

Looking to Fault-Tolerant Software

Nov 9, 2010 |

Achieving workable software-based fault tolerance will require a fresh approach for developers.

The Other Exascale Challenge

Jun 10, 2010 |

Supercomputing apps may have to ditch the checkpoint-restart model.

Embrace Failure!

Apr 22, 2009 |

Can smart checkpoints and fault-resilient applications avert a Malthusian Catastrophe?