In some of my past columns, I had discussed the reliability of clusters. The statistics tell us that the bigger things get, the more failures we can expect. Indeed, it is not uncommon for very large clusters to have a failure per day and this is totally expected! In my past musings on this topic, I have suggested dynamic parallelization and disposable nodes as ways to address the failure issue. I’ll expand on the disposable idea based on some new research projects I have been following.
Disposable HPC
May 26, 2010