Researchers from Google recently addressed this issue of availability in globally distributed storage systems, noting that while there is plenty of information about how components of storage systems fail, few have looked at the more positive side of the storage coin—overall availability for megacloud-based storage services.
The work is based on the results of a one-year study of Google’s main storage infrastructure. The authors note that “highly-available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity services and disk drives” and thus accordingly, “sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include hardware, software, network connectivity and power issues.”
To arrive at some of their conclusions, the authors put together a series of statistical models that look at different design choices, including variable replication and data placement choices. Using these models the researchers are able to examine availability against a number of system parameters that have been tested and encountered in Google’s fleet.
Among some key findings is that there’s a strong correlation “among node failures that dwarfs all other contributions to unavailability in our [Google’s] production environment.” This is in addition to the conclusion that “though disk failures can result in permanent data loss, the multitude of transitory node failures account for most unavailability.”