Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
July 2, 2012

Amazon Bit the Dust… Again

Robert Gelber

Last Friday, in what has become a not-so-unusual occurrence for the cloud services provider, Amazon’s Elastic Compute Cloud went dark. The event, lasting two hours, happened just weeks after a power outage knocked the company’s US-EAST-1 region offline for roughly six hours.

Both failures originated from Amazon operations based in Northern Virginia. Ars Technica detailed the earlier event, which was the result of primary power, primary backup power, and secondary backup power failures. Amazon explained that the issue began with a cable fault, disconnecting their primary power source. Shortly thereafter, the primary backup generator failed due to a faulty cooling fan. The Secondary backup was also inoperable due to an incorrectly configured circuit breaker.

Amazon had since promised that circuit breaker configuration would become part of their auditing process, but the message was met with some valid skepticism by Ars.

So, the breakers are fixed, but it’s hard to imagine there won’t be other problems in the future.

Surely enough, another power related event knocked out the US-EAST-1 region. This time, operations were affected by a major storm that left roughly 400,000 people without electricity. The issue took down websites Instagram, Pinterest, Heroku and Netflix for 2-3 hours.

Netflix is a prominent user of Amazon Web Services and is fully aware that the cloud provider is not infallible. Last April, their website famously stayed online during a major EC2 outage that took down Reddit, Quora, Hootsuite and Foursquare among others. Following that event, Netflix explained how their service stayed online during the Amazon outage.

Why were some websites impacted while others were not? For Netflix, the short answer is that our systems are designed explicitly for these sorts of failures. When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to.

Unfortunately, Friday’s event took down the video streaming site as well. As of now, the Netflix tech blog has not posted a breakdown of the event. PC Mag received a vague explanation from a Netflix representative, saying the downtime was the result of a “rare technical issue that our engineers fixed.”

The recent events demonstrate how fragile some portions of the Internet can be. They also act as a wake-up call to services relying on Amazon. HP, Microsoft and most recently Google, have entered the public cloud game, offering alternatives to EC2. If reliability continues to hinder the cloud giant, these competitors may be more than willing to tempt some of its current customers away.

SC14 Virtual Booth Tours

AMD SC14 video AMD Virtual Booth Tour @ SC14
Click to Play Video
Cray SC14 video Cray Virtual Booth Tour @ SC14
Click to Play Video
Datasite SC14 video DataSite and RedLine @ SC14
Click to Play Video
HP SC14 video HP Virtual Booth Tour @ SC14
Click to Play Video
IBM DCS3860 and Elastic Storage @ SC14 video IBM DCS3860 and Elastic Storage @ SC14
Click to Play Video
IBM Flash Storage
@ SC14 video IBM Flash Storage @ SC14  
Click to Play Video
IBM Platform @ SC14 video IBM Platform @ SC14
Click to Play Video
IBM Power Big Data SC14 video IBM Power Big Data @ SC14
Click to Play Video
Intel SC14 video Intel Virtual Booth Tour @ SC14
Click to Play Video
Lenovo SC14 video Lenovo Virtual Booth Tour @ SC14
Click to Play Video
Mellanox SC14 video Mellanox Virtual Booth Tour @ SC14
Click to Play Video
Panasas SC14 video Panasas Virtual Booth Tour @ SC14
Click to Play Video
Quanta SC14 video Quanta Virtual Booth Tour @ SC14
Click to Play Video
Seagate SC14 video Seagate Virtual Booth Tour @ SC14
Click to Play Video
Supermicro SC14 video Supermicro Virtual Booth Tour @ SC14
Click to Play Video