The Microsoft Windows Azure cloud experienced a serious outage early Wednesday morning (Greenwich Mean Time), preventing some customers from accessing data and applications. The problem, which has affected the system’s service management component, is thought to be related to a leap day timing glitch. While the company has been actively working to restore the service, it is still experiencing intermittent issues at the time of writing.
This is the company’s official response:
On February 28th, 2012 at 5:45 PM PST Microsoft became aware of an issue impacting Windows Azure service management in a number of regions. This prevented customers from creating, updating or deleting their hosted services. The root cause of this incident has been traced back to a certificate issue that affected all Windows Azure sub-regions. In order to contain the incident we disabled service management for all Windows Azure sub-regions. During this incident there was no impact on Windows Azure Storage and all storage accounts remained fully accessible in all sub-regions.
The Windows Azure engineering team developed, validated and deployed a fix for this issue. The fix has been successfully deployed to most Windows Azure sub regions. Windows Azure service management has been restored in these regions. We were unable to deploy the fix successfully to some customers in 3 sub regions – North Central US, South Central US, and North Europe. Currently impacted customers may experience issues with Access Control 2.0, Marketplace, Service Bus and the Access Control & Caching Portal and as a result may experience a loss of application functionality. Engineering teams are actively working to resolve this issue as soon as possible.
We apologize for any inconvenience this has caused you and will update the Service Dashboard on an hourly basis until this incident is resolved. We are continuously working on improving the Windows Azure Platform to help ensure similar incidents do not occur in the future.
The Azure Service Dashboard first listed the problem on Feb. 29 at 1:45 a.m. GMT (Feb. 28, 5:45 p.m. PST):
We are experiencing an issue with Windows Azure service management. Customers will not be able to carry out service management operations. We are actively investigating this issue and working to resolve it as soon as possible. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
At 4 a.m. GMT, Feb. 29 (8 p.m. PST, Feb. 28), Microsoft said that it had identified the root cause of this incident, tracing it “back to a cert issue triggered on 2/29/2012 GMT.”
At 9 a.m. GMT, Feb. 29 (1 a.m. PST, Feb. 29), the company announced that it would begin fixing the problem. They began with a gradual rollout of the hotfix in North Central US sub-region. “As we proceed through the rollout,” Microsoft stated, “we will progressively enable service management back for customers.”
At 7:30 p.m. UTC Feb. 29 (11:30 a.m. PST, Feb. 29), the software giant posted the following statement:
We are actively recovering Windows Azure hosted services in the North Central US, South Central US and North Europe sub-regions. More and more customers applications should be back up-and-running even if service management functionality is not yet restored. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
As of approximately 4:30 p.m. PST, the service is still experiencing multiple issues. While Microsoft has been proactive in issuing status updates, they have not been forthcoming in providing an estimated time to completion. Software development times are notoriously unpredictable and the company doesn’t want to over-promise, but that doesn’t make it any less frustrating for the customer.
This is not Azure’s first face plant. Back in 2009, while still in beta, the service went down for 22 hours. Amazon and Google have also experienced outages. What does this say about the reliability of the cloud? As with any technology there is a risk-benefit profile, but reliability is supposed to be one of cloud’s calling cards. When it comes to lessons learned, this is a painful one, to the customers and to the cloud provider. There’s no question that people will be talking about these very public failures for a long time.
Update (information added to the article at 7:25 p.m. PST.)
Microsoft has posted the following updates on the Azure Service Dashboard.
1-Mar-12
1:00 AM UTC We have restored full service management functionality for all Windows Azure hosted services in the North Central US sub-region. We have restored full service management functionality for most customers in the South Central US and North Europe sub-regions. We have published a recap of the incident since it started on the Windows Azure team blog (http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/windows-azure-service-disruption-update.aspx). Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.
1:25 AM UTC We have restored full service management functionality for all Windows Azure hosted services in the North Europe sub-region. We have restored full service management functionality for most customers in the South Central US sub-region. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.