June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

The Amazon EC2 cloud had a relatively minor outage a couple weeks ago, on June 14 2012. As it turns out, it happened in the same Virginia datacenter that spawned the April 2011 and August 2011 outages. I've been on the road but now that I look into it I see that it's actually a classic outage scenario and a classic example of cascading failure resulting from a failover. It also illustrates just why you need to plan for Expendable Datacenters℠.

Background: Amazon and Their Cloud Service

I blogged about Amazon's big outage last August (source), and described how large a role Amazon plays in the cloud world. I won't recap all that here, but I will say that among its clients are Netflix, Instagram, Reddit, Foursquare, Quora, Pinterest, parts of Salesforce.com, and ZeroNines client ZenVault.

The Problem: A Power Outage

According to the Amazon status page (source) "a cable fault in the high voltage Utility power distribution system" led to a power outage at the datacenter. Primary backup generators successfully kicked in but after about nine minutes "one of the generators overheated and powered off because of a defective cooling fan." Secondary backup power successfully kicked in but after about four minutes a circuit breaker opened because it had been "incorrectly configured." At this point, with no power at all, some customers went completely offline. Others that were using Amazon's multi-Availability Zone configurations stayed online bu seem to have suffered from impaired API calls, described below. Power was restored about half an hour after it was first lost.

Sites started recovering as soon as power was restored and most customers were back online about an hour after the whole episode began. But it is clear that many weren't really ready for business again because of the cascading effects of the initial interruption.

Subsequent Problems: Loss of In-Flight Transactions

The Amazon report says that when power came back on, some instances were "in an inconsistent state" and that they may have lost "in-flight writes." I interpret this to mean that when the system failed over to the backups, the backup servers were not synchronized with the primaries, resulting in lost transactions. This is typical of a failover disaster recovery system.

Another Subsequent Problem: Impaired API Calls

Additionally, during the power outage, API calls related to Amazon Elastic Block Store (EBS) volumes failed. Amazon sums up the effect beautifully: "The datastore that lost power did not fail cleanly, leaving the system unable to flip [failover] the datastore to its replicas in another Availability Zone." Here's a second failed failover within the same disaster.

My Compliments to Amazon EC2

In all seriousness, I truly commend Amazon for publicly posting such a detailed description of the disaster. It looks to me like they handled the disaster quickly and efficiently within the limitations of their system. Unfortunately that system is clearly not suited to the job at hand.

Amazon does a pretty good job at uptime. We (ZeroNines) use the Amazon EC2 cloud ourselves. But we hedge our bets by adding our own commercially available Always Available™ architecture to harden the whole thing against power outages and such. If this outage had affected our particular instances, we would not have experienced any downtime, inconsistency, failed transactions, or other ill effects.

One Solution for Three Problems

Always Available runs multiple instances of the same apps and data in multiple clouds, virtual servers, or other hosting environments. All are hot and all are active.

When the power failed in the first phase of this disaster, two or more identical Always Available nodes would have continued processing as normal. The initial power outage would not have caused service downtime because customers would have been served by the other nodes.

Secondly, those in-flight transactions would not have been lost because the other nodes would have continued processing them. With Always Available there is no failover and consequently no "dead air" when transactions can be lost.

Third, those failed EBS API calls would not have failed because again, they would have gone to the remaining fully functional nodes.

A big issue in this disaster was the "inconsistent state," or lack of synchronization between the primary and the failover servers. Within an Always Available architecture, there is no failover. Each server is continually updating and being updated by all other servers. Synchronization takes place constantly so when one node is taken out of the configuration the others simply proceed as normal, processing all transactions in order. When the failed server is brought back online, the others update it and bring it to the same logical state so it can begin processing again.

The Expendable Datacenter

Another thing I can't help but point out is the string of events that caused the outage in the first place. First a cable failure combines with a fan failure, and that combines with a circuit breaker failure. It's simple stuff that adds up into a disaster. Then software that can't synchronize. Given the complexities of the modern datacenter, how many possible combinations of points of failure are there? Thousands? Millions? I'll go on the record and say that there is no way to map all the possible failures, and no way to guard against them all individually. Its far better to accept the fact that servers, nodes or entire facilities will go down someday, and that you need to make the whole datacenter expendable without affecting performance. That's what ZeroNines does.

So if you're a cloud customer, take a look at ZeroNines. We can offer virtually 100% uptime whether you host in the cloud, on virtual servers, or in a typical hosted environment. And if you're a cloud provider, you can apply Always Available architecture to your service offering, avoiding disasters like this in the first place.

Check back in a few days and I'll write another post that looks at this outage from a business planning perspective.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines