July 5, 2012

Multi-Region Disasters and Expendable Databases℠

As I write this, the United States is suffering from a frightening heat wave and the lingering effects of storms that threatened everything east of the Rockies. Over two dozen lives have been lost and the heat might last for several more days. The transportation, emergency response, and utility infrastructures are badly strained; about a million customers are still without power. Major fires are burning in the west. It is distressing to think about this kind of multi-region disaster, but it is happening now and continues to unfold.

And as trivial as it feels to write this, these natural disasters led to another outage of the Amazon EC2 cloud on June 29. This one is very similar to the EC2 outage on June 15 which I blogged about last week (source). Friday's happened at the same Virginia datacenter and was also caused by the same kind of event.

The Problem: Another Power Outage and Generator Failure

Similarly to what happened a couple weeks ago, on Friday a storm-induced power outage at the utility company forced a switchover to Amazon's backup generator, and this generator failed (source). Netflix, Instagram, Pinterest, and others began to experience difficulties and outages. The problems lasted about an hour.

But rather than dissect this one outage, let's take a look at the larger issues surrounding downtime in the cloud.

Cloud Users Must Plan their Own Disaster Recovery

Wired.com had this to say about Friday's event:

In theory, big outages like this aren’t supposed to happen. Amazon is supposed to keep the data centers up and running – something it has become very good at… In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage. (source)

Long and short, cloud customers think that they have shed their responsibility for business continuity and handed it to the cloud provider. They're wrong, and Amazon has apparently admitted as much by telling its customers to make their own disaster recovery preparations.

"Stupid IT Mistakes"

Those are the lead words in the title of an article about the June 15 outage (source). In it, the author refers to statistics from cloud optimization firm Newvem that show that 40% of cloud users are not properly prepared for an outage. They don't have any kind of redundancy: they don't back up their data and they deploy to only one region. Frighteningly, this includes large companies as well as small.

Promoting a Failure of a Recovery Plan

Another problem is that Amazon has apparently told its customers to "be ready to roll over to a new data center" (source). This is tacit approval of failover-based disaster recovery systems. But as we saw with the June 15 outage, failovers fail all the time and cannot be relied upon to maintain continuity. In fact, they often contribute to outages.

As for regular backups, that's always a good idea. But a backup location can fail too, particularly if it is hit by the same disaster. And what happens with transactions that occurred after the last backup? Will a recovery based on these backups even succeed? And although backup may eventually get you running again, it can't prevent the costly downtime.

You Can't Prevent All Causes of Failure

I argue again and again that there is no way to prevent all the thousands of small and large errors that can conspire (singly or in combination) to knock out a datacenter, cloud node, or server. Generators, power supplies, bad cables, human error and any number of other small disasters can easily combine to make a big disaster. It's not practical to continually monitor all of these to circumvent every possible failure. IT is chaos theory personified; do all you can, but something's going to break.

Geographic Issues

As we are seeing this week, one disaster or group of disasters can span vast geographic areas. You need to plan your business continuity system so the same disaster can't affect everything. Companies that have located all their IT on the Eastern Seaboard should be sweating it this week, because it's conceivable that the heat wave and storms could cause simultaneous power outages from New York to Virginia to Florida. A primary site, failover site, and backup location could all go down at the same time.

The Real Solution: Geographically Separated Expendable Datacenters℠

Here at ZeroNines we've constructed our business continuity solution around a number of tenets, including:
  • Service must continue despite the complete failure of any one datacenter.
  • Geographical separation is key, to prevent one disaster from wiping out everything.
  • Failover is not an option, because it is extremely unreliable.
  • The solution must be easy and affordable so that the "40%" mentioned above actually use it.
Based on all this, we've developed Always Available™ architecture that enables multiple instances of the same applications, data, and transactions to run equally and simultaneously in multiple locations hundreds or thousands of miles apart. The system does not rely upon or include either failover or restoration from backup. Best of all, an entire datacenter could go offline at any moment and all transactions will simply continue to be processed at other datacenters with no interruption to service. It is affordable, OS agnostic, and operates with existing apps, databases, and infrastructures.

ZeroNines client ZenVault uses Always Available. They also host on Amazon EC2. During the outages, no extraordinary measures are needed. If the EC2 East node goes offline, the two other nodes (in Santa Clara and EC2 West) will continue running the service, and will restore the EC2 East node once it comes back online. ZenVault has had true 100% uptime since the day it launched in 2010.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

1 comment:

  1. Alan, this is very clear to me ZeroNines has figured it out. I am reading about more and more outages and it seems very disruptive. I like your solution and see significant value.

    ReplyDelete