ZeroNines® - Always Available™: EC2

Showing posts with label EC2. Show all posts

July 5, 2012

Multi-Region Disasters and Expendable Databases℠

As I write this, the United States is suffering from a frightening heat wave and the lingering effects of storms that threatened everything east of the Rockies. Over two dozen lives have been lost and the heat might last for several more days. The transportation, emergency response, and utility infrastructures are badly strained; about a million customers are still without power. Major fires are burning in the west. It is distressing to think about this kind of multi-region disaster, but it is happening now and continues to unfold.

And as trivial as it feels to write this, these natural disasters led to another outage of the Amazon EC2 cloud on June 29. This one is very similar to the EC2 outage on June 15 which I blogged about last week (source). Friday's happened at the same Virginia datacenter and was also caused by the same kind of event.

The Problem: Another Power Outage and Generator Failure

Similarly to what happened a couple weeks ago, on Friday a storm-induced power outage at the utility company forced a switchover to Amazon's backup generator, and this generator failed (source). Netflix, Instagram, Pinterest, and others began to experience difficulties and outages. The problems lasted about an hour.

But rather than dissect this one outage, let's take a look at the larger issues surrounding downtime in the cloud.

Cloud Users Must Plan their Own Disaster Recovery

Wired.com had this to say about Friday's event:

In theory, big outages like this aren’t supposed to happen. Amazon is supposed to keep the data centers up and running – something it has become very good at… In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage. (source)

Long and short, cloud customers think that they have shed their responsibility for business continuity and handed it to the cloud provider. They're wrong, and Amazon has apparently admitted as much by telling its customers to make their own disaster recovery preparations.

"Stupid IT Mistakes"

Those are the lead words in the title of an article about the June 15 outage (source). In it, the author refers to statistics from cloud optimization firm Newvem that show that 40% of cloud users are not properly prepared for an outage. They don't have any kind of redundancy: they don't back up their data and they deploy to only one region. Frighteningly, this includes large companies as well as small.

Promoting a Failure of a Recovery Plan

Another problem is that Amazon has apparently told its customers to "be ready to roll over to a new data center" (source). This is tacit approval of failover-based disaster recovery systems. But as we saw with the June 15 outage, failovers fail all the time and cannot be relied upon to maintain continuity. In fact, they often contribute to outages.

As for regular backups, that's always a good idea. But a backup location can fail too, particularly if it is hit by the same disaster. And what happens with transactions that occurred after the last backup? Will a recovery based on these backups even succeed? And although backup may eventually get you running again, it can't prevent the costly downtime.

You Can't Prevent All Causes of Failure

I argue again and again that there is no way to prevent all the thousands of small and large errors that can conspire (singly or in combination) to knock out a datacenter, cloud node, or server. Generators, power supplies, bad cables, human error and any number of other small disasters can easily combine to make a big disaster. It's not practical to continually monitor all of these to circumvent every possible failure. IT is chaos theory personified; do all you can, but something's going to break.

Geographic Issues

As we are seeing this week, one disaster or group of disasters can span vast geographic areas. You need to plan your business continuity system so the same disaster can't affect everything. Companies that have located all their IT on the Eastern Seaboard should be sweating it this week, because it's conceivable that the heat wave and storms could cause simultaneous power outages from New York to Virginia to Florida. A primary site, failover site, and backup location could all go down at the same time.

The Real Solution: Geographically Separated Expendable Datacenters℠

Here at ZeroNines we've constructed our business continuity solution around a number of tenets, including:

Service must continue despite the complete failure of any one datacenter.
Geographical separation is key, to prevent one disaster from wiping out everything.
Failover is not an option, because it is extremely unreliable.
The solution must be easy and affordable so that the "40%" mentioned above actually use it.

Based on all this, we've developed Always Available™ architecture that enables multiple instances of the same applications, data, and transactions to run equally and simultaneously in multiple locations hundreds or thousands of miles apart. The system does not rely upon or include either failover or restoration from backup. Best of all, an entire datacenter could go offline at any moment and all transactions will simply continue to be processed at other datacenters with no interruption to service. It is affordable, OS agnostic, and operates with existing apps, databases, and infrastructures.

ZeroNines client ZenVault uses Always Available. They also host on Amazon EC2. During the outages, no extraordinary measures are needed. If the EC2 East node goes offline, the two other nodes (in Santa Clara and EC2 West) will continue running the service, and will restore the EC2 East node once it comes back online. ZenVault has had true 100% uptime since the day it launched in 2010.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

The Amazon EC2 cloud had a relatively minor outage a couple weeks ago, on June 14 2012. As it turns out, it happened in the same Virginia datacenter that spawned the April 2011 and August 2011 outages. I've been on the road but now that I look into it I see that it's actually a classic outage scenario and a classic example of cascading failure resulting from a failover. It also illustrates just why you need to plan for Expendable Datacenters℠.

Background: Amazon and Their Cloud Service

I blogged about Amazon's big outage last August (source), and described how large a role Amazon plays in the cloud world. I won't recap all that here, but I will say that among its clients are Netflix, Instagram, Reddit, Foursquare, Quora, Pinterest, parts of Salesforce.com, and ZeroNines client ZenVault.

The Problem: A Power Outage

According to the Amazon status page (source) "a cable fault in the high voltage Utility power distribution system" led to a power outage at the datacenter. Primary backup generators successfully kicked in but after about nine minutes "one of the generators overheated and powered off because of a defective cooling fan." Secondary backup power successfully kicked in but after about four minutes a circuit breaker opened because it had been "incorrectly configured." At this point, with no power at all, some customers went completely offline. Others that were using Amazon's multi-Availability Zone configurations stayed online bu seem to have suffered from impaired API calls, described below. Power was restored about half an hour after it was first lost.

Sites started recovering as soon as power was restored and most customers were back online about an hour after the whole episode began. But it is clear that many weren't really ready for business again because of the cascading effects of the initial interruption.

Subsequent Problems: Loss of In-Flight Transactions

The Amazon report says that when power came back on, some instances were "in an inconsistent state" and that they may have lost "in-flight writes." I interpret this to mean that when the system failed over to the backups, the backup servers were not synchronized with the primaries, resulting in lost transactions. This is typical of a failover disaster recovery system.

Another Subsequent Problem: Impaired API Calls

Additionally, during the power outage, API calls related to Amazon Elastic Block Store (EBS) volumes failed. Amazon sums up the effect beautifully: "The datastore that lost power did not fail cleanly, leaving the system unable to flip [failover] the datastore to its replicas in another Availability Zone." Here's a second failed failover within the same disaster.

My Compliments to Amazon EC2

In all seriousness, I truly commend Amazon for publicly posting such a detailed description of the disaster. It looks to me like they handled the disaster quickly and efficiently within the limitations of their system. Unfortunately that system is clearly not suited to the job at hand.

Amazon does a pretty good job at uptime. We (ZeroNines) use the Amazon EC2 cloud ourselves. But we hedge our bets by adding our own commercially available Always Available™ architecture to harden the whole thing against power outages and such. If this outage had affected our particular instances, we would not have experienced any downtime, inconsistency, failed transactions, or other ill effects.

One Solution for Three Problems

Always Available runs multiple instances of the same apps and data in multiple clouds, virtual servers, or other hosting environments. All are hot and all are active.

When the power failed in the first phase of this disaster, two or more identical Always Available nodes would have continued processing as normal. The initial power outage would not have caused service downtime because customers would have been served by the other nodes.

Secondly, those in-flight transactions would not have been lost because the other nodes would have continued processing them. With Always Available there is no failover and consequently no "dead air" when transactions can be lost.

Third, those failed EBS API calls would not have failed because again, they would have gone to the remaining fully functional nodes.

A big issue in this disaster was the "inconsistent state," or lack of synchronization between the primary and the failover servers. Within an Always Available architecture, there is no failover. Each server is continually updating and being updated by all other servers. Synchronization takes place constantly so when one node is taken out of the configuration the others simply proceed as normal, processing all transactions in order. When the failed server is brought back online, the others update it and bring it to the same logical state so it can begin processing again.

The Expendable Datacenter

Another thing I can't help but point out is the string of events that caused the outage in the first place. First a cable failure combines with a fan failure, and that combines with a circuit breaker failure. It's simple stuff that adds up into a disaster. Then software that can't synchronize. Given the complexities of the modern datacenter, how many possible combinations of points of failure are there? Thousands? Millions? I'll go on the record and say that there is no way to map all the possible failures, and no way to guard against them all individually. Its far better to accept the fact that servers, nodes or entire facilities will go down someday, and that you need to make the whole datacenter expendable without affecting performance. That's what ZeroNines does.

So if you're a cloud customer, take a look at ZeroNines. We can offer virtually 100% uptime whether you host in the cloud, on virtual servers, or in a typical hosted environment. And if you're a cloud provider, you can apply Always Available architecture to your service offering, avoiding disasters like this in the first place.

Check back in a few days and I'll write another post that looks at this outage from a business planning perspective.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

August 10, 2011

Amazon EC2 Outage: Déjà Vu All Over Again

It seems we can always rely on cloud outages to spice up the news feeds. Today, it's another Amazon EC2 Cloud outage, which is a nice departure from the wildly gyrating stock market and the U.S. debt downgrade.

I didn't write about Amazon's big April 2011 EC2 outage simply because I was overwhelmed with other work (along with texts, tweets and emails about the outage). That outage affected big-name customers like Netflix, Foursquare, HootSuite, and Reddit (source). Some EC2 customers' websites were down for as much as two days.

Then just this past weekend an electrical storm over Dublin Ireland led to a lightning strike on a transformer and a subsequent explosion, fire, and loss of power at an Amazon data center. Backup generators could not be started. Amazon's European EC2 Service was affected for as long as twelve hours. Some Microsoft cloud services were knocked out as well (source).

I am a huge proponent of the cloud; however, I believe reliability can and should improve. As a frequent speaker and panelist at cloud-related events, I find that many in the audience are not convinced that the cloud is reliable enough to meet the needs of mission-critical applications. Outages like this don’t help. However, I am aware of several successful implementations of robust, outage-resistant cloud deployments that simply have not gotten any attention because the clients are not motivated to share how they did it with their competitors. Some of these early adopters took risks and made large investments when the mainstream would not, and they feel they deserve some advantage while they can get it. Naturally enough I think ZeroNines has the right solution, but read on for now.

Background: Amazon as a major cloud provider

Amazon EC2 is the Amazon Elastic Compute Cloud (source). It provides thousands of online service providers and software developers easy access to cloud computing capacity that is variable in size. Customers pay only for what they use. Their customers include Netflix (streaming movies and TV shows), Instagram (photo sharing), Reddit (social networking for sharing news), and Foursquare (location-based social networking).

The Problem: Something's rotten in the state of Virginia

I have not found a clear statement yet that describes the exact cause of the August 8 outage, but PCMag.com says that it "closely mirrors a similar cloud outage Amazon suffered in April" (source). It also happened in the same Virginia data center. The April 2011 outage "happened after Amazon network traffic was 'executed incorrectly.' Instead of shifting to another router, traffic went to a lower-capacity network, taking down servers in Northern Virginia." (source). So Amazon loses points for allowing the same problem to happen twice in the same place, but wins a few back for apparently being ready this time and containing the August 8 outage to minutes rather than days.

The Cost: Revenue and reputation

As always with these outages there is talk of the provider compensating its customers through waived fees and such. Mark that against Amazon's balance sheet. Customers no doubt lost business, and you can mark that against their balance sheets. Reliability issues will chase away customers who don't want to risk their own revenue with a service notorious for crashing. But if the cloud nonetheless offers the best business model, what do these customers do? Press for lower fees and more favorable service level agreements for one.

The Solution: Prevention, not recovery

If you're an actual or potential cloud user (with any provider), Always Available™ from ZeroNines can protect your existing systems without changing providers, hardware, operating systems, or applications. If there's a disaster in any part of your system, all your networked transactions and applications continue functioning as normal on the other network nodes. Our CloudNines™ application can protect your cloud-based infrastructure, VirtualNines™ can protect virtualized environments on your own machines, and EnterpriseNines™ can add Always Available protection to any other network infrastructure. You can mix and match so all these can interoperate seamlessly. For businesses of any size, the result is uptime of virtually 100% regardless of the disasters that may strike any individual node in the Always Available array.

The cloud providers themselves could use the same CloudNines product to protect their systems, virtually eliminating downtime and avoiding headlines like Amazon's. We are currently developing and monitoring on Amazon and other cloud platforms. Our technology is certified for Windows Server^® 2008, compatible with Windows Server^® 2008 Hyper-V™ and Hyper-V™ Server, and certified as VMWare^® ready.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

July 5, 2012

Multi-Region Disasters and Expendable Databases℠

June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

August 10, 2011

Amazon EC2 Outage: Déjà Vu All Over Again

About ZeroNines

Z9 Links & Resources

Blog Archive

July 5, 2012

Multi-Region Disasters and Expendable Databases℠

June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

August 10, 2011

Amazon EC2 Outage: Déjà Vu All Over Again

About ZeroNines

Z9 Links & Resources

Blog Archive

Subscribe