Amazon EC2 Outage: Déjà Vu All Over Again

It seems we can always rely on cloud outages to spice up the news feeds. Today, it's another Amazon EC2 Cloud outage, which is a nice departure from the wildly gyrating stock market and the U.S. debt downgrade.

I didn't write about Amazon's big April 2011 EC2 outage simply because I was overwhelmed with other work (along with texts, tweets and emails about the outage). That outage affected big-name customers like Netflix, Foursquare, HootSuite, and Reddit (source). Some EC2 customers' websites were down for as much as two days.

Then just this past weekend an electrical storm over Dublin Ireland led to a lightning strike on a transformer and a subsequent explosion, fire, and loss of power at an Amazon data center. Backup generators could not be started. Amazon's European EC2 Service was affected for as long as twelve hours. Some Microsoft cloud services were knocked out as well (source).

I am a huge proponent of the cloud; however, I believe reliability can and should improve. As a frequent speaker and panelist at cloud-related events, I find that many in the audience are not convinced that the cloud is reliable enough to meet the needs of mission-critical applications. Outages like this don’t help. However, I am aware of several successful implementations of robust, outage-resistant cloud deployments that simply have not gotten any attention because the clients are not motivated to share how they did it with their competitors. Some of these early adopters took risks and made large investments when the mainstream would not, and they feel they deserve some advantage while they can get it. Naturally enough I think ZeroNines has the right solution, but read on for now.

Background: Amazon as a major cloud provider

Amazon EC2 is the Amazon Elastic Compute Cloud (source). It provides thousands of online service providers and software developers easy access to cloud computing capacity that is variable in size. Customers pay only for what they use. Their customers include Netflix (streaming movies and TV shows), Instagram (photo sharing), Reddit (social networking for sharing news), and Foursquare (location-based social networking).

The Problem: Something's rotten in the state of Virginia

I have not found a clear statement yet that describes the exact cause of the August 8 outage, but PCMag.com says that it "closely mirrors a similar cloud outage Amazon suffered in April" (source). It also happened in the same Virginia data center. The April 2011 outage "happened after Amazon network traffic was 'executed incorrectly.' Instead of shifting to another router, traffic went to a lower-capacity network, taking down servers in Northern Virginia." (source). So Amazon loses points for allowing the same problem to happen twice in the same place, but wins a few back for apparently being ready this time and containing the August 8 outage to minutes rather than days.

The Cost: Revenue and reputation

As always with these outages there is talk of the provider compensating its customers through waived fees and such. Mark that against Amazon's balance sheet. Customers no doubt lost business, and you can mark that against their balance sheets. Reliability issues will chase away customers who don't want to risk their own revenue with a service notorious for crashing. But if the cloud nonetheless offers the best business model, what do these customers do? Press for lower fees and more favorable service level agreements for one.

The Solution: Prevention, not recovery

If you're an actual or potential cloud user (with any provider), Always Available™ from ZeroNines can protect your existing systems without changing providers, hardware, operating systems, or applications. If there's a disaster in any part of your system, all your networked transactions and applications continue functioning as normal on the other network nodes. Our CloudNines™ application can protect your cloud-based infrastructure, VirtualNines™ can protect virtualized environments on your own machines, and EnterpriseNines™ can add Always Available protection to any other network infrastructure. You can mix and match so all these can interoperate seamlessly. For businesses of any size, the result is uptime of virtually 100% regardless of the disasters that may strike any individual node in the Always Available array.

The cloud providers themselves could use the same CloudNines product to protect their systems, virtually eliminating downtime and avoiding headlines like Amazon's. We are currently developing and monitoring on Amazon and other cloud platforms. Our technology is certified for Windows Server^® 2008, compatible with Windows Server^® 2008 Hyper-V™ and Hyper-V™ Server, and certified as VMWare^® ready.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

August 10, 2011