ZeroNines® - Always Available™: 2012

August 2, 2012

How Short Outages Become Long Outages

Early in the morning of Friday July 27, 2012, Hosting.com experienced an 11-minute outage. Although service was restored very quickly, many customers weren't prepared and experienced hours of downtime as a result (source).

The key story here is that even though a few minutes of hosting provider downtime is probably well within the parameters of the service level agreement (SLA), the customer's actual downtime far exceeds that. I'm going to quote my own blog from just a couple weeks ago because it accurately sums up the situation:

Your cloud (or other hosting) provider no doubt promises a certain amount of uptime in their service level agreement. Let's imagine that allows one hour of downtime per year. If they have one minor problem it could cause downtime of just a few minutes. But if your systems are not prepared, that interruption could corrupt databases, lose transactions in flight, crash applications, and wreak all manner of havoc. Their downtime glitch will become your costly business disaster unless you are prepared in advance to control it on your end (source).

Hosting.com's Service Level Agreement is posted publicly on their website (source). A quick read reveals that it does NOT promise 100% uptime. Datacenters fail, and that's a fact of life. When signing with any hosting or cloud provider, it is vital that you understand exactly who is responsible for what, and whether total downtime is measured according to the unavailability of their infrastructure or the amount of time it takes you to recover.

Recently, the Paris-based International Working Group on Cloud Computing Resiliency (IWGCR) found that costs for outages between 2007 and 2011 among the 13 providers they reviewed exceeded $70 million (source). No SLA from any provider is going to compensate for those kinds of losses. If the industry demanded this of them, no hosting provider would be able to stay in business. It will be far better for their customers to invest in reliability than to expect dollar-for-dollar restitution after a disaster.

Background: Hosting.com and its Customers

According to the company website, "Hosting.com is a next generation cloud hosting and recovery services company focused on ensuring your mission-critical applications are AlwaysOn™" (source). They are a leading provider of other enterprise hosting solutions and services as well, with datacenters in Dallas, Denver, Irvine, Louisville, Newark, and San Francisco. One source says they host over 65,000 websites (source). This includes financial services, healthcare, media, retail, software as a service (SaaS) providers, and content distribution networks (CDN) (source).

In contrast with other recent outages and other providers whose explanations were late or non-existent, Hosting.com CEO Art Zeile stepped up very quickly during this crisis and alerted his customers of the problem, its cause, and its effects. Though they won't be thrilled with news like this, customers need clear communication and honesty from their providers. That way they know what to tell their own customers and management, and their overworked internal IT teams will have a better chance of taming the chaos. I applaud Mr. Zeile and his actions. We need this level of leadership to benefit the cloud industry at large.

The Problem: Human Error, a Power Outage, and a Chain Reaction

Mr. Zeile explained that "An incorrect breaker operation sequence executed by the servicing vendor caused a shutdown of the UPS plant resulting in loss of critical power to one data center suite within the [Newark, Delaware] facility" (source). The power was back on within 11 minutes, but "customer web sites were offline for between one and five hours as their equipment and databases required more time to recover from the sudden loss of power."

I wasn't there but I can surmise what happened. When the power went out, an unspecified number of servers were shut off without proper shutdown procedures. Applications and databases were abruptly terminated. Other applications and databases that depended upon them suddenly lost transactions in flight. They crashed too, taking down other apps and databases in turn. And so on down the line in a classic cascading failure scenario.

Recovery of the customers' crashed apps and databases required hours. Each customer needed its own data and apps restored, and those that were still running probably had to be shut down and then re-started in proper sequence. Servers had to be checked for damage after their "crash" shut-downs. Apps and data that successfully cut over or failed-over to secondaries had to be cut over again, from the secondaries back to the primaries, and I'll bet there were further failures as that happened.

The Solution: Make Your Datacenters Expendable

Many of the apps and data on the system were undoubtedly protected by failover and backup recovery architecture, or by one of the Hosting.com business continuity solutions. Many of these certainly continued running as they successfully failed over to their secondaries. But equally clear is that apps and data for about 1,100 customers (1.7% of the total Hosting.com customer base) did not continue running. Either they were not equipped with adequate business continuity systems, or the failovers failed. One writer quotes Zeile as saying that although Hosting.com offers a backup option "few customers, at the affected location, had elected to purchase it" (source).

I am unaware of any hosting or cloud provider who publicly promises 100% uptime. So the customer must expect to have some amount of downtime, if only for maintenance. Logically, customers need to provide adequate business continuity systems to protect themselves.

Datacenters go offline all the time for any number of reasons. Thus, your business needs to be able to continue talking to customers, sending billing statements, shipping goods, and paying creditors despite untoward events like power outages, fires, human error, hardware failure, and so forth.

ZeroNines does not recommend or use a failover- or backup-based recovery paradigm. We take a different approach aimed at preventing downtime in the first place, rather than recovering from it afterward. In the case of the Hosting.com outage, Always Available™ architecture from ZeroNines offers two solution scenarios:
1) Hosting.com already operates multiple geographically separated datacenters. Always Available architecture would allow processing to continue on any or all of the remaining five when any one of them goes down.
2) Hosting.com customers could deploy their own Always Available array that would simultaneously replicate all transactions and data on other Hosting.com datacenters, or in other clouds or with other providers.

In either case, the end user experiences no downtime because the remaining nodes continue processing as usual. The offending datacenter simply drops out of the array until power is restored or until your staff can repair it. The other nodes of the Always Available array will update the damaged node once it is functioning again, and bring it to an identical logical state.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

July 30, 2012

Azure and Google and Twitter, Oh My!

The flying monkeys attacked in force on Thursday July 26, 2012, taking down cloud leaders Microsoft, Twitter, and Google Talk. It seems the cloud is no Yellow Brick Road after all, guiding merry executives to some imagined Oz where the sun always shines and outages never happen.

The more I talk to IT planners, the more I find they are looking at reinvesting their cloud savings into business continuity. They rightly hope to compete based upon reliability, and to protect their businesses by exchanging the extremely high and unpredictable costs of outages for the predictable and low costs of the cloud and business continuity. They've clearly got the right idea, especially when you consider the noise outages like yesterday's can make.

Microsoft Azure

Area affected: Europe, via the Dublin datacenter and Amsterdam facility.
Duration: About two and a half hours.
Cause: Unspecified, but one expert suspects infrastructure troubles.
Effects: Loss of cloud service throughout Western Europe. Businesses like SoundGecko were unavailable.
Source: WebTechInfo.com

"In Azure’s case on Thursday, the constant availability of power and lack of a software culprit, such as the Feb. 29 one that downed several services, points to 'more of an infrastructure issue' where some undetected single point of failure in a network switch or other device temporarily disrupted availability, said Chris Leigh-Currill, CTO of Ospero, a supplier of private cloud services in Europe." (source)

Google Talk

Area affected: Worldwide (source).
Duration: About five hours.
Cause: Unspecified, but one expert suspects a bad hardware or software upgrade.
Effects: System unusable, granting access but providing only error messages.
Source: TechNewsWorld.com

"Outages like this 'often happen as the result of a hardware or software upgrade that wasn't properly tested before installation,' Rob Enderle, principal analyst at the Enderle Group, told TechNewsWorld" (source).

Twitter

Area affected: Worldwide.
Duration: About an hour and possibly more in some areas.
Cause: Datacenter failure and a failed failover.
Effects: "Users around the world got zilch from us."
Source: CNN

"Twitter's vice president of engineering, Mazan Rawashdeh… blamed the outage on the concurrent failure of two data centers. When one fails, a parallel system is designed to take over -- but in this case, the second system also failed, he said" (source).

Am I Repeating Myself? Am I Repeating Myself?

My last blog post was titled "A Flurry of July Outages – And All of them Preventable". On Thursday we had another flurry, and all of them in the cloud. And I have to say again, with caution, that these were preventable. I am cautious because we don't know for sure if Mr. Leigh-Currill and Mr. Enderle are correct in their assumptions about the Azure and Google service interruptions, but if they are correct then these outages were almost certainly preventable.

And the Twitter outage was clearly a case of failover failing us once again. That's what it is when "a parallel system that is designed to take over" fails to do so when the primary goes down. It's the same story we see over and over again.

After all these years in the industry I am still amazed that the biggest tech names in the world continue to rely on the ancient failover paradigm. The leaders of these organizations are trusting their reputations, their revenue, their shareholders' profits, their customers' businesses, and potentially peoples' lives to a disaster recovery process that quite likely won't work.

What Some Companies are Already Doing About It

Earlier in this post I mentioned the companies that are reinvesting cloud savings into reliability. These are typically smaller, more agile companies. They'll be able to take on the tech behemoths based on reliability, because they are thinking beyond mere cost savings and efficiency. They see the cloud as a means of creating their own "failsafe" hosting paradigms.

As most readers of this blog know, Always Available™ technology from ZeroNines replaces failover and backup-based recovery. It is already cloud-friendly. The Twitter outage is exactly the kind of thing we prevent. Companies hosting an Always Available array on Azure would have had virtually no risk of downtime, because all network transactions would have continued processing on other nodes. Always Available prevents data disasters before they happen, whether a power supply fails, or software gets corrupted, or a tornado picks up your Kansas datacenter and relocates it to Munchkinland, or someone melts your servers with a bucket of water, or the flying monkeys carry off the last of your support staff. Why clean up after an expensive disaster if you can prevent it in the first place?

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

July 18, 2012

A Flurry of July Outages – And All of Them Preventable

It's starting to look like the Amazon outages in June were only the beginning.

A number of spectacular datacenter failures have made the news just in the past week. First one hit, and I decided to blog on it. Then another. Then another. So here's a digest of all of them. Note that all three disasters centered around lost power, either at the utility or within their own facilities. Combine these with the two power-related Amazon outages on June 15 and June 29 and we can see a disturbing trend.

Level 3 Communications, July 10, 2012

Facility affected: Central London datacenter
Duration: Approximately six hours
Cause: Loss of A/C power "to the content delivery network equipment and customer colocation"
Effects: At least fifty companies went offline directly as a result, and an unspecified number of other companies that use the datacenter for connectivity and hosting also lost service.
Source: ZDNet.com

With this one, it looks like power from the utility provider failed. Reading between the lines I surmise that two diesel backup generators kicked in, but an uninterruptible power supply failed (dare I say "was interrupted?") As this ZDNet author sums it up, "Because the Braham Street facility is a major connectivity point for Level 3, companies that use services that plug into the transit provider were also severely affected." So direct customers were knocked offline and also the customers of customers. For example, colocation provider Adapt had to alert their customers that service was unavailable. The most telling quote comes from Justin Lewis, operations director for Adapt: "When I saw this I was very surprised — this is not a normal event by any means… You would not expect to have a total failure of this nature in a datacentre."

While it's true that a total failure of a datacenter is unusual, partial failures related to power outages happen all the time, as with Amazon. And guess what happened the very next day…

Shaw Communications, July 11, 2012

Facility affected: Shaw Communications HQ and IBM datacenter, downtown Calgary
Duration: Two+ days, with lingering effects
Cause: Transformer explosion and fire
Effects: hospital data center outage, cancellation of 400+ surgeries and medical procedures, inability of populace to reach 911 emergency services via landline, inability to reach city services via phone, unspecified business site/service outages, loss of online motor-vehicle and land-title services.
Source: Datacenter Dynamics Focus

This one should scare all of us because it illustrates the depth of business and social disaster that can stem from a single-point-of-failure system. You really have to read the whole article to get a feel for the extent of the impact. There is no mention in this article of human injuries or fatalities, so I am optimistically assuming there were none.

Shaw communications is "one of Canada's largest telcos." On Wednesday the 11th, an explosion and fire disrupted all services at the datacenter which serves medical centers, businesses, emergency phone service, and several city services.

The big picture is beautifully summed up by DatacenterKnowledge.com:

The incident serves as a wake-up call for government agencies to ensure that the data centers that manage emergency services have recovery and failover systems that can survive [a] series of adversities – the “perfect storm of impossible events” that combine to defeat disaster management plans (source).

Salesforce.com, July 12, 2012

Facility affected: The West Coast Datacenter run by Equinix
Duration: Approximately seven hours, with performance issues for several days
Cause: Loss of power during maintenance, and apparent additional failures
Effects: Salesforce.com customers were unable to use the service or experienced poor performance.
Source: Information Week

It sounds like the actual power outage was brief, but that this caused ancillary problems. "Equinix company officials acknowledged their Silicon Valley data center had experienced a brief power outage and said some customers may have been affected longer than the loss of power itself" (emphasis is mine.) If power comes back on, and services don't, that's a failed failover and/or cascading software failures. I offer this quote as further proof: "Standard procedures for restoring service to the storage devices were not successful and additional time was necessary to engage the respective vendors to further troubleshoot and ensure data integrity (source)."

What's the REAL Cause?

Loss of power or power systems was the key instigating factor in all three outages. But power loss is a known threat that is supposed to be guarded against, so secondary and backup systems should have prevented service downtime and business disasters once the power outage was under way. Clearly, occasionally there will be combinations of failures that cannot be foreseen or prevented.

So I assert that the real cause of business disasters like these is not blown transformers and bad utility service but insufficient preparation. Any one datacenter is vulnerable to these occasional "black swan" events and you have to expect it to be disabled somewhere along the line. The power may go out, but it is up to you to prevent the business disaster.

In order to maintain service, any given datacenter must be expendable, so you and your customers can carry on without it until it is fixed.

Forget about failover. It fails as often as it succeeds, as we can see with Salesforce.com above. And I'd bet my socks that the Level 3 and Shaw Communications outages also featured failovers that didn't work.

Locating everything in one building is just plain irresponsible. That is a carryover from a past era, where geographical separation was dreamt of but not practical. One fire or power outage and entire systems are gone.

An MSP Solution from ZeroNines

Always Available™ technology from ZeroNines could easily have prevented application, data, and service downtime in each of these three disasters. It combines distant geographical separation, redundant/simultaneous processing of all data and transactions, and interoperability between systems to enable uptime in excess of five nines (99.999%), regardless of what happens at any given location.

In conjunction with a major network infrastructure provider, we have recently rolled out an Always Available solution specifically for managed service providers. So if you're a provider like Adapt (See the Shaw Communications story above), your service should remain fully available even if one datacenter melts down completely. You're no longer dependent on the talents and equipment of your datacenter provider to support the SLAs you sign with your customers.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

July 12, 2012

The Cost of Cloud Outages and Planning for the Next One

At least one customer is publicly abandoning the Amazon EC2 cloud after two power-related outages within a month. Online dating site WhatsYourPrice.com is walking out, going back to a more traditional local hosting provider (source). And so the heat is still on in the East, with high temperatures still making life difficult and with the Amazon cloud beginning to lose customers.

What is the Real Cost of Cloud Outages?

This question is difficult to answer, as little hard data is available. Cloud providers and cloud customers keep the financial numbers closely guarded, whether we are talking about the customer's lost business and recovery costs, or Amazon's losses due to lost revenue, restitution paid to customers, customer attrition, and the increased difficulty of acquiring new customers.

Coincidentally, a report on cloud outages among top providers came out on June 22 2012. In it, the Paris-based International Working Group on Cloud Computing Resiliency (IWGCR) claims that costs for outages between 2007 and 2011 among the 13 providers they reviewed totaled $70 million. Their estimates were based on "hourly costs accepted in the industry" (source).

Downtime and availability rates are reported for Amazon Web Services (AWS), Microsoft, Research in Motion (RIM), and others. Total downtime was 568 hours, and availability was 99.917%, nowhere near the five nines (99.999%) that is becoming the de facto target for acceptable uptime.

Although it’s useful to put a line in the sand and publish studies on the cost of outages, such surveys are virtually impossible to do accurately. I don't think that even big analysts like Gartner are able to get a good view into the real costs. Unfortunately, this article does not make it clear whether the $70 million was the cost to providers, to customers, or both. Also, the sample size was pretty small and apparently there is no information about actual customer size. I know first-hand that many of our clients claim their downtime costs start at $6 million per hour and average $18-24 million per hour. Apparently only outages that made the news were included in the report, so this leaves a lot of actual downtime out of their equation, such as small glitches and maintenance downtime that journalists don't hear about. Because of all this I know the actual costs must be higher. Despite all the unavoidable barriers to accurate measurement, such studies are still valuable because they highlight the bottom-line impacts. They also demonstrate just how difficult it is to estimate the cost of downtime.

40% of Cloud Users are Not Prepared

"Light, medium, and heavy cloud users are running clouds where on average 40 percent of their cloud — data, applications, and infrastructure — is NOT backed up and exposed to outage meltdown" (source).

This was said a couple weeks ago by Cameron Peron, VP Marketing at Newvem, a cloud optimization consultancy that specializes in the Amazon cloud. He was referring to his company's clients and the June 15 outage.

If the average cloud customer is anything like these companies then it is no wonder that cloud outages are such a concern. Another writer referred to this kind of planning (or lack of planning) as "stupid IT mistakes" (source).

Who's at Fault?

So when a cloud customer experiences downtime and loses money, who is actually to blame? The cloud provider who failed to deliver 100% uptime, or the cloud customer who was unprepared for the unavoidable downtime?

According to Peron, "Amazon doesn’t make any promises to back up data... The real issue is that many users are under the impression that their data is backed up… but in fact it isn’t due to mismanaged infrastructure configuration." (source)

Cloud customers need to be prepared to use best practices for data protection and disaster prevention/recovery. They need to remember that a cloud is just a virtual datacenter. It is a building crammed full of servers, each of which is home to a number of virtual servers. And all of it is subject to the thousand natural (and unnatural) shocks that silicon is heir to.

Cloud Customers Need to Take Responsibility for Continuity

So here's some friendly advice to WhatsYourPrice.com and others like them: whatever hosting model you choose, get your disaster plan in place. An outage is out there, waiting for you in the form of a bad cooling fan, corrupt database, fire, flood, or human error whether it's in the cloud, your own virtual servers, or a local hosting provider.

Best practices and DR discipline should not be taken for granted simply because the datacenter is outsourced. Many companies that I meet with have gone to the cloud to cut costs, and many of those are reinvesting their savings into providing higher availability. They're looking ahead, trying to avoid disasters, outcompete based on performance, and support customer satisfaction.

Even before the advent of the cloud, new generations of low-cost compute models enabled disaster recovery standards that could prevent a lot of downtime. But they are often poorly executed or ignored altogether. And now, with it being so easy to outsource hosting to "the cloud", it is even easier for companies to shake off responsibility for business continuity, assuming or hoping the folks behind the curtain will take care of everything.

I say that the primary responsibility for outages is the customer's. If you're providing a high-demand service you need to be ready to deliver. It doesn't matter if you can legitimately blame your provider after a disaster; your customers will blame you.

Your cloud (or other hosting) provider no doubt promises a certain amount of uptime in their service level agreement. Let's imagine that allows one hour of downtime per year. If they have one minor problem it could cause downtime of just a few minutes. But if your systems are not prepared, that interruption could corrupt databases, lose transactions in flight, crash applications, and wreak all manner of havoc. Their downtime glitch will become your costly business disaster unless you are prepared in advance to control it on your end.

It's like a tire blowing out on a car; the manufacturer may be responsible to a degree for a wreck, but if your seatbelt was not fastened then all bets are off. Safety systems are there for a reason.

Make it so Any Datacenter is Expendable

ZeroNines offers a solution that enables the complete loss of any datacenter without causing service downtime. We believe that if WhatsYourPrice.com was using our Always Available™ architecture, their dating service would have continued to operate at full capacity for the duration of the outage, with no impact to the customer experience.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime. You may also be interested in our whitepaper "Cloud Computing Observations and the Value of CloudNines™".

Alan Gin – Founder & CEO, ZeroNines

July 5, 2012

Multi-Region Disasters and Expendable Databases℠

As I write this, the United States is suffering from a frightening heat wave and the lingering effects of storms that threatened everything east of the Rockies. Over two dozen lives have been lost and the heat might last for several more days. The transportation, emergency response, and utility infrastructures are badly strained; about a million customers are still without power. Major fires are burning in the west. It is distressing to think about this kind of multi-region disaster, but it is happening now and continues to unfold.

And as trivial as it feels to write this, these natural disasters led to another outage of the Amazon EC2 cloud on June 29. This one is very similar to the EC2 outage on June 15 which I blogged about last week (source). Friday's happened at the same Virginia datacenter and was also caused by the same kind of event.

The Problem: Another Power Outage and Generator Failure

Similarly to what happened a couple weeks ago, on Friday a storm-induced power outage at the utility company forced a switchover to Amazon's backup generator, and this generator failed (source). Netflix, Instagram, Pinterest, and others began to experience difficulties and outages. The problems lasted about an hour.

But rather than dissect this one outage, let's take a look at the larger issues surrounding downtime in the cloud.

Cloud Users Must Plan their Own Disaster Recovery

Wired.com had this to say about Friday's event:

In theory, big outages like this aren’t supposed to happen. Amazon is supposed to keep the data centers up and running – something it has become very good at… In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage. (source)

Long and short, cloud customers think that they have shed their responsibility for business continuity and handed it to the cloud provider. They're wrong, and Amazon has apparently admitted as much by telling its customers to make their own disaster recovery preparations.

"Stupid IT Mistakes"

Those are the lead words in the title of an article about the June 15 outage (source). In it, the author refers to statistics from cloud optimization firm Newvem that show that 40% of cloud users are not properly prepared for an outage. They don't have any kind of redundancy: they don't back up their data and they deploy to only one region. Frighteningly, this includes large companies as well as small.

Promoting a Failure of a Recovery Plan

Another problem is that Amazon has apparently told its customers to "be ready to roll over to a new data center" (source). This is tacit approval of failover-based disaster recovery systems. But as we saw with the June 15 outage, failovers fail all the time and cannot be relied upon to maintain continuity. In fact, they often contribute to outages.

As for regular backups, that's always a good idea. But a backup location can fail too, particularly if it is hit by the same disaster. And what happens with transactions that occurred after the last backup? Will a recovery based on these backups even succeed? And although backup may eventually get you running again, it can't prevent the costly downtime.

You Can't Prevent All Causes of Failure

I argue again and again that there is no way to prevent all the thousands of small and large errors that can conspire (singly or in combination) to knock out a datacenter, cloud node, or server. Generators, power supplies, bad cables, human error and any number of other small disasters can easily combine to make a big disaster. It's not practical to continually monitor all of these to circumvent every possible failure. IT is chaos theory personified; do all you can, but something's going to break.

Geographic Issues

As we are seeing this week, one disaster or group of disasters can span vast geographic areas. You need to plan your business continuity system so the same disaster can't affect everything. Companies that have located all their IT on the Eastern Seaboard should be sweating it this week, because it's conceivable that the heat wave and storms could cause simultaneous power outages from New York to Virginia to Florida. A primary site, failover site, and backup location could all go down at the same time.

The Real Solution: Geographically Separated Expendable Datacenters℠

Here at ZeroNines we've constructed our business continuity solution around a number of tenets, including:

Service must continue despite the complete failure of any one datacenter.
Geographical separation is key, to prevent one disaster from wiping out everything.
Failover is not an option, because it is extremely unreliable.
The solution must be easy and affordable so that the "40%" mentioned above actually use it.

Based on all this, we've developed Always Available™ architecture that enables multiple instances of the same applications, data, and transactions to run equally and simultaneously in multiple locations hundreds or thousands of miles apart. The system does not rely upon or include either failover or restoration from backup. Best of all, an entire datacenter could go offline at any moment and all transactions will simply continue to be processed at other datacenters with no interruption to service. It is affordable, OS agnostic, and operates with existing apps, databases, and infrastructures.

ZeroNines client ZenVault uses Always Available. They also host on Amazon EC2. During the outages, no extraordinary measures are needed. If the EC2 East node goes offline, the two other nodes (in Santa Clara and EC2 West) will continue running the service, and will restore the EC2 East node once it comes back online. ZenVault has had true 100% uptime since the day it launched in 2010.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

The Amazon EC2 cloud had a relatively minor outage a couple weeks ago, on June 14 2012. As it turns out, it happened in the same Virginia datacenter that spawned the April 2011 and August 2011 outages. I've been on the road but now that I look into it I see that it's actually a classic outage scenario and a classic example of cascading failure resulting from a failover. It also illustrates just why you need to plan for Expendable Datacenters℠.

Background: Amazon and Their Cloud Service

I blogged about Amazon's big outage last August (source), and described how large a role Amazon plays in the cloud world. I won't recap all that here, but I will say that among its clients are Netflix, Instagram, Reddit, Foursquare, Quora, Pinterest, parts of Salesforce.com, and ZeroNines client ZenVault.

The Problem: A Power Outage

According to the Amazon status page (source) "a cable fault in the high voltage Utility power distribution system" led to a power outage at the datacenter. Primary backup generators successfully kicked in but after about nine minutes "one of the generators overheated and powered off because of a defective cooling fan." Secondary backup power successfully kicked in but after about four minutes a circuit breaker opened because it had been "incorrectly configured." At this point, with no power at all, some customers went completely offline. Others that were using Amazon's multi-Availability Zone configurations stayed online bu seem to have suffered from impaired API calls, described below. Power was restored about half an hour after it was first lost.

Sites started recovering as soon as power was restored and most customers were back online about an hour after the whole episode began. But it is clear that many weren't really ready for business again because of the cascading effects of the initial interruption.

Subsequent Problems: Loss of In-Flight Transactions

The Amazon report says that when power came back on, some instances were "in an inconsistent state" and that they may have lost "in-flight writes." I interpret this to mean that when the system failed over to the backups, the backup servers were not synchronized with the primaries, resulting in lost transactions. This is typical of a failover disaster recovery system.

Another Subsequent Problem: Impaired API Calls

Additionally, during the power outage, API calls related to Amazon Elastic Block Store (EBS) volumes failed. Amazon sums up the effect beautifully: "The datastore that lost power did not fail cleanly, leaving the system unable to flip [failover] the datastore to its replicas in another Availability Zone." Here's a second failed failover within the same disaster.

My Compliments to Amazon EC2

In all seriousness, I truly commend Amazon for publicly posting such a detailed description of the disaster. It looks to me like they handled the disaster quickly and efficiently within the limitations of their system. Unfortunately that system is clearly not suited to the job at hand.

Amazon does a pretty good job at uptime. We (ZeroNines) use the Amazon EC2 cloud ourselves. But we hedge our bets by adding our own commercially available Always Available™ architecture to harden the whole thing against power outages and such. If this outage had affected our particular instances, we would not have experienced any downtime, inconsistency, failed transactions, or other ill effects.

One Solution for Three Problems

Always Available runs multiple instances of the same apps and data in multiple clouds, virtual servers, or other hosting environments. All are hot and all are active.

When the power failed in the first phase of this disaster, two or more identical Always Available nodes would have continued processing as normal. The initial power outage would not have caused service downtime because customers would have been served by the other nodes.

Secondly, those in-flight transactions would not have been lost because the other nodes would have continued processing them. With Always Available there is no failover and consequently no "dead air" when transactions can be lost.

Third, those failed EBS API calls would not have failed because again, they would have gone to the remaining fully functional nodes.

A big issue in this disaster was the "inconsistent state," or lack of synchronization between the primary and the failover servers. Within an Always Available architecture, there is no failover. Each server is continually updating and being updated by all other servers. Synchronization takes place constantly so when one node is taken out of the configuration the others simply proceed as normal, processing all transactions in order. When the failed server is brought back online, the others update it and bring it to the same logical state so it can begin processing again.

The Expendable Datacenter

Another thing I can't help but point out is the string of events that caused the outage in the first place. First a cable failure combines with a fan failure, and that combines with a circuit breaker failure. It's simple stuff that adds up into a disaster. Then software that can't synchronize. Given the complexities of the modern datacenter, how many possible combinations of points of failure are there? Thousands? Millions? I'll go on the record and say that there is no way to map all the possible failures, and no way to guard against them all individually. Its far better to accept the fact that servers, nodes or entire facilities will go down someday, and that you need to make the whole datacenter expendable without affecting performance. That's what ZeroNines does.

So if you're a cloud customer, take a look at ZeroNines. We can offer virtually 100% uptime whether you host in the cloud, on virtual servers, or in a typical hosted environment. And if you're a cloud provider, you can apply Always Available architecture to your service offering, avoiding disasters like this in the first place.

Check back in a few days and I'll write another post that looks at this outage from a business planning perspective.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

March 26, 2012

BATS: “When the Exchange Blows Itself Up, That’s Not a Good Thing”

I can imagine scenarios more spectacular than the BATS failure and IPO withdrawal on Friday, but most of them involve alien invasion, chupacabras, or the expiration of the Mayan calendar.

Every once in a while a technology failure casts such a light (or perhaps a shadow) on how we do things that people actually stop and wonder aloud if the entire paradigm is too flawed to succeed. I don't mean that they look at alternative solutions like they do every time RIM goes down. They actually begin to question the soundness of the industry sector itself and the technological and market assumptions on which it is founded. Take for example this quote about BATS from a veteran equities professional:

"This tells you the system we’ve created over the last 15 years has holes in it, and this is one example of a failure,” Joseph Saluzzi, partner and co-head of equity trading at Themis Trading LLC in Chatham, New Jersey, said in a phone interview. “When the exchange blows itself up, that’s not a good thing.” The malfunctions will refocus scrutiny on market structure in the U.S., where two decades of government regulation have broken the grip of the biggest exchanges and left trading fragmented over as many as 50 venues. BATS, whose name stands for Better Alternative Trading System, expanded in tandem with the automated firms that now dominate the buying and selling of American equities. The withdrawal also raises questions about the reliability of venues formed as competitors to the New York Stock Exchange and Nasdaq Stock Market since the 1990s. [source]

When High Frequency Trading Crashes

BATS was initially built to service brokers and high frequency trading firms [source]. Their Friday crash mirrors a scenario I speculated about in this blog back in May 2010 in regard to an 80-minute TD Ameritrade outage:

Some banks, hedge funds, and other high-power financial firms engaged in High Frequency Trading (HFT) make billions of trades a day over ultra-high speed connections [source]. Many trades live for only a few seconds. Enormous transactions are conceived and executed in half a second, with computers evaluating the latest news and acting on it well before human traders even know what the news is. HFT is having a significant effect on markets; there is evidence that the history-making “Flash Crash” of May 6 2010 was caused and then largely corrected by High Frequency Trading [source]. What would happen if one of these HFT systems was down for an hour and a half? Or even just a minute? [source]

To answer my own question, apparently IPOs can be cancelled, a faulty 100-share trade can halt the trading of a leading stock, and regulators will jump like you stuck them with a pin.

About BATS

BATS Global Markets, Inc. (http://www.batstrading.com) describes itself in its prospectus as primarily a technology company [source]. The BATS exchanges represent an enormous 11 to 12 percent of daily market volume [source]. BATS is backed by Citigroup Inc, Morgan Stanley, Credit Suisse Group [source] and several other large financial firms. BATS planned to go public on Friday March 23, 2012 by selling their own shares on their own exchange that runs on their own technology.

What Happened

To my knowledge, BATS has not issued a definitive statement on the exact technological problem. Here is the best summary I could come up with:

A BATS computer that matches orders in companies with ticker symbols in the A-BFZZZ range "encountered a software bug related to IPO auctions" at 10:45 a.m. New York time [source].

This prevented the exchange from trading its own shares. Then…

A single trade for 100 shares executed on a BATS venue at 10:57 a.m. briefly sent Apple down more than 9 percent to $542.80 [source].

Apple's ticker is AAPL so it was within the group affected by the bug. By the time it was over, BATS shares had dropped from about $16.00 to zero and they withdrew their IPO. All trades in the IPO, Apple, and other affected equities were canceled.

Service Uptime: 99.998% Won't Cut It.

BATS claims it experienced very low downtime in 2011: 99.94% for its BZX exchange and 99.998% for its BYX exchange [source]. This equates to about 5.25 hours of downtime for BZX and about 10.5 minutes of downtime for BYX.

Though admirable for most online services, my first thought is that for a High Frequency Trading system this could represent billions of dollars in lost trades. I'd like to hope that all this downtime occurred when the markets were closed.

But even short outages of seconds or even fractions of a second can cause cascading application and database failures that corrupt the system and cause problems for hours or days.

Speculation: How it Could Have Been Prevented

This was not an all-out downtime event like I prophesied back in May 2010 and I don't think that ZeroNines technology could have prevented it. However, a variety of other real-world failures can be just as disastrous, particularly when one considers the shortcomings of failover-based disaster recovery, questionable cloud reliability, and the fact that switches and other hardware will eventually fail.

Allow me to postulate how this might happen one day on another financial firm's systems.

Scenario 1
The software fails because of a corrupted instance of the trading application on one server node. In an Always Available™ configuration that one node could have been shut down, allowing other instances of the application to handle all transactions as normal.

Scenario 2
A network switch fails, leading to extreme latency in some transaction responses. In an Always Available configuration, all transactions are processed simultaneously in multiple places, so the slow responses from the failed node would be ignored in favor of fast responses from the functioning nodes. The latency would not have been noticed by the traders or the market.

Scenario 3
An entire cloud instance has an outage lasting just a few seconds, well within their service level promises. During failover to another virtual server, one application is corrupted and begins sending false information. In an Always Available configuration there is no failover and no failover-induced corruption. Instead, other instances of the application process all the transactions while one is down, and then update the failed instance once it comes back online.

These scenarios are simplifications but you get the point: problems will always occur, but if they are within an Always Available configuration the nodes that continue to function will cover for those that don't.

Playing Nice for the SEC

“When you’re actually the exchange that is coming public and the platform that you touted as being very robust and competitive with any other global exchange fails to work, it’s a real black eye,” Peter Kovalski, a money manager who expected to receive shares in Bats for Alpine Mutual Funds in Purchase, New York [source].

With all respect to Mr. Kovalski, this is far more than a black eye. It's a business disaster. BATS is on the ropes, and I suspect their underwriters are hurting too. Trading opportunities were lost, many permanently. The SEC is already investigating automated and High Frequency Trading [source], and I'd guess it will be far more cautious about innovation among exchanges, trading platforms, and other systems.

ZeroNines offers one way to keep the wolves at bay: our Always Available technology can enable uptime in excess of 99.999% (five nines) among applications and data within traditional hosting environments, virtualized environments, and the cloud. We can't fix your software bugs, but we can prevent downtime caused by other things.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

January 27, 2012

The Legal Ramifications of Cloud Outages

Here's a public service announcement for cloud customers and cloud service providers alike: If you're not doing something to significantly increase the reliability of your cloud systems, you should prepare your legal team.

Take a look at this article; it's a great primer to get everyone started: 5 Key Considerations When Litigating Cloud Computing Disputes by Gerry Silver, partner at Chadbourne & Parke.

I agree with Silver who sums up the situation nicely when he says that "given the ever-increasing reliance on cloud computing, it is inevitable that disputes and litigation will increase between corporations and cloud service providers."

Understandably, both cloud users and cloud providers will want to dodge responsibility for cloud outages. "The corporation may be facing enormous liability and will seek to hold the cloud provider responsible, while the cloud provider will undoubtedly look to the parties' agreement and the underlying circumstances for defenses" [source].

Looks like the future is bright for attorneys who specialize in cloud issues. After all, a faulty power supply or software glitch could lead to years of court battles.

Five Legal Elements

Silver outlines five key elements for the legal team to consider:

Limitation of liability written into service contracts.
Whether the Limitation of Liability clause can be circumvented: can the cloud provider be held responsible despite this clause?
Contract terms: A breach of contract on either side can greatly affect litigation outcomes.
Remedies: During the crisis the corporation could demand that the cloud provider takes extraordinary steps to restore systems and data.
Insurance and indemnification: Insurance may cover some losses, and a third party may bear some responsibility for the problem too.

The Disturbing News: Expectations are Low

In my travels, I am still surprised at how little thought goes into the liability associated with an outage whether it be in a data center, cloud or hybrid configuration. Although I embrace everyone’s motivation to move to the cloud, I found a couple of points in Silver's article disturbing because they shed light on the obsolete way the tech industry thinks about cloud architecture as it relates to disaster prevention.

1) Just how much foresight is a cloud provider legally expected to have? In the section titled "May the Limitation of Liability Clause Be Circumvented?" Silver describes how "one court recently sustained a claim of gross negligence and/or recklessness in a cloud computing/loss of data case because it was alleged that the provider failed to take adequate steps to protect the data." This raises the question of what constitutes "failure to take adequate steps". Does it mean that the provider did something genuinely negligent like setting up a system with multiple single points of failure? Were they culpable because they had followed best practices and relied upon an industry-standard failover-based recovery system which later failed? Or did they fail to seek out (or create) the most advanced and reliable proactive business continuity system on the planet? Whatever they were using probably seemed good at the time but was clearly not adequate because it failed to protect the customer's data.

I would speculate that a customer’s lawyer would have a pretty high expectation of what "adequate steps" are, but as you will see in my next point the bar is still set pretty low.

2) The expectation is that cloud providers will be using failover, which is 20 years out of date. In the same section, Silver asks "Were back-ups of data stored in different regions? Were banks of computers isolated from one another ready to take over if another zone failed?" This without doubt describes a failover system. Apparently his expectation is that a cloud provider should follow current best practices and use a failover disaster recovery system. But the failover technique was designed decades ago for systems that are now extinct or nearly so. The latest networks are radically more sophisticated than their forebears and consequently have radically different requirements. Even a successful failover is a perilous thing, and failovers fail all the time. If they didn't, Mr. Silver would probably not have found it necessary to write this article. Backups happen only on fixed schedules so the most recent transactions are often lost during a disaster. You can expect legal battles over downtime and data loss to continue because cloud providers and their customers are all using one variation or another of these outdated disaster recovery techniques. So how can a disaster recovery system that is so prone to disaster be considered an "adequate step?"

Like I said, you'd better call a meeting with your legal counsel and get ready.

No Outage, No Litigation

ZeroNines can actually eliminate outages. Our Always Available™ technology processes all network transactions simultaneously and in parallel on multiple cloud nodes or servers that are geographically separated. If something fails and brings down Cloud Node A, Nodes B and C continue processing everything as if nothing had happened. There is no hierarchy and no failover. So if this cloud provider's service does not go offline there is no violation of SLAs and no cause for litigation.

Our approach to business continuity is far superior to the failover paradigm, offering in excess of five nines (>99.999%) of uptime. It is suitable for modern generations of clouds, virtual servers, traditional servers, colocation hosting, in-house servers, and the applications and databases that clients will want to run in all of these.

So my message to cloud providers is to check out ZeroNines and Always Available as a means of protecting your service from downtime and the litigation that can come with it.

My message to cloud customers is that you can apply ZeroNines and Always Available whether your cloud provider is involved or not. After all, your key interest here is to maintain business continuity, not to win a big settlement over an outage.

And heads-up to the lawyers on both sides: We are setting a new standard in what constitutes "adequate steps".

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

January 23, 2012

RIM co-CEOs Resign: Is This the Cost of Downtime?

Back in October I commented in this blog about the enormous RIM BlackBerry outage [source]. I wrote that "even a massive outage like this is unlikely to cause the demise of a large and important firm, but combined with other woes like a less-than-competitive product and poor business model it could well be the deciding factor."

And now for the fallout. RIM is still in business, but its beleaguered co-CEOs/co-Chairmen Jim Balsillie and Mike Lazaridis have resigned and taken other positions within the company [source]. I'm sure it was not the outage alone (or all RIM outages put together) that caused this leadership shakeup. But it could well have been the deciding factor.

Outages and CEO Job Security

RIM's product problems are certainly serious. But I see a fundamental difference between 1) the prescience needed to get the right product to market at the right time, and 2) the technical ability to keep an existing product up and running. Customers might to some degree forgive a company whose product is reliable but behind the times. They will abandon if it doesn't work when they need it even if it is the newest, slickest thing around.

The October outage has RIM "facing a possible class action lawsuit in Canada" [source]. Add the cost of that in addition to the costs of recovery, customer abandonment, shareholder value and so forth. (Stay tuned; I will be commenting on the legal issues around cloud outages in the next few days.)

To put RIM's decline in perspective, the company was worth $70 billion a few years ago but today has a market value of about $8.9 billion [source]. Their stock dropped about 75% last year and was down to $16.28 before the market opened on Monday January 23, 2012 [source].

So according to the rules of modern business, someone has to pay and in this case it is the CEOs.

Now Imagine This at a Smaller Company

Can you imagine a three-day outage at a smaller software company? Or even a one-day outage? Imagine a typical e-commerce technology provider with 50 retail customers, 100 employees, and an SaaS application. If the core application, image server, database server, customer care system, inventory system, orders & fulfillment system, or other key element goes down that could be the end of them. Many smaller companies do not survive a significant downtime event. And many smaller retailers do not survive if they are unable to do business on a key shopping day such as Black Friday or Cyber Monday.

Or even if the email system goes down for a couple hours. It happens all the time. Email is a key element of workflow and productivity and what company can afford to sit still for even a couple hours?

It's more than the CEO whose job is at risk. Here's where an ounce of prevention is worth far more than a pound of cure.

That Rickety Old Failover

Remember my earlier comment about outdated yet reliable products, versus outdated and unreliable products? Ironically, the failover disaster recovery model that failed RIM back in October is one of those old and unreliable products. It was designed for systems and architectures that no longer bear any resemblance to what businesses are actually using. If failover worked I would not be writing this because there would be no need for its replacement.

But if you want to find out about real business continuity and getting away from failover, take a look at ZeroNines. Our Always Available™ architecture processes in multiple cloud locations, on multiple servers, and in multiple nodes. There is no hierarchy so if one goes down the others continue processing all network transactions. ZeroNines can bring application uptime to virtually 100%. It is a complete departure from the failover that RIM is using, and that small businesses everywhere stake their futures upon.

Time Will Tell

"RIM earned its reputation by focusing relentlessly on the customer and delivering unique mobile communications solutions… We intend to build on this heritage to expand BlackBerry's leadership position," RIM's new CEO Thorsten Heins is quoted as saying [source].

Let's hope this "focus on the customer" also includes a strategic initiative to build genuine uptime and availability, or maybe we'll be reading about another new RIM CEO next January.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

August 2, 2012

How Short Outages Become Long Outages

July 30, 2012

Azure and Google and Twitter, Oh My!

July 18, 2012

A Flurry of July Outages – And All of Them Preventable

July 12, 2012

The Cost of Cloud Outages and Planning for the Next One

July 5, 2012

Multi-Region Disasters and Expendable Databases℠

June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

March 26, 2012

BATS: “When the Exchange Blows Itself Up, That’s Not a Good Thing”

January 27, 2012

The Legal Ramifications of Cloud Outages

January 23, 2012

RIM co-CEOs Resign: Is This the Cost of Downtime?

About ZeroNines

Z9 Links & Resources

Blog Archive

August 2, 2012

July 30, 2012

July 18, 2012

July 12, 2012

July 5, 2012

June 28, 2012

March 26, 2012

January 27, 2012

January 23, 2012

About ZeroNines

Z9 Links & Resources

Blog Archive

Subscribe