July 30, 2012

Azure and Google and Twitter, Oh My!

The flying monkeys attacked in force on Thursday July 26, 2012, taking down cloud leaders Microsoft, Twitter, and Google Talk. It seems the cloud is no Yellow Brick Road after all, guiding merry executives to some imagined Oz where the sun always shines and outages never happen.

The more I talk to IT planners, the more I find they are looking at reinvesting their cloud savings into business continuity. They rightly hope to compete based upon reliability, and to protect their businesses by exchanging the extremely high and unpredictable costs of outages for the predictable and low costs of the cloud and business continuity. They've clearly got the right idea, especially when you consider the noise outages like yesterday's can make.

Microsoft Azure
  • Area affected: Europe, via the Dublin datacenter and Amsterdam facility.
  • Duration: About two and a half hours.
  • Cause: Unspecified, but one expert suspects infrastructure troubles.
  • Effects: Loss of cloud service throughout Western Europe. Businesses like SoundGecko were unavailable.
  • Source: WebTechInfo.com
"In Azure’s case on Thursday, the constant availability of power and lack of a software culprit, such as the Feb. 29 one that downed several services, points to 'more of an infrastructure issue' where some undetected single point of failure in a network switch or other device temporarily disrupted availability, said Chris Leigh-Currill, CTO of Ospero, a supplier of private cloud services in Europe." (source)

Google Talk
  • Area affected: Worldwide (source).
  • Duration: About five hours.
  • Cause: Unspecified, but one expert suspects a bad hardware or software upgrade.
  • Effects: System unusable, granting access but providing only error messages.
  • Source: TechNewsWorld.com
"Outages like this 'often happen as the result of a hardware or software upgrade that wasn't properly tested before installation,' Rob Enderle, principal analyst at the Enderle Group, told TechNewsWorld" (source).

Twitter
  • Area affected: Worldwide.
  • Duration: About an hour and possibly more in some areas.
  • Cause: Datacenter failure and a failed failover.
  • Effects: "Users around the world got zilch from us."
  • Source: CNN
"Twitter's vice president of engineering, Mazan Rawashdeh… blamed the outage on the concurrent failure of two data centers. When one fails, a parallel system is designed to take over -- but in this case, the second system also failed, he said" (source).

Am I Repeating Myself? Am I Repeating Myself?

My last blog post was titled "A Flurry of July Outages – And All of them Preventable". On Thursday we had another flurry, and all of them in the cloud. And I have to say again, with caution, that these were preventable. I am cautious because we don't know for sure if Mr. Leigh-Currill and Mr. Enderle are correct in their assumptions about the Azure and Google service interruptions, but if they are correct then these outages were almost certainly preventable.

And the Twitter outage was clearly a case of failover failing us once again. That's what it is when "a parallel system that is designed to take over" fails to do so when the primary goes down. It's the same story we see over and over again.

After all these years in the industry I am still amazed that the biggest tech names in the world continue to rely on the ancient failover paradigm. The leaders of these organizations are trusting their reputations, their revenue, their shareholders' profits, their customers' businesses, and potentially peoples' lives to a disaster recovery process that quite likely won't work.

What Some Companies are Already Doing About It

Earlier in this post I mentioned the companies that are reinvesting cloud savings into reliability. These are typically smaller, more agile companies. They'll be able to take on the tech behemoths based on reliability, because they are thinking beyond mere cost savings and efficiency. They see the cloud as a means of creating their own "failsafe" hosting paradigms.

As most readers of this blog know, Always Available™ technology from ZeroNines replaces failover and backup-based recovery. It is already cloud-friendly. The Twitter outage is exactly the kind of thing we prevent. Companies hosting an Always Available array on Azure would have had virtually no risk of downtime, because all network transactions would have continued processing on other nodes. Always Available prevents data disasters before they happen, whether a power supply fails, or software gets corrupted, or a tornado picks up your Kansas datacenter and relocates it to Munchkinland, or someone melts your servers with a bucket of water, or the flying monkeys carry off the last of your support staff. Why clean up after an expensive disaster if you can prevent it in the first place?

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

July 18, 2012

A Flurry of July Outages – And All of Them Preventable

It's starting to look like the Amazon outages in June were only the beginning.

A number of spectacular datacenter failures have made the news just in the past week. First one hit, and I decided to blog on it. Then another. Then another. So here's a digest of all of them. Note that all three disasters centered around lost power, either at the utility or within their own facilities. Combine these with the two power-related Amazon outages on June 15 and June 29 and we can see a disturbing trend.

Level 3 Communications, July 10, 2012
  • Facility affected: Central London datacenter
  • Duration: Approximately six hours
  • Cause: Loss of A/C power "to the content delivery network equipment and customer colocation"
  • Effects: At least fifty companies went offline directly as a result, and an unspecified number of other companies that use the datacenter for connectivity and hosting also lost service.
  • Source: ZDNet.com
With this one, it looks like power from the utility provider failed. Reading between the lines I surmise that two diesel backup generators kicked in, but an uninterruptible power supply failed (dare I say "was interrupted?") As this ZDNet author sums it up, "Because the Braham Street facility is a major connectivity point for Level 3, companies that use services that plug into the transit provider were also severely affected." So direct customers were knocked offline and also the customers of customers. For example, colocation provider Adapt had to alert their customers that service was unavailable. The most telling quote comes from Justin Lewis, operations director for Adapt: "When I saw this I was very surprised — this is not a normal event by any means… You would not expect to have a total failure of this nature in a datacentre."

While it's true that a total failure of a datacenter is unusual, partial failures related to power outages happen all the time, as with Amazon. And guess what happened the very next day…

Shaw Communications, July 11, 2012
  • Facility affected: Shaw Communications HQ and IBM datacenter, downtown Calgary
  • Duration: Two+ days, with lingering effects
  • Cause: Transformer explosion and fire
  • Effects: hospital data center outage, cancellation of 400+ surgeries and medical procedures, inability of populace to reach 911 emergency services via landline, inability to reach city services via phone, unspecified business site/service outages, loss of online motor-vehicle and land-title services.
  • Source: Datacenter Dynamics Focus
This one should scare all of us because it illustrates the depth of business and social disaster that can stem from a single-point-of-failure system. You really have to read the whole article to get a feel for the extent of the impact. There is no mention in this article of human injuries or fatalities, so I am optimistically assuming there were none.

Shaw communications is "one of Canada's largest telcos." On Wednesday the 11th, an explosion and fire disrupted all services at the datacenter which serves medical centers, businesses, emergency phone service, and several city services.

The big picture is beautifully summed up by DatacenterKnowledge.com:

The incident serves as a wake-up call for government agencies to ensure that the data centers that manage emergency services have recovery and failover systems that can survive [a] series of adversities – the “perfect storm of impossible events” that combine to defeat disaster management plans (source).

Salesforce.com, July 12, 2012
  • Facility affected: The West Coast Datacenter run by Equinix
  • Duration: Approximately seven hours, with performance issues for several days
  • Cause: Loss of power during maintenance, and apparent additional failures
  • Effects: Salesforce.com customers were unable to use the service or experienced poor performance.
  • Source: Information Week
It sounds like the actual power outage was brief, but that this caused ancillary problems. "Equinix company officials acknowledged their Silicon Valley data center had experienced a brief power outage and said some customers may have been affected longer than the loss of power itself" (emphasis is mine.) If power comes back on, and services don't, that's a failed failover and/or cascading software failures. I offer this quote as further proof: "Standard procedures for restoring service to the storage devices were not successful and additional time was necessary to engage the respective vendors to further troubleshoot and ensure data integrity (source)."

What's the REAL Cause?

Loss of power or power systems was the key instigating factor in all three outages. But power loss is a known threat that is supposed to be guarded against, so secondary and backup systems should have prevented service downtime and business disasters once the power outage was under way. Clearly, occasionally there will be combinations of failures that cannot be foreseen or prevented.

So I assert that the real cause of business disasters like these is not blown transformers and bad utility service but insufficient preparation. Any one datacenter is vulnerable to these occasional "black swan" events and you have to expect it to be disabled somewhere along the line. The power may go out, but it is up to you to prevent the business disaster.

In order to maintain service, any given datacenter must be expendable, so you and your customers can carry on without it until it is fixed.

Forget about failover. It fails as often as it succeeds, as we can see with Salesforce.com above. And I'd bet my socks that the Level 3 and Shaw Communications outages also featured failovers that didn't work.

Locating everything in one building is just plain irresponsible. That is a carryover from a past era, where geographical separation was dreamt of but not practical. One fire or power outage and entire systems are gone.

An MSP Solution from ZeroNines

Always Available™ technology from ZeroNines could easily have prevented application, data, and service downtime in each of these three disasters. It combines distant geographical separation, redundant/simultaneous processing of all data and transactions, and interoperability between systems to enable uptime in excess of five nines (99.999%), regardless of what happens at any given location.

In conjunction with a major network infrastructure provider, we have recently rolled out an Always Available solution specifically for managed service providers. So if you're a provider like Adapt (See the Shaw Communications story above), your service should remain fully available even if one datacenter melts down completely. You're no longer dependent on the talents and equipment of your datacenter provider to support the SLAs you sign with your customers.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

July 12, 2012

The Cost of Cloud Outages and Planning for the Next One

At least one customer is publicly abandoning the Amazon EC2 cloud after two power-related outages within a month. Online dating site WhatsYourPrice.com is walking out, going back to a more traditional local hosting provider (source). And so the heat is still on in the East, with high temperatures still making life difficult and with the Amazon cloud beginning to lose customers.

What is the Real Cost of Cloud Outages?

This question is difficult to answer, as little hard data is available. Cloud providers and cloud customers keep the financial numbers closely guarded, whether we are talking about the customer's lost business and recovery costs, or Amazon's losses due to lost revenue, restitution paid to customers, customer attrition, and the increased difficulty of acquiring new customers.

Coincidentally, a report on cloud outages among top providers came out on June 22 2012. In it, the Paris-based International Working Group on Cloud Computing Resiliency (IWGCR) claims that costs for outages between 2007 and 2011 among the 13 providers they reviewed totaled $70 million. Their estimates were based on "hourly costs accepted in the industry" (source).

Downtime and availability rates are reported for Amazon Web Services (AWS), Microsoft, Research in Motion (RIM), and others. Total downtime was 568 hours, and availability was 99.917%, nowhere near the five nines (99.999%) that is becoming the de facto target for acceptable uptime.

Although it’s useful to put a line in the sand and publish studies on the cost of outages, such surveys are virtually impossible to do accurately. I don't think that even big analysts like Gartner are able to get a good view into the real costs. Unfortunately, this article does not make it clear whether the $70 million was the cost to providers, to customers, or both. Also, the sample size was pretty small and apparently there is no information about actual customer size. I know first-hand that many of our clients claim their downtime costs start at $6 million per hour and average $18-24 million per hour. Apparently only outages that made the news were included in the report, so this leaves a lot of actual downtime out of their equation, such as small glitches and maintenance downtime that journalists don't hear about. Because of all this I know the actual costs must be higher. Despite all the unavoidable barriers to accurate measurement, such studies are still valuable because they highlight the bottom-line impacts. They also demonstrate just how difficult it is to estimate the cost of downtime.

40% of Cloud Users are Not Prepared

"Light, medium, and heavy cloud users are running clouds where on average 40 percent of their cloud — data, applications, and infrastructure — is NOT backed up and exposed to outage meltdown" (source).

This was said a couple weeks ago by Cameron Peron, VP Marketing at Newvem, a cloud optimization consultancy that specializes in the Amazon cloud. He was referring to his company's clients and the June 15 outage.

If the average cloud customer is anything like these companies then it is no wonder that cloud outages are such a concern. Another writer referred to this kind of planning (or lack of planning) as "stupid IT mistakes" (source).

Who's at Fault?

So when a cloud customer experiences downtime and loses money, who is actually to blame? The cloud provider who failed to deliver 100% uptime, or the cloud customer who was unprepared for the unavoidable downtime?

According to Peron, "Amazon doesn’t make any promises to back up data... The real issue is that many users are under the impression that their data is backed up… but in fact it isn’t due to mismanaged infrastructure configuration." (source)

Cloud customers need to be prepared to use best practices for data protection and disaster prevention/recovery. They need to remember that a cloud is just a virtual datacenter. It is a building crammed full of servers, each of which is home to a number of virtual servers. And all of it is subject to the thousand natural (and unnatural) shocks that silicon is heir to.

Cloud Customers Need to Take Responsibility for Continuity

So here's some friendly advice to WhatsYourPrice.com and others like them: whatever hosting model you choose, get your disaster plan in place. An outage is out there, waiting for you in the form of a bad cooling fan, corrupt database, fire, flood, or human error whether it's in the cloud, your own virtual servers, or a local hosting provider.

Best practices and DR discipline should not be taken for granted simply because the datacenter is outsourced. Many companies that I meet with have gone to the cloud to cut costs, and many of those are reinvesting their savings into providing higher availability. They're looking ahead, trying to avoid disasters, outcompete based on performance, and support customer satisfaction.

Even before the advent of the cloud, new generations of low-cost compute models enabled disaster recovery standards that could prevent a lot of downtime. But they are often poorly executed or ignored altogether. And now, with it being so easy to outsource hosting to "the cloud", it is even easier for companies to shake off responsibility for business continuity, assuming or hoping the folks behind the curtain will take care of everything.

I say that the primary responsibility for outages is the customer's. If you're providing a high-demand service you need to be ready to deliver. It doesn't matter if you can legitimately blame your provider after a disaster; your customers will blame you.

Your cloud (or other hosting) provider no doubt promises a certain amount of uptime in their service level agreement. Let's imagine that allows one hour of downtime per year. If they have one minor problem it could cause downtime of just a few minutes. But if your systems are not prepared, that interruption could corrupt databases, lose transactions in flight, crash applications, and wreak all manner of havoc. Their downtime glitch will become your costly business disaster unless you are prepared in advance to control it on your end.

It's like a tire blowing out on a car; the manufacturer may be responsible to a degree for a wreck, but if your seatbelt was not fastened then all bets are off. Safety systems are there for a reason.

Make it so Any Datacenter is Expendable

ZeroNines offers a solution that enables the complete loss of any datacenter without causing service downtime. We believe that if WhatsYourPrice.com was using our Always Available™ architecture, their dating service would have continued to operate at full capacity for the duration of the outage, with no impact to the customer experience.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime. You may also be interested in our whitepaper "Cloud Computing Observations and the Value of CloudNines™".

Alan Gin – Founder & CEO, ZeroNines

July 5, 2012

Multi-Region Disasters and Expendable Databases℠

As I write this, the United States is suffering from a frightening heat wave and the lingering effects of storms that threatened everything east of the Rockies. Over two dozen lives have been lost and the heat might last for several more days. The transportation, emergency response, and utility infrastructures are badly strained; about a million customers are still without power. Major fires are burning in the west. It is distressing to think about this kind of multi-region disaster, but it is happening now and continues to unfold.

And as trivial as it feels to write this, these natural disasters led to another outage of the Amazon EC2 cloud on June 29. This one is very similar to the EC2 outage on June 15 which I blogged about last week (source). Friday's happened at the same Virginia datacenter and was also caused by the same kind of event.

The Problem: Another Power Outage and Generator Failure

Similarly to what happened a couple weeks ago, on Friday a storm-induced power outage at the utility company forced a switchover to Amazon's backup generator, and this generator failed (source). Netflix, Instagram, Pinterest, and others began to experience difficulties and outages. The problems lasted about an hour.

But rather than dissect this one outage, let's take a look at the larger issues surrounding downtime in the cloud.

Cloud Users Must Plan their Own Disaster Recovery

Wired.com had this to say about Friday's event:

In theory, big outages like this aren’t supposed to happen. Amazon is supposed to keep the data centers up and running – something it has become very good at… In reality, though, Amazon data centers have outages all the time. In fact, Amazon tells its customers to plan for this to happen, and to be ready to roll over to a new data center whenever there’s an outage. (source)

Long and short, cloud customers think that they have shed their responsibility for business continuity and handed it to the cloud provider. They're wrong, and Amazon has apparently admitted as much by telling its customers to make their own disaster recovery preparations.

"Stupid IT Mistakes"

Those are the lead words in the title of an article about the June 15 outage (source). In it, the author refers to statistics from cloud optimization firm Newvem that show that 40% of cloud users are not properly prepared for an outage. They don't have any kind of redundancy: they don't back up their data and they deploy to only one region. Frighteningly, this includes large companies as well as small.

Promoting a Failure of a Recovery Plan

Another problem is that Amazon has apparently told its customers to "be ready to roll over to a new data center" (source). This is tacit approval of failover-based disaster recovery systems. But as we saw with the June 15 outage, failovers fail all the time and cannot be relied upon to maintain continuity. In fact, they often contribute to outages.

As for regular backups, that's always a good idea. But a backup location can fail too, particularly if it is hit by the same disaster. And what happens with transactions that occurred after the last backup? Will a recovery based on these backups even succeed? And although backup may eventually get you running again, it can't prevent the costly downtime.

You Can't Prevent All Causes of Failure

I argue again and again that there is no way to prevent all the thousands of small and large errors that can conspire (singly or in combination) to knock out a datacenter, cloud node, or server. Generators, power supplies, bad cables, human error and any number of other small disasters can easily combine to make a big disaster. It's not practical to continually monitor all of these to circumvent every possible failure. IT is chaos theory personified; do all you can, but something's going to break.

Geographic Issues

As we are seeing this week, one disaster or group of disasters can span vast geographic areas. You need to plan your business continuity system so the same disaster can't affect everything. Companies that have located all their IT on the Eastern Seaboard should be sweating it this week, because it's conceivable that the heat wave and storms could cause simultaneous power outages from New York to Virginia to Florida. A primary site, failover site, and backup location could all go down at the same time.

The Real Solution: Geographically Separated Expendable Datacenters℠

Here at ZeroNines we've constructed our business continuity solution around a number of tenets, including:
  • Service must continue despite the complete failure of any one datacenter.
  • Geographical separation is key, to prevent one disaster from wiping out everything.
  • Failover is not an option, because it is extremely unreliable.
  • The solution must be easy and affordable so that the "40%" mentioned above actually use it.
Based on all this, we've developed Always Available™ architecture that enables multiple instances of the same applications, data, and transactions to run equally and simultaneously in multiple locations hundreds or thousands of miles apart. The system does not rely upon or include either failover or restoration from backup. Best of all, an entire datacenter could go offline at any moment and all transactions will simply continue to be processed at other datacenters with no interruption to service. It is affordable, OS agnostic, and operates with existing apps, databases, and infrastructures.

ZeroNines client ZenVault uses Always Available. They also host on Amazon EC2. During the outages, no extraordinary measures are needed. If the EC2 East node goes offline, the two other nodes (in Santa Clara and EC2 West) will continue running the service, and will restore the EC2 East node once it comes back online. ZenVault has had true 100% uptime since the day it launched in 2010.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines