July 18, 2012

A Flurry of July Outages – And All of Them Preventable

It's starting to look like the Amazon outages in June were only the beginning.

A number of spectacular datacenter failures have made the news just in the past week. First one hit, and I decided to blog on it. Then another. Then another. So here's a digest of all of them. Note that all three disasters centered around lost power, either at the utility or within their own facilities. Combine these with the two power-related Amazon outages on June 15 and June 29 and we can see a disturbing trend.

Level 3 Communications, July 10, 2012
  • Facility affected: Central London datacenter
  • Duration: Approximately six hours
  • Cause: Loss of A/C power "to the content delivery network equipment and customer colocation"
  • Effects: At least fifty companies went offline directly as a result, and an unspecified number of other companies that use the datacenter for connectivity and hosting also lost service.
  • Source: ZDNet.com
With this one, it looks like power from the utility provider failed. Reading between the lines I surmise that two diesel backup generators kicked in, but an uninterruptible power supply failed (dare I say "was interrupted?") As this ZDNet author sums it up, "Because the Braham Street facility is a major connectivity point for Level 3, companies that use services that plug into the transit provider were also severely affected." So direct customers were knocked offline and also the customers of customers. For example, colocation provider Adapt had to alert their customers that service was unavailable. The most telling quote comes from Justin Lewis, operations director for Adapt: "When I saw this I was very surprised — this is not a normal event by any means… You would not expect to have a total failure of this nature in a datacentre."

While it's true that a total failure of a datacenter is unusual, partial failures related to power outages happen all the time, as with Amazon. And guess what happened the very next day…

Shaw Communications, July 11, 2012
  • Facility affected: Shaw Communications HQ and IBM datacenter, downtown Calgary
  • Duration: Two+ days, with lingering effects
  • Cause: Transformer explosion and fire
  • Effects: hospital data center outage, cancellation of 400+ surgeries and medical procedures, inability of populace to reach 911 emergency services via landline, inability to reach city services via phone, unspecified business site/service outages, loss of online motor-vehicle and land-title services.
  • Source: Datacenter Dynamics Focus
This one should scare all of us because it illustrates the depth of business and social disaster that can stem from a single-point-of-failure system. You really have to read the whole article to get a feel for the extent of the impact. There is no mention in this article of human injuries or fatalities, so I am optimistically assuming there were none.

Shaw communications is "one of Canada's largest telcos." On Wednesday the 11th, an explosion and fire disrupted all services at the datacenter which serves medical centers, businesses, emergency phone service, and several city services.

The big picture is beautifully summed up by DatacenterKnowledge.com:

The incident serves as a wake-up call for government agencies to ensure that the data centers that manage emergency services have recovery and failover systems that can survive [a] series of adversities – the “perfect storm of impossible events” that combine to defeat disaster management plans (source).

Salesforce.com, July 12, 2012
  • Facility affected: The West Coast Datacenter run by Equinix
  • Duration: Approximately seven hours, with performance issues for several days
  • Cause: Loss of power during maintenance, and apparent additional failures
  • Effects: Salesforce.com customers were unable to use the service or experienced poor performance.
  • Source: Information Week
It sounds like the actual power outage was brief, but that this caused ancillary problems. "Equinix company officials acknowledged their Silicon Valley data center had experienced a brief power outage and said some customers may have been affected longer than the loss of power itself" (emphasis is mine.) If power comes back on, and services don't, that's a failed failover and/or cascading software failures. I offer this quote as further proof: "Standard procedures for restoring service to the storage devices were not successful and additional time was necessary to engage the respective vendors to further troubleshoot and ensure data integrity (source)."

What's the REAL Cause?

Loss of power or power systems was the key instigating factor in all three outages. But power loss is a known threat that is supposed to be guarded against, so secondary and backup systems should have prevented service downtime and business disasters once the power outage was under way. Clearly, occasionally there will be combinations of failures that cannot be foreseen or prevented.

So I assert that the real cause of business disasters like these is not blown transformers and bad utility service but insufficient preparation. Any one datacenter is vulnerable to these occasional "black swan" events and you have to expect it to be disabled somewhere along the line. The power may go out, but it is up to you to prevent the business disaster.

In order to maintain service, any given datacenter must be expendable, so you and your customers can carry on without it until it is fixed.

Forget about failover. It fails as often as it succeeds, as we can see with Salesforce.com above. And I'd bet my socks that the Level 3 and Shaw Communications outages also featured failovers that didn't work.

Locating everything in one building is just plain irresponsible. That is a carryover from a past era, where geographical separation was dreamt of but not practical. One fire or power outage and entire systems are gone.

An MSP Solution from ZeroNines

Always Available™ technology from ZeroNines could easily have prevented application, data, and service downtime in each of these three disasters. It combines distant geographical separation, redundant/simultaneous processing of all data and transactions, and interoperability between systems to enable uptime in excess of five nines (99.999%), regardless of what happens at any given location.

In conjunction with a major network infrastructure provider, we have recently rolled out an Always Available solution specifically for managed service providers. So if you're a provider like Adapt (See the Shaw Communications story above), your service should remain fully available even if one datacenter melts down completely. You're no longer dependent on the talents and equipment of your datacenter provider to support the SLAs you sign with your customers.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

No comments:

Post a Comment