August 2, 2012

How Short Outages Become Long Outages

Early in the morning of Friday July 27, 2012, Hosting.com experienced an 11-minute outage. Although service was restored very quickly, many customers weren't prepared and experienced hours of downtime as a result (source).

The key story here is that even though a few minutes of hosting provider downtime is probably well within the parameters of the service level agreement (SLA), the customer's actual downtime far exceeds that. I'm going to quote my own blog from just a couple weeks ago because it accurately sums up the situation:

Your cloud (or other hosting) provider no doubt promises a certain amount of uptime in their service level agreement. Let's imagine that allows one hour of downtime per year. If they have one minor problem it could cause downtime of just a few minutes. But if your systems are not prepared, that interruption could corrupt databases, lose transactions in flight, crash applications, and wreak all manner of havoc. Their downtime glitch will become your costly business disaster unless you are prepared in advance to control it on your end (source).

Hosting.com's Service Level Agreement is posted publicly on their website (source). A quick read reveals that it does NOT promise 100% uptime. Datacenters fail, and that's a fact of life. When signing with any hosting or cloud provider, it is vital that you understand exactly who is responsible for what, and whether total downtime is measured according to the unavailability of their infrastructure or the amount of time it takes you to recover.

Recently, the Paris-based International Working Group on Cloud Computing Resiliency (IWGCR) found that costs for outages between 2007 and 2011 among the 13 providers they reviewed exceeded $70 million (source). No SLA from any provider is going to compensate for those kinds of losses. If the industry demanded this of them, no hosting provider would be able to stay in business. It will be far better for their customers to invest in reliability than to expect dollar-for-dollar restitution after a disaster.

Background: Hosting.com and its Customers

According to the company website, "Hosting.com is a next generation cloud hosting and recovery services company focused on ensuring your mission-critical applications are AlwaysOn™" (source). They are a leading provider of other enterprise hosting solutions and services as well, with datacenters in Dallas, Denver, Irvine, Louisville, Newark, and San Francisco. One source says they host over 65,000 websites (source). This includes financial services, healthcare, media, retail, software as a service (SaaS) providers, and content distribution networks (CDN) (source).

In contrast with other recent outages and other providers whose explanations were late or non-existent, Hosting.com CEO Art Zeile stepped up very quickly during this crisis and alerted his customers of the problem, its cause, and its effects. Though they won't be thrilled with news like this, customers need clear communication and honesty from their providers. That way they know what to tell their own customers and management, and their overworked internal IT teams will have a better chance of taming the chaos. I applaud Mr. Zeile and his actions. We need this level of leadership to benefit the cloud industry at large.

The Problem: Human Error, a Power Outage, and a Chain Reaction

Mr. Zeile explained that "An incorrect breaker operation sequence executed by the servicing vendor caused a shutdown of the UPS plant resulting in loss of critical power to one data center suite within the [Newark, Delaware] facility" (source). The power was back on within 11 minutes, but "customer web sites were offline for between one and five hours as their equipment and databases required more time to recover from the sudden loss of power."

I wasn't there but I can surmise what happened. When the power went out, an unspecified number of servers were shut off without proper shutdown procedures. Applications and databases were abruptly terminated. Other applications and databases that depended upon them suddenly lost transactions in flight. They crashed too, taking down other apps and databases in turn. And so on down the line in a classic cascading failure scenario.

Recovery of the customers' crashed apps and databases required hours. Each customer needed its own data and apps restored, and those that were still running probably had to be shut down and then re-started in proper sequence. Servers had to be checked for damage after their "crash" shut-downs. Apps and data that successfully cut over or failed-over to secondaries had to be cut over again, from the secondaries back to the primaries, and I'll bet there were further failures as that happened.

The Solution: Make Your Datacenters Expendable

Many of the apps and data on the system were undoubtedly protected by failover and backup recovery architecture, or by one of the Hosting.com business continuity solutions. Many of these certainly continued running as they successfully failed over to their secondaries. But equally clear is that apps and data for about 1,100 customers (1.7% of the total Hosting.com customer base) did not continue running. Either they were not equipped with adequate business continuity systems, or the failovers failed. One writer quotes Zeile as saying that although Hosting.com offers a backup option "few customers, at the affected location, had elected to purchase it" (source).

I am unaware of any hosting or cloud provider who publicly promises 100% uptime. So the customer must expect to have some amount of downtime, if only for maintenance. Logically, customers need to provide adequate business continuity systems to protect themselves.

Datacenters go offline all the time for any number of reasons. Thus, your business needs to be able to continue talking to customers, sending billing statements, shipping goods, and paying creditors despite untoward events like power outages, fires, human error, hardware failure, and so forth.

ZeroNines does not recommend or use a failover- or backup-based recovery paradigm. We take a different approach aimed at preventing downtime in the first place, rather than recovering from it afterward. In the case of the Hosting.com outage, Always Available™ architecture from ZeroNines offers two solution scenarios:
1)  Hosting.com already operates multiple geographically separated datacenters. Always Available architecture would allow processing to continue on any or all of the remaining five when any one of them goes down.
2)  Hosting.com customers could deploy their own Always Available array that would simultaneously replicate all transactions and data on other Hosting.com datacenters, or in other clouds or with other providers.

In either case, the end user experiences no downtime because the remaining nodes continue processing as usual. The offending datacenter simply drops out of the array until power is restored or until your staff can repair it. The other nodes of the Always Available array will update the damaged node once it is functioning again, and bring it to an identical logical state.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines