The Cost of Cloud Outages and Planning for the Next One

At least one customer is publicly abandoning the Amazon EC2 cloud after two power-related outages within a month. Online dating site WhatsYourPrice.com is walking out, going back to a more traditional local hosting provider (source). And so the heat is still on in the East, with high temperatures still making life difficult and with the Amazon cloud beginning to lose customers.

What is the Real Cost of Cloud Outages?

This question is difficult to answer, as little hard data is available. Cloud providers and cloud customers keep the financial numbers closely guarded, whether we are talking about the customer's lost business and recovery costs, or Amazon's losses due to lost revenue, restitution paid to customers, customer attrition, and the increased difficulty of acquiring new customers.

Coincidentally, a report on cloud outages among top providers came out on June 22 2012. In it, the Paris-based International Working Group on Cloud Computing Resiliency (IWGCR) claims that costs for outages between 2007 and 2011 among the 13 providers they reviewed totaled $70 million. Their estimates were based on "hourly costs accepted in the industry" (source).

Downtime and availability rates are reported for Amazon Web Services (AWS), Microsoft, Research in Motion (RIM), and others. Total downtime was 568 hours, and availability was 99.917%, nowhere near the five nines (99.999%) that is becoming the de facto target for acceptable uptime.

Although it’s useful to put a line in the sand and publish studies on the cost of outages, such surveys are virtually impossible to do accurately. I don't think that even big analysts like Gartner are able to get a good view into the real costs. Unfortunately, this article does not make it clear whether the $70 million was the cost to providers, to customers, or both. Also, the sample size was pretty small and apparently there is no information about actual customer size. I know first-hand that many of our clients claim their downtime costs start at $6 million per hour and average $18-24 million per hour. Apparently only outages that made the news were included in the report, so this leaves a lot of actual downtime out of their equation, such as small glitches and maintenance downtime that journalists don't hear about. Because of all this I know the actual costs must be higher. Despite all the unavoidable barriers to accurate measurement, such studies are still valuable because they highlight the bottom-line impacts. They also demonstrate just how difficult it is to estimate the cost of downtime.

40% of Cloud Users are Not Prepared

"Light, medium, and heavy cloud users are running clouds where on average 40 percent of their cloud — data, applications, and infrastructure — is NOT backed up and exposed to outage meltdown" (source).

This was said a couple weeks ago by Cameron Peron, VP Marketing at Newvem, a cloud optimization consultancy that specializes in the Amazon cloud. He was referring to his company's clients and the June 15 outage.

If the average cloud customer is anything like these companies then it is no wonder that cloud outages are such a concern. Another writer referred to this kind of planning (or lack of planning) as "stupid IT mistakes" (source).

Who's at Fault?

So when a cloud customer experiences downtime and loses money, who is actually to blame? The cloud provider who failed to deliver 100% uptime, or the cloud customer who was unprepared for the unavoidable downtime?

According to Peron, "Amazon doesn’t make any promises to back up data... The real issue is that many users are under the impression that their data is backed up… but in fact it isn’t due to mismanaged infrastructure configuration." (source)

Cloud customers need to be prepared to use best practices for data protection and disaster prevention/recovery. They need to remember that a cloud is just a virtual datacenter. It is a building crammed full of servers, each of which is home to a number of virtual servers. And all of it is subject to the thousand natural (and unnatural) shocks that silicon is heir to.

Cloud Customers Need to Take Responsibility for Continuity

So here's some friendly advice to WhatsYourPrice.com and others like them: whatever hosting model you choose, get your disaster plan in place. An outage is out there, waiting for you in the form of a bad cooling fan, corrupt database, fire, flood, or human error whether it's in the cloud, your own virtual servers, or a local hosting provider.

Best practices and DR discipline should not be taken for granted simply because the datacenter is outsourced. Many companies that I meet with have gone to the cloud to cut costs, and many of those are reinvesting their savings into providing higher availability. They're looking ahead, trying to avoid disasters, outcompete based on performance, and support customer satisfaction.

Even before the advent of the cloud, new generations of low-cost compute models enabled disaster recovery standards that could prevent a lot of downtime. But they are often poorly executed or ignored altogether. And now, with it being so easy to outsource hosting to "the cloud", it is even easier for companies to shake off responsibility for business continuity, assuming or hoping the folks behind the curtain will take care of everything.

I say that the primary responsibility for outages is the customer's. If you're providing a high-demand service you need to be ready to deliver. It doesn't matter if you can legitimately blame your provider after a disaster; your customers will blame you.

Your cloud (or other hosting) provider no doubt promises a certain amount of uptime in their service level agreement. Let's imagine that allows one hour of downtime per year. If they have one minor problem it could cause downtime of just a few minutes. But if your systems are not prepared, that interruption could corrupt databases, lose transactions in flight, crash applications, and wreak all manner of havoc. Their downtime glitch will become your costly business disaster unless you are prepared in advance to control it on your end.

It's like a tire blowing out on a car; the manufacturer may be responsible to a degree for a wreck, but if your seatbelt was not fastened then all bets are off. Safety systems are there for a reason.

Make it so Any Datacenter is Expendable

ZeroNines offers a solution that enables the complete loss of any datacenter without causing service downtime. We believe that if WhatsYourPrice.com was using our Always Available™ architecture, their dating service would have continued to operate at full capacity for the duration of the outage, with no impact to the customer experience.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime. You may also be interested in our whitepaper "Cloud Computing Observations and the Value of CloudNines™".

Alan Gin – Founder & CEO, ZeroNines

July 12, 2012