Azure and Google and Twitter, Oh My!

The flying monkeys attacked in force on Thursday July 26, 2012, taking down cloud leaders Microsoft, Twitter, and Google Talk. It seems the cloud is no Yellow Brick Road after all, guiding merry executives to some imagined Oz where the sun always shines and outages never happen.

The more I talk to IT planners, the more I find they are looking at reinvesting their cloud savings into business continuity. They rightly hope to compete based upon reliability, and to protect their businesses by exchanging the extremely high and unpredictable costs of outages for the predictable and low costs of the cloud and business continuity. They've clearly got the right idea, especially when you consider the noise outages like yesterday's can make.

Microsoft Azure

Area affected: Europe, via the Dublin datacenter and Amsterdam facility.
Duration: About two and a half hours.
Cause: Unspecified, but one expert suspects infrastructure troubles.
Effects: Loss of cloud service throughout Western Europe. Businesses like SoundGecko were unavailable.
Source: WebTechInfo.com

"In Azure’s case on Thursday, the constant availability of power and lack of a software culprit, such as the Feb. 29 one that downed several services, points to 'more of an infrastructure issue' where some undetected single point of failure in a network switch or other device temporarily disrupted availability, said Chris Leigh-Currill, CTO of Ospero, a supplier of private cloud services in Europe." (source)

Google Talk

Area affected: Worldwide (source).
Duration: About five hours.
Cause: Unspecified, but one expert suspects a bad hardware or software upgrade.
Effects: System unusable, granting access but providing only error messages.
Source: TechNewsWorld.com

"Outages like this 'often happen as the result of a hardware or software upgrade that wasn't properly tested before installation,' Rob Enderle, principal analyst at the Enderle Group, told TechNewsWorld" (source).

Twitter

Area affected: Worldwide.
Duration: About an hour and possibly more in some areas.
Cause: Datacenter failure and a failed failover.
Effects: "Users around the world got zilch from us."
Source: CNN

"Twitter's vice president of engineering, Mazan Rawashdeh… blamed the outage on the concurrent failure of two data centers. When one fails, a parallel system is designed to take over -- but in this case, the second system also failed, he said" (source).

Am I Repeating Myself? Am I Repeating Myself?

My last blog post was titled "A Flurry of July Outages – And All of them Preventable". On Thursday we had another flurry, and all of them in the cloud. And I have to say again, with caution, that these were preventable. I am cautious because we don't know for sure if Mr. Leigh-Currill and Mr. Enderle are correct in their assumptions about the Azure and Google service interruptions, but if they are correct then these outages were almost certainly preventable.

And the Twitter outage was clearly a case of failover failing us once again. That's what it is when "a parallel system that is designed to take over" fails to do so when the primary goes down. It's the same story we see over and over again.

After all these years in the industry I am still amazed that the biggest tech names in the world continue to rely on the ancient failover paradigm. The leaders of these organizations are trusting their reputations, their revenue, their shareholders' profits, their customers' businesses, and potentially peoples' lives to a disaster recovery process that quite likely won't work.

What Some Companies are Already Doing About It

Earlier in this post I mentioned the companies that are reinvesting cloud savings into reliability. These are typically smaller, more agile companies. They'll be able to take on the tech behemoths based on reliability, because they are thinking beyond mere cost savings and efficiency. They see the cloud as a means of creating their own "failsafe" hosting paradigms.

As most readers of this blog know, Always Available™ technology from ZeroNines replaces failover and backup-based recovery. It is already cloud-friendly. The Twitter outage is exactly the kind of thing we prevent. Companies hosting an Always Available array on Azure would have had virtually no risk of downtime, because all network transactions would have continued processing on other nodes. Always Available prevents data disasters before they happen, whether a power supply fails, or software gets corrupted, or a tornado picks up your Kansas datacenter and relocates it to Munchkinland, or someone melts your servers with a bucket of water, or the flying monkeys carry off the last of your support staff. Why clean up after an expensive disaster if you can prevent it in the first place?

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

July 30, 2012