The more I talk to IT planners, the more I find they are looking at reinvesting their cloud savings into business continuity. They rightly hope to compete based upon reliability, and to protect their businesses by exchanging the extremely high and unpredictable costs of outages for the predictable and low costs of the cloud and business continuity. They've clearly got the right idea, especially when you consider the noise outages like yesterday's can make.
Microsoft Azure
- Area affected: Europe, via the Dublin datacenter and Amsterdam facility.
- Duration: About two and a half hours.
- Cause: Unspecified, but one expert suspects infrastructure troubles.
- Effects: Loss of cloud service throughout Western Europe. Businesses like SoundGecko were unavailable.
- Source: WebTechInfo.com
Google Talk
- Area affected: Worldwide (source).
- Duration: About five hours.
- Cause: Unspecified, but one expert suspects a bad hardware or software upgrade.
- Effects: System unusable, granting access but providing only error messages.
- Source: TechNewsWorld.com
- Area affected: Worldwide.
- Duration: About an hour and possibly more in some areas.
- Cause: Datacenter failure and a failed failover.
- Effects: "Users around the world got zilch from us."
- Source: CNN
Am I Repeating Myself? Am I Repeating Myself?
My last blog post was titled "A Flurry of July Outages – And All of them Preventable". On Thursday we had another flurry, and all of them in the cloud. And I have to say again, with caution, that these were preventable. I am cautious because we don't know for sure if Mr. Leigh-Currill and Mr. Enderle are correct in their assumptions about the Azure and Google service interruptions, but if they are correct then these outages were almost certainly preventable.
And the Twitter outage was clearly a case of failover failing us once again. That's what it is when "a parallel system that is designed to take over" fails to do so when the primary goes down. It's the same story we see over and over again.
After all these years in the industry I am still amazed that the biggest tech names in the world continue to rely on the ancient failover paradigm. The leaders of these organizations are trusting their reputations, their revenue, their shareholders' profits, their customers' businesses, and potentially peoples' lives to a disaster recovery process that quite likely won't work.
What Some Companies are Already Doing About It
Earlier in this post I mentioned the companies that are reinvesting cloud savings into reliability. These are typically smaller, more agile companies. They'll be able to take on the tech behemoths based on reliability, because they are thinking beyond mere cost savings and efficiency. They see the cloud as a means of creating their own "failsafe" hosting paradigms.
As most readers of this blog know, Always Available™ technology from ZeroNines replaces failover and backup-based recovery. It is already cloud-friendly. The Twitter outage is exactly the kind of thing we prevent. Companies hosting an Always Available array on Azure would have had virtually no risk of downtime, because all network transactions would have continued processing on other nodes. Always Available prevents data disasters before they happen, whether a power supply fails, or software gets corrupted, or a tornado picks up your Kansas datacenter and relocates it to Munchkinland, or someone melts your servers with a bucket of water, or the flying monkeys carry off the last of your support staff. Why clean up after an expensive disaster if you can prevent it in the first place?
Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.
Alan Gin – Founder & CEO, ZeroNines