January 22, 2010

Twitter Grows Up and then Falls Down

Do you Twitter? Or Tweet? Or whatever they call it? Gotta admit, I don’t. So I didn’t really pay much attention when I first heard that Twitter had gone down the other day [source]. Life goes on. But what a good thing (I thought) that this had not happened a week ago when Twitter took the spotlight on the world stage as it helped gather money for earthquake relief in Haiti.

That made positive headlines everywhere. But if the outage had occurred during the first critical hours or days of the relief effort, a self-righteous world would instead have sneered at Twitter for having failed, despite the fact that Twitter was never billed as a source of disaster relief. This is a window into an important reality: you’d better plan uptime into your system because you never know when you will be caught in the spotlight.

Background: Twitter Comes of Age

Twitter is an instant messaging system that allows short messages of up to 140 characters to be sent to a subscriber’s contacts, or be made available to the Twitter community at large. Millions of people use Twitter every day. Data reported in Wikipedia [source] shows that over 75% of the messages on Twitter are either “conversational” or “pointless babble.” A small but powerful percentage of messages are for far more serious purposes. Twitter was drafted into service for political campaigning, education, public relations, and emergencies long before the Haiti earthquake. But I see its Haiti relief efforts as the moment it came of age, when Twitter was first used to mobilize money on a mass worldwide scale for a focused, responsible, humanitarian purpose.

The Problem: A Failover Failure

On the morning of Wednesday January 20, 2010, Twitter became virtually inaccessible. According to Information Week, "A sudden failure coupled with problems in switching to a backup system produced a high number of errors for around 90 minutes" [source]. In other words, an unspecified failure in one place forced the system to rely on its “failover” architecture, which in turn failed. This is a classic failover failure.

The Cost: Hard to Quantify but Scary to Contemplate

Since Twitter service is free, there may be no direct cost to Twitter. Indirectly, this event contributes negatively to Twitter’s overall equation for obtaining venture capital, building a positive public image, and eventually making money off of paid services.

And here’s where we get into very uncertain territory. What would the cost have been if it had happened just a few days earlier? Would millions of dollars in aid to Haiti have been delayed or failed to materialize? Would people who were saved by this aid have died? Possibly. At the very least, Twitter would have experienced a PR storm far more serious than the January 20 outage caused.

The Solution: Ditch the Failover

Failover (also known as cutover) is the de facto recovery solution for dealing with IT disasters, but it contains inherent flaws that often prevent it from working at the very moment it is needed. Vast numbers of companies and other organizations in the U.S. and around the world rely on failover to keep them functioning in the event of their own disasters, be they failed server equipment or regional catastrophes.

ZeroNines has designed an Always Available™ business continuity architecture that does away with failover entirely. No backup systems ever need to kick in with only microseconds of notice. Instead, processing of all network transactions occurs continually, simultaneously, and equally in multiple locations. Long and short, Always Available prevents disasters like Twitter’s 90 minute outage on Wednesday. Instead of relying on a cutover event to succeed, Always Available simply continues running the same apps and data at one of two or three additional locations, with no interruption to the user.

A Parting Thought

Systems like Twitter’s, which rely on failover in the event of disasters, currently form the backbone of business and government information systems. Suppose that earthquake had happened somewhere in the U.S. (which one day it will) and knocked out data centers, communications, and other key infrastructure? If the failover systems fail like they failed Twitter (which they will), then what is the prospect for marshalling aid within our own borders? A scary thing to consider.

Visit the ZeroNines website to find out more about how our disaster-proof architecture can protect businesses (and government agencies) of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines