Already being called one of the largest data failures in recent memory, October’s Sidekick disaster was actually two disasters rolled into one. First, the cloud-based service suffered an outage which stranded thousands of users. Second, the backup/storage system failed and erased the personal data of thousands of users. Every failure like this leads to a round of hand-wringing over the cloud, and this one is no different. It underscores the need for a far more robust cloud architecture, where a failure in one area is truly isolated from the rest of the system and can’t cause an outage.
Background: Sidekick and the Cloud
The Sidekick mobile device is developed by Microsoft subsidiary Danger and is sold and serviced by T-Mobile. It holds a special place in the hearts and hands of a select group of users because its QWERTY keyboard promises ease of use and its cloud-based data storage gives it the appearance of real go-anywhere, do-it-anytime utility. Unlike other hand-helds like the iPhone and Blackberry, the Sidekick backs up personal data to cloud-based storage at Microsoft and not to your computer’s hard drive. And there’s the seed of the trouble.
The problem: Hardware Failure Leads to Database Failure
It seems that beginning at about 1:30 AM on Friday October 2 [source], a “hardware failure… took out both the primary and backup copies of the database that contained Sidekick users' information.” [source] This apparently occurred during an upgrade to the Danger/Microsoft Storage Area Network [source]. When they discovered their Sidekicks weren’t working, many users re-set their Sidekicks (some under instructions from T-Mobile customer service) which wiped the devices’ hard drives. Combined with the back-end server failure, this led to apparent permanent data loss for anyone who tried to re-set their Sidekicks.
The cost to T-Mobile and Microsoft
This is going to cost millions. At least. T-Mobile halted sales of all Sidekicks shortly after the event and is compensating its affected users with a period of free data service [source]. There were the usual rants about users refusing to continue paying on their contracts, and news that T-Mobile was voluntarily letting anyone out of their contract who wanted out [source]. Lawsuits were filed [source]. Sarcasm and criticism runs thick online. Whatever the actual facts, this is a marketing disaster of the greatest degree for T-Mobile and Microsoft. There is no way to calculate how many of the approximately 800,000 existing sidekick customers [source] will jump ship, how many potential new customers will be lost, and what this means for Microsoft’s “Pink” project, intended follow-on to the Sidekick [source].
The Solution: A Robust Cloud
ZeroNines’ CloudNines™ product enables the cloud to function as it is supposed to, by processing every transaction simultaneously and equally on multiple cloud-based network nodes in an Always Available™ configuration. In the Sidekick disaster, CloudNines would simply have cut off the node with the hardware failure. All processing would have continued on other geographically separated nodes that were running identical active instances of the affected applications and databases. The failure would have been contained. There would have been no service downtime, and no need for ill-advised attempts to re-boot individual Sidekicks.
Not only would the Sidekick applications have continued operation, but the databases would too. There would have been no apparent loss of customer data. After the event, one author bitingly asked “But the question remains, why wasn't there a true independent backup of the data?” [source]. ZeroNines and Always Available technology would have made this a moot point.
As of this writing, T-Mobile and Microsoft have announced that they “have recovered most, if not all, customer data” [source]. I can’t help but breathe a sigh of relief for them even though I am not a Sidekicker myself. But wouldn’t it have been far better to have avoided the problem in the first place?
Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.
Alan Gin – Founder & CEO, ZeroNines
No comments:
Post a Comment