October 22, 2009

Enabling Cloud Confidence

A week ago I wrote about the Sidekick disaster and how events like that just keep doubts growing, pushing the wholesale adoption of the Cloud further away. This doubt has made it into the mainstream media, where it will taint the opinions of potential cloud users, both consumer and commercial. We at ZeroNines think we have the solution that will enable the cloud to perform as it needs to.

The core problem with outages is not the existence of hazards that can damage servers and knock elements of a network (cloud or otherwise) offline. Storms, fires, and equipment failure will always happen and there is no way to eliminate them. The real problem is the reliance of cloud providers on obsolete failover-based recovery paradigms that simply can’t maintain continuity when disaster does strike.

L.A. Times columnist David Sarno perfectly sums up the cloud’s tenuous situation in his October 18 article “Still hazy on cloud computers' security” [source]. “A series of incidents involving cloud computing over the last several months has poked holes in the hype bubble, raising questions about the cloud's dependability -- and whether it's ready for use by a broader group of workers and businesses.” He is right on target.

Meeting the Need to Fortify

As Sarno puts it, “As e-mail, word processing and data storage continue to move from users' computers to the Web, companies must fortify their servers from a variety of potential disasters -- natural and man-made -- to help ensure that the data and the applications are accessible at all times.” He quotes Google’s SEC filing:

"(Google’s) systems are vulnerable to damage or interruption from earthquakes, terrorist attacks, floods, fires, power loss, telecommunications failures, computer viruses, computer denial of service attacks" as well as sabotage and vandalism...
The good news is that today, ZeroNines' Always Available™ CloudNines™ technology can fortify servers from damage or interruption from earthquakes, terrorist attacks, floods, fires, power loss, telecommunications failures, computer denial of service attacks, as well as sabotage and vandalism. We leave the viruses to others to deal with, but we can add most types of routine maintenance, unplanned maintenance, data migrations, equipment upgrades, software upgrades, and a number of other potential causes of downtime.

Forget Failover

The IT world fatalistically believes that downtime is inevitable, and is something to be lived with and minimized if you’re fortunate. This view predominates because until now the only disaster recovery solution available has been the flawed failover paradigm, which everyone in IT knows can be a disaster unto itself. During a crisis or failover event, cutover can cause additional problems, downtime, and cascading application failures as computing switches from primary to backup systems.

But the IT world has it wrong. Disasters will happen and must be dealt with, but the downtime they cause can be prevented.

Always Available™ Means Virtually 100% Uptime

ZeroNines’ Always Available™ solution eliminates failover and backups, instead providing synchronous identical processing on multiple cloud nodes geographically separated by thousands of miles. If a storm wipes out your East Coast cloud, CloudNines enables processing to continue on clouds in other parts of the country and around the world. If you need to upgrade server software, you can isolate one cloud node, do your upgrade, and bring it back online once it is stable. Our technology has journaling and updating features to assure that all transactions are completed and that any cloud node that goes offline is brought up to the most accurate logical state once it comes back online.

CloudNines can push application availability beyond the industry-accepted standard of 99.999% (five nines) to virtually 100%. In our ongoing test case, the ZeroNines MyFailSafe environment has never experienced any downtime at all, for any reason. It went live in July 2004, and had individual network nodes knocked offline a number of times due to hurricanes, power outages, server migrations, and other causes. All applications experienced full 100% availability throughout.

Will ZeroNines eventually be recognized as a vital cloud-enabling technology? That remains to be seen but you can bet that is how we see ourselves. If you want to find out how we can make the cloud a viable option for you, let me know.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 15, 2009

Sidekick: Two Disasters for the Price of One Really Big One

Already being called one of the largest data failures in recent memory, October’s Sidekick disaster was actually two disasters rolled into one. First, the cloud-based service suffered an outage which stranded thousands of users. Second, the backup/storage system failed and erased the personal data of thousands of users. Every failure like this leads to a round of hand-wringing over the cloud, and this one is no different. It underscores the need for a far more robust cloud architecture, where a failure in one area is truly isolated from the rest of the system and can’t cause an outage.

Background: Sidekick and the Cloud

The Sidekick mobile device is developed by Microsoft subsidiary Danger and is sold and serviced by T-Mobile. It holds a special place in the hearts and hands of a select group of users because its QWERTY keyboard promises ease of use and its cloud-based data storage gives it the appearance of real go-anywhere, do-it-anytime utility. Unlike other hand-helds like the iPhone and Blackberry, the Sidekick backs up personal data to cloud-based storage at Microsoft and not to your computer’s hard drive. And there’s the seed of the trouble.

The problem: Hardware Failure Leads to Database Failure

It seems that beginning at about 1:30 AM on Friday October 2 [source], a “hardware failure… took out both the primary and backup copies of the database that contained Sidekick users' information.” [source] This apparently occurred during an upgrade to the Danger/Microsoft Storage Area Network [source]. When they discovered their Sidekicks weren’t working, many users re-set their Sidekicks (some under instructions from T-Mobile customer service) which wiped the devices’ hard drives. Combined with the back-end server failure, this led to apparent permanent data loss for anyone who tried to re-set their Sidekicks.

The cost to T-Mobile and Microsoft

This is going to cost millions. At least. T-Mobile halted sales of all Sidekicks shortly after the event and is compensating its affected users with a period of free data service [source]. There were the usual rants about users refusing to continue paying on their contracts, and news that T-Mobile was voluntarily letting anyone out of their contract who wanted out [source]. Lawsuits were filed [source]. Sarcasm and criticism runs thick online. Whatever the actual facts, this is a marketing disaster of the greatest degree for T-Mobile and Microsoft. There is no way to calculate how many of the approximately 800,000 existing sidekick customers [source] will jump ship, how many potential new customers will be lost, and what this means for Microsoft’s “Pink” project, intended follow-on to the Sidekick [source].

The Solution: A Robust Cloud

ZeroNines’ CloudNines™ product enables the cloud to function as it is supposed to, by processing every transaction simultaneously and equally on multiple cloud-based network nodes in an Always Available™ configuration. In the Sidekick disaster, CloudNines would simply have cut off the node with the hardware failure. All processing would have continued on other geographically separated nodes that were running identical active instances of the affected applications and databases. The failure would have been contained. There would have been no service downtime, and no need for ill-advised attempts to re-boot individual Sidekicks.

Not only would the Sidekick applications have continued operation, but the databases would too. There would have been no apparent loss of customer data. After the event, one author bitingly asked “But the question remains, why wasn't there a true independent backup of the data?” [source]. ZeroNines and Always Available technology would have made this a moot point.

As of this writing, T-Mobile and Microsoft have announced that they “have recovered most, if not all, customer data” [source]. I can’t help but breathe a sigh of relief for them even though I am not a Sidekicker myself. But wouldn’t it have been far better to have avoided the problem in the first place?

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines