What Did One BlackBerry User Say to the Other BlackBerry User?

Nothing, according to twitter user @giselewaymes (source).

In what has to be every large enterprise IT manager's worst nightmare, a big high profile outage grew into a monster, expanded to global proportions, made headlines everywhere, and after three days seemed to have no end in sight. The cause was a failed failover that could have been avoided.

Background: RIM BlackBerry

BlackBerry is produced by Canadian firm Research In Motion (RIM). It is one of the leading smart phones among business users. Its real forte is encrypted mobile email and instant messaging. BlackBerry has about 70 million users worldwide (source). Several high-profile outages and many smaller ones have tarnished its reputation, and this week's seems to be pushing the company to the breaking point if all the buzz on the Internet is to be believed.

The Problem: Failed failover

On Monday morning October 10 2011, millions of BlackBerry users in Europe, the Middle east, and Africa lost access to messenger, email, and Internet. The outage spread to every continent and may eventually have effected half of all BlackBerry users (source).

RIM explained things to some degree on their website on Tuesday October 11: "The messaging and browsing delays that some of you are still experiencing were caused by a core switch failure within RIM’s infrastructure. Although the system is designed to failover to a back-up switch, the failover did not function as previously tested (source).

In other words, their failover-based disaster recovery system failed. It can be inferred that this led to cascading failures that knocked out other systems in other regions, leading to this worldwide problem. As of Wednesday evening the 12th it was still not fully resolved, with an interesting update posted on their site outlining the status in various parts of the world (source). By Thursday morning it looked like things were finally under control, with service almost back to normal in most areas.

The Cost: Paid compensation and a blow to the business

I don't doubt that RIM will compensate users in one way or another, perhaps in the form of free service (which seems to be the industry's de-facto compensation currency). RIM Co-CEO Jim Balsillie said that such a step would be considered but that their immediate focus was fixing the problem (source).

More damaging is the additional blow to RIM's reputation. Lots of users are claiming on Facebook, Twitter, and other online forums that this is the last straw and that they will quit BlackBerry. For many this may be a hollow threat but there is genuine peril here. "This outage… comes at a particularly bad time for RIM, since it faces increasing competition in the smarpthone market… Apple's iPhone and phones on the Google Android operating system have been gaining ground, and the new iPhone 4S goes on sale Friday (October 14)" (source).

The cost can be high outside of RIM as well. "The outage caught much of D.C. off guard Wednesday and underscored the region’s reliance on the BlackBerry — which is still the only federally approved smartphone for employees in some government agencies (source).

As for RIM itself, back in June there was a flurry of articles suggesting RIM was potentially facing bankruptcy (source). And this week there have been a number of stories about growing momentum for a RIM breakup or merger (source). Even a massive outage like this is unlikely to cause the demise of a large and important firm, but combined with other woes like a less-than-competitive product and poor business model it could well be the deciding factor.

The Solution: Eliminate failover systems

RIM is in trouble for a number of reasons but downtime like this does not need to be one of them. I contend that the core problem was not a failed switch but a failed failover. Switches will fail and there is no avoiding that. If you can architect the perfect switch, I invite you to do so and you'll be richer than Bill Gates. It's what happens after the inevitable switch malfunction (or other disaster) that matters most. Failover systems will fail too. RIM's apparently worked fine during a test but the strain and chaos of a real-world crisis was too much for it. At ZeroNines, we propose eliminating the failover systems in favor of something that will turn failures into virtual non-events.

ZeroNines' Always Available™ technology eliminates the need for failover, processing the same applications and data simultaneously on multiple servers, clouds, and virtual servers separated by thousands of miles. All servers are hot, and all applications are active. So if a switch fails in one network instance there is no need for a risky failover to another. Other instances are already processing the same transactions in parallel and simply continue processing as if nothing had happened. Once the problem with the switch is rectified, that instance is brought back into the Always Available array, is automatically updated, and resumes processing along with the others.

The Numbers

RIM says that its service "has been operational for 99.7% of the time over the last 18 months" (source). That equates to about 1,576.8 minutes of downtime, or 26.28 hours per year.

A good industry standard for uptime is 99.9% or three nines. That is 525.6 minutes of downtime, or 8.76 hours per year.

ZeroNines can provide in excess of five nines of uptime, or 99.999%. That is less than 5.3 minutes of downtime per year.

I do not know if planned downtime was included in RIM's 99.7% calculation. Companies often do not include planned downtime in their business continuity projections, counting only unplanned outages. But downtime is downtime from a user's perspective, whether caused by an accident or a planned maintenance cycle. ZeroNines protects against both.

In the last 12 months since ZenVault Medical went live on an Always Available cloud-based architecture it has experienced true 100% uptime, with no downtime whatsoever for any reason. That includes planned maintenance, upgrades, and other events that would have taken an ordinary network offline.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 13, 2011