October 24, 2011

Building Outage Resistance into Network Operations


An article I read the other day in MIT's Technology Review [source] nicely sums up what I've been hearing about cloud operations from dozens of clients, partners, and other colleagues around the country. The cloud is great for development, prototyping, and special projects for enterprises, but don't rely on it for anything serious. As that article says, "For all the unprecedented scalability and convenience of cloud computing, there's one way it falls short: reliability."

But the truth is that the tried-and-true models of network operations aren't all that reliable themselves, and neither are the disaster recovery systems that are supposed to protect them. Granted, they are probably more reliable than the cloud at this point, but downtime is downtime whether it's in the cloud or in a colocation facility. The effect is the same.

What is really needed is outage resistance that is built into network operations, whatever the model

Why downtime happens

I recently read an interesting whitepaper from Emerson Network Power [source] that describes the seven most common causes of downtime as revealed by a 2010 survey by the Ponemon institute (http://www.ponemon.org/index.php). The causes are all pretty mundane: UPS problems such as battery failure or exceeded capacity, power distribution unit and circuit breaker failures, cooling problems, human error, and similar things. All of them apply to any data center, whether in-house or in the cloud. None of the exciting stuff like fires, terrorism, or hurricanes made it into the top seven, though of course they could lead to a failure of a battery, circuit breaker, or cooling unit.

The Emerson whitepaper describes best practices that can reduce the likelihood of downtime induced by each of the top seven causes. That is all well and good, but some are very costly, such as remodeling server rooms "to optimize air flow within the data center by adopting a cold-aisle containment strategy." Other recommendations include regular and frequent inspection and testing of backup batteries, installation of circuit breaker monitoring systems, and increased training for staff.

These are good ideas but costly, if not in capital for server room reconfiguration then in staff hours and other recurring costs. The paper contends that problems caused by human error are "wholly preventable" but I believe this is a mistake. No matter how stringent the rules or how well-documented the procedures, someone will take short cuts, overlook a vital step in the midst of a crisis, or sneak their donut and coffee into the control room. Applications fail under stress, databases fail to restart properly, and any number of other things can and do go wrong. There is no way to write contingencies for each, particularly when the initial failure leads to an unpredictable cascade effect.

And what of the cloud?

I believe the cloud brings tremendous value to developers, SMBs, and other institutions that need low cost and great flexibility. Where else can an online store launch with a configuration that is not only affordable but also ready for both super-slow sales and a drastic ramp-up if sales shoot into the stratosphere? But like most “better, cheaper, faster” initiatives, the cloud has genuine reliability problems. A company running their own data center could choose to incur the expense and work of instituting all of Emerson's best practices since they are in control of the environment. But all they have from their cloud provider (or colocation provider for that matter) is their Service Level Agreement (SLA). They can't go in themselves and swap out aged batteries or fire the guy who persists in smuggling cinnamon rolls into the NOC.

The Technology Review article tells us that some companies are looking for ways to make their cloud deployments far more disaster resistant to start with, rather than just relying on their cloud provider's promises [source]. Seattle-based software developer BigDoor experienced service interruptions as a result of the Amazon cloud's big outage in April 2011. Co-founder Jeff Malek said "For me, [service agreements] are created by bureaucrats and lawyers… What I care about is how dependable the cloud service is, and what a provider has done to prepare for outages" [source].

The same article describes the Amazon SLA and its implications:

Even though outages put businesses at immense risk, public cloud providers still don't offer ironclad guarantees. In its so-called "service-level agreement," Amazon says that if its services are unavailable for more than 0.05 percent of a year (around four hours) it will give the clients a credit "equal to 10% of their bill." Some in the industry believe public clouds like Amazon should aim for 99.999 percent availability, or downtime of only around five minutes a year.


The outage resistant cloud

ZeroNines can give you that 99.999% (five nines) or better, whether you are running a cloud or just running in the cloud. Cloud service providers could install an Always Available™ configuration on their publicly-offered services, providing a highly competitive edge when attracting new customers.

Individual businesses could install an Always Available array on their own networks, synchronizing any combination of cloud deployments, colocation, and in-house network nodes. It also facilitates cloud migration, because you can deploy to the cloud while keeping your existing network up and running as it always has. There is no monumental cloud migration that could take the whole network down and leave the business stranded if there's a glitch in starting an application. Instead, Always Available runs all servers hot and all applications active, enabling entire nodes to fall in and out of the configuration as needed without affecting service. The remaining nodes can update a new or re-started node once it rejoins the system.

ZeroNines client ZenVault Medical (www.zenvault.com/medical) developed and launched their live site in the cloud using an Always Available configuration. Since the day of its launch in September 2010 it has run in the cloud with true 100% uptime, with no downtime at all. That includes maintenance and upgrades. When a problem or maintenance cycle requires a node to be taken offline, ZenVault staffers remove it from the configuration, modify it as necessary, and seamlessly add it back to into the mix once it is ready. ZenVault users don't experience any interruptions.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 13, 2011

What Did One BlackBerry User Say to the Other BlackBerry User?

Nothing, according to twitter user @giselewaymes (source).

In what has to be every large enterprise IT manager's worst nightmare, a big high profile outage grew into a monster, expanded to global proportions, made headlines everywhere, and after three days seemed to have no end in sight. The cause was a failed failover that could have been avoided.

Background: RIM BlackBerry 

BlackBerry is produced by Canadian firm Research In Motion (RIM). It is one of the leading smart phones among business users. Its real forte is encrypted mobile email and instant messaging. BlackBerry has about 70 million users worldwide (source). Several high-profile outages and many smaller ones have tarnished its reputation, and this week's seems to be pushing the company to the breaking point if all the buzz on the Internet is to be believed.

The Problem: Failed failover 

On Monday morning October 10 2011, millions of BlackBerry users in Europe, the Middle east, and Africa lost access to messenger, email, and Internet. The outage spread to every continent and may eventually have effected half of all BlackBerry users (source).

RIM explained things to some degree on their website on Tuesday October 11: "The messaging and browsing delays that some of you are still experiencing were caused by a core switch failure within RIM’s infrastructure. Although the system is designed to failover to a back-up switch, the failover did not function as previously tested (source).

In other words, their failover-based disaster recovery system failed. It can be inferred that this led to cascading failures that knocked out other systems in other regions, leading to this worldwide problem. As of Wednesday evening the 12th it was still not fully resolved, with an interesting update posted on their site outlining the status in various parts of the world (source). By Thursday morning it looked like things were finally under control, with service almost back to normal in most areas.

The Cost: Paid compensation and a blow to the business 

I don't doubt that RIM will compensate users in one way or another, perhaps in the form of free service (which seems to be the industry's de-facto compensation currency). RIM Co-CEO Jim Balsillie said that such a step would be considered but that their immediate focus was fixing the problem (source).

More damaging is the additional blow to RIM's reputation. Lots of users are claiming on Facebook, Twitter, and other online forums that this is the last straw and that they will quit BlackBerry. For many this may be a hollow threat but there is genuine peril here. "This outage… comes at a particularly bad time for RIM, since it faces increasing competition in the smarpthone market… Apple's iPhone and phones on the Google Android operating system have been gaining ground, and the new iPhone 4S goes on sale Friday (October 14)" (source).

The cost can be high outside of RIM as well. "The outage caught much of D.C. off guard Wednesday and underscored the region’s reliance on the BlackBerry — which is still the only federally approved smartphone for employees in some government agencies (source).

As for RIM itself, back in June there was a flurry of articles suggesting RIM was potentially facing bankruptcy (source). And this week there have been a number of stories about growing momentum for a RIM breakup or merger (source). Even a massive outage like this is unlikely to cause the demise of a large and important firm, but combined with other woes like a less-than-competitive product and poor business model it could well be the deciding factor.

The Solution: Eliminate failover systems

RIM is in trouble for a number of reasons but downtime like this does not need to be one of them. I contend that the core problem was not a failed switch but a failed failover. Switches will fail and there is no avoiding that. If you can architect the perfect switch, I invite you to do so and you'll be richer than Bill Gates.  It's what happens after the inevitable switch malfunction (or other disaster) that matters most. Failover systems will fail too. RIM's apparently worked fine during a test but the strain and chaos of a real-world crisis was too much for it. At ZeroNines, we propose eliminating the failover systems in favor of something that will turn failures into virtual non-events.

ZeroNines' Always Available™ technology eliminates the need for failover, processing the same applications and data simultaneously on multiple servers, clouds, and virtual servers separated by thousands of miles. All servers are hot, and all applications are active. So if a switch fails in one network instance there is no need for a risky failover to another. Other instances are already processing the same transactions in parallel and simply continue processing as if nothing had happened. Once the problem with the switch is rectified, that instance is brought back into the Always Available array, is automatically updated, and resumes processing along with the others.

The Numbers

RIM says that its service "has been operational for 99.7% of the time over the last 18 months" (source). That equates to about 1,576.8 minutes of downtime, or 26.28 hours per year.

A good industry standard for uptime is 99.9% or three nines. That is 525.6 minutes of downtime, or 8.76 hours per year.

ZeroNines can provide in excess of five nines of uptime, or 99.999%. That is less than 5.3 minutes of downtime per year.

I do not know if planned downtime was included in RIM's 99.7% calculation. Companies often do not include planned downtime in their business continuity projections, counting only unplanned outages. But downtime is downtime from a user's perspective, whether caused by an accident or a planned maintenance cycle. ZeroNines protects against both.

In the last 12 months since ZenVault Medical went live on an Always Available cloud-based architecture it has experienced true 100% uptime, with no downtime whatsoever for any reason. That includes planned maintenance, upgrades, and other events that would have taken an ordinary network offline. 


Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines