Building Outage Resistance into Network Operations

An article I read the other day in MIT's Technology Review [source] nicely sums up what I've been hearing about cloud operations from dozens of clients, partners, and other colleagues around the country. The cloud is great for development, prototyping, and special projects for enterprises, but don't rely on it for anything serious. As that article says, "For all the unprecedented scalability and convenience of cloud computing, there's one way it falls short: reliability."

But the truth is that the tried-and-true models of network operations aren't all that reliable themselves, and neither are the disaster recovery systems that are supposed to protect them. Granted, they are probably more reliable than the cloud at this point, but downtime is downtime whether it's in the cloud or in a colocation facility. The effect is the same.

What is really needed is outage resistance that is built into network operations, whatever the model

Why downtime happens

I recently read an interesting whitepaper from Emerson Network Power [source] that describes the seven most common causes of downtime as revealed by a 2010 survey by the Ponemon institute (http://www.ponemon.org/index.php). The causes are all pretty mundane: UPS problems such as battery failure or exceeded capacity, power distribution unit and circuit breaker failures, cooling problems, human error, and similar things. All of them apply to any data center, whether in-house or in the cloud. None of the exciting stuff like fires, terrorism, or hurricanes made it into the top seven, though of course they could lead to a failure of a battery, circuit breaker, or cooling unit.

The Emerson whitepaper describes best practices that can reduce the likelihood of downtime induced by each of the top seven causes. That is all well and good, but some are very costly, such as remodeling server rooms "to optimize air flow within the data center by adopting a cold-aisle containment strategy." Other recommendations include regular and frequent inspection and testing of backup batteries, installation of circuit breaker monitoring systems, and increased training for staff.

These are good ideas but costly, if not in capital for server room reconfiguration then in staff hours and other recurring costs. The paper contends that problems caused by human error are "wholly preventable" but I believe this is a mistake. No matter how stringent the rules or how well-documented the procedures, someone will take short cuts, overlook a vital step in the midst of a crisis, or sneak their donut and coffee into the control room. Applications fail under stress, databases fail to restart properly, and any number of other things can and do go wrong. There is no way to write contingencies for each, particularly when the initial failure leads to an unpredictable cascade effect.

And what of the cloud?

I believe the cloud brings tremendous value to developers, SMBs, and other institutions that need low cost and great flexibility. Where else can an online store launch with a configuration that is not only affordable but also ready for both super-slow sales and a drastic ramp-up if sales shoot into the stratosphere? But like most “better, cheaper, faster” initiatives, the cloud has genuine reliability problems. A company running their own data center could choose to incur the expense and work of instituting all of Emerson's best practices since they are in control of the environment. But all they have from their cloud provider (or colocation provider for that matter) is their Service Level Agreement (SLA). They can't go in themselves and swap out aged batteries or fire the guy who persists in smuggling cinnamon rolls into the NOC.

The Technology Review article tells us that some companies are looking for ways to make their cloud deployments far more disaster resistant to start with, rather than just relying on their cloud provider's promises [source]. Seattle-based software developer BigDoor experienced service interruptions as a result of the Amazon cloud's big outage in April 2011. Co-founder Jeff Malek said "For me, [service agreements] are created by bureaucrats and lawyers… What I care about is how dependable the cloud service is, and what a provider has done to prepare for outages" [source].

The same article describes the Amazon SLA and its implications:

Even though outages put businesses at immense risk, public cloud providers still don't offer ironclad guarantees. In its so-called "service-level agreement," Amazon says that if its services are unavailable for more than 0.05 percent of a year (around four hours) it will give the clients a credit "equal to 10% of their bill." Some in the industry believe public clouds like Amazon should aim for 99.999 percent availability, or downtime of only around five minutes a year.

The outage resistant cloud

ZeroNines can give you that 99.999% (five nines) or better, whether you are running a cloud or just running in the cloud. Cloud service providers could install an Always Available™ configuration on their publicly-offered services, providing a highly competitive edge when attracting new customers.

Individual businesses could install an Always Available array on their own networks, synchronizing any combination of cloud deployments, colocation, and in-house network nodes. It also facilitates cloud migration, because you can deploy to the cloud while keeping your existing network up and running as it always has. There is no monumental cloud migration that could take the whole network down and leave the business stranded if there's a glitch in starting an application. Instead, Always Available runs all servers hot and all applications active, enabling entire nodes to fall in and out of the configuration as needed without affecting service. The remaining nodes can update a new or re-started node once it rejoins the system.

ZeroNines client ZenVault Medical (www.zenvault.com/medical) developed and launched their live site in the cloud using an Always Available configuration. Since the day of its launch in September 2010 it has run in the cloud with true 100% uptime, with no downtime at all. That includes maintenance and upgrades. When a problem or maintenance cycle requires a node to be taken offline, ZenVault staffers remove it from the configuration, modify it as necessary, and seamlessly add it back to into the mix once it is ready. ZenVault users don't experience any interruptions.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 24, 2011