December 12, 2011

BAE, Microsoft, the Cloud, and Planning for When it All Goes Horribly Wrong

"If it fails in Ireland, it goes to Holland. But what if it fails in Holland as well?"
Paraphrase of Charles Newhouse, BAE [source]

Cloud news circuits have been abuzz the last few days over BAE rejecting Microsoft's Office 365 cloud solution because of the Patriot Act. This is the highest-profile rejection of a cloud offering I have seen. I am shocked and dismayed that after all the advancements that have improved continuity in the cloud, the network architectures our cloud service providers are offering are still in the stone age. They're still trying to use failover and pass it off as advanced and reliable. I can only assume that if given a 787 they would try to fly it off a dirt landing strip.

When you read the articles closely, it is clear that the big issue for BAE was data sovereignty. How does one retain control of data during a network disaster, and where does it go when your service provider has to failover from the primary network node to the backup? To quote Charles Newhouse, head of strategy and design at British defense contractor BAE,

"We had these wonderful conversations with Microsoft where we were going to adopt Office 365 for some of our unrestricted stuff, and it was all going to be brilliant. I went back and spoke to the lawyers and said, '[The data center is in] Ireland and then if it fails in Ireland go to Holland.' And the lawyers said 'What happen[s] if they lose Holland as well?'" [source]

And earlier in the same article he described the user experience during a cloud outage:

"A number of high profile outages that users have suffered recently demonstrated just how little control you actually have. When it all goes horribly wrong, you just sit there and hope it is going to get better. There's nothing tangibly you can do to assist" [source].

It's About More than Just the Patriot Act

The big focus in these articles is the Patriot Act. BAE lawyers forbade the use of Office 365 and the Microsoft public cloud because as a U.S. company, Microsoft could be required to turn BAE data over to the U.S. government under terms of the Patriot Act [source].

It is true that the Patriot Act can require cloud service providers like Microsoft (and Amazon, Google, and others) to give the U.S. government the data on their servers, even if those servers are housed outside the United States [source]. Newhouse also said that "the geo-location of that data and who has access to that data is the number one killer for adopting to the public cloud at the moment" [source].

But European governments are already moving to eliminate this loophole. As explained in November on ZDNet.com, a new European directive "will not only modernize the data protection laws, but will also counteract the effects of the Patriot Act in Europe" [source]. Sounds to me like Microsoft's jurisdictional problems will be solved for them. And failing that there is probably some creative and legal business restructuring that would do the trick.

It's Really about Failover and its Shortcomings

So if European law will provide data sovereignty from a legal standpoint, why reject the Microsoft cloud? It all comes back to "when things go horribly wrong."

When Newhouse describes the Ireland-to-Holland scenario, he is clearly talking about Microsoft failing-over from their Ireland datacenter to their Holland datacenter. I find it hard to believe that Microsoft thinks the outdated and flawed failover model is suitable for a leading cloud offering. Office 365 and their customers deserve better.

Apparently BAE agrees. It put its foot down and refused to play because the reality does not match the promise.

Failovers often fail, causing the downtime they were supposed to prevent. If the secondary site fails to start up properly (which is very common) or suffers an outage of its own, the business is either a) still offline or b) failed over to yet another location. The customer quickly loses control, network transactions get lost, and their data goes… where? Another server in Europe? Part of an American cloud? How many locations is Microsoft prepared to failover to, and where are they? And with the cloud these issues loom even larger because there is no particular machine that houses the data.

The Solution: Cloud and Data Reliability without Failover

ZeroNines offers two potential scenarios that will solve this problem:

1) Prevent downtime on Protect the cloud provider's systems from downtime, offering a far more reliable cloud.

2) Protect the business' systems from a cloud provider's downtime.

Our Always Available technology is designed to provide data and application uptime well in excess of five nines. ZenVault Medical has been running in the cloud on Always Available for about 14 months with true 100% uptime. Always Available runs multiple network and cloud nodes in distant geographical areas. All servers and nodes are hot, and all applications are active. If one fails, the others continue processing as before, with no interruption to the business or the user experience. There is no failover, and thus no chance for outages caused by a failed failover.

So if Microsoft were to adopt our Always Available technology, a storm like the one that knocked out their data center in Ireland this past August would not affect service. The Ireland node might go down, but all network activities would proceed as usual on other cloud data centers in Holland, Italy, or wherever they have set them up. Users would never know it.

If BAE adopted Always Available, they could bring their Microsoft cloud node into an Always Available array with other cloud nodes or data centers of their own choosing. A failure in one simply means that business proceeds on the others.

The business or the service provider can determine which nodes are brought into the array. BAE could choose to use only European cloud nodes to maintain data sovereignty.

ZeroNines' Always Available technology is built precisely for the moment "when it all goes horribly wrong." The difference is that with ZeroNines, it won't mean downtime.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

December 1, 2011

Retail Business Continuity on Black Friday and Cyber Monday

The economy has heaved a sigh of relief after good sales reports from the Thanksgiving weekend. Have you stopped to really think about the importance of reliable IT systems and business continuity during this and other key sales events?

A company really may live or die according to what it or its service provider does in preparation for Black Friday and Cyber Monday. The game is in the hands of the technicians more and more every year.

While these two days can herald great things during a good year, they can also seem like harbingers of doom if things don't go so well. Their grim-sounding names are oddly appropriate, and everyone watches with trepidation.

• Black Friday. Very ominous, evoking images of stock market crashes and other disasters. A few decades ago it came to mean "the day after Thanksgiving in which retailers make enough sales to put themselves 'into the black ink'" [source] which is actually a good thing.

• Cyber Monday. Sounds like something from The Terminator. Actually… "The term 'Cyber Monday' was coined in 2005 by Shop.org, a division of the National Retail Federation [source]." This is the Monday after Thanksgiving, when online sales show a significant spike. Cyber Monday has become a major shopping day and economic indicator in its own right.

Jittery analysts are poised every year with their thumb on the Recession Early Warning button, ready to sound the alarm if the score doesn't add up and the game goes badly. (I think they secretly enjoy this.)

It's All IT's Fault. But No Pressure, Guys! : )

Every year in advance of this season opener, IT Managers beg for money to upgrade servers, replace old circuit breakers and backup batteries, service the cooling systems, and do a thousand other things to help prop up their networks for the onslaught. They also stock up on the coffee, donuts, and Valium that will keep them going through long days and even longer nights of watching, waiting, rebooting, hot swapping, and occasionally panicking over system crashes and failovers. I do not envy them, as the fate of the economy apparently rests upon their shoulders.

If the IT systems go down the business is out of the game and the term "Black Friday" takes on an entirely new meaning. Revenue on Thanksgiving weekend is largely driven by time-sensitive discounts, so shoppers will buy from competitors if a website or point-of-sale (POS) system is down. For those of you running these systems, my heart goes out to you. I have been in similar situations myself many times.

Thanksgiving Weekend Outages Mostly Due to Heavy Traffic

There were a number of reports of ecommerce sites becoming unavailable on Thanksgiving, Black Friday, and Cyber Monday. Victoria's Secret went beyond secret and became downright invisible three separate times, for a total of about 80 minutes [source]. I have read about downtime and poor site performance at many other online retailers as well, including PC Mall and Crutchfield [source]. Universally, there is no mention of the cause of all this downtime, but the implication is that it was simple old-fashioned traffic overload.

Fire Suppression System Suppresses Sales on eBay

One outage not caused by traffic was ProStores, an online store solution used by lots of smaller operations to run their eBay storefronts. According to a Thanksgiving Day post by ProStores on their discussion board, "the data center fire suppression system tripped the Emergency Power Off (EPO) system causing a loss of power to the data center's raised floor environment" [source]. As is usual in such circumstances, it took most of the day before things could be brought back to normal. I strongly suggest you read their post, as it is an excellent account of the gyrations an IT department has to go through in such situations. I applaud ProStores for being so forthright and providing this information.

Preventing this and Other Outages

Always Available technology from ZeroNines could have prevented the ProStores outage entirely. Yes, that faulty fire suppression system would still have freaked out at that particular data center. But Always Available would have been running one, two, or more instances of the same applications and transactions in the cloud or at other data centers. ProStores clients and their customers would never have known there was a power outage and no sales would have been lost.

ProStores made no mention at all of failover, so I assume they do not have a failover-based recovery system in place. With ZeroNines, that's perfectly fine because we do not use failover either. We make failover unnecessary. We offer disaster avoidance, not disaster recovery. There is no way to prevent all system malfunctions because there are too many complex parts. Next month maybe a circuit breaker will fail. After that, maybe it's a failed hard disk and an application crash. The list goes on.

Girding Your Loins for Next Year

Online retailers wanting to guard themselves against a Black Friday blackout (or on any other day) should consider the modular approach ZeroNines takes. You can apply Always Available to selected high-value systems such as:

  • Webstore servers and databases
  • Product/inventory databases
  • Payment systems
  • Image rendering systems
These will keep you running if something blows up. Close behind are customer service systems and warehousing/fulfillment. These become more important the closer you get to Christmas, as last-minute shoppers tend to need more personal help and there is no leeway for late shipments.

To prevent traffic-related outages, set up proper load balancing. If huge players like J.C. Penney, Apple, Macy's, Sears, Amazon, and Dell can come through Cyber Monday with flying colors [source], you can too. But for the hardware failures, human mistakes, software crashes, and other things that can hit you any day of the year as well, look into ZeroNines.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 24, 2011

Building Outage Resistance into Network Operations


An article I read the other day in MIT's Technology Review [source] nicely sums up what I've been hearing about cloud operations from dozens of clients, partners, and other colleagues around the country. The cloud is great for development, prototyping, and special projects for enterprises, but don't rely on it for anything serious. As that article says, "For all the unprecedented scalability and convenience of cloud computing, there's one way it falls short: reliability."

But the truth is that the tried-and-true models of network operations aren't all that reliable themselves, and neither are the disaster recovery systems that are supposed to protect them. Granted, they are probably more reliable than the cloud at this point, but downtime is downtime whether it's in the cloud or in a colocation facility. The effect is the same.

What is really needed is outage resistance that is built into network operations, whatever the model

Why downtime happens

I recently read an interesting whitepaper from Emerson Network Power [source] that describes the seven most common causes of downtime as revealed by a 2010 survey by the Ponemon institute (http://www.ponemon.org/index.php). The causes are all pretty mundane: UPS problems such as battery failure or exceeded capacity, power distribution unit and circuit breaker failures, cooling problems, human error, and similar things. All of them apply to any data center, whether in-house or in the cloud. None of the exciting stuff like fires, terrorism, or hurricanes made it into the top seven, though of course they could lead to a failure of a battery, circuit breaker, or cooling unit.

The Emerson whitepaper describes best practices that can reduce the likelihood of downtime induced by each of the top seven causes. That is all well and good, but some are very costly, such as remodeling server rooms "to optimize air flow within the data center by adopting a cold-aisle containment strategy." Other recommendations include regular and frequent inspection and testing of backup batteries, installation of circuit breaker monitoring systems, and increased training for staff.

These are good ideas but costly, if not in capital for server room reconfiguration then in staff hours and other recurring costs. The paper contends that problems caused by human error are "wholly preventable" but I believe this is a mistake. No matter how stringent the rules or how well-documented the procedures, someone will take short cuts, overlook a vital step in the midst of a crisis, or sneak their donut and coffee into the control room. Applications fail under stress, databases fail to restart properly, and any number of other things can and do go wrong. There is no way to write contingencies for each, particularly when the initial failure leads to an unpredictable cascade effect.

And what of the cloud?

I believe the cloud brings tremendous value to developers, SMBs, and other institutions that need low cost and great flexibility. Where else can an online store launch with a configuration that is not only affordable but also ready for both super-slow sales and a drastic ramp-up if sales shoot into the stratosphere? But like most “better, cheaper, faster” initiatives, the cloud has genuine reliability problems. A company running their own data center could choose to incur the expense and work of instituting all of Emerson's best practices since they are in control of the environment. But all they have from their cloud provider (or colocation provider for that matter) is their Service Level Agreement (SLA). They can't go in themselves and swap out aged batteries or fire the guy who persists in smuggling cinnamon rolls into the NOC.

The Technology Review article tells us that some companies are looking for ways to make their cloud deployments far more disaster resistant to start with, rather than just relying on their cloud provider's promises [source]. Seattle-based software developer BigDoor experienced service interruptions as a result of the Amazon cloud's big outage in April 2011. Co-founder Jeff Malek said "For me, [service agreements] are created by bureaucrats and lawyers… What I care about is how dependable the cloud service is, and what a provider has done to prepare for outages" [source].

The same article describes the Amazon SLA and its implications:

Even though outages put businesses at immense risk, public cloud providers still don't offer ironclad guarantees. In its so-called "service-level agreement," Amazon says that if its services are unavailable for more than 0.05 percent of a year (around four hours) it will give the clients a credit "equal to 10% of their bill." Some in the industry believe public clouds like Amazon should aim for 99.999 percent availability, or downtime of only around five minutes a year.


The outage resistant cloud

ZeroNines can give you that 99.999% (five nines) or better, whether you are running a cloud or just running in the cloud. Cloud service providers could install an Always Available™ configuration on their publicly-offered services, providing a highly competitive edge when attracting new customers.

Individual businesses could install an Always Available array on their own networks, synchronizing any combination of cloud deployments, colocation, and in-house network nodes. It also facilitates cloud migration, because you can deploy to the cloud while keeping your existing network up and running as it always has. There is no monumental cloud migration that could take the whole network down and leave the business stranded if there's a glitch in starting an application. Instead, Always Available runs all servers hot and all applications active, enabling entire nodes to fall in and out of the configuration as needed without affecting service. The remaining nodes can update a new or re-started node once it rejoins the system.

ZeroNines client ZenVault Medical (www.zenvault.com/medical) developed and launched their live site in the cloud using an Always Available configuration. Since the day of its launch in September 2010 it has run in the cloud with true 100% uptime, with no downtime at all. That includes maintenance and upgrades. When a problem or maintenance cycle requires a node to be taken offline, ZenVault staffers remove it from the configuration, modify it as necessary, and seamlessly add it back to into the mix once it is ready. ZenVault users don't experience any interruptions.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 13, 2011

What Did One BlackBerry User Say to the Other BlackBerry User?

Nothing, according to twitter user @giselewaymes (source).

In what has to be every large enterprise IT manager's worst nightmare, a big high profile outage grew into a monster, expanded to global proportions, made headlines everywhere, and after three days seemed to have no end in sight. The cause was a failed failover that could have been avoided.

Background: RIM BlackBerry 

BlackBerry is produced by Canadian firm Research In Motion (RIM). It is one of the leading smart phones among business users. Its real forte is encrypted mobile email and instant messaging. BlackBerry has about 70 million users worldwide (source). Several high-profile outages and many smaller ones have tarnished its reputation, and this week's seems to be pushing the company to the breaking point if all the buzz on the Internet is to be believed.

The Problem: Failed failover 

On Monday morning October 10 2011, millions of BlackBerry users in Europe, the Middle east, and Africa lost access to messenger, email, and Internet. The outage spread to every continent and may eventually have effected half of all BlackBerry users (source).

RIM explained things to some degree on their website on Tuesday October 11: "The messaging and browsing delays that some of you are still experiencing were caused by a core switch failure within RIM’s infrastructure. Although the system is designed to failover to a back-up switch, the failover did not function as previously tested (source).

In other words, their failover-based disaster recovery system failed. It can be inferred that this led to cascading failures that knocked out other systems in other regions, leading to this worldwide problem. As of Wednesday evening the 12th it was still not fully resolved, with an interesting update posted on their site outlining the status in various parts of the world (source). By Thursday morning it looked like things were finally under control, with service almost back to normal in most areas.

The Cost: Paid compensation and a blow to the business 

I don't doubt that RIM will compensate users in one way or another, perhaps in the form of free service (which seems to be the industry's de-facto compensation currency). RIM Co-CEO Jim Balsillie said that such a step would be considered but that their immediate focus was fixing the problem (source).

More damaging is the additional blow to RIM's reputation. Lots of users are claiming on Facebook, Twitter, and other online forums that this is the last straw and that they will quit BlackBerry. For many this may be a hollow threat but there is genuine peril here. "This outage… comes at a particularly bad time for RIM, since it faces increasing competition in the smarpthone market… Apple's iPhone and phones on the Google Android operating system have been gaining ground, and the new iPhone 4S goes on sale Friday (October 14)" (source).

The cost can be high outside of RIM as well. "The outage caught much of D.C. off guard Wednesday and underscored the region’s reliance on the BlackBerry — which is still the only federally approved smartphone for employees in some government agencies (source).

As for RIM itself, back in June there was a flurry of articles suggesting RIM was potentially facing bankruptcy (source). And this week there have been a number of stories about growing momentum for a RIM breakup or merger (source). Even a massive outage like this is unlikely to cause the demise of a large and important firm, but combined with other woes like a less-than-competitive product and poor business model it could well be the deciding factor.

The Solution: Eliminate failover systems

RIM is in trouble for a number of reasons but downtime like this does not need to be one of them. I contend that the core problem was not a failed switch but a failed failover. Switches will fail and there is no avoiding that. If you can architect the perfect switch, I invite you to do so and you'll be richer than Bill Gates.  It's what happens after the inevitable switch malfunction (or other disaster) that matters most. Failover systems will fail too. RIM's apparently worked fine during a test but the strain and chaos of a real-world crisis was too much for it. At ZeroNines, we propose eliminating the failover systems in favor of something that will turn failures into virtual non-events.

ZeroNines' Always Available™ technology eliminates the need for failover, processing the same applications and data simultaneously on multiple servers, clouds, and virtual servers separated by thousands of miles. All servers are hot, and all applications are active. So if a switch fails in one network instance there is no need for a risky failover to another. Other instances are already processing the same transactions in parallel and simply continue processing as if nothing had happened. Once the problem with the switch is rectified, that instance is brought back into the Always Available array, is automatically updated, and resumes processing along with the others.

The Numbers

RIM says that its service "has been operational for 99.7% of the time over the last 18 months" (source). That equates to about 1,576.8 minutes of downtime, or 26.28 hours per year.

A good industry standard for uptime is 99.9% or three nines. That is 525.6 minutes of downtime, or 8.76 hours per year.

ZeroNines can provide in excess of five nines of uptime, or 99.999%. That is less than 5.3 minutes of downtime per year.

I do not know if planned downtime was included in RIM's 99.7% calculation. Companies often do not include planned downtime in their business continuity projections, counting only unplanned outages. But downtime is downtime from a user's perspective, whether caused by an accident or a planned maintenance cycle. ZeroNines protects against both.

In the last 12 months since ZenVault Medical went live on an Always Available cloud-based architecture it has experienced true 100% uptime, with no downtime whatsoever for any reason. That includes planned maintenance, upgrades, and other events that would have taken an ordinary network offline. 


Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

August 10, 2011

Amazon EC2 Outage: Déjà Vu All Over Again

It seems we can always rely on cloud outages to spice up the news feeds. Today, it's another Amazon EC2 Cloud outage, which is a nice departure from the wildly gyrating stock market and the U.S. debt downgrade.

I didn't write about Amazon's big April 2011 EC2 outage simply because I was overwhelmed with other work (along with texts, tweets and emails about the outage). That outage affected big-name customers like Netflix, Foursquare, HootSuite, and Reddit (source). Some EC2 customers' websites were down for as much as two days.

Then just this past weekend an electrical storm over Dublin Ireland led to a lightning strike on a transformer and a subsequent explosion, fire, and loss of power at an Amazon data center. Backup generators could not be started. Amazon's European EC2 Service was affected for as long as twelve hours. Some Microsoft cloud services were knocked out as well (source).

I am a huge proponent of the cloud; however, I believe reliability can and should improve. As a frequent speaker and panelist at cloud-related events, I find that many in the audience are not convinced that the cloud is reliable enough to meet the needs of mission-critical applications. Outages like this don’t help. However, I am aware of several successful implementations of robust, outage-resistant cloud deployments that simply have not gotten any attention because the clients are not motivated to share how they did it with their competitors. Some of these early adopters took risks and made large investments when the mainstream would not, and they feel they deserve some advantage while they can get it. Naturally enough I think ZeroNines has the right solution, but read on for now.

Background: Amazon as a major cloud provider

Amazon EC2 is the Amazon Elastic Compute Cloud (source). It provides thousands of online service providers and software developers easy access to cloud computing capacity that is variable in size. Customers pay only for what they use. Their customers include Netflix (streaming movies and TV shows), Instagram (photo sharing), Reddit (social networking for sharing news), and Foursquare (location-based social networking).

The Problem: Something's rotten in the state of Virginia

I have not found a clear statement yet that describes the exact cause of the August 8 outage, but PCMag.com says that it "closely mirrors a similar cloud outage Amazon suffered in April" (source). It also happened in the same Virginia data center. The April 2011 outage "happened after Amazon network traffic was 'executed incorrectly.' Instead of shifting to another router, traffic went to a lower-capacity network, taking down servers in Northern Virginia." (source). So Amazon loses points for allowing the same problem to happen twice in the same place, but wins a few back for apparently being ready this time and containing the August 8 outage to minutes rather than days.

The Cost: Revenue and reputation

As always with these outages there is talk of the provider compensating its customers through waived fees and such. Mark that against Amazon's balance sheet. Customers no doubt lost business, and you can mark that against their balance sheets. Reliability issues will chase away customers who don't want to risk their own revenue with a service notorious for crashing. But if the cloud nonetheless offers the best business model, what do these customers do? Press for lower fees and more favorable service level agreements for one.

The Solution: Prevention, not recovery

If you're an actual or potential cloud user (with any provider), Always Available™ from ZeroNines can protect your existing systems without changing providers, hardware, operating systems, or applications. If there's a disaster in any part of your system, all your networked transactions and applications continue functioning as normal on the other network nodes. Our CloudNines™ application can protect your cloud-based infrastructure, VirtualNines™ can protect virtualized environments on your own machines, and EnterpriseNines™ can add Always Available protection to any other network infrastructure. You can mix and match so all these can interoperate seamlessly. For businesses of any size, the result is uptime of virtually 100% regardless of the disasters that may strike any individual node in the Always Available array.

The cloud providers themselves could use the same CloudNines product to protect their systems, virtually eliminating downtime and avoiding headlines like Amazon's. We are currently developing and monitoring on Amazon and other cloud platforms. Our technology is certified for Windows Server® 2008, compatible with Windows Server® 2008 Hyper-V™ and Hyper-V™ Server, and certified as VMWare® ready.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines