December 12, 2011

BAE, Microsoft, the Cloud, and Planning for When it All Goes Horribly Wrong

"If it fails in Ireland, it goes to Holland. But what if it fails in Holland as well?"
Paraphrase of Charles Newhouse, BAE [source]

Cloud news circuits have been abuzz the last few days over BAE rejecting Microsoft's Office 365 cloud solution because of the Patriot Act. This is the highest-profile rejection of a cloud offering I have seen. I am shocked and dismayed that after all the advancements that have improved continuity in the cloud, the network architectures our cloud service providers are offering are still in the stone age. They're still trying to use failover and pass it off as advanced and reliable. I can only assume that if given a 787 they would try to fly it off a dirt landing strip.

When you read the articles closely, it is clear that the big issue for BAE was data sovereignty. How does one retain control of data during a network disaster, and where does it go when your service provider has to failover from the primary network node to the backup? To quote Charles Newhouse, head of strategy and design at British defense contractor BAE,

"We had these wonderful conversations with Microsoft where we were going to adopt Office 365 for some of our unrestricted stuff, and it was all going to be brilliant. I went back and spoke to the lawyers and said, '[The data center is in] Ireland and then if it fails in Ireland go to Holland.' And the lawyers said 'What happen[s] if they lose Holland as well?'" [source]

And earlier in the same article he described the user experience during a cloud outage:

"A number of high profile outages that users have suffered recently demonstrated just how little control you actually have. When it all goes horribly wrong, you just sit there and hope it is going to get better. There's nothing tangibly you can do to assist" [source].

It's About More than Just the Patriot Act

The big focus in these articles is the Patriot Act. BAE lawyers forbade the use of Office 365 and the Microsoft public cloud because as a U.S. company, Microsoft could be required to turn BAE data over to the U.S. government under terms of the Patriot Act [source].

It is true that the Patriot Act can require cloud service providers like Microsoft (and Amazon, Google, and others) to give the U.S. government the data on their servers, even if those servers are housed outside the United States [source]. Newhouse also said that "the geo-location of that data and who has access to that data is the number one killer for adopting to the public cloud at the moment" [source].

But European governments are already moving to eliminate this loophole. As explained in November on ZDNet.com, a new European directive "will not only modernize the data protection laws, but will also counteract the effects of the Patriot Act in Europe" [source]. Sounds to me like Microsoft's jurisdictional problems will be solved for them. And failing that there is probably some creative and legal business restructuring that would do the trick.

It's Really about Failover and its Shortcomings

So if European law will provide data sovereignty from a legal standpoint, why reject the Microsoft cloud? It all comes back to "when things go horribly wrong."

When Newhouse describes the Ireland-to-Holland scenario, he is clearly talking about Microsoft failing-over from their Ireland datacenter to their Holland datacenter. I find it hard to believe that Microsoft thinks the outdated and flawed failover model is suitable for a leading cloud offering. Office 365 and their customers deserve better.

Apparently BAE agrees. It put its foot down and refused to play because the reality does not match the promise.

Failovers often fail, causing the downtime they were supposed to prevent. If the secondary site fails to start up properly (which is very common) or suffers an outage of its own, the business is either a) still offline or b) failed over to yet another location. The customer quickly loses control, network transactions get lost, and their data goes… where? Another server in Europe? Part of an American cloud? How many locations is Microsoft prepared to failover to, and where are they? And with the cloud these issues loom even larger because there is no particular machine that houses the data.

The Solution: Cloud and Data Reliability without Failover

ZeroNines offers two potential scenarios that will solve this problem:

1) Prevent downtime on Protect the cloud provider's systems from downtime, offering a far more reliable cloud.

2) Protect the business' systems from a cloud provider's downtime.

Our Always Available technology is designed to provide data and application uptime well in excess of five nines. ZenVault Medical has been running in the cloud on Always Available for about 14 months with true 100% uptime. Always Available runs multiple network and cloud nodes in distant geographical areas. All servers and nodes are hot, and all applications are active. If one fails, the others continue processing as before, with no interruption to the business or the user experience. There is no failover, and thus no chance for outages caused by a failed failover.

So if Microsoft were to adopt our Always Available technology, a storm like the one that knocked out their data center in Ireland this past August would not affect service. The Ireland node might go down, but all network activities would proceed as usual on other cloud data centers in Holland, Italy, or wherever they have set them up. Users would never know it.

If BAE adopted Always Available, they could bring their Microsoft cloud node into an Always Available array with other cloud nodes or data centers of their own choosing. A failure in one simply means that business proceeds on the others.

The business or the service provider can determine which nodes are brought into the array. BAE could choose to use only European cloud nodes to maintain data sovereignty.

ZeroNines' Always Available technology is built precisely for the moment "when it all goes horribly wrong." The difference is that with ZeroNines, it won't mean downtime.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

December 1, 2011

Retail Business Continuity on Black Friday and Cyber Monday

The economy has heaved a sigh of relief after good sales reports from the Thanksgiving weekend. Have you stopped to really think about the importance of reliable IT systems and business continuity during this and other key sales events?

A company really may live or die according to what it or its service provider does in preparation for Black Friday and Cyber Monday. The game is in the hands of the technicians more and more every year.

While these two days can herald great things during a good year, they can also seem like harbingers of doom if things don't go so well. Their grim-sounding names are oddly appropriate, and everyone watches with trepidation.

• Black Friday. Very ominous, evoking images of stock market crashes and other disasters. A few decades ago it came to mean "the day after Thanksgiving in which retailers make enough sales to put themselves 'into the black ink'" [source] which is actually a good thing.

• Cyber Monday. Sounds like something from The Terminator. Actually… "The term 'Cyber Monday' was coined in 2005 by Shop.org, a division of the National Retail Federation [source]." This is the Monday after Thanksgiving, when online sales show a significant spike. Cyber Monday has become a major shopping day and economic indicator in its own right.

Jittery analysts are poised every year with their thumb on the Recession Early Warning button, ready to sound the alarm if the score doesn't add up and the game goes badly. (I think they secretly enjoy this.)

It's All IT's Fault. But No Pressure, Guys! : )

Every year in advance of this season opener, IT Managers beg for money to upgrade servers, replace old circuit breakers and backup batteries, service the cooling systems, and do a thousand other things to help prop up their networks for the onslaught. They also stock up on the coffee, donuts, and Valium that will keep them going through long days and even longer nights of watching, waiting, rebooting, hot swapping, and occasionally panicking over system crashes and failovers. I do not envy them, as the fate of the economy apparently rests upon their shoulders.

If the IT systems go down the business is out of the game and the term "Black Friday" takes on an entirely new meaning. Revenue on Thanksgiving weekend is largely driven by time-sensitive discounts, so shoppers will buy from competitors if a website or point-of-sale (POS) system is down. For those of you running these systems, my heart goes out to you. I have been in similar situations myself many times.

Thanksgiving Weekend Outages Mostly Due to Heavy Traffic

There were a number of reports of ecommerce sites becoming unavailable on Thanksgiving, Black Friday, and Cyber Monday. Victoria's Secret went beyond secret and became downright invisible three separate times, for a total of about 80 minutes [source]. I have read about downtime and poor site performance at many other online retailers as well, including PC Mall and Crutchfield [source]. Universally, there is no mention of the cause of all this downtime, but the implication is that it was simple old-fashioned traffic overload.

Fire Suppression System Suppresses Sales on eBay

One outage not caused by traffic was ProStores, an online store solution used by lots of smaller operations to run their eBay storefronts. According to a Thanksgiving Day post by ProStores on their discussion board, "the data center fire suppression system tripped the Emergency Power Off (EPO) system causing a loss of power to the data center's raised floor environment" [source]. As is usual in such circumstances, it took most of the day before things could be brought back to normal. I strongly suggest you read their post, as it is an excellent account of the gyrations an IT department has to go through in such situations. I applaud ProStores for being so forthright and providing this information.

Preventing this and Other Outages

Always Available technology from ZeroNines could have prevented the ProStores outage entirely. Yes, that faulty fire suppression system would still have freaked out at that particular data center. But Always Available would have been running one, two, or more instances of the same applications and transactions in the cloud or at other data centers. ProStores clients and their customers would never have known there was a power outage and no sales would have been lost.

ProStores made no mention at all of failover, so I assume they do not have a failover-based recovery system in place. With ZeroNines, that's perfectly fine because we do not use failover either. We make failover unnecessary. We offer disaster avoidance, not disaster recovery. There is no way to prevent all system malfunctions because there are too many complex parts. Next month maybe a circuit breaker will fail. After that, maybe it's a failed hard disk and an application crash. The list goes on.

Girding Your Loins for Next Year

Online retailers wanting to guard themselves against a Black Friday blackout (or on any other day) should consider the modular approach ZeroNines takes. You can apply Always Available to selected high-value systems such as:

  • Webstore servers and databases
  • Product/inventory databases
  • Payment systems
  • Image rendering systems
These will keep you running if something blows up. Close behind are customer service systems and warehousing/fulfillment. These become more important the closer you get to Christmas, as last-minute shoppers tend to need more personal help and there is no leeway for late shipments.

To prevent traffic-related outages, set up proper load balancing. If huge players like J.C. Penney, Apple, Macy's, Sears, Amazon, and Dell can come through Cyber Monday with flying colors [source], you can too. But for the hardware failures, human mistakes, software crashes, and other things that can hit you any day of the year as well, look into ZeroNines.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines