ZeroNines® - Always Available™

June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

The Amazon EC2 cloud had a relatively minor outage a couple weeks ago, on June 14 2012. As it turns out, it happened in the same Virginia datacenter that spawned the April 2011 and August 2011 outages. I've been on the road but now that I look into it I see that it's actually a classic outage scenario and a classic example of cascading failure resulting from a failover. It also illustrates just why you need to plan for Expendable Datacenters℠.

Background: Amazon and Their Cloud Service

I blogged about Amazon's big outage last August (source), and described how large a role Amazon plays in the cloud world. I won't recap all that here, but I will say that among its clients are Netflix, Instagram, Reddit, Foursquare, Quora, Pinterest, parts of Salesforce.com, and ZeroNines client ZenVault.

The Problem: A Power Outage

According to the Amazon status page (source) "a cable fault in the high voltage Utility power distribution system" led to a power outage at the datacenter. Primary backup generators successfully kicked in but after about nine minutes "one of the generators overheated and powered off because of a defective cooling fan." Secondary backup power successfully kicked in but after about four minutes a circuit breaker opened because it had been "incorrectly configured." At this point, with no power at all, some customers went completely offline. Others that were using Amazon's multi-Availability Zone configurations stayed online bu seem to have suffered from impaired API calls, described below. Power was restored about half an hour after it was first lost.

Sites started recovering as soon as power was restored and most customers were back online about an hour after the whole episode began. But it is clear that many weren't really ready for business again because of the cascading effects of the initial interruption.

Subsequent Problems: Loss of In-Flight Transactions

The Amazon report says that when power came back on, some instances were "in an inconsistent state" and that they may have lost "in-flight writes." I interpret this to mean that when the system failed over to the backups, the backup servers were not synchronized with the primaries, resulting in lost transactions. This is typical of a failover disaster recovery system.

Another Subsequent Problem: Impaired API Calls

Additionally, during the power outage, API calls related to Amazon Elastic Block Store (EBS) volumes failed. Amazon sums up the effect beautifully: "The datastore that lost power did not fail cleanly, leaving the system unable to flip [failover] the datastore to its replicas in another Availability Zone." Here's a second failed failover within the same disaster.

My Compliments to Amazon EC2

In all seriousness, I truly commend Amazon for publicly posting such a detailed description of the disaster. It looks to me like they handled the disaster quickly and efficiently within the limitations of their system. Unfortunately that system is clearly not suited to the job at hand.

Amazon does a pretty good job at uptime. We (ZeroNines) use the Amazon EC2 cloud ourselves. But we hedge our bets by adding our own commercially available Always Available™ architecture to harden the whole thing against power outages and such. If this outage had affected our particular instances, we would not have experienced any downtime, inconsistency, failed transactions, or other ill effects.

One Solution for Three Problems

Always Available runs multiple instances of the same apps and data in multiple clouds, virtual servers, or other hosting environments. All are hot and all are active.

When the power failed in the first phase of this disaster, two or more identical Always Available nodes would have continued processing as normal. The initial power outage would not have caused service downtime because customers would have been served by the other nodes.

Secondly, those in-flight transactions would not have been lost because the other nodes would have continued processing them. With Always Available there is no failover and consequently no "dead air" when transactions can be lost.

Third, those failed EBS API calls would not have failed because again, they would have gone to the remaining fully functional nodes.

A big issue in this disaster was the "inconsistent state," or lack of synchronization between the primary and the failover servers. Within an Always Available architecture, there is no failover. Each server is continually updating and being updated by all other servers. Synchronization takes place constantly so when one node is taken out of the configuration the others simply proceed as normal, processing all transactions in order. When the failed server is brought back online, the others update it and bring it to the same logical state so it can begin processing again.

The Expendable Datacenter

Another thing I can't help but point out is the string of events that caused the outage in the first place. First a cable failure combines with a fan failure, and that combines with a circuit breaker failure. It's simple stuff that adds up into a disaster. Then software that can't synchronize. Given the complexities of the modern datacenter, how many possible combinations of points of failure are there? Thousands? Millions? I'll go on the record and say that there is no way to map all the possible failures, and no way to guard against them all individually. Its far better to accept the fact that servers, nodes or entire facilities will go down someday, and that you need to make the whole datacenter expendable without affecting performance. That's what ZeroNines does.

So if you're a cloud customer, take a look at ZeroNines. We can offer virtually 100% uptime whether you host in the cloud, on virtual servers, or in a typical hosted environment. And if you're a cloud provider, you can apply Always Available architecture to your service offering, avoiding disasters like this in the first place.

Check back in a few days and I'll write another post that looks at this outage from a business planning perspective.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

March 26, 2012

BATS: “When the Exchange Blows Itself Up, That’s Not a Good Thing”

I can imagine scenarios more spectacular than the BATS failure and IPO withdrawal on Friday, but most of them involve alien invasion, chupacabras, or the expiration of the Mayan calendar.

Every once in a while a technology failure casts such a light (or perhaps a shadow) on how we do things that people actually stop and wonder aloud if the entire paradigm is too flawed to succeed. I don't mean that they look at alternative solutions like they do every time RIM goes down. They actually begin to question the soundness of the industry sector itself and the technological and market assumptions on which it is founded. Take for example this quote about BATS from a veteran equities professional:

"This tells you the system we’ve created over the last 15 years has holes in it, and this is one example of a failure,” Joseph Saluzzi, partner and co-head of equity trading at Themis Trading LLC in Chatham, New Jersey, said in a phone interview. “When the exchange blows itself up, that’s not a good thing.” The malfunctions will refocus scrutiny on market structure in the U.S., where two decades of government regulation have broken the grip of the biggest exchanges and left trading fragmented over as many as 50 venues. BATS, whose name stands for Better Alternative Trading System, expanded in tandem with the automated firms that now dominate the buying and selling of American equities. The withdrawal also raises questions about the reliability of venues formed as competitors to the New York Stock Exchange and Nasdaq Stock Market since the 1990s. [source]

When High Frequency Trading Crashes

BATS was initially built to service brokers and high frequency trading firms [source]. Their Friday crash mirrors a scenario I speculated about in this blog back in May 2010 in regard to an 80-minute TD Ameritrade outage:

Some banks, hedge funds, and other high-power financial firms engaged in High Frequency Trading (HFT) make billions of trades a day over ultra-high speed connections [source]. Many trades live for only a few seconds. Enormous transactions are conceived and executed in half a second, with computers evaluating the latest news and acting on it well before human traders even know what the news is. HFT is having a significant effect on markets; there is evidence that the history-making “Flash Crash” of May 6 2010 was caused and then largely corrected by High Frequency Trading [source]. What would happen if one of these HFT systems was down for an hour and a half? Or even just a minute? [source]

To answer my own question, apparently IPOs can be cancelled, a faulty 100-share trade can halt the trading of a leading stock, and regulators will jump like you stuck them with a pin.

About BATS

BATS Global Markets, Inc. (http://www.batstrading.com) describes itself in its prospectus as primarily a technology company [source]. The BATS exchanges represent an enormous 11 to 12 percent of daily market volume [source]. BATS is backed by Citigroup Inc, Morgan Stanley, Credit Suisse Group [source] and several other large financial firms. BATS planned to go public on Friday March 23, 2012 by selling their own shares on their own exchange that runs on their own technology.

What Happened

To my knowledge, BATS has not issued a definitive statement on the exact technological problem. Here is the best summary I could come up with:

A BATS computer that matches orders in companies with ticker symbols in the A-BFZZZ range "encountered a software bug related to IPO auctions" at 10:45 a.m. New York time [source].

This prevented the exchange from trading its own shares. Then…

A single trade for 100 shares executed on a BATS venue at 10:57 a.m. briefly sent Apple down more than 9 percent to $542.80 [source].

Apple's ticker is AAPL so it was within the group affected by the bug. By the time it was over, BATS shares had dropped from about $16.00 to zero and they withdrew their IPO. All trades in the IPO, Apple, and other affected equities were canceled.

Service Uptime: 99.998% Won't Cut It.

BATS claims it experienced very low downtime in 2011: 99.94% for its BZX exchange and 99.998% for its BYX exchange [source]. This equates to about 5.25 hours of downtime for BZX and about 10.5 minutes of downtime for BYX.

Though admirable for most online services, my first thought is that for a High Frequency Trading system this could represent billions of dollars in lost trades. I'd like to hope that all this downtime occurred when the markets were closed.

But even short outages of seconds or even fractions of a second can cause cascading application and database failures that corrupt the system and cause problems for hours or days.

Speculation: How it Could Have Been Prevented

This was not an all-out downtime event like I prophesied back in May 2010 and I don't think that ZeroNines technology could have prevented it. However, a variety of other real-world failures can be just as disastrous, particularly when one considers the shortcomings of failover-based disaster recovery, questionable cloud reliability, and the fact that switches and other hardware will eventually fail.

Allow me to postulate how this might happen one day on another financial firm's systems.

Scenario 1
The software fails because of a corrupted instance of the trading application on one server node. In an Always Available™ configuration that one node could have been shut down, allowing other instances of the application to handle all transactions as normal.

Scenario 2
A network switch fails, leading to extreme latency in some transaction responses. In an Always Available configuration, all transactions are processed simultaneously in multiple places, so the slow responses from the failed node would be ignored in favor of fast responses from the functioning nodes. The latency would not have been noticed by the traders or the market.

Scenario 3
An entire cloud instance has an outage lasting just a few seconds, well within their service level promises. During failover to another virtual server, one application is corrupted and begins sending false information. In an Always Available configuration there is no failover and no failover-induced corruption. Instead, other instances of the application process all the transactions while one is down, and then update the failed instance once it comes back online.

These scenarios are simplifications but you get the point: problems will always occur, but if they are within an Always Available configuration the nodes that continue to function will cover for those that don't.

Playing Nice for the SEC

“When you’re actually the exchange that is coming public and the platform that you touted as being very robust and competitive with any other global exchange fails to work, it’s a real black eye,” Peter Kovalski, a money manager who expected to receive shares in Bats for Alpine Mutual Funds in Purchase, New York [source].

With all respect to Mr. Kovalski, this is far more than a black eye. It's a business disaster. BATS is on the ropes, and I suspect their underwriters are hurting too. Trading opportunities were lost, many permanently. The SEC is already investigating automated and High Frequency Trading [source], and I'd guess it will be far more cautious about innovation among exchanges, trading platforms, and other systems.

ZeroNines offers one way to keep the wolves at bay: our Always Available technology can enable uptime in excess of 99.999% (five nines) among applications and data within traditional hosting environments, virtualized environments, and the cloud. We can't fix your software bugs, but we can prevent downtime caused by other things.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

January 27, 2012

The Legal Ramifications of Cloud Outages

Here's a public service announcement for cloud customers and cloud service providers alike: If you're not doing something to significantly increase the reliability of your cloud systems, you should prepare your legal team.

Take a look at this article; it's a great primer to get everyone started: 5 Key Considerations When Litigating Cloud Computing Disputes by Gerry Silver, partner at Chadbourne & Parke.

I agree with Silver who sums up the situation nicely when he says that "given the ever-increasing reliance on cloud computing, it is inevitable that disputes and litigation will increase between corporations and cloud service providers."

Understandably, both cloud users and cloud providers will want to dodge responsibility for cloud outages. "The corporation may be facing enormous liability and will seek to hold the cloud provider responsible, while the cloud provider will undoubtedly look to the parties' agreement and the underlying circumstances for defenses" [source].

Looks like the future is bright for attorneys who specialize in cloud issues. After all, a faulty power supply or software glitch could lead to years of court battles.

Five Legal Elements

Silver outlines five key elements for the legal team to consider:

Limitation of liability written into service contracts.
Whether the Limitation of Liability clause can be circumvented: can the cloud provider be held responsible despite this clause?
Contract terms: A breach of contract on either side can greatly affect litigation outcomes.
Remedies: During the crisis the corporation could demand that the cloud provider takes extraordinary steps to restore systems and data.
Insurance and indemnification: Insurance may cover some losses, and a third party may bear some responsibility for the problem too.

The Disturbing News: Expectations are Low

In my travels, I am still surprised at how little thought goes into the liability associated with an outage whether it be in a data center, cloud or hybrid configuration. Although I embrace everyone’s motivation to move to the cloud, I found a couple of points in Silver's article disturbing because they shed light on the obsolete way the tech industry thinks about cloud architecture as it relates to disaster prevention.

1) Just how much foresight is a cloud provider legally expected to have? In the section titled "May the Limitation of Liability Clause Be Circumvented?" Silver describes how "one court recently sustained a claim of gross negligence and/or recklessness in a cloud computing/loss of data case because it was alleged that the provider failed to take adequate steps to protect the data." This raises the question of what constitutes "failure to take adequate steps". Does it mean that the provider did something genuinely negligent like setting up a system with multiple single points of failure? Were they culpable because they had followed best practices and relied upon an industry-standard failover-based recovery system which later failed? Or did they fail to seek out (or create) the most advanced and reliable proactive business continuity system on the planet? Whatever they were using probably seemed good at the time but was clearly not adequate because it failed to protect the customer's data.

I would speculate that a customer’s lawyer would have a pretty high expectation of what "adequate steps" are, but as you will see in my next point the bar is still set pretty low.

2) The expectation is that cloud providers will be using failover, which is 20 years out of date. In the same section, Silver asks "Were back-ups of data stored in different regions? Were banks of computers isolated from one another ready to take over if another zone failed?" This without doubt describes a failover system. Apparently his expectation is that a cloud provider should follow current best practices and use a failover disaster recovery system. But the failover technique was designed decades ago for systems that are now extinct or nearly so. The latest networks are radically more sophisticated than their forebears and consequently have radically different requirements. Even a successful failover is a perilous thing, and failovers fail all the time. If they didn't, Mr. Silver would probably not have found it necessary to write this article. Backups happen only on fixed schedules so the most recent transactions are often lost during a disaster. You can expect legal battles over downtime and data loss to continue because cloud providers and their customers are all using one variation or another of these outdated disaster recovery techniques. So how can a disaster recovery system that is so prone to disaster be considered an "adequate step?"

Like I said, you'd better call a meeting with your legal counsel and get ready.

No Outage, No Litigation

ZeroNines can actually eliminate outages. Our Always Available™ technology processes all network transactions simultaneously and in parallel on multiple cloud nodes or servers that are geographically separated. If something fails and brings down Cloud Node A, Nodes B and C continue processing everything as if nothing had happened. There is no hierarchy and no failover. So if this cloud provider's service does not go offline there is no violation of SLAs and no cause for litigation.

Our approach to business continuity is far superior to the failover paradigm, offering in excess of five nines (>99.999%) of uptime. It is suitable for modern generations of clouds, virtual servers, traditional servers, colocation hosting, in-house servers, and the applications and databases that clients will want to run in all of these.

So my message to cloud providers is to check out ZeroNines and Always Available as a means of protecting your service from downtime and the litigation that can come with it.

My message to cloud customers is that you can apply ZeroNines and Always Available whether your cloud provider is involved or not. After all, your key interest here is to maintain business continuity, not to win a big settlement over an outage.

And heads-up to the lawyers on both sides: We are setting a new standard in what constitutes "adequate steps".

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

January 23, 2012

RIM co-CEOs Resign: Is This the Cost of Downtime?

Back in October I commented in this blog about the enormous RIM BlackBerry outage [source]. I wrote that "even a massive outage like this is unlikely to cause the demise of a large and important firm, but combined with other woes like a less-than-competitive product and poor business model it could well be the deciding factor."

And now for the fallout. RIM is still in business, but its beleaguered co-CEOs/co-Chairmen Jim Balsillie and Mike Lazaridis have resigned and taken other positions within the company [source]. I'm sure it was not the outage alone (or all RIM outages put together) that caused this leadership shakeup. But it could well have been the deciding factor.

Outages and CEO Job Security

RIM's product problems are certainly serious. But I see a fundamental difference between 1) the prescience needed to get the right product to market at the right time, and 2) the technical ability to keep an existing product up and running. Customers might to some degree forgive a company whose product is reliable but behind the times. They will abandon if it doesn't work when they need it even if it is the newest, slickest thing around.

The October outage has RIM "facing a possible class action lawsuit in Canada" [source]. Add the cost of that in addition to the costs of recovery, customer abandonment, shareholder value and so forth. (Stay tuned; I will be commenting on the legal issues around cloud outages in the next few days.)

To put RIM's decline in perspective, the company was worth $70 billion a few years ago but today has a market value of about $8.9 billion [source]. Their stock dropped about 75% last year and was down to $16.28 before the market opened on Monday January 23, 2012 [source].

So according to the rules of modern business, someone has to pay and in this case it is the CEOs.

Now Imagine This at a Smaller Company

Can you imagine a three-day outage at a smaller software company? Or even a one-day outage? Imagine a typical e-commerce technology provider with 50 retail customers, 100 employees, and an SaaS application. If the core application, image server, database server, customer care system, inventory system, orders & fulfillment system, or other key element goes down that could be the end of them. Many smaller companies do not survive a significant downtime event. And many smaller retailers do not survive if they are unable to do business on a key shopping day such as Black Friday or Cyber Monday.

Or even if the email system goes down for a couple hours. It happens all the time. Email is a key element of workflow and productivity and what company can afford to sit still for even a couple hours?

It's more than the CEO whose job is at risk. Here's where an ounce of prevention is worth far more than a pound of cure.

That Rickety Old Failover

Remember my earlier comment about outdated yet reliable products, versus outdated and unreliable products? Ironically, the failover disaster recovery model that failed RIM back in October is one of those old and unreliable products. It was designed for systems and architectures that no longer bear any resemblance to what businesses are actually using. If failover worked I would not be writing this because there would be no need for its replacement.

But if you want to find out about real business continuity and getting away from failover, take a look at ZeroNines. Our Always Available™ architecture processes in multiple cloud locations, on multiple servers, and in multiple nodes. There is no hierarchy so if one goes down the others continue processing all network transactions. ZeroNines can bring application uptime to virtually 100%. It is a complete departure from the failover that RIM is using, and that small businesses everywhere stake their futures upon.

Time Will Tell

"RIM earned its reputation by focusing relentlessly on the customer and delivering unique mobile communications solutions… We intend to build on this heritage to expand BlackBerry's leadership position," RIM's new CEO Thorsten Heins is quoted as saying [source].

Let's hope this "focus on the customer" also includes a strategic initiative to build genuine uptime and availability, or maybe we'll be reading about another new RIM CEO next January.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

December 12, 2011

BAE, Microsoft, the Cloud, and Planning for When it All Goes Horribly Wrong

"If it fails in Ireland, it goes to Holland. But what if it fails in Holland as well?"
Paraphrase of Charles Newhouse, BAE [source]

Cloud news circuits have been abuzz the last few days over BAE rejecting Microsoft's Office 365 cloud solution because of the Patriot Act. This is the highest-profile rejection of a cloud offering I have seen. I am shocked and dismayed that after all the advancements that have improved continuity in the cloud, the network architectures our cloud service providers are offering are still in the stone age. They're still trying to use failover and pass it off as advanced and reliable. I can only assume that if given a 787 they would try to fly it off a dirt landing strip.

When you read the articles closely, it is clear that the big issue for BAE was data sovereignty. How does one retain control of data during a network disaster, and where does it go when your service provider has to failover from the primary network node to the backup? To quote Charles Newhouse, head of strategy and design at British defense contractor BAE,

"We had these wonderful conversations with Microsoft where we were going to adopt Office 365 for some of our unrestricted stuff, and it was all going to be brilliant. I went back and spoke to the lawyers and said, '[The data center is in] Ireland and then if it fails in Ireland go to Holland.' And the lawyers said 'What happen[s] if they lose Holland as well?'" [source]

And earlier in the same article he described the user experience during a cloud outage:

"A number of high profile outages that users have suffered recently demonstrated just how little control you actually have. When it all goes horribly wrong, you just sit there and hope it is going to get better. There's nothing tangibly you can do to assist" [source].

It's About More than Just the Patriot Act

The big focus in these articles is the Patriot Act. BAE lawyers forbade the use of Office 365 and the Microsoft public cloud because as a U.S. company, Microsoft could be required to turn BAE data over to the U.S. government under terms of the Patriot Act [source].

It is true that the Patriot Act can require cloud service providers like Microsoft (and Amazon, Google, and others) to give the U.S. government the data on their servers, even if those servers are housed outside the United States [source]. Newhouse also said that "the geo-location of that data and who has access to that data is the number one killer for adopting to the public cloud at the moment" [source].

But European governments are already moving to eliminate this loophole. As explained in November on ZDNet.com, a new European directive "will not only modernize the data protection laws, but will also counteract the effects of the Patriot Act in Europe" [source]. Sounds to me like Microsoft's jurisdictional problems will be solved for them. And failing that there is probably some creative and legal business restructuring that would do the trick.

It's Really about Failover and its Shortcomings

So if European law will provide data sovereignty from a legal standpoint, why reject the Microsoft cloud? It all comes back to "when things go horribly wrong."

When Newhouse describes the Ireland-to-Holland scenario, he is clearly talking about Microsoft failing-over from their Ireland datacenter to their Holland datacenter. I find it hard to believe that Microsoft thinks the outdated and flawed failover model is suitable for a leading cloud offering. Office 365 and their customers deserve better.

Apparently BAE agrees. It put its foot down and refused to play because the reality does not match the promise.

Failovers often fail, causing the downtime they were supposed to prevent. If the secondary site fails to start up properly (which is very common) or suffers an outage of its own, the business is either a) still offline or b) failed over to yet another location. The customer quickly loses control, network transactions get lost, and their data goes… where? Another server in Europe? Part of an American cloud? How many locations is Microsoft prepared to failover to, and where are they? And with the cloud these issues loom even larger because there is no particular machine that houses the data.

The Solution: Cloud and Data Reliability without Failover

ZeroNines offers two potential scenarios that will solve this problem:

1) Prevent downtime on Protect the cloud provider's systems from downtime, offering a far more reliable cloud.

2) Protect the business' systems from a cloud provider's downtime.

Our Always Available technology is designed to provide data and application uptime well in excess of five nines. ZenVault Medical has been running in the cloud on Always Available for about 14 months with true 100% uptime. Always Available runs multiple network and cloud nodes in distant geographical areas. All servers and nodes are hot, and all applications are active. If one fails, the others continue processing as before, with no interruption to the business or the user experience. There is no failover, and thus no chance for outages caused by a failed failover.

So if Microsoft were to adopt our Always Available technology, a storm like the one that knocked out their data center in Ireland this past August would not affect service. The Ireland node might go down, but all network activities would proceed as usual on other cloud data centers in Holland, Italy, or wherever they have set them up. Users would never know it.

If BAE adopted Always Available, they could bring their Microsoft cloud node into an Always Available array with other cloud nodes or data centers of their own choosing. A failure in one simply means that business proceeds on the others.

The business or the service provider can determine which nodes are brought into the array. BAE could choose to use only European cloud nodes to maintain data sovereignty.

ZeroNines' Always Available technology is built precisely for the moment "when it all goes horribly wrong." The difference is that with ZeroNines, it won't mean downtime.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

December 1, 2011

Retail Business Continuity on Black Friday and Cyber Monday

The economy has heaved a sigh of relief after good sales reports from the Thanksgiving weekend. Have you stopped to really think about the importance of reliable IT systems and business continuity during this and other key sales events?

A company really may live or die according to what it or its service provider does in preparation for Black Friday and Cyber Monday. The game is in the hands of the technicians more and more every year.

While these two days can herald great things during a good year, they can also seem like harbingers of doom if things don't go so well. Their grim-sounding names are oddly appropriate, and everyone watches with trepidation.

• Black Friday. Very ominous, evoking images of stock market crashes and other disasters. A few decades ago it came to mean "the day after Thanksgiving in which retailers make enough sales to put themselves 'into the black ink'" [source] which is actually a good thing.

• Cyber Monday. Sounds like something from The Terminator. Actually… "The term 'Cyber Monday' was coined in 2005 by Shop.org, a division of the National Retail Federation [source]." This is the Monday after Thanksgiving, when online sales show a significant spike. Cyber Monday has become a major shopping day and economic indicator in its own right.

Jittery analysts are poised every year with their thumb on the Recession Early Warning button, ready to sound the alarm if the score doesn't add up and the game goes badly. (I think they secretly enjoy this.)

It's All IT's Fault. But No Pressure, Guys! : )

Every year in advance of this season opener, IT Managers beg for money to upgrade servers, replace old circuit breakers and backup batteries, service the cooling systems, and do a thousand other things to help prop up their networks for the onslaught. They also stock up on the coffee, donuts, and Valium that will keep them going through long days and even longer nights of watching, waiting, rebooting, hot swapping, and occasionally panicking over system crashes and failovers. I do not envy them, as the fate of the economy apparently rests upon their shoulders.

If the IT systems go down the business is out of the game and the term "Black Friday" takes on an entirely new meaning. Revenue on Thanksgiving weekend is largely driven by time-sensitive discounts, so shoppers will buy from competitors if a website or point-of-sale (POS) system is down. For those of you running these systems, my heart goes out to you. I have been in similar situations myself many times.

Thanksgiving Weekend Outages Mostly Due to Heavy Traffic

There were a number of reports of ecommerce sites becoming unavailable on Thanksgiving, Black Friday, and Cyber Monday. Victoria's Secret went beyond secret and became downright invisible three separate times, for a total of about 80 minutes [source]. I have read about downtime and poor site performance at many other online retailers as well, including PC Mall and Crutchfield [source]. Universally, there is no mention of the cause of all this downtime, but the implication is that it was simple old-fashioned traffic overload.

Fire Suppression System Suppresses Sales on eBay

One outage not caused by traffic was ProStores, an online store solution used by lots of smaller operations to run their eBay storefronts. According to a Thanksgiving Day post by ProStores on their discussion board, "the data center fire suppression system tripped the Emergency Power Off (EPO) system causing a loss of power to the data center's raised floor environment" [source]. As is usual in such circumstances, it took most of the day before things could be brought back to normal. I strongly suggest you read their post, as it is an excellent account of the gyrations an IT department has to go through in such situations. I applaud ProStores for being so forthright and providing this information.

Preventing this and Other Outages

Always Available technology from ZeroNines could have prevented the ProStores outage entirely. Yes, that faulty fire suppression system would still have freaked out at that particular data center. But Always Available would have been running one, two, or more instances of the same applications and transactions in the cloud or at other data centers. ProStores clients and their customers would never have known there was a power outage and no sales would have been lost.

ProStores made no mention at all of failover, so I assume they do not have a failover-based recovery system in place. With ZeroNines, that's perfectly fine because we do not use failover either. We make failover unnecessary. We offer disaster avoidance, not disaster recovery. There is no way to prevent all system malfunctions because there are too many complex parts. Next month maybe a circuit breaker will fail. After that, maybe it's a failed hard disk and an application crash. The list goes on.

Girding Your Loins for Next Year

Online retailers wanting to guard themselves against a Black Friday blackout (or on any other day) should consider the modular approach ZeroNines takes. You can apply Always Available to selected high-value systems such as:

Webstore servers and databases

Product/inventory databases

Payment systems

Image rendering systems

These will keep you running if something blows up. Close behind are customer service systems and warehousing/fulfillment. These become more important the closer you get to Christmas, as last-minute shoppers tend to need more personal help and there is no leeway for late shipments.

To prevent traffic-related outages, set up proper load balancing. If huge players like J.C. Penney, Apple, Macy's, Sears, Amazon, and Dell can come through Cyber Monday with flying colors [source], you can too. But for the hardware failures, human mistakes, software crashes, and other things that can hit you any day of the year as well, look into ZeroNines.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 24, 2011

Building Outage Resistance into Network Operations

An article I read the other day in MIT's Technology Review [source] nicely sums up what I've been hearing about cloud operations from dozens of clients, partners, and other colleagues around the country. The cloud is great for development, prototyping, and special projects for enterprises, but don't rely on it for anything serious. As that article says, "For all the unprecedented scalability and convenience of cloud computing, there's one way it falls short: reliability."

But the truth is that the tried-and-true models of network operations aren't all that reliable themselves, and neither are the disaster recovery systems that are supposed to protect them. Granted, they are probably more reliable than the cloud at this point, but downtime is downtime whether it's in the cloud or in a colocation facility. The effect is the same.

What is really needed is outage resistance that is built into network operations, whatever the model

Why downtime happens

I recently read an interesting whitepaper from Emerson Network Power [source] that describes the seven most common causes of downtime as revealed by a 2010 survey by the Ponemon institute (http://www.ponemon.org/index.php). The causes are all pretty mundane: UPS problems such as battery failure or exceeded capacity, power distribution unit and circuit breaker failures, cooling problems, human error, and similar things. All of them apply to any data center, whether in-house or in the cloud. None of the exciting stuff like fires, terrorism, or hurricanes made it into the top seven, though of course they could lead to a failure of a battery, circuit breaker, or cooling unit.

The Emerson whitepaper describes best practices that can reduce the likelihood of downtime induced by each of the top seven causes. That is all well and good, but some are very costly, such as remodeling server rooms "to optimize air flow within the data center by adopting a cold-aisle containment strategy." Other recommendations include regular and frequent inspection and testing of backup batteries, installation of circuit breaker monitoring systems, and increased training for staff.

These are good ideas but costly, if not in capital for server room reconfiguration then in staff hours and other recurring costs. The paper contends that problems caused by human error are "wholly preventable" but I believe this is a mistake. No matter how stringent the rules or how well-documented the procedures, someone will take short cuts, overlook a vital step in the midst of a crisis, or sneak their donut and coffee into the control room. Applications fail under stress, databases fail to restart properly, and any number of other things can and do go wrong. There is no way to write contingencies for each, particularly when the initial failure leads to an unpredictable cascade effect.

And what of the cloud?

I believe the cloud brings tremendous value to developers, SMBs, and other institutions that need low cost and great flexibility. Where else can an online store launch with a configuration that is not only affordable but also ready for both super-slow sales and a drastic ramp-up if sales shoot into the stratosphere? But like most “better, cheaper, faster” initiatives, the cloud has genuine reliability problems. A company running their own data center could choose to incur the expense and work of instituting all of Emerson's best practices since they are in control of the environment. But all they have from their cloud provider (or colocation provider for that matter) is their Service Level Agreement (SLA). They can't go in themselves and swap out aged batteries or fire the guy who persists in smuggling cinnamon rolls into the NOC.

The Technology Review article tells us that some companies are looking for ways to make their cloud deployments far more disaster resistant to start with, rather than just relying on their cloud provider's promises [source]. Seattle-based software developer BigDoor experienced service interruptions as a result of the Amazon cloud's big outage in April 2011. Co-founder Jeff Malek said "For me, [service agreements] are created by bureaucrats and lawyers… What I care about is how dependable the cloud service is, and what a provider has done to prepare for outages" [source].

The same article describes the Amazon SLA and its implications:

Even though outages put businesses at immense risk, public cloud providers still don't offer ironclad guarantees. In its so-called "service-level agreement," Amazon says that if its services are unavailable for more than 0.05 percent of a year (around four hours) it will give the clients a credit "equal to 10% of their bill." Some in the industry believe public clouds like Amazon should aim for 99.999 percent availability, or downtime of only around five minutes a year.

The outage resistant cloud

ZeroNines can give you that 99.999% (five nines) or better, whether you are running a cloud or just running in the cloud. Cloud service providers could install an Always Available™ configuration on their publicly-offered services, providing a highly competitive edge when attracting new customers.

Individual businesses could install an Always Available array on their own networks, synchronizing any combination of cloud deployments, colocation, and in-house network nodes. It also facilitates cloud migration, because you can deploy to the cloud while keeping your existing network up and running as it always has. There is no monumental cloud migration that could take the whole network down and leave the business stranded if there's a glitch in starting an application. Instead, Always Available runs all servers hot and all applications active, enabling entire nodes to fall in and out of the configuration as needed without affecting service. The remaining nodes can update a new or re-started node once it rejoins the system.

ZeroNines client ZenVault Medical (www.zenvault.com/medical) developed and launched their live site in the cloud using an Always Available configuration. Since the day of its launch in September 2010 it has run in the cloud with true 100% uptime, with no downtime at all. That includes maintenance and upgrades. When a problem or maintenance cycle requires a node to be taken offline, ZenVault staffers remove it from the configuration, modify it as necessary, and seamlessly add it back to into the mix once it is ready. ZenVault users don't experience any interruptions.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 13, 2011

What Did One BlackBerry User Say to the Other BlackBerry User?

Nothing, according to twitter user @giselewaymes (source).

In what has to be every large enterprise IT manager's worst nightmare, a big high profile outage grew into a monster, expanded to global proportions, made headlines everywhere, and after three days seemed to have no end in sight. The cause was a failed failover that could have been avoided.

Background: RIM BlackBerry

BlackBerry is produced by Canadian firm Research In Motion (RIM). It is one of the leading smart phones among business users. Its real forte is encrypted mobile email and instant messaging. BlackBerry has about 70 million users worldwide (source). Several high-profile outages and many smaller ones have tarnished its reputation, and this week's seems to be pushing the company to the breaking point if all the buzz on the Internet is to be believed.

The Problem: Failed failover

On Monday morning October 10 2011, millions of BlackBerry users in Europe, the Middle east, and Africa lost access to messenger, email, and Internet. The outage spread to every continent and may eventually have effected half of all BlackBerry users (source).

RIM explained things to some degree on their website on Tuesday October 11: "The messaging and browsing delays that some of you are still experiencing were caused by a core switch failure within RIM’s infrastructure. Although the system is designed to failover to a back-up switch, the failover did not function as previously tested (source).

In other words, their failover-based disaster recovery system failed. It can be inferred that this led to cascading failures that knocked out other systems in other regions, leading to this worldwide problem. As of Wednesday evening the 12th it was still not fully resolved, with an interesting update posted on their site outlining the status in various parts of the world (source). By Thursday morning it looked like things were finally under control, with service almost back to normal in most areas.

The Cost: Paid compensation and a blow to the business

I don't doubt that RIM will compensate users in one way or another, perhaps in the form of free service (which seems to be the industry's de-facto compensation currency). RIM Co-CEO Jim Balsillie said that such a step would be considered but that their immediate focus was fixing the problem (source).

More damaging is the additional blow to RIM's reputation. Lots of users are claiming on Facebook, Twitter, and other online forums that this is the last straw and that they will quit BlackBerry. For many this may be a hollow threat but there is genuine peril here. "This outage… comes at a particularly bad time for RIM, since it faces increasing competition in the smarpthone market… Apple's iPhone and phones on the Google Android operating system have been gaining ground, and the new iPhone 4S goes on sale Friday (October 14)" (source).

The cost can be high outside of RIM as well. "The outage caught much of D.C. off guard Wednesday and underscored the region’s reliance on the BlackBerry — which is still the only federally approved smartphone for employees in some government agencies (source).

As for RIM itself, back in June there was a flurry of articles suggesting RIM was potentially facing bankruptcy (source). And this week there have been a number of stories about growing momentum for a RIM breakup or merger (source). Even a massive outage like this is unlikely to cause the demise of a large and important firm, but combined with other woes like a less-than-competitive product and poor business model it could well be the deciding factor.

The Solution: Eliminate failover systems

RIM is in trouble for a number of reasons but downtime like this does not need to be one of them. I contend that the core problem was not a failed switch but a failed failover. Switches will fail and there is no avoiding that. If you can architect the perfect switch, I invite you to do so and you'll be richer than Bill Gates. It's what happens after the inevitable switch malfunction (or other disaster) that matters most. Failover systems will fail too. RIM's apparently worked fine during a test but the strain and chaos of a real-world crisis was too much for it. At ZeroNines, we propose eliminating the failover systems in favor of something that will turn failures into virtual non-events.

ZeroNines' Always Available™ technology eliminates the need for failover, processing the same applications and data simultaneously on multiple servers, clouds, and virtual servers separated by thousands of miles. All servers are hot, and all applications are active. So if a switch fails in one network instance there is no need for a risky failover to another. Other instances are already processing the same transactions in parallel and simply continue processing as if nothing had happened. Once the problem with the switch is rectified, that instance is brought back into the Always Available array, is automatically updated, and resumes processing along with the others.

The Numbers

RIM says that its service "has been operational for 99.7% of the time over the last 18 months" (source). That equates to about 1,576.8 minutes of downtime, or 26.28 hours per year.

A good industry standard for uptime is 99.9% or three nines. That is 525.6 minutes of downtime, or 8.76 hours per year.

ZeroNines can provide in excess of five nines of uptime, or 99.999%. That is less than 5.3 minutes of downtime per year.

I do not know if planned downtime was included in RIM's 99.7% calculation. Companies often do not include planned downtime in their business continuity projections, counting only unplanned outages. But downtime is downtime from a user's perspective, whether caused by an accident or a planned maintenance cycle. ZeroNines protects against both.

In the last 12 months since ZenVault Medical went live on an Always Available cloud-based architecture it has experienced true 100% uptime, with no downtime whatsoever for any reason. That includes planned maintenance, upgrades, and other events that would have taken an ordinary network offline.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

August 10, 2011

Amazon EC2 Outage: Déjà Vu All Over Again

It seems we can always rely on cloud outages to spice up the news feeds. Today, it's another Amazon EC2 Cloud outage, which is a nice departure from the wildly gyrating stock market and the U.S. debt downgrade.

I didn't write about Amazon's big April 2011 EC2 outage simply because I was overwhelmed with other work (along with texts, tweets and emails about the outage). That outage affected big-name customers like Netflix, Foursquare, HootSuite, and Reddit (source). Some EC2 customers' websites were down for as much as two days.

Then just this past weekend an electrical storm over Dublin Ireland led to a lightning strike on a transformer and a subsequent explosion, fire, and loss of power at an Amazon data center. Backup generators could not be started. Amazon's European EC2 Service was affected for as long as twelve hours. Some Microsoft cloud services were knocked out as well (source).

I am a huge proponent of the cloud; however, I believe reliability can and should improve. As a frequent speaker and panelist at cloud-related events, I find that many in the audience are not convinced that the cloud is reliable enough to meet the needs of mission-critical applications. Outages like this don’t help. However, I am aware of several successful implementations of robust, outage-resistant cloud deployments that simply have not gotten any attention because the clients are not motivated to share how they did it with their competitors. Some of these early adopters took risks and made large investments when the mainstream would not, and they feel they deserve some advantage while they can get it. Naturally enough I think ZeroNines has the right solution, but read on for now.

Background: Amazon as a major cloud provider

Amazon EC2 is the Amazon Elastic Compute Cloud (source). It provides thousands of online service providers and software developers easy access to cloud computing capacity that is variable in size. Customers pay only for what they use. Their customers include Netflix (streaming movies and TV shows), Instagram (photo sharing), Reddit (social networking for sharing news), and Foursquare (location-based social networking).

The Problem: Something's rotten in the state of Virginia

I have not found a clear statement yet that describes the exact cause of the August 8 outage, but PCMag.com says that it "closely mirrors a similar cloud outage Amazon suffered in April" (source). It also happened in the same Virginia data center. The April 2011 outage "happened after Amazon network traffic was 'executed incorrectly.' Instead of shifting to another router, traffic went to a lower-capacity network, taking down servers in Northern Virginia." (source). So Amazon loses points for allowing the same problem to happen twice in the same place, but wins a few back for apparently being ready this time and containing the August 8 outage to minutes rather than days.

The Cost: Revenue and reputation

As always with these outages there is talk of the provider compensating its customers through waived fees and such. Mark that against Amazon's balance sheet. Customers no doubt lost business, and you can mark that against their balance sheets. Reliability issues will chase away customers who don't want to risk their own revenue with a service notorious for crashing. But if the cloud nonetheless offers the best business model, what do these customers do? Press for lower fees and more favorable service level agreements for one.

The Solution: Prevention, not recovery

If you're an actual or potential cloud user (with any provider), Always Available™ from ZeroNines can protect your existing systems without changing providers, hardware, operating systems, or applications. If there's a disaster in any part of your system, all your networked transactions and applications continue functioning as normal on the other network nodes. Our CloudNines™ application can protect your cloud-based infrastructure, VirtualNines™ can protect virtualized environments on your own machines, and EnterpriseNines™ can add Always Available protection to any other network infrastructure. You can mix and match so all these can interoperate seamlessly. For businesses of any size, the result is uptime of virtually 100% regardless of the disasters that may strike any individual node in the Always Available array.

The cloud providers themselves could use the same CloudNines product to protect their systems, virtually eliminating downtime and avoiding headlines like Amazon's. We are currently developing and monitoring on Amazon and other cloud platforms. Our technology is certified for Windows Server^® 2008, compatible with Windows Server^® 2008 Hyper-V™ and Hyper-V™ Server, and certified as VMWare^® ready.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 11, 2010

Announcing ZenVault Medical: Your Cloud-Based, Secure, Encrypted Personal Health Record

I had a heart attack back in 2008. I was lucky. My local emergency room facility and the intensive care unit hospital that I was transferred to happened to share my medical records in electronic format. But only about 10% of U.S. hospitals use electronic records so if this had happened away from home I probably would have died because no other doctor or hospital would have known about my pre-existing medical conditions.

It was suddenly very easy for me to see the need for a system that would allow consumers to take their medical records with them wherever they go. Not only for emergencies but for everyday reference. Some quick Googling revealed Personal Health Record (PHR) solutions from Microsoft (HealthVault), Google (Google Health) and a large number of others, but consumer adoption was low. I also discovered that the Electronic Medical Records (EMRs) used by hospitals and doctors were no solution because they are inaccessible to consumers and practitioners outside the system.

I enlisted the help of my personal doctors, friends and classmates who work in the healthcare field as well as other technologists who are consulting to large medical organizations around the country. All told, we have consulted with 36 experts who freely gave us their opinions about the issues surrounding EMRs and how a comprehensive PHR should be designed in order to deliver high value to consumers while potentially saving lives. I summarize the issues in BOLD and describe how we address them.

So today we at ZeroNines introduced ZenVault Medical (www.zenvault.com/medical), a Cloud-based, private, encrypted, online PHR for consumers that you can access through a computer or mobile device. In addition to helping people with their medical care, it’s a great example of how the Cloud and other cutting-edge technologies can come together to create a unique and valuable consumer product.

Background: The Need for Digital Medical Records

If you’re like most people, your medical records are scattered among a number of doctors and they are hard to get to. The Obama administration wants the country to convert to Electronic Medical Records. The goal is to improve healthcare and cut costs by making an individual’s collection of medical records available electronically at any hospital or doctor’s office, cutting down on paper volume, saving time, and increasing accessibility particularly in emergencies. This truly needs to happen – my own experience proves that – but the issue is how.

The Problem: Security, Privacy, and Reliability

Questions surrounding security and privacy make many citizens and consumer advocates reluctant to jump on board. Will such a system be run by the government or by business? Who will have access? Will sensitive personal information about illnesses, prescriptions, and treatments be turned over to insurance companies? To marketers? To employers? Can any body of law successfully regulate how such highly personal information is handled and protected, enabling it to benefit the individual yet keeping it out of the hands of those who would profit by violating privacy? Is it even the government’s place to get involved with personal medical records? And what technology is secure enough to handle all this?

Security: Any medical records system needs to keep hackers at bay. Well-publicized data breaches with Microsoft and Google call into question their ability to protect medical privacy. Frankly, I decided to subscribe to one of these systems before we came up with ZenVault, but was concerned with who might be accessing my records and selling it to insurance companies and marketing firms.

Privacy: Many companies offering free digital medical records turn around and sell customer data to pharmaceutical and insurance companies. And a September 16 2010 article in the Wall Street Journal described a data breach wherein a Google engineer broke the company’s privacy policies by accessing private customer information.

Reliability:If anything needs 100% uptime, it’s medical applications. Take a look at some of the high-profile downtime events discussed in the rest of this blog and then imagine the cost in lives and well-being if they had affected hospital emergency rooms.

The Solution: Customer Control of a Safe, Secure, and Always Available™ Personal Health Record

Simply putting control of the health record in the hands of the individual consumer or patient addresses the bulk of these concerns. If no one can read the record but the customer, that’s most of the battle won. So what is the difference between ZenVault Medical and other consumer-facing PHRs like Google Health and HealthVault?

Security: ZenVault encrypts stored records with a patent-pending variant of the NSA-approved encryption protocols that protect top-secret information. ZenVault does not employ a “key ring” that stores customer encryption keys which means there is no copy available for anyone to find and rummage through your data. The customer creates his or her own unique encryption key so only they can access and edit their private medical records. SSL-secured sessions protect data in transit from computers, smartphones, and tablets.

Privacy: ZenVault never shares information. Period. We don’t sell it, rent it, or give it away, not even in a “sanitized” format like some admit to doing. We charge consumers for our service and our business model is based on customer trust. If they don’t trust us we lose. In fact, our encryption system prevents even our own engineers and administrators from reading patient data, so we couldn’t sell it even if we wanted to. How’s that for a guarantee?

Reliability: ZenVault uses ZeroNines' Always Available™ technology designed to protect the world's most sensitive financial and military computer systems. There is virtually no "downtime" or data loss with ZenVault. A Cloud-based infrastructure helps keep costs down, ensures scalability, and supports universal accessibility. Use of Always Available allays any concerns over Cloud reliability. In fact, we intend to use ZenVault as an example of a highly reliable, high-usage application deployed in the Cloud. Read more about Always Available on the ZeroNines.com website ZeroNines.com website.

Convenience: Users can update or read their records anywhere they have Internet access. They can send their records to any doctor with just a few clicks using a secure message system. Have you ever wasted time at a doctor appointment filling out a clipboard full of medical history forms? Use ZenVault to send them your PHR instead! Doctors can send patients their records, lab results, and x-rays with equal ease.

Affordable: A free account is available, offering a basic PHR with full security, encryption, and privacy protection. A premium account adds advanced features for a small monthly charge.

Secure Emergency Room Access: ZenVault offers emergency rooms their own accounts with their own special encryption keys. They get controlled access to six key fields in a patient’s record such as history of heart disease, drug sensitivities, and emergency contact information. This gives them the basic information they need to save a life and contact loved ones yet protects the majority of personal information until the patient or their family elects to release it.

Take Your Personal Health Record with You

If you have Internet access, you can use ZenVault. I hope none of you ever has a medical emergency like the one that sent me to the hospital two years ago. But if you do, ZenVault could save your life by putting the needed information in the right place, at the right time. I have no doubt that one day a universal health record database will be a reality, but until then you can have all the benefits while keeping control yourself. Try it out and let me know what you think: www.zenvault.com/medical.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

June 28, 2012

Outage at Amazon EC2 Virginia Illustrates the Value of the Expendable Datacenter℠

March 26, 2012

BATS: “When the Exchange Blows Itself Up, That’s Not a Good Thing”

January 27, 2012

The Legal Ramifications of Cloud Outages

January 23, 2012

RIM co-CEOs Resign: Is This the Cost of Downtime?

December 12, 2011

BAE, Microsoft, the Cloud, and Planning for When it All Goes Horribly Wrong

December 1, 2011

Retail Business Continuity on Black Friday and Cyber Monday

October 24, 2011

Building Outage Resistance into Network Operations

October 13, 2011

What Did One BlackBerry User Say to the Other BlackBerry User?

August 10, 2011

Amazon EC2 Outage: Déjà Vu All Over Again

October 11, 2010

Announcing ZenVault Medical: Your Cloud-Based, Secure, Encrypted Personal Health Record

About ZeroNines

Z9 Links & Resources

Blog Archive

June 28, 2012

March 26, 2012

January 27, 2012

January 23, 2012

December 12, 2011

December 1, 2011

October 24, 2011

October 13, 2011

August 10, 2011

October 11, 2010

About ZeroNines

Z9 Links & Resources

Blog Archive

Subscribe