December 22, 2009

Most Businesses Don’t Know what Downtime Costs Them

I just discovered the results of a survey about the need for application availability among businesses [source]. The survey was conducted by ITIC and Stratus Technologies. Results were released in April 2009. It basically sought to find out how much application uptime businesses think they need, and what they intend to do about it.

The survey found that overall, IT executives are aware that the need has grown for high-availability applications and the infrastructure to support them. But budgets are too low to support them, and most companies do not know what their downtime is costing them. This makes it difficult for these same executives to make a budgetary case for implementing high uptime solutions.

Downtime is a business killer. As an example, consider that of the 350 companies in the World Trade Center before the 1993 truck bombing, 150 were out of business a year later because of the disruption. [source: Gartner/RagingWire report cited in “Without the wires,” Fabio Campagna, Disaster Recovery Journal, Winter 2002].

The big lesson here: there is a significant competitive advantage for investing in uptime.

Here are some key facts from the survey, and my thoughts about them.

1) “Two out of five businesses – 40% – report that their major business applications require higher availability rates than they did two or three years ago. However an overwhelming 81% are unable to quantify the cost of downtime and only a small 5% minority of businesses are willing to spend whatever it takes to guarantee the highest levels of application availability 99.99% and above.”

Clearly, the field is wide open for companies to pull ahead if they go for four or five nines of uptime (or more), particularly those who serve vital and highly regulated sectors such as financial, healthcare, defense, data hosting, and so forth. A company that falls out of compliance with strict regulations like Sarbanes-Oxley or HIPAA can be driven to the brink by fines, the costs of regaining compliance, and lost business.

2) “The survey results uncovered many “disconnects” between the levels of application reliability that corporate enterprises profess to need and the availability rates their systems and applications actually deliver.” In other words, businesses are not getting the uptime they require, whether it is to meet SLAs or simply conduct everyday business.

In reality, the uptime that company leaders “profess to need” is probably insufficient. Considering that a downtime event of only a few seconds can cause a cascading failure in applications and databases, they probably need uptime of practically 100% in order to avoid a bigger disaster. Once that first domino falls, maybe you can grab it and stand it back up but all the rest are already falling. The damage is done.

3) “Some 41% said they would be satisfied with conventional 99% to 99.9% (the equivalent of two or three nines) availability for their most critical applications.”

I can’t imagine a company being without its “most critical application” for between 8+ hours (for 99% uptime) and four full days (99.9%). Companies have gone out of business after downtime of less than that. I can’t help but believe that the executives who answered this question like that are somehow out of touch with the realities of their environment. Maybe they are in industries where expectations are really low. But can you think of a bank or stock brokerage or hospital where one- or two-day outages a couple times a year are the norm? I can’t. And that is probably because such companies cease to exist.

Contrast that with this: “An overwhelming 81% of survey respondents said the number of applications that demand high availability has increased in the past two-to-three years.” High availability is typically considered to be four nines (99.99% availability and above) or less than 53 minutes of downtime per year. Yet 41% of respondents say they would be satisfied with only two or three nines? Astounding.

The Disaster of Disaster Recovery

IT executives typically prepare for downtime by implementing some variation of the backup/failover paradigm, even though most are aware it is unlikely to work. I invite you to read the ZeroNines whitepaper “The Disaster of Disaster Recovery” (available on the ZeroNines.com website) which looks at the causes of downtime and explores the shortcomings of the predominant failover disaster recovery technique. It also discusses the ZeroNines alternative, which can bring uptime beyond any measure of “nines” to virtually 100%.

ZeroNines Technology, Inc. is not affiliated with ITIC, the Information Technology Intelligence Corp. or with Stratus Technologies.

Visit the ZeroNines.com website to find out more about how our disaster-proof architecture can protect businesses (and government agencies) of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

November 23, 2009

Fixing The FAA’s Single Point of Failure

“The difficulties started when a single circuit board in a piece of networking equipment at a computer center in Salt Lake City failed around 5 a.m…” [source]

All too often it seems that the biggest problems are caused by the smallest failures. This blog is full of posts about how generator transfer switches, router programming changes, and problematic network hardware can bring businesses to their knees. Now a single circuit board failure causes havoc among airlines, airports, and air travelers.

I stand by my earlier assertion: Trying to guarantee application and data uptime by eliminating all possible sources of failure is not possible. The more complex a system gets, the more likely some part is going to fail, and it is impossible to identify them all. But there is a way to prevent these little disasters from becoming big ones.

Background: The Flight Plan Management System

The failed FAA computer system was the National Airspace Data Interchange Network [source], which manages flight plans and ground traffic. This is one of two nationwide computer centers that collects flight plans. The other is in Atlanta. This was the third time since June 2007 the system has failed [source].

The Problem: Hardware Failure Blocks Access

When the circuit board failed on November 19, 2009, access to data and communications was blocked, making flight plans filed by airlines inaccessible [source]. Air traffic controllers had to enter flight plans manually in several parts of the U.S. The problem was fixed about five hours later.

The Cost: Mostly to the Airlines

The FAA being a governmental agency, no direct fiscal impact can be readily estimated. However, the cost to airlines has to be considerable, since many flights were canceled or delayed. Airline stocks were down that day – whether the computer failure was the cause or not – and our poor beleaguered airlines can’t help but suffer when something like this happens. They were still down even after the problem had been fixed [source]. And of course individual travelers, such as myself, will bear the brunt too in the form of delays, costlier alternative travel, and unplanned hotel stays. Not to mention missed business meetings which can cost a business a lot more than a replacement airline ticket. The domino effect of airline delays is a disaster unto itself.

The Solution: Sidestep the Single Point of Failure

Doug Church, a spokesman for the National Air Traffic Controllers Association, said…"We think it's a single-point failure that occurred somewhere in the system," he said. "One single glitch was able to shut down the entire system." [source]

This is perhaps the scariest statement about the whole affair. The simple fact that they went dark shows that their backup systems also failed. This is not surprising; most disaster recovery systems use the “failover” or “cutover” technique which is outdated, unreliable and can lead to cascading failures and increased downtime. Such occurrences are frighteningly common.

At ZeroNines we propose a different approach. Instead of trying to catch a downtime event with a failover recovery, like a ninja trying to catch an arrow, we simply double up all the processing in multiple data centers around the country or around the world. Each processes the same thing at the same time so if “a single circuit board in a piece of networking equipment at a computer center in Salt Lake City” fails, the additional networking equipment in Atlanta or Omaha or Dusseldorf or wherever keeps on processing.

The likelihood of application or data downtime – where users lose access to the tools and information they need to do their jobs – drops to virtually zero because the chances of all data centers, or clouds, or virtual environments failing simultaneously is statistically negligible. In this instance, had the National Airspace Data Interchange Network been protected by our Always Available™ technology, the Atlanta network node would simply have continued processing while Salt Lake City was repaired and brought back online. Then the system would have automatically updated Salt Lake with all the transactions that had occurred in its absence.

Visit the ZeroNines website to find out more about how our disaster-proof architecture can protect businesses (and government agencies) of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 22, 2009

Enabling Cloud Confidence

A week ago I wrote about the Sidekick disaster and how events like that just keep doubts growing, pushing the wholesale adoption of the Cloud further away. This doubt has made it into the mainstream media, where it will taint the opinions of potential cloud users, both consumer and commercial. We at ZeroNines think we have the solution that will enable the cloud to perform as it needs to.

The core problem with outages is not the existence of hazards that can damage servers and knock elements of a network (cloud or otherwise) offline. Storms, fires, and equipment failure will always happen and there is no way to eliminate them. The real problem is the reliance of cloud providers on obsolete failover-based recovery paradigms that simply can’t maintain continuity when disaster does strike.

L.A. Times columnist David Sarno perfectly sums up the cloud’s tenuous situation in his October 18 article “Still hazy on cloud computers' security” [source]. “A series of incidents involving cloud computing over the last several months has poked holes in the hype bubble, raising questions about the cloud's dependability -- and whether it's ready for use by a broader group of workers and businesses.” He is right on target.

Meeting the Need to Fortify

As Sarno puts it, “As e-mail, word processing and data storage continue to move from users' computers to the Web, companies must fortify their servers from a variety of potential disasters -- natural and man-made -- to help ensure that the data and the applications are accessible at all times.” He quotes Google’s SEC filing:

"(Google’s) systems are vulnerable to damage or interruption from earthquakes, terrorist attacks, floods, fires, power loss, telecommunications failures, computer viruses, computer denial of service attacks" as well as sabotage and vandalism...
The good news is that today, ZeroNines' Always Available™ CloudNines™ technology can fortify servers from damage or interruption from earthquakes, terrorist attacks, floods, fires, power loss, telecommunications failures, computer denial of service attacks, as well as sabotage and vandalism. We leave the viruses to others to deal with, but we can add most types of routine maintenance, unplanned maintenance, data migrations, equipment upgrades, software upgrades, and a number of other potential causes of downtime.

Forget Failover

The IT world fatalistically believes that downtime is inevitable, and is something to be lived with and minimized if you’re fortunate. This view predominates because until now the only disaster recovery solution available has been the flawed failover paradigm, which everyone in IT knows can be a disaster unto itself. During a crisis or failover event, cutover can cause additional problems, downtime, and cascading application failures as computing switches from primary to backup systems.

But the IT world has it wrong. Disasters will happen and must be dealt with, but the downtime they cause can be prevented.

Always Available™ Means Virtually 100% Uptime

ZeroNines’ Always Available™ solution eliminates failover and backups, instead providing synchronous identical processing on multiple cloud nodes geographically separated by thousands of miles. If a storm wipes out your East Coast cloud, CloudNines enables processing to continue on clouds in other parts of the country and around the world. If you need to upgrade server software, you can isolate one cloud node, do your upgrade, and bring it back online once it is stable. Our technology has journaling and updating features to assure that all transactions are completed and that any cloud node that goes offline is brought up to the most accurate logical state once it comes back online.

CloudNines can push application availability beyond the industry-accepted standard of 99.999% (five nines) to virtually 100%. In our ongoing test case, the ZeroNines MyFailSafe environment has never experienced any downtime at all, for any reason. It went live in July 2004, and had individual network nodes knocked offline a number of times due to hurricanes, power outages, server migrations, and other causes. All applications experienced full 100% availability throughout.

Will ZeroNines eventually be recognized as a vital cloud-enabling technology? That remains to be seen but you can bet that is how we see ourselves. If you want to find out how we can make the cloud a viable option for you, let me know.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

October 15, 2009

Sidekick: Two Disasters for the Price of One Really Big One

Already being called one of the largest data failures in recent memory, October’s Sidekick disaster was actually two disasters rolled into one. First, the cloud-based service suffered an outage which stranded thousands of users. Second, the backup/storage system failed and erased the personal data of thousands of users. Every failure like this leads to a round of hand-wringing over the cloud, and this one is no different. It underscores the need for a far more robust cloud architecture, where a failure in one area is truly isolated from the rest of the system and can’t cause an outage.

Background: Sidekick and the Cloud

The Sidekick mobile device is developed by Microsoft subsidiary Danger and is sold and serviced by T-Mobile. It holds a special place in the hearts and hands of a select group of users because its QWERTY keyboard promises ease of use and its cloud-based data storage gives it the appearance of real go-anywhere, do-it-anytime utility. Unlike other hand-helds like the iPhone and Blackberry, the Sidekick backs up personal data to cloud-based storage at Microsoft and not to your computer’s hard drive. And there’s the seed of the trouble.

The problem: Hardware Failure Leads to Database Failure

It seems that beginning at about 1:30 AM on Friday October 2 [source], a “hardware failure… took out both the primary and backup copies of the database that contained Sidekick users' information.” [source] This apparently occurred during an upgrade to the Danger/Microsoft Storage Area Network [source]. When they discovered their Sidekicks weren’t working, many users re-set their Sidekicks (some under instructions from T-Mobile customer service) which wiped the devices’ hard drives. Combined with the back-end server failure, this led to apparent permanent data loss for anyone who tried to re-set their Sidekicks.

The cost to T-Mobile and Microsoft

This is going to cost millions. At least. T-Mobile halted sales of all Sidekicks shortly after the event and is compensating its affected users with a period of free data service [source]. There were the usual rants about users refusing to continue paying on their contracts, and news that T-Mobile was voluntarily letting anyone out of their contract who wanted out [source]. Lawsuits were filed [source]. Sarcasm and criticism runs thick online. Whatever the actual facts, this is a marketing disaster of the greatest degree for T-Mobile and Microsoft. There is no way to calculate how many of the approximately 800,000 existing sidekick customers [source] will jump ship, how many potential new customers will be lost, and what this means for Microsoft’s “Pink” project, intended follow-on to the Sidekick [source].

The Solution: A Robust Cloud

ZeroNines’ CloudNines™ product enables the cloud to function as it is supposed to, by processing every transaction simultaneously and equally on multiple cloud-based network nodes in an Always Available™ configuration. In the Sidekick disaster, CloudNines would simply have cut off the node with the hardware failure. All processing would have continued on other geographically separated nodes that were running identical active instances of the affected applications and databases. The failure would have been contained. There would have been no service downtime, and no need for ill-advised attempts to re-boot individual Sidekicks.

Not only would the Sidekick applications have continued operation, but the databases would too. There would have been no apparent loss of customer data. After the event, one author bitingly asked “But the question remains, why wasn't there a true independent backup of the data?” [source]. ZeroNines and Always Available technology would have made this a moot point.

As of this writing, T-Mobile and Microsoft have announced that they “have recovered most, if not all, customer data” [source]. I can’t help but breathe a sigh of relief for them even though I am not a Sidekicker myself. But wouldn’t it have been far better to have avoided the problem in the first place?

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

September 12, 2009

Gmail Maintenance Leads to Router Overload

It is often the mundane problems that cause the most trouble. The two-hour Gmail outage on Tuesday, September 1, 2009 had a fairly unspectacular cause, but its effects are shaking a tech giant’s plans and causing some commentators to wring their hands over the acceptance of SaaS offerings in general.

Whatever its effects on the industry, this is one of several outages in the past year which are harming Google’s efforts to sell its email services as a corporate tool. At the very least it cost them a lot of money. Fortunately, such outages are avoidable.

Background: Google Hopes for Significant Gmail Revenue

Gmail is Google’s free email app, and is used worldwide by millions of people. Gmail also has paid services, and Google is trying to build it up into a corporate app that can generate significant revenue. Analysts and customers alike have been watching it closely over the years to see if it really can grow into a reliable corporate power tool, but have been disappointed by a number of recent outages.

The Problem: A Classic Cascade Failure

Last Tuesday’s problem “was caused by a classic cascade in which servers became overwhelmed with traffic in rapid succession” [source]. Google had taken several Gmail servers offline for maintenance. Recent changes to routers were intended to increase routing efficiency, but instead caused some routers to become overloaded. Traffic got shunted to an increasingly small pool of available routers until the system collapsed.

The Cost to Google

Google wants to get more customers onto its paid Gmail service. The outage adds to the image of Gmail as being not stable enough for business use and makes it harder to persuade corporate users to actually pay for it.

By way of compensation, Google “…added three days to year-long subscriptions to its corporate Google Apps email service, which costs $50 per-user-per-year.” [source] This equates to approximately $50 million. Unfortunately, users would rather have uptime than compensation, and Google got a lot of bad publicity which will make it harder to get business users to switch from other offerings. [source]

The Solution: A Network that Can Absorb Failures

“Google said it would focus on making sure that the request routers have sufficient headroom to handle future spikes in demand, as well as figuring out a way to make sure that problems in one sector can be isolated without bringing down the entire service.” [source]

Isolation of problem servers or nodes is a core function of ZeroNines’ Always Available™ technology. If they had been using Always Available, Gmail could have tested their new router/server configuration in isolation while the rest of the network was left to operate in the usual way. The new configuration could have been rolled out one server at a time without interrupting service. If one or more of the newly configured routers became unstable, that failure would have been confined to just that sector and the rest of the Gmail network could have continued processing in its usual way.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

August 10, 2009

PayPal™ Network Equipment Failure Strands Consumers & Merchants

“How can it be that a single piece of network hardware brings down a business critical part of a network? It used to be that all financial institutions ensured that they remained available at all times regardless of cost. Is it possible that the current crop of engineers don’t make this a must have feature of their designs?” (source)


These tough questions were posted by a user on the PayPal™ blog after a network equipment failure took down the popular payment service for a full hour worldwide on August 3, 2009 (source). Today I’ll take a quick look at this outage and offer a solution for preventing similar disasters.

Background: PayPal and its Impact

PayPal is one of those breakthrough solutions that has enabled e-commerce to take off as it has. Its low fraud rate and ease of use may make it the prototype for apps that could replace credit card accounts as the preferred means of making online payments. By way of illustrating PayPal’s importance, here are some stats from the company’s media site:

  • PayPal's net Total Payment Volume for 2008, the total value of transactions, was $60 billion.
  • PayPal has 73 million active registered accounts (184 million total accounts).
  • PayPal supports payments in 19 currencies.
  • PayPal's revenues now represent 32% of eBay Inc. companywide revenues.

The Problem: Network Equipment Failure

Acording to an August 3 announcement by PayPal’s SVP Technology…

“At around 10:30 am PT Monday, a network hardware failure resulted in a service interruption for all PayPal users worldwide. Everyone in our organization focused immediately on identifying the issue and getting PayPal up and running again. We accomplished that in about an hour. By approximately 3 pm PT, full service was restored across our platform." (source)


At the rate PayPal transacts, one hour of downtime means about $7,000,000 in lost or delayed transactions (source). Some comments from that and other blogs nicely illustrate the downstream effects:

"We have been down for the better part of the day. We are still down. I am a very unhappy customer. This failure has cost me thousands of dollars." (source)

"I am beginning to question my use of the product considering there does not appear to be a high availability solution in place." (source)

The Solution: Remove the Single Point of Failure

ZeroNines’ Always Available™ technology is the high availability solution this PayPal customer was wishing for. It could have prevented last week’s downtime event because it processes network transactions synchronously on multiple data centers, clouds, or virtualized environments through multiple network paths. There is no hierarchy and no single point of failure. In case of an equipment failure, power outage, application crash, storm, or other catastrophe in one area, processing would simply continue via other network nodes, switches, and data centers. The business disaster does not occur because users never lose access to the apps, data, and services they need.

Always Available would also have enabled automatic update of the errant server once it was brought back online, preventing PayPal from having to use “everyone in their organization” just go get things going again. Of course the IT department still needs to replace the failed equipment, but that can be done in isolation as the rest of the business carries on as usual.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin Founder & CEO, ZeroNines

July 8, 2009

Seattle Database Fire Unnecessarily Shuts Down Businesses and Online Services

Cascade failure. If you’re in IT, that’s a particularly frightening term. In the case of last week’s Seattle data center fire, the term is especially appropriate since it was literally a cascade of water that wrecked everything and sent a number of businesses and online services offline. Here’s a look at this disaster and a way it could have been prevented.

Background: Fisher Plaza, a Major Hosting Facility

Fisher Plaza is “a self-styled carrier hotel in Seattle, and home to multiple datacenter and colocation providers.” [source] A partial list of organizations hosted there includes: payment service provider Authorize.net (which itself has 238,000 merchant customers), Port of Seattle email system, Swedish Hospital’s internal IT systems, Pacific Science Center website, geocaching.com website, major TV and radio station KOMO, online Facebook game Bejeweled Blitz and dozens of other businesses [source].

The Problem: Fire Leads to Cascade Failure

Early on Friday morning, July 3, 2009, Fisher Plaza’s main generator/transfer switch failed. This caused an overload. This caused a fire. This triggered the fire suppression system and brought firefighters to the scene, both of which shot water into the generator room. The generators stopped, and we deduce that power from the grid was shut off too. The UPS and the cooling system also failed. Temperatures in the facility rose high enough to wreck some servers and destroy data [source].

Think about the downstream effects. 238,000 merchants potentially have their transactions interrupted or lost because Authorize.net’s servers are forced offline. One can only hope they had their own functioning backup plan. A hospital’s IT system became unavailable; I have no information on what impact this had on patient care. And apparently KOMO had to transmit from a mobile unit in their parking lot [source]. It is not hard to imagine the impact to these and other organizations.

The Solution: Fire- and Flood-Proof Hosting

No, ZeroNines does not wrap servers in asbestos. There is no way to know what bizarre little accident will happen next, so prevention is unlikely. Some will trigger chain reactions that become major IT disasters.

What we do is to prevent a catastrophe in one place from knocking out a business everyplace. In this case, if any of the clients or tenants at Fisher Plaza had been using our technology, their data, transactions, apps, and other assets would have all been processing simultaneously and in perfect replication in other data centers hundreds or thousands of miles away.

This is not a cutover scenario. Processing would not have “switched” from Seattle to elsewhere. It simply would have stopped in Seattle and continued in real time in San Jose, or Denver, or Singapore, or wherever else they placed their data centers. There would be no loss of business continuity. Their businesses would not have gone down, and the real disaster – lost connectivity, productivity, and revenue – would not have taken place.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

June 30, 2009

Uptime and the Cloud Crowd at CSIA

A few days ago, Jake Smith of Intel and I presented at the Colorado Software Industry Association (CSIA) monthly meeting in Denver (source). We talked about cloud computing and the elements that will determine its rate of adoption: the needs of businesses, their expectations of cloud performance, and the real-world limitations of the cloud that are currently stalling its adoption. The biggest issue is reliability, and I introduced ZeroNines’ technology as a potential solution. It was a great crowd, and their hunger for a reliable cloud was obvious.

Businesses need their applications and data to be available all the time. So far, clouds and cloud providers have not succeeded in proving that they can actually offer that. The industry needs to overcome the cloud’s downtime problems before serious business can be done on it. I believe the Big Three (Amazon, Azure, and Google) will refocus their efforts on providing highly available cloud infrastructures and market this capability accordingly.

The Cause is Academic

Of course every network is subject to threats and failures that can cause downtime, and there’s no getting away from that. It doesn’t take an earthquake to knock vital networked apps offline; some recent high-profile cloud provider outages have shown that all it takes is a failed OS upgrade. New and unexpected problems crop up every day. But the cause of an outage is really only academic for the business relying on the cloud. Service should simply continue because the business needs it to.

The scary thing is that the current disaster recovery paradigm (failover) is insufficient for protecting businesses when these things happen, and can’t be relied upon to prevent downtime or even a speedy recovery. In addition, there is an increase in catastrophic risk from poorly architected virtualized environments, and most notably in server consolidation, which is a core technology of the cloud.

The Solution is Continuity

At the CSIA meeting, we introduced the crowd to our Always Available™ technology, which maintains cloud continuity by synchronizing and protecting multiple private, public or hybrid clouds. It can mix cloud computing and physical hosting via datacenters hundreds or thousands of miles apart. The distance prevents any single regional disaster from damaging more than one data center. There is no server hierarchy, so all transactions run simultaneously and equally on all cloud and server nodes. Best of all, they update each other constantly in real time so if one goes down the others simply continue processing with no interruption to service.

To protect against an outage during an upgrade, I would postulate the following solution: Isolate one cloud or network node in an Always Available configuration and do your upgrade there, while the other nodes manage the clients’ transactions. Test the upgrade and slowly roll it out to the other nodes. If things start to go haywire, isolate the misbehaving node, solve your problems, and start the rollout again. There would be no need to risk the entire service on an untested upgrade.

Always Available works for cloud customers as well as service providers. It is provider- and platform-agnostic, so you can mix and match all you need to.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

January 12, 2009

Hurricane Charley Couldn’t Stop the Email

In this first Disaster Litany Posting, I look at a sequence of events that is near and dear to ZeroNines. Our own real-world experience with a hurricane, power outages, and an email system will show just how our downtime-preventing technology works, and serve as a pattern for the solutions we suggest for other disasters.

Background: MyFailSafe™ Email System

ZeroNines offers Always Available technology that can virtually eliminate downtime among networked applications, data, and other assets. To test our technology, we created the MyFailSafe Email Service and launched it on our Always Available network in July of 2004. This was specifically intended to test Always Available in the real world, by running MyFailSafe just like any other email service is run, with real customers and real traffic, and subject to the same threats that any other network or email system is vulnerable to.

The Problem: A Hurricane

All readers who remember Hurricane Charley please raise your hands… For those of you who don’t, Charley hit Florida on August 13, 2004. According to Wikipedia, it killed about thirty people and caused $15 billion in damage. Widespread flooding, wind damage, power outages, and other problems crippled much of the state for several days. I don’t have statistics on downtime among private business networks or service providers, but it’s a safe bet that it was serious.

Charley hit about a month after we launched MyFailSafe. It caused electrical grid fluctuations that drained the Orlando local exchange carrier battery backup systems, isolating the Orlando node of the ZeroNines Always Available infrastructure. Our own battery system prevailed and still had a 75% charge when commercial power was reliably restored, but the site could not communicate for 16 hours because of LEC downtime.

The Solution: Hurricane-Proof Architecture

During this 16 hours, when our Orlando node was effectively offline, the MyFailSafe email service did not experience any downtime at all. Any user whose power was still on and whose desk was not under water experienced true 100% uptime throughout, whether they were in Florida, Colorado, Canada, Asia, or anywhere else.

How? Our Always Available deployment has additional nodes and data centers in Colorado and California. All applications, transactions, data exchanges, and other network activities run equally and simultaneously on these multiple secure application servers, geographically separated by hundreds of miles. In IT parlance, all servers are hot, and all instances of all applications are active. There is no server hierarchy, and consequently no single point of failure. When the Orlando node fell silent, all MyFailSafe processing continued uninterrupted on the others. There was no need for failover or recovery because these other nodes were far from the storm, they never went down, and continuity was maintained.

Since activation on July 15, 2004, the MyFailSafe network has never experienced any downtime for any reason, including this and other hurricanes, two migrations from server collocation providers to clouds, a data center move, and an email worm attack that interrupted email service from AOL and other major providers. These potential disasters, which forced our servers offline, had no power to bring our applications down. All applications and information retained 100% availability throughout.

Contact ZeroNines to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines

January 3, 2009

A Litany of Disasters: Downtime Events and How to Avoid Them

“Aviation in itself is not inherently dangerous. But to an even greater degree than the sea, it is terribly unforgiving of any carelessness, incapacity, or neglect."
-- Anonymous

Years ago, I saw those words on a poster of a World War One aircraft stuck about 20 feet off the ground in the limbs of a tree. If we were to update this and adapt it to the business user’s desktop, it would lose its poetic charm but strike home with a whole new audience:

“Networked assets in themselves are not inherently dangerous. But to an even greater degree than stuff on your hard drive, they are terribly unforgiving of any carelessness, incapacity, or neglect."

The warning is clear: Disaster may be only inches away, particularly for the unprepared. It’s a lot harder to recover after some accident knocks out a hundred or a thousand users than it is to re-boot your own machine.

In this blog, we will be looking at some actual disasters that have struck organizations when their networks have taken a hit from storms, fires, attacks, and far more mundane threats like human error and equipment failure.

For a business, there may be little correlation between the physical effects of a disaster and its financial impact. Imagine a business dependent upon a distant data center in the U.S. Tornado Belt. One good storm could leave their personnel and property untouched, yet destroy their ability to do business by wiping out their data, applications, and transactions. Elsewhere, an earthquake could cause deplorable loss of life and property damage, yet leave a business relatively unharmed if its networked computing capabilities remain intact. And an otherwise strong corporation could suffer irreparable damage by something as quiet as a software failure or equipment malfunction, which to the outside world does not qualify as a “disaster” at all.

I’ll be describing some instances where ZeroNines’ solutions for networks, virtualized environments, and clouds could have prevented disastrous downtime, and helped avoid unwanted headlines and losses to productivity, reputation, and revenue. Our approach does not use any kind of failover or cutover, since those occur after the downtime event and are not true disaster prevention. After all, it’s far better to avoid the downtime in the first place than to try to recover from it afterward.

Next week: How MyFailSafe really did provide fail-safe email during Hurricane Charley.

Contact ZeroNines to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – ZeroNines, Founder & CEO