November 23, 2009

Fixing The FAA’s Single Point of Failure

“The difficulties started when a single circuit board in a piece of networking equipment at a computer center in Salt Lake City failed around 5 a.m…” [source]

All too often it seems that the biggest problems are caused by the smallest failures. This blog is full of posts about how generator transfer switches, router programming changes, and problematic network hardware can bring businesses to their knees. Now a single circuit board failure causes havoc among airlines, airports, and air travelers.

I stand by my earlier assertion: Trying to guarantee application and data uptime by eliminating all possible sources of failure is not possible. The more complex a system gets, the more likely some part is going to fail, and it is impossible to identify them all. But there is a way to prevent these little disasters from becoming big ones.

Background: The Flight Plan Management System

The failed FAA computer system was the National Airspace Data Interchange Network [source], which manages flight plans and ground traffic. This is one of two nationwide computer centers that collects flight plans. The other is in Atlanta. This was the third time since June 2007 the system has failed [source].

The Problem: Hardware Failure Blocks Access

When the circuit board failed on November 19, 2009, access to data and communications was blocked, making flight plans filed by airlines inaccessible [source]. Air traffic controllers had to enter flight plans manually in several parts of the U.S. The problem was fixed about five hours later.

The Cost: Mostly to the Airlines

The FAA being a governmental agency, no direct fiscal impact can be readily estimated. However, the cost to airlines has to be considerable, since many flights were canceled or delayed. Airline stocks were down that day – whether the computer failure was the cause or not – and our poor beleaguered airlines can’t help but suffer when something like this happens. They were still down even after the problem had been fixed [source]. And of course individual travelers, such as myself, will bear the brunt too in the form of delays, costlier alternative travel, and unplanned hotel stays. Not to mention missed business meetings which can cost a business a lot more than a replacement airline ticket. The domino effect of airline delays is a disaster unto itself.

The Solution: Sidestep the Single Point of Failure

Doug Church, a spokesman for the National Air Traffic Controllers Association, said…"We think it's a single-point failure that occurred somewhere in the system," he said. "One single glitch was able to shut down the entire system." [source]

This is perhaps the scariest statement about the whole affair. The simple fact that they went dark shows that their backup systems also failed. This is not surprising; most disaster recovery systems use the “failover” or “cutover” technique which is outdated, unreliable and can lead to cascading failures and increased downtime. Such occurrences are frighteningly common.

At ZeroNines we propose a different approach. Instead of trying to catch a downtime event with a failover recovery, like a ninja trying to catch an arrow, we simply double up all the processing in multiple data centers around the country or around the world. Each processes the same thing at the same time so if “a single circuit board in a piece of networking equipment at a computer center in Salt Lake City” fails, the additional networking equipment in Atlanta or Omaha or Dusseldorf or wherever keeps on processing.

The likelihood of application or data downtime – where users lose access to the tools and information they need to do their jobs – drops to virtually zero because the chances of all data centers, or clouds, or virtual environments failing simultaneously is statistically negligible. In this instance, had the National Airspace Data Interchange Network been protected by our Always Available™ technology, the Atlanta network node would simply have continued processing while Salt Lake City was repaired and brought back online. Then the system would have automatically updated Salt Lake with all the transactions that had occurred in its absence.

Visit the ZeroNines website to find out more about how our disaster-proof architecture can protect businesses (and government agencies) of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines