March 26, 2012

BATS: “When the Exchange Blows Itself Up, That’s Not a Good Thing”

I can imagine scenarios more spectacular than the BATS failure and IPO withdrawal on Friday, but most of them involve alien invasion, chupacabras, or the expiration of the Mayan calendar.

Every once in a while a technology failure casts such a light (or perhaps a shadow) on how we do things that people actually stop and wonder aloud if the entire paradigm is too flawed to succeed. I don't mean that they look at alternative solutions like they do every time RIM goes down. They actually begin to question the soundness of the industry sector itself and the technological and market assumptions on which it is founded. Take for example this quote about BATS from a veteran equities professional:

"This tells you the system we’ve created over the last 15 years has holes in it, and this is one example of a failure,” Joseph Saluzzi, partner and co-head of equity trading at Themis Trading LLC in Chatham, New Jersey, said in a phone interview. “When the exchange blows itself up, that’s not a good thing.” The malfunctions will refocus scrutiny on market structure in the U.S., where two decades of government regulation have broken the grip of the biggest exchanges and left trading fragmented over as many as 50 venues. BATS, whose name stands for Better Alternative Trading System, expanded in tandem with the automated firms that now dominate the buying and selling of American equities. The withdrawal also raises questions about the reliability of venues formed as competitors to the New York Stock Exchange and Nasdaq Stock Market since the 1990s. [source]

When High Frequency Trading Crashes

BATS was initially built to service brokers and high frequency trading firms [source]. Their Friday crash mirrors a scenario I speculated about in this blog back in May 2010 in regard to an 80-minute TD Ameritrade outage:

Some banks, hedge funds, and other high-power financial firms engaged in High Frequency Trading (HFT) make billions of trades a day over ultra-high speed connections [source]. Many trades live for only a few seconds. Enormous transactions are conceived and executed in half a second, with computers evaluating the latest news and acting on it well before human traders even know what the news is. HFT is having a significant effect on markets; there is evidence that the history-making “Flash Crash” of May 6 2010 was caused and then largely corrected by High Frequency Trading [source]. What would happen if one of these HFT systems was down for an hour and a half? Or even just a minute? [source]

To answer my own question, apparently IPOs can be cancelled, a faulty 100-share trade can halt the trading of a leading stock, and regulators will jump like you stuck them with a pin.

About BATS

BATS Global Markets, Inc. (http://www.batstrading.com) describes itself in its prospectus as primarily a technology company [source]. The BATS exchanges represent an enormous 11 to 12 percent of daily market volume [source]. BATS is backed by Citigroup Inc, Morgan Stanley, Credit Suisse Group [source] and several other large financial firms. BATS planned to go public on Friday March 23, 2012 by selling their own shares on their own exchange that runs on their own technology.

What Happened

To my knowledge, BATS has not issued a definitive statement on the exact technological problem. Here is the best summary I could come up with:

A BATS computer that matches orders in companies with ticker symbols in the A-BFZZZ range "encountered a software bug related to IPO auctions" at 10:45 a.m. New York time [source].

This prevented the exchange from trading its own shares. Then…

A single trade for 100 shares executed on a BATS venue at 10:57 a.m. briefly sent Apple down more than 9 percent to $542.80 [source].

Apple's ticker is AAPL so it was within the group affected by the bug. By the time it was over, BATS shares had dropped from about $16.00 to zero and they withdrew their IPO. All trades in the IPO, Apple, and other affected equities were canceled.

Service Uptime: 99.998% Won't Cut It.

BATS claims it experienced very low downtime in 2011: 99.94% for its BZX exchange and 99.998% for its BYX exchange [source]. This equates to about 5.25 hours of downtime for BZX and about 10.5 minutes of downtime for BYX.

Though admirable for most online services, my first thought is that for a High Frequency Trading system this could represent billions of dollars in lost trades. I'd like to hope that all this downtime occurred when the markets were closed.

But even short outages of seconds or even fractions of a second can cause cascading application and database failures that corrupt the system and cause problems for hours or days.

Speculation: How it Could Have Been Prevented

This was not an all-out downtime event like I prophesied back in May 2010 and I don't think that ZeroNines technology could have prevented it. However, a variety of other real-world failures can be just as disastrous, particularly when one considers the shortcomings of failover-based disaster recovery, questionable cloud reliability, and the fact that switches and other hardware will eventually fail.

Allow me to postulate how this might happen one day on another financial firm's systems.

Scenario 1
The software fails because of a corrupted instance of the trading application on one server node. In an Always Available™ configuration that one node could have been shut down, allowing other instances of the application to handle all transactions as normal.

Scenario 2
A network switch fails, leading to extreme latency in some transaction responses. In an Always Available configuration, all transactions are processed simultaneously in multiple places, so the slow responses from the failed node would be ignored in favor of fast responses from the functioning nodes. The latency would not have been noticed by the traders or the market.

Scenario 3
An entire cloud instance has an outage lasting just a few seconds, well within their service level promises. During failover to another virtual server, one application is corrupted and begins sending false information. In an Always Available configuration there is no failover and no failover-induced corruption. Instead, other instances of the application process all the transactions while one is down, and then update the failed instance once it comes back online.

These scenarios are simplifications but you get the point: problems will always occur, but if they are within an Always Available configuration the nodes that continue to function will cover for those that don't.

Playing Nice for the SEC

“When you’re actually the exchange that is coming public and the platform that you touted as being very robust and competitive with any other global exchange fails to work, it’s a real black eye,” Peter Kovalski, a money manager who expected to receive shares in Bats for Alpine Mutual Funds in Purchase, New York [source].

With all respect to Mr. Kovalski, this is far more than a black eye. It's a business disaster. BATS is on the ropes, and I suspect their underwriters are hurting too. Trading opportunities were lost, many permanently. The SEC is already investigating automated and High Frequency Trading [source], and I'd guess it will be far more cautious about innovation among exchanges, trading platforms, and other systems.

ZeroNines offers one way to keep the wolves at bay: our Always Available technology can enable uptime in excess of 99.999% (five nines) among applications and data within traditional hosting environments, virtualized environments, and the cloud. We can't fix your software bugs, but we can prevent downtime caused by other things.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines