September 12, 2009

Gmail Maintenance Leads to Router Overload

It is often the mundane problems that cause the most trouble. The two-hour Gmail outage on Tuesday, September 1, 2009 had a fairly unspectacular cause, but its effects are shaking a tech giant’s plans and causing some commentators to wring their hands over the acceptance of SaaS offerings in general.

Whatever its effects on the industry, this is one of several outages in the past year which are harming Google’s efforts to sell its email services as a corporate tool. At the very least it cost them a lot of money. Fortunately, such outages are avoidable.

Background: Google Hopes for Significant Gmail Revenue

Gmail is Google’s free email app, and is used worldwide by millions of people. Gmail also has paid services, and Google is trying to build it up into a corporate app that can generate significant revenue. Analysts and customers alike have been watching it closely over the years to see if it really can grow into a reliable corporate power tool, but have been disappointed by a number of recent outages.

The Problem: A Classic Cascade Failure

Last Tuesday’s problem “was caused by a classic cascade in which servers became overwhelmed with traffic in rapid succession” [source]. Google had taken several Gmail servers offline for maintenance. Recent changes to routers were intended to increase routing efficiency, but instead caused some routers to become overloaded. Traffic got shunted to an increasingly small pool of available routers until the system collapsed.

The Cost to Google

Google wants to get more customers onto its paid Gmail service. The outage adds to the image of Gmail as being not stable enough for business use and makes it harder to persuade corporate users to actually pay for it.

By way of compensation, Google “…added three days to year-long subscriptions to its corporate Google Apps email service, which costs $50 per-user-per-year.” [source] This equates to approximately $50 million. Unfortunately, users would rather have uptime than compensation, and Google got a lot of bad publicity which will make it harder to get business users to switch from other offerings. [source]

The Solution: A Network that Can Absorb Failures

“Google said it would focus on making sure that the request routers have sufficient headroom to handle future spikes in demand, as well as figuring out a way to make sure that problems in one sector can be isolated without bringing down the entire service.” [source]

Isolation of problem servers or nodes is a core function of ZeroNines’ Always Available™ technology. If they had been using Always Available, Gmail could have tested their new router/server configuration in isolation while the rest of the network was left to operate in the usual way. The new configuration could have been rolled out one server at a time without interrupting service. If one or more of the newly configured routers became unstable, that failure would have been confined to just that sector and the rest of the Gmail network could have continued processing in its usual way.

Visit the ZeroNines website to find out more about how our disaster-proof architecture protects businesses of any description from downtime.

Alan Gin – Founder & CEO, ZeroNines