Tuesday, July 03, 2012

Amazon #FAIL in the recent storms. Failover needs to be part of your cloud-based product design

Yes the recent storms in the Mid-Atlantic were bad. 100 mile an hour winds and heavy rain. Here in the Baltmore DC area there are still many people without power. But Amazon suffered a failure that took out some major services.
David Linthicum puts it really well. 
I agree. Somebody at Amazon has been failing. May be the services are "over sold" or "over committed" and losing part of the capability leaves some customers with insufficient resources. However, if major players like NetFlix and Pinterest can't get the resources they need then there must have been something seriously wrong. These data centers should be designed to withstand severe weather. It is not an unusual occurrence. 
Amazon must have been making some serious trade-offs (aka gambles) in managing costs and profitability and they have obviously come back to bite them.
Taking things in to our own hands
This latest incident has underlined that cloud providers are not bulletproof. As such it is incumbent on cloud users to consider redundancy in their designs - right from the outset. As always the challenge is often cost.
If you want to use Amazon services you can configure to operate across multiple regions. The challenge with this is that moving data between regions has a bandwidth cost that you don't see if you operate within a single region. Replication between regions can be technically challenging, depending on your technical configuration but it can also get expensive. 
Using multiple Cloud providers is not simple and straightforward either. Rackspace is promoting OpenStack and Microsoft has indicated some intention to work with OpenStack
While it is feasible to configure your platform on different cloud platforms the challenge is managing demand across the different platforms, particularly if you want the ability to load balance or have instant failover.
In these days where there is a strong urge to shorten the time to market it can still pay dividends in ongoing operational costs to think about scalability, resilience and fail over as part of the initial design of your product. Failure to do so can lead to embarrassing outages that are not of your own choosing.