Saturday, Apr 23 2011

April 21: In 2 Bullet Points

Written by
There is so much posted on the AWS outage in the north east.. Some great detailed blogs on designing high availability for the cloud, some people who survived, and the modern equivalent of CNN moments for those who didn’t, why this was the app owners fault and not amazon and vice versa.. As always, there is alot of fluff around the cloud. Who is at fault and who is to blame ? Is the cloud a failed concept? Hyped up load of bollocks.
Two sides to every story and anyone with a practical bone in their body will realise it.

1. It was a bug, and failure by Amazon architects.

“AWS region is a single API end point and control plane, one AZ EBS fail took out EBS control plane for a short while, that’s the big bug” – adrianco

The cloud should not fail in this manner because they are supposed to be designing a “service” that stays up based on components that are “designed to fail”. Amazon is as responsible for their customers satisfaction as anyone else. Lots of unsatisfied customers = #fail. Don’t care what the SLA commitments are because if you rely on SLA’s you are asking for problems.

That being said, bug found, and I am sure the rearchitecting process is in progress. Amazon is a stronger service provider for it..

2. Buyer beware.

reddit-aws-downNo matter how resilient your infrastructure vendor (whether it be Tandem, IBM, Whitebox PC or Amazon) tells you your platform is, design your app for failure. You are ultimately responsible for your customers experience and therefore you make the risk assessment of how much to invest in resiliency at the app level. If your business is critical, then dig into how a service is architected, look for the weak point and architect around it. The guys at Everyblock said it pretty cleanly (and a few others have some clean), Amazon gave guidance on how to deploy apps for more resilience. They didn’t foresee the whole issue, but they gave some insight in the regions, availability zones and a number of the data and network service features, as guidance that you need to design around some weaknesses.

I am sure there are customers with genuine grievances.  Those that designed for failure of amazon based on the guidance given and the information available, and still failed. I would like to hear those stories, but I am sure we never will.

Contributed by: Brad Vaughan