1. It was a bug, and failure by Amazon architects.
“AWS region is a single API end point and control plane, one AZ EBS fail took out EBS control plane for a short while, that’s the big bug” – adrianco
The cloud should not fail in this manner because they are supposed to be designing a “service” that stays up based on components that are “designed to fail”. Amazon is as responsible for their customers satisfaction as anyone else. Lots of unsatisfied customers = #fail. Don’t care what the SLA commitments are because if you rely on SLA’s you are asking for problems.
That being said, bug found, and I am sure the rearchitecting process is in progress. Amazon is a stronger service provider for it..
2. Buyer beware.
No matter how resilient your infrastructure vendor (whether it be Tandem, IBM, Whitebox PC or Amazon) tells you your platform is, design your app for failure. You are ultimately responsible for your customers experience and therefore you make the risk assessment of how much to invest in resiliency at the app level. If your business is critical, then dig into how a service is architected, look for the weak point and architect around it. The guys at Everyblock said it pretty cleanly (and a few others have some clean), Amazon gave guidance on how to deploy apps for more resilience. They didn’t foresee the whole issue, but they gave some insight in the regions, availability zones and a number of the data and network service features, as guidance that you need to design around some weaknesses.
I am sure there are customers with genuine grievances. Those that designed for failure of amazon based on the guidance given and the information available, and still failed. I would like to hear those stories, but I am sure we never will.
Contributed by: Brad Vaughan