Learning the right lessons from the Amazon cloud outage

Updated: May 02, 2011

At this point in time, we're experiencing a backlash from this crash. People are reconsidering the wisdom of moving to the cloud, and in particular, public clouds. Perhaps the large infrastructure vendors who were warning their customers about the security and reliability issues with public clouds in order to sell more gear to build private clouds were right after all?

Not so fast. If we place the Amazon crash into its proper context, we are in a better position to learn the right lessons from this crisis, rather than reacting out of fear to an event taken out of that context. Here, then, are some essential lessons we should take away from the crash:

  • There is no such thing as 100 percent reliability. In fact, there's nothing 100 percent about any of IT—no code is 100 percent bug free, no system is 100 percent crashproof, and no security is 100 percent impenetrable. Just because Amazon came up snake eyes on this throw of the dice doesn't mean that public clouds are any less reliable than they were before the crisis. Whether investing in the stock market or building a high availability IT infrastructure, the best way to lower risk is to diversify. You got eggs? The more baskets the better.
  • This particular crisis is unlikely to happen ever again. We can safely assume that Amazon has some wicked smart cloud experts, and that they had already built a cloud architecture that could withstand most challenges. Suffice it to say, therefore, that the latest crisis had an unusual and complex set of causes. It also goes without saying that those experts are working feverishly to root out those causes, so that this particular set of circumstances won't happen again.

    Just because Amazon came up snake eyes on this throw of the dice doesn't mean that public clouds are any less reliable than they were before the crisis.

  • The unknown unknowns are by definition inherently unpredictable. Even though the particular sequence of events that led to the current crisis is unlikely to happen again, the chance that other entirely unpredictable issues will arise in the future is relatively likely. But such issues might very well apply to private, hybrid, or community clouds just as much as they might impact the public cloud again. In other words, bailing on public clouds to take refuge in the supposedly safer private cloud arena is an exercise in futility.
  • The most important lesson for Amazon to learn is more about visibility than reliability. The weakest part of Amazon's cloud offerings is the lack of visibility they provide their customers. This "never mind the man behind the curtain" attitude is part of how Amazon supports the cloud abstraction I discussed in the previous ZapFlash. But now it's working against them and their customers. For Amazon to build on its success, it must open the kimono a bit and provide its customers a level of management visibility into its internal infrastructure that it's been uncomfortable delivering to this point.