We’ve become so used to expecting the cloud to be always on and available that yesterday’s Amazon Web Services (AWS) S3 outage or as Amazon characterized it on Twitter “S3 is experiencing high error rates” has caught many by surprise.  The outage manifested itself in interesting ways, surfacing some non-obvious dependencies on S3.  The application could be generally up and running but images that were stored in S3 weren’t accessible and couldn’t be displayed.  Many apps that rely on S3 storage were affected.  There were statements that half the internet was down.  You can find a balanced post outage article here.

So in a world where the assumption is that the Internet is ubiquitous, always on and available what steps should application architects and developers take to protect the application and its users when these types of outages happen?  When some elements of an app are working and others (especially storage) are not, there’s always the concern that you can end up in an inconsistent state.  This can be complicated even more when caching schemes in web apps, mobile apps and browsers come into play.  Keep in mind that, with all of the redundancy built into cloud services like Amazon’s AWS, these types of outages are rare.  It did happen yesterday and it’s not the first time.

Here are some thoughts around mitigating these risks.

  • Monitor to find out as soon as possible.
    • Build in ongoing health monitoring of your application’s services and services it depends on (within the application and leveraging AWS monitoring services)
      • Course grained monitoring is fine as a starting point
      • In an increasingly Microservices oriented world this is just good practice
  • Think about the User Experience during a failure.
    • If the services are critical to your application (and as architects and developers we should know which ones are) – let the users know that some services are currently unavailable (perhaps providing the option of notifying them when services are back up). A first cut could be to consider every service as critical
    • If the services aren’t critical (like possibly presenting images, or editing when most of the app is about presenting information) notify the user that this service is currently unavailable, you’re working on it and that the application is still OK to use
    • If the app is just not available, have a default page that states that clearly (likely already there for most apps)
  • Have a Business Continuity and Disaster Recovery plan.
    • You should be ready for more likely outages than an AWS outage.

Dealing with a cloud outage should fall under your Risk Management Plan, including strategies for risk acceptance, mitigation, avoidance, and transfer. We’ve given you some ideas for reducing the impact of an outage, but how do you plan to reduce the probability of an outage affecting your service?  Amazon can help with this through its support for multiple regions (rather than just multiple availability zones within a single region).  The effort/cost in supporting multiple regions needs to be balanced with the cost of downtime.  Understanding the architecture limitations of a cloud system and their implications are important to this balancing act.

AWS has been very reliable to-date and such outages are rare. However build in some simple monitoring and notification capability to notify and protect users from using the application in a potentially compromised state.  This is good practice regardless of how and where the application is deployed.  This outage just highlights that.