AWS S3 Outage:

If you’ve been reading reports that websites/applications like Quora, GitHub, Slack, Docker registry hub, Bitbucket, Coursera and many others were down on Tuesday, February 28, 2017; and wondering what happened, you’re not alone. It all occurred when S3 buckets in the US-East-1 (North Virginia) Region of AWS became inaccessible at about 09:45 AM PST. AWS service health dashboard took quite a while to show up the status of services to the world because the static content of the dashboard is also hosted on AWS S3.

The issue was intermittent and later the buckets became inaccessible, it was partially resolved at 01:12 PM PST when users could perform GET and DELETE operations on their buckets. The services, however, were fully functional by 01:49 PM PST. By the time official update came at around 02:10 PM PST, social media was abuzz with this disaster and its repercussions. In this blog post, we examine what exactly went wrong and how can businesses minimize the impact of such AWS S3 outage.

Why Did It Happen?

AWS S3 (Simple Storage Service) finds extensive use for hosting static websites, storing static contents of the website, data for the analytic workloads and for Backups. In this outage, not only were the websites and hosted services affected – but a whole lot of IoT devices as well! Something that we have seldom experienced so far.
AWS has 16 regions in total and US-East-1 is one of them. Though AWS gives 99.99% availability and 99.999999999% durability for its standard S3 service, data is supposed to be spread out over the different data centers to avoid data center failures.

It is one of the most overlooked aspects of cloud computing owing to incremental costs involved at the time of setting up. While there is no immediate return on investment, think of it in terms of an insurance policy – something that you hope that you’ll never need, but you cannot avoid nevertheless!

Here’s a simple approach that helps minimize the impact of such outages on your business operations in future:

This allows you to replicate your data or the entire static website onto a separate AWS data center or region. It is easy to configure and replicate the data but you need to keep an eye on the costs involved.

This step allows hosting the same website on your new bucket – which will be a replica of your primary bucket and provide you with another S3 publicly routable endpoint.

This is the mechanism that monitors the health of the primary bucket and should it fail, it will initiate a failover route to the secondary bucket. Moreover, once your primary bucket is back live, route 53 automatically points back to it. The entire process is automated and does not require any human intervention once set up.
The above approach is an example of the ‘Multi-Site’ Cloud DR Strategy – especially useful when you want to ensure minimum downtime with the least human intervention. You can read more about this and other such strategies on our Cloud DR Page. Feel free to contact BluePi for custom Cloud Disaster Recovery Services for your business.

Sources also say that along with S3 outage, it was impossible to launch new EC2 instances in the US-East-1 region and other services like Elastic File System, Elastic Load Balancer, Simple Email Service, Relational Database Service, Lambda, Elastic MapR and Elastic Beanstalk were all impacted. AWS has not released any information about the root cause of the outage. But there were continuous updates on their service dashboard page and on their Twitter handle. We’ll keep a close watch on this, in the days ahead.

Until we get the next update, Think Cloud, Think Cloud Disaster Recovery, Think BluePi!

Himanshu Gupta, BluePi

Himanshu Gupta is an AWS Certified Technical Professional, Microsoft Certified Specialist: Azure Infrastructure Solutions, and a Microsoft Certified Professional; currently working as a System Administrator at BluePi.
His key areas of interest include Cloud Computing, Databases, Containers and Infrastructure-as-a-Code.

Related Tags