Report on August 1st Production Outage

Posted by Baron Schwartz on Aug 10, 2014 2:48:00 AM

On Friday August 1st, about 5pm EST, we had an outage in production that made our application and APIs partially unavailable. Though functionality was quickly restored, we lost data for a window of time. I want to personally apologize to our customers. We do take availability, correctness, and performance very seriously. We have made sure this problem won’t happen again.

What happened was fairly simple. An engineer was working with automation around AWS provisioning using Ansible, specifically, the EC2 provision Ansible modules. He was looking for a way to describe the launched EC2 instances in Ansible variables and ensure that only 1 type of a tagged instance exists at one time. This can be done by using the exact_count parameter with the count_tag to define which tag to count.

The problem was a typo in an automation script. The engineer only defined the exact_count parameter as 1 and forgot to set the count_tag to count by Name. This caused the Ansible script to (attempt to) terminate every instance in production except one! Fortunately, most instances were termination-protected, so only a few instances were terminated. We have redundancy built into our architecture and all of these terminated instances had a secondary, so we were able to quickly recover with a couple of proxy configuration changes, and we were back up within 5 minutes of the issue being reported.

We know there’s never a single root cause, but this incident pointed out that we’d gradually and without noticing it developed a specific part of our codebase by testing in production, in a manner of speaking. It sounds silly, but it’s easy to miss the truth about what you’re doing when you’re managing a lot of different things.

As the very wise man once said,

1st_Production_Outage

To prevent this from happening again, we’ve created an isolated AWS account for infrastructure-as-code development. We’re also adding termination protection to all instances, as well as taking some other, smaller measures.

On a broader level, we’ve had several far-reaching infrastructure projects underway for months. We’re about half-migrated to a new data pipeline backend that’s much less tightly coupled and much more resilient, and when we complete this, the number and type of incidents that could cause temporary or permanent unavailability or data loss will be greatly reduced.

We always appreciate the supportive feedback from customers, but at the same time we know we did let you down in a very real way, and we take it to heart. Use the comments to ask me anything else you’d like to know, and let me know any suggestions you have.

Recent Posts

Posts by Topic

see all