Analysis of Outages on Nov 15 and Nov 17, 2015

Posted by Baron Schwartz on Nov 20, 2015 5:09:06 PM

We’ve had a couple of issues with some of our server infrastructure recently, which have affected portions of our customer base. In this blog post I want to explain what has happened, why, and what we’re doing to correct and prevent it.

I am writing a combined report of these issues because the first one wasn’t fully understood when the second one happened, and because the issues largely have the same contributing factors.

I apologize to our customers who have been impacted. Monitoring is supposed to be more highly available than the monitored systems. I know firsthand how damaging it can be when you can’t access your monitoring data. I take this very seriously and the whole team is working hard to prevent it from recurring.

Summary of Incidents

  • On November 15, 2015, from 19:15 until 22:30 Eastern time, some customer data ingest was delayed. Up to 25% of customer environments were affected at peak.
  • On November 17, 2015, from 17:00 until 18:11 Eastern time, a similar incident occurred, with similar impact to a different set of customers.

The incidents were resolved by performing failovers and restarts.

There is no single root cause of any incident, but a variety of factors contributed:

  • High load caused by background processes
  • A bug in MySQL causing it to freeze
  • Lack of good failover processes for the affected MySQL instances
  • More instability than expected in EC2 instances
  • Unclear responsibilities, communications, and procedures for incident handling

Background

VividCortex’s largest and highest-load systems are our sharded datastore, which contains mostly time series data that it receives from agents. It is write-heavy and is optimized for loss avoidance and fast reads of large amounts of data. It consists of:

  • Agent programs running on customer servers
  • Amazon’s Elastic Load Balancer
  • Proxy servers, using Nginx and HAProxy to route API requests
  • Externally facing API servers
  • Internal services (internal APIs)
  • Kafka for fast persistent ingest of raw metrics
  • Kafka consumers to process data from Kafka and build materialized views in MySQL
  • Redis as a write buffer in front of MySQL for some metric metadata
  • Partitioned MySQL databases for storing and processing all long-lived data
  • Background services for downsampling and reprocessing metrics, anomaly detection, enforcing retention policies, and many other tasks
  • Several other services such as DNS

All of these systems work together to provide a high-performance time series database for our customers.

One of the most important architectural decisions implicit in the above is that in the event of a problem, the most important thing is to avoid data loss. It is often okay if data is delayed a bit, but data loss is a very serious problem. You can see this decision reflected in several areas:

  1. Our agents will buffer and retry a limited amount of metrics if they are not successful in sending them.
  2. If the agents can’t send the buffered data, they fall back to Amazon S3, where they securely and anonymously store the data they were unable to send.
  3. Successful API calls to store data immediately insert it into Kafka, which is essentially a distributed, replicated log file.
  4. Data that fails and is stored in S3 is recovered by a special worker and inserted into Kafka as well.

The large stream of metrics is therefore asynchronous and redundant, and there is always some delay in processing it; the question is just whether the delay is noticeable to humans. We strive to keep this short and normally it is essentially zero. We monitor and alert on delays in reading data from Kafka.

Once we read the data out of Kafka, the fundamental assumption is that in most situations, MySQL is a stable and reliable building block of infrastructure. It is, to borrow a phrase, boring technology. Redis has also proven to be boring and reliable. We use both of them in very simplistic ways, and most of our time series processing occurs in our APIs and internal services, not in MySQL or Redis per se.

The assumption that MySQL “just works” unless you do something unwise to it has held true over most of my career. As a result, our architecture reflects this experience and assumption. In particular, our systems are not built for automatic failover of MySQL. My experience has been that such systems usually cause far more trouble than they prevent, due to the enormous cost of a false-positive failover.

Instead of automatic failover, we have designed our system (as it currently exists) to require manual failover. Note that we do maintain replicated copies both within our primary AWS availability zone in northern Virginia, as well as in Oregon. We have hot standby systems, but no automatic failover to them.

We have always known that we would need to change this someday. Automated MySQL failover can be made to work well, if approached with great care. Many companies have done so, perhaps most prominently Facebook. However, there’s a simple calculation to be made, of impact versus likelihood balanced against effort, and my internal risk calculator told me this is a problem to be solved at a larger scale because we were very unlikely to see the need until we had a lot more shards.

Where The Assumptions Failed

There are at least two problems with my assumptions.

The first is that we’re running our servers on EC2 and we’re experiencing more unreliability recently than we are used to. Several of our EC2 instances have become completely unavailable recently, including those holding lots of data. These instances cannot even be restarted from the AWS console/APIs and have to be rebuilt from scratch. When this happens, which is typically under high load, we prefer to try to rescue the instance if it’s running something stateful like MySQL, because rebuilding and syncing a copy of data onto it is slow.

The next is that we’re having some troubles with MySQL. We have not yet been able to diagnose due to the impossibility of accessing the instances when the trouble happens. As far as we can tell, it’s a bug in MySQL. We believe it is likely a deadlock within the server, because the server goes completely idle and has to be killed with kill -9. After restarting, it works fine again. We’ve been researching but haven’t positively identified what is happening or whether others have seen and reported this bug yet, if that’s what it is.

Why This Affected Customers

These problems are not supposed to affect our customers. This is where I believe we, as an organization, have the most to learn and our biggest opportunity to improve. Understanding why we’ve failed to keep your monitoring systems available despite a couple of failed servers is very important.

I’d like to stress that this is not an event or project, it will be an ongoing effort that is a continual part of the normal way we do business going forward.

At this time, the team and I believe the following factors have contributed the most to the incidents. There’s a lot to this, so I’ll keep it fairly high level:

  • The architecture we’ve built assumes less failure than we’re experiencing. Rapid growth has also brought this more quickly than expected.
  • We have allowed specialization in specific teams and roles; for example, not everyone knows which servers do what, or how to find out quickly.
  • We’ve let some institutional knowledge and experience rest with the willing, instead of involving more people. If John’s always done it and Jane has never been required to, John’s unavailability is a problem.
  • Some capabilities, such as the ability to provision new EC2 instances, are only granted to specific people, and we’ve granted that to too few. (There is a security tradeoff to be considered here.)
  • Some things are not fully automated. We’re big believers in ChatOps and infrastructure as code, through systems such as Jenkins, Hubot and Ansible, but for “rare” things that have turned out to be more frequent recently, we haven’t made the investment in automation.
  • We have not invested enough time in documentation, particularly for playbooks that matter the most in these hopefully-infrequent scenarios. Likewise we have not done enough fire drills to practice replacing our infrastructure when parts of it fail.
  • Our preference to rescue, rather than replace, a database server has led to longer delays before we give up and perform a failover. We need to embrace failure as a more normal part of life in the cloud.
  • The way we’ve organized our teams has encouraged some responsibilities to be implicit. If it’s everyone’s job it’s no one’s job.
  • We’re writing microservices at the code level, but we’re not microservice-oriented at the team level. Some of our services are shared and as such they are monitored, deployed, diagnosed, and bugfixed by “other people.”

Addressing those points is a long-term goal towards which I’m going to continually lead the team. In the near term, the immediate steps we’re taking are:

  1. Investigating how we can change our server workload to avoid what appears to be causing the undiagnosable bugs. Availability=MTBF/(MTTR+MTBF), and if we can increase MTBF we can improve this situation a lot.
  2. Creating a public status page where we can communicate and archive status information. We’re not good enough at communicating outages, either internally or externally. The tools and processes we’ve developed thus far haven’t proven to work well, and several team members have called this out explicitly in recent discussions. Communication is the backbone of remediation.
  3. Increasing access to our infrastructure so more people can join in quickly if there’s an incident.
  4. Building more automation and playbooks for the scenarios we’ve seen recently.
  5. Rehearsing database failovers, in particular.

We will also continue to perform reviews and analysis not only of what’s happened, but of what could happen and how to prepare for it. VividCortex has been focusing a lot on rapid iteration and product development since our launch. We’ve received a huge amount of feedback and as a result have changed our direction greatly. At this point, however, we need to take the next steps in our journey and stop feeling as though all decisions are temporary. We’re growing up, fast, and we will work hard on building a culture that creates systems and processes to help us be secure, high-performance, and stable while accelerating the pace of innovation and continuing to be customer-driven and lean. (I believe those goals are reinforcing, not conflicting).

To all those customers whose data was unavailable or delayed during these outages, again I apologize. I hope what I’ve written above provides more insight into how we build and operate our systems, why server failures caused customer-visible downtime, and what we are doing to prevent it going forward. I also hope you will reach out to me with any advice and suggestions, because I know many of you have solved similar problems. I believe we’re actually not too far from where we need to be–this is a course correction, not a turnaround–but as a group, you have much more experience than we do.

Finally, all of us welcome your feedback on this post-mortem itself, which was a team effort as well.

Thanks for being loyal and encouraging supporters of VividCortex through thick and thin. We’ll continue to work hard to earn your trust and business.

Recent Posts

Posts by Topic

see all