On April 8th, we had a serious outage and we lost some performance metrics from your (our customers’) systems. I know how important this is to you, and above all I’m sorry. All of us are, and we’ve worked long and hard to fix the causes as quickly as we prudently could.
We’ve done a detailed internal analysis and I’ve decided to keep this brief, because the main points don’t require a lot of detail to communicate. Background: our monitoring agent programs send time series data to our APIs, which write it to Kafka and then return an OK to the agent. If the APIs have issues, the agents don’t get an OK, and they use a redundant, separate fallback mechanism. This architecture is designed to trade latency for availability, with the goal of uptime and avoidance of data loss above all.Here’s what happened, in a nutshell.
- 9:53AM Eastern Daylight Time: We made what we thought were routine, nondisruptive configuration changes to Kafka servers. The purpose was to prepare for growth and keep our provisioning code clean. The result was that Kafka went completely offline and our APIs couldn’t write into it. The Kafka cluster was completely broken, in a way that took a long time to recover even after it was corrected; it ultimately took about an hour.
- 10:28AM: We discovered that our fallback mechanism wasn’t working. Writes to Kafka were being done in an asynchronous code path. Instead of writing to Kafka, getting a failure, and reporting the failure back to the agents, the API calls were passing the data to a background process and reporting an OK. So the agents didn’t activate their fallback.
- 10:28AM: We shut down the APIs in question to stop the data from being dropped on the floor. The agents started using the fallback.
- 11:01AM: Everything was back and fully operational, but we were far behind on processing the data from the fallback location. It took some time for that to catch up; it varied from a few minutes to hours, for some customers.
- That evening, we fixed the bug that made writes asynchronous. We tested and deployed this not long afterward.
This outage was a rude awakening for us in several ways. First, we’ve become accustomed to Kafka never being a problem. It was our fault; Kafka is amazing, but we took it for granted. Second, discovering that our fallback mechanism was circumvented was another jolt. We’ve beaten up on that in a number of different ways, both deliberately and during incidents, and it’s always worked. Of course, in hindsight it’s obvious that we overlooked this code path precisely because Kafka never fails. Finally, since we built this architecture, we have gotten used to never losing any customer data. We were confident that it worked as intended, and because of that we didn’t notice immediately that the problem was more than a delay.
The overall context for this is rapid growth in the amount of data we’re ingesting from customers. Running VividCortex is very much like the story about changing the wings while the plane is in the air. Landing and parking in the hangar is not an option. It’s in-flight, it’s full of customers, and we have to upgrade it from a Canadair Regional Jet to a Boeing 747.
Amid all of this, we’re building new features, improving performance by orders of magnitude, and generally growing our infrastructure rapidly. We have many projects in-flight to address all of this: there’s time series data processing improvements; there’s a series of splits and separations of data and services; there’s work underway to support 1000x growth with the existing architecture; there’s an investigation into a time series backend store that could be a better long term fit for the requirements we’ve discovered we have; and so on.
So we fixed “the bug,” but what if there’s another one? What are we doing to prevent this kind of thing going forward? Although I’m very proud of what our team has accomplished, frankly there are still areas where we’re not as good as we should be. A virtually indestructible system that’s practically infinitely scalable is what we’re working towards. And we will get to the point where we’re running Chaos Monkey. It is not just aspirational. As Adrian Cockcroft says, a monitoring company needs to be more available than the systems they monitor. We’ll achieve that.
In summary, I’m sorry for this outage. We’ve fixed the prima facie reasons it happened, and we’re working diligently on the meta-reasons, as we have always done. If you have any questions, please email me personally at firstname.lastname@example.org. And thanks for being loyal customers: you are the reasons we do this.