Adrian Cockcroft really nailed it when he said that a monitoring system has to be more reliable than what it’s monitoring. I don’t mind admitting that in our first year or so, we had some troubles with losing telemetry data. Customers were never sure whether their systems were offline, the agents were down, or we were not collecting the data. Even a few seconds of missing data is glaringly obvious when you have 1-second resolution data. There’s nowhere to hide.
It was embarrassing and we made it a top priority to fix. And fix it we did. This isn’t news, but we never wrote about it, so it’s time. Hopefully this is helpful to someone else building systems like ours, where the workload is hard in unusual ways, and all sorts of interesting things break in ways you wouldn’t expect. Here’s how we built a system that’s highly resilient at scale, and doesn’t lose data.
Agent In-Memory Spooling
The first set of changes we made were to our agents. We added a small, very short-lived round-robin in-memory buffer and coded the agents to handle specific API responses and network problems. If there’s a temporary failure, the chunk of data goes into the buffer and gets retried. This works well for transient “hiccups” but is a dangerous thing to do in general.
This is actually the most obvious of the changes, which explains why we did it first! It also explains why we got so many requests from customers for this kind of thing. Every time a customer’s firewall would break our outbound connections, we’d troubleshoot it and the customer would say “can you make the agents spool to disk?” It’s a good suggestion but it’s also a foot-gun. We put a lot of effort into making sure our agents don’t cause troubles on customer systems. Spooling anything to disk is much more dangerous in my experience than the “safe” things we do that have occasionally caused edge-case problems.
In a diverse customer base, the most banal of things will blow up badly… but after a few months we had things working really well. However, we still had fundamental challenges in our backend systems that were causing troubles regardless of how resilient the agents were.
Our APIs were initially a monolith. There are a lot of problems with monolithic APIs, and that’s worth a blog post someday. For purposes of never losing data, breaking into smaller, tightly purposed APIs is really important. This way they can all be operated separately.
Still more important is separating read and write paths. Reads can tend to be long-running and potentially use a lot of resources, which are difficult to constrain in specific scenarios. Writes need to just put the data somewhere durable ASAP and finish so they’re not tying up resources. These two conflict; reads can block the resources the writes need, causing e.g. waiting for a database connection, or worse still, dying un-serviced while we reboot the API to fix a resource-hogging read. You can read more about the challenges and solutions at our blog post about seeing in-flight requests and blockers in real-time.
After separating our monolith into smaller services, separating reads and writes, and including our opensource libraries for managing in-flight requests, we had a much more resilient system. But there was still one major problem.
Decoupling From The Database
Our APIs were still writing directly to the database, meaning that any database downtime or other problems were guaranteed to cause us to lose incoming data as soon as the agent’s round-robin buffer filled up. We had a short window for downtime-causing changes, but no more.
The “obvious” solution to this is a queueing system, like RabbitMQ or similar. However, after seeing those in action at a lot of customers while I was a consultant, I didn’t like them very much. It’s not that they don’t work well. They usually do, although indeed they do fail in very difficult ways when things go wrong. What bothers me about them is that they are neither here-nor-there architecturally and instead of simplifying the architecture, they make it more complex in a lot of cases.
What I wanted, I thought, was not a queue but a message bus. The queue is okay inasmuch as it decouples the direct dependency between components in the architecture, but a message bus implies ordering and organizing principles that I didn’t see expressed in message queues. I wanted a “river of data” flowing one direction, from which everyone could drink.
And then we found Kafka and realized we didn’t want a bus or river, we wanted a log. I’ll leave you to read more on the log as a unifying abstraction if you haven’t yet. I intuitively knew that Kafka was the solution we were looking for. In previous jobs I’d built similar things using flat files in a filesystem (which is actually an incredibly simple, reliable, high performance way to do things). We discussed amongst ourselves and all came to the same conclusion.
Kafka actually took us a while to get into production; more than six months, I think. There were sharp edges and problems with client libraries in Go and so on. Those were solved and we got it up and running. We had one instance where we bled pretty heavily on a gotcha in partition management and node replacement. Maybe a couple other minor things I’m forgetting. Other than that, Kafka has been exactly what it sounds like.
Kafka is a huge part of why we don’t lose data anymore. Our APIs do the minimal processing and then write the data into Kafka. Several very important problems are solved, easily and elegantly: HA, decoupling, architectural straightforwardness.
More Agent Changes
But we weren’t done yet. While talking with Adrian Cockcroft (one of our advisors, who works with us on a weekly basis) we brought up another customer networking issue where some data didn’t get sent from the agents and expired from the buffer. Although this issue had been a customer problem, we knew there were still mistakes we could make that would cause problems too:
- We could forget to renew our SSL key.
- We could forget to pay our DNS provider.
- We could accidentally terminate our EC2 instances that load-balance and proxy.
There are still single points of failure and there always will be. What if we set up a backup instance of our APIs, we wondered? With completely separate resources end-to-end? Separate DNS names and providers, separate hosting, separate credit cards for billing, and so on? Agents could send data to these as a backup if the main instance were down.
I know, you’re probably thinking “just go through the main instances and make them have no SPOFs!” but we were doing a what-if thought experiment, “what if we do a separate one instead, will we get 99% of the benefit at a tiny fraction of the cost and effort of really hardening our main systems?” You see, each incremental 9 of availability is exponential in cost and effort.
It was just a thought, and it led somewhere great: instead of duplicating our entire infrastructure, rely on one of the most robust systems on the Internet. If you guessed Amazon S3, you’re right.
It was Adrian’s suggestion: if the APIs are down or unreachable for some reason, and we’re about to expire data from the ring buffer, instead pack the data up, encrypt it, and write it to a write-only S3 bucket. Monitor S3 for data being written to it (which should “never happen” of course) and pull that data out, verify everything about it, and push it into Kafka.
The beauty of this system is that it has very few moving parts. We wouldn’t want to use it as our primary channel for getting data into our backend, but it’s great for a fallback. We’ve architected it to layer anonymity and high security on top of S3’s already high security, and of course it’s configurable so we can disable it if customers dislike it.
As a bonus, we found one set of agents were sending data to S3 that shouldn’t have been, and found a bug in our round-robin buffer! This is always the worry about infrequently-used “emergency flare” code–it’s much more likely to have bugs than code that runs constantly.
Your mileage may vary, but in our case we’ve achieved the level of resilience and high availability we need, for a large and fast-moving inbound stream, with commodity/simple components, by doing the following:
- Make agents spool locally, and send to S3 as a last-ditch effort
- Decompose APIs into smallish “macroservices” bundles
- Run critical read and write paths through entirely separate channels
- Decouple writes from the databases with Kafka
I’d love to hear your feedback in the comments!