Blog

Published by Alex Slotnick on May 4, 2017 10:16:57 AM

Percona Live 2017 Replay: Baron Schwartz's Keynote Address

Last week, Percona Live 2017 gave us the chance to meet with our friends and peers at the best open-source database conference in the world. Hundreds of thought leaders and experts came together to discuss and present cutting edge ideas in our industry, demonstrating how rapidly monitoring and data technologies continue to evolve.

Read More
Published by Alex Slotnick on Nov 20, 2015 5:01:57 PM

The Value of Log Structured Merge Trees

Indexes are usually built by way of a data structure; typically, that structure takes the form of a “tree.” Most commonly, the structure of choice is a B-Tree, which is a hierarchical organization defined by the arrangement and interactions of its nodes.

Read More
Published by Alex Slotnick on Sep 9, 2015 11:01:48 AM

VividCortex on Tech.Co: Can Your Business Benefit From a Cloud Database?

Should you move your database to the cloud? Over on Tech.Co, VividCortex's CEO and founder Baron Schwartz has published an article that covers some of the major benefits of such a shift. Cloud computing has changed tremendously over the past decade, and the potential for databases in the cloud is only growing. Baron's article explains what some of those changes might mean for you and your data.

Read More
Published by Baron Schwartz on May 26, 2015 6:31:00 AM

How We Ensure VividCortex Never Loses Data

Adrian Cockcroft really nailed it when he said that a monitoring system has to be more reliable than what it’s monitoring. I don’t mind admitting that in our first year or so, we had some troubles with losing telemetry data. Customers were never sure whether their systems were offline, the agents were down, or we were not collecting the data. Even a few seconds of missing data is glaringly obvious when you have 1-second resolution data. There’s nowhere to hide.

It was embarrassing and we made it a top priority to fix. And fix it we did. This isn’t news, but we never wrote about it, so it’s time. Hopefully this is helpful to someone else building systems like ours, where the workload is hard in unusual ways, and all sorts of interesting things break in ways you wouldn’t expect. Here’s how we built a system that’s highly resilient at scale, and doesn’t lose data.

Agent In-Memory Spooling

The first set of changes we made were to our agents. We added a small, very short-lived round-robin in-memory buffer and coded the agents to handle specific API responses and network problems. If there’s a temporary failure, the chunk of data goes into the buffer and gets retried. This works well for transient “hiccups” but is a dangerous thing to do in general.

This is actually the most obvious of the changes, which explains why we did it first! It also explains why we got so many requests from customers for this kind of thing. Every time a customer’s firewall would break our outbound connections, we’d troubleshoot it and the customer would say “can you make the agents spool to disk?” It’s a good suggestion but it’s also a foot-gun. We put a lot of effort into making sure our agents don’t cause troubles on customer systems. Spooling anything to disk is much more dangerous in my experience than the “safe” things we do that have occasionally caused edge-case problems.

In a diverse customer base, the most banal of things will blow up badly… but after a few months we had things working really well. However, we still had fundamental challenges in our backend systems that were causing troubles regardless of how resilient the agents were.

API Changes

Our APIs were initially a monolith. There are a lot of problems with monolithic APIs, and that’s worth a blog post someday. For purposes of never losing data, breaking into smaller, tightly purposed APIs is really important. This way they can all be operated separately.

Still more important is separating read and write paths. Reads can tend to be long-running and potentially use a lot of resources, which are difficult to constrain in specific scenarios. Writes need to just put the data somewhere durable ASAP and finish so they’re not tying up resources. These two conflict; reads can block the resources the writes need, causing e.g. waiting for a database connection, or worse still, dying un-serviced while we reboot the API to fix a resource-hogging read. You can read more about the challenges and solutions at our blog post about seeing in-flight requests and blockers in real-time.

After separating our monolith into smaller services, separating reads and writes, and including our opensource libraries for managing in-flight requests, we had a much more resilient system. But there was still one major problem.

Decoupling From The Database

Our APIs were still writing directly to the database, meaning that any database downtime or other problems were guaranteed to cause us to lose incoming data as soon as the agent’s round-robin buffer filled up. We had a short window for downtime-causing changes, but no more.

The “obvious” solution to this is a queueing system, like RabbitMQ or similar. However, after seeing those in action at a lot of customers while I was a consultant, I didn’t like them very much. It’s not that they don’t work well. They usually do, although indeed they do fail in very difficult ways when things go wrong. What bothers me about them is that they are neither here-nor-there architecturally and instead of simplifying the architecture, they make it more complex in a lot of cases.

What I wanted, I thought, was not a queue but a message bus. The queue is okay inasmuch as it decouples the direct dependency between components in the architecture, but a message bus implies ordering and organizing principles that I didn’t see expressed in message queues. I wanted a “river of data” flowing one direction, from which everyone could drink.

And then we found Kafka and realized we didn’t want a bus or river, we wanted a log. I’ll leave you to read more on the log as a unifying abstraction if you haven’t yet. I intuitively knew that Kafka was the solution we were looking for. In previous jobs I’d built similar things using flat files in a filesystem (which is actually an incredibly simple, reliable, high performance way to do things). We discussed amongst ourselves and all came to the same conclusion.

Kafka actually took us a while to get into production; more than six months, I think. There were sharp edges and problems with client libraries in Go and so on. Those were solved and we got it up and running. We had one instance where we bled pretty heavily on a gotcha in partition management and node replacement. Maybe a couple other minor things I’m forgetting. Other than that, Kafka has been exactly what it sounds like.

Kafka is a huge part of why we don’t lose data anymore. Our APIs do the minimal processing and then write the data into Kafka. Several very important problems are solved, easily and elegantly: HA, decoupling, architectural straightforwardness.

More Agent Changes

But we weren’t done yet. While talking with Adrian Cockcroft (one of our advisors, who works with us on a weekly basis) we brought up another customer networking issue where some data didn’t get sent from the agents and expired from the buffer. Although this issue had been a customer problem, we knew there were still mistakes we could make that would cause problems too:

  • We could forget to renew our SSL key.
  • We could forget to pay our DNS provider.
  • We could accidentally terminate our EC2 instances that load-balance and proxy.

There are still single points of failure and there always will be. What if we set up a backup instance of our APIs, we wondered? With completely separate resources end-to-end? Separate DNS names and providers, separate hosting, separate credit cards for billing, and so on? Agents could send data to these as a backup if the main instance were down.

I know, you’re probably thinking “just go through the main instances and make them have no SPOFs!” but we were doing a what-if thought experiment, “what if we do a separate one instead, will we get 99% of the benefit at a tiny fraction of the cost and effort of really hardening our main systems?” You see, each incremental 9 of availability is exponential in cost and effort.

It was just a thought, and it led somewhere great: instead of duplicating our entire infrastructure, rely on one of the most robust systems on the Internet. If you guessed Amazon S3, you’re right.

It was Adrian’s suggestion: if the APIs are down or unreachable for some reason, and we’re about to expire data from the ring buffer, instead pack the data up, encrypt it, and write it to a write-only S3 bucket. Monitor S3 for data being written to it (which should “never happen” of course) and pull that data out, verify everything about it, and push it into Kafka.

The beauty of this system is that it has very few moving parts. We wouldn’t want to use it as our primary channel for getting data into our backend, but it’s great for a fallback. We’ve architected it to layer anonymity and high security on top of S3’s already high security, and of course it’s configurable so we can disable it if customers dislike it.

As a bonus, we found one set of agents were sending data to S3 that shouldn’t have been, and found a bug in our round-robin buffer! This is always the worry about infrequently-used “emergency flare” code–it’s much more likely to have bugs than code that runs constantly.

Conclusions

Your mileage may vary, but in our case we’ve achieved the level of resilience and high availability we need, for a large and fast-moving inbound stream, with commodity/simple components, by doing the following:

  • Make agents spool locally, and send to S3 as a last-ditch effort
  • Decompose APIs into smallish “macroservices” bundles
  • Run critical read and write paths through entirely separate channels
  • Decouple writes from the databases with Kafka

I’d love to hear your feedback in the comments!

Read More
Published by Baron Schwartz on Feb 24, 2015 3:16:00 AM

Schemaless Databases Don't Exist

There’s no such thing as a schemaless database. I know, lots of people want a schemaless database, and lots of companies are promoting their products as schemaless DBMSs. And schemaless DBMSs exist. But schemaless databases are mythical beasts because there is always a schema somewhere. Usually in multiple places, which I will later claim is what causes grief.

Read More
Published by Baron Schwartz on Jan 27, 2015 6:18:00 AM

How VividCortex's Agents Manage Logs

It can be scary to run agents on your critical servers. Misbehaving agent software can cause harm, including pegging your CPUs, filling up your disks, or eating all your memory and making your server swap. Fortunately, at VividCortex we have many years of experience with these problems and we designed our agents to avoid them from day one.

Our agents are self-limiting in every aspect of resource consumption, including log files. Unlike some software you may have experience with, VividCortex’s agents won’t fill your disks with logs or temp files, nor will they cause a lot of I/O. We ensure this through a variety of measures, but there are three basic techniques I’d like to mention here.

  1. The agents don’t generate or keep spools of data. We don’t spool metrics to disk, for example, nor do we create temporary files or other caches. Our metrics aggregator agent will retry failed transmissions up to 5 times by default, but that represents a very minimal amount of memory and doesn’t impact disk at all.
  2. We only log minimal information. It’s much better to avoid logging than to deal with the consequences of verbose logging. In normal operation, VividCortex’s agents will only log a periodic indication that they’re successfully communicating with our APIs. It’s basically a liveness heartbeat.
  3. The log messages that the agents do write won’t grow indefinitely. They essentially have a built-in logrotate mechanism. They will truncate and expire their own logs. This protects you against filling up your disks.

This last bit may sound silly – why don’t we just use standard logrotate, instead? – but it’s actually very important. My experience as a consultant taught me that relying on system-provided functionality is a path to madness, because facilities like logrotate have bugs and undesired behaviors in a variety of circumstances. Furthermore, many of these might be triggered by nonstandard system configurations out of our ability to anticipate, detect, and control. For these reasons we ship a known good, clean implementation of everything we need (except for certain core system libraries), baked right into our agents.

Our agents are fully configurable, so all of the above are subject to tweaking and tuning if customers want. However, out of the box, our agents are configured to “just work” and be very respectful of system resources. One of the fastest ways to kill a system is to fill up its disk or cause a lot of I/O. High-performance servers, in particular, often run very close to a performance cliff, a delicate state where just a little bad behavior can cause the whole thing to tip over and die or freeze.

And how do we know that our agents don’t cause these problems? Well, because we capture per-process performance data in 1-second resolution – including CPU, memory, network, disk and more – if there’s a problem, we can see it. And because our Adaptive Fault Detection technology detects micro-stalls as short as 1 second long, we also know when a badly-behaved program impacts others. And we do see this, all the time – with other software, and sometimes even with other monitoring software. But not with our agents.

Sign up for your free trial and experience the benefits of VividCortex for yourself!

Read More

Subscribe to Email Updates

Posts by Topic

see all