How VividCortex's Agents Manage Logs

Posted by Baron Schwartz on Jan 27, 2015 6:18:00 AM

It can be scary to run agents on your critical servers. Misbehaving agent software can cause harm, including pegging your CPUs, filling up your disks, or eating all your memory and making your server swap. Fortunately, at VividCortex we have many years of experience with these problems and we designed our agents to avoid them from day one.

Our agents are self-limiting in every aspect of resource consumption, including log files. Unlike some software you may have experience with, VividCortex’s agents won’t fill your disks with logs or temp files, nor will they cause a lot of I/O. We ensure this through a variety of measures, but there are three basic techniques I’d like to mention here.

  1. The agents don’t generate or keep spools of data. We don’t spool metrics to disk, for example, nor do we create temporary files or other caches. Our metrics aggregator agent will retry failed transmissions up to 5 times by default, but that represents a very minimal amount of memory and doesn’t impact disk at all.
  2. We only log minimal information. It’s much better to avoid logging than to deal with the consequences of verbose logging. In normal operation, VividCortex’s agents will only log a periodic indication that they’re successfully communicating with our APIs. It’s basically a liveness heartbeat.
  3. The log messages that the agents do write won’t grow indefinitely. They essentially have a built-in logrotate mechanism. They will truncate and expire their own logs. This protects you against filling up your disks.

This last bit may sound silly – why don’t we just use standard logrotate, instead? – but it’s actually very important. My experience as a consultant taught me that relying on system-provided functionality is a path to madness, because facilities like logrotate have bugs and undesired behaviors in a variety of circumstances. Furthermore, many of these might be triggered by nonstandard system configurations out of our ability to anticipate, detect, and control. For these reasons we ship a known good, clean implementation of everything we need (except for certain core system libraries), baked right into our agents.

Our agents are fully configurable, so all of the above are subject to tweaking and tuning if customers want. However, out of the box, our agents are configured to “just work” and be very respectful of system resources. One of the fastest ways to kill a system is to fill up its disk or cause a lot of I/O. High-performance servers, in particular, often run very close to a performance cliff, a delicate state where just a little bad behavior can cause the whole thing to tip over and die or freeze.

And how do we know that our agents don’t cause these problems? Well, because we capture per-process performance data in 1-second resolution – including CPU, memory, network, disk and more – if there’s a problem, we can see it. And because our Adaptive Fault Detection technology detects micro-stalls as short as 1 second long, we also know when a badly-behaved program impacts others. And we do see this, all the time – with other software, and sometimes even with other monitoring software. But not with our agents.

Sign up for your free trial and experience the benefits of VividCortex for yourself!

Recent Posts

Posts by Topic

see all