How VividCortex Achieves Gapless Aggregator Restarts

Posted by Alex Slotnick on Jun 14, 2016 10:58:30 AM

As many users already know, we run a set of agents in each box of each customer's environment. All but two of these agents are set up to continuously fetch statistics from the connected OS and the different database instances that the agents are able to find.

The two agents that do not fetch stats instead undertake supervisor processes, monitoring and controlling all other agents in the box. One of these is an agent called vc-aggregator (read a bit more about this agent here, here, and here), which serves as an aggregation point for all the metrics the other agents read.

In the past, we encountered an issue whenever we needed to restart the agents' process: it resulted in a small gap in the host's charts. The reason? Only complete minutes can be pushed to the API. However, when a restart triggers, it interrupts the current minute, making it incomplete, meaning that the fragmented minute cannot be pushed.

While, in reality, a single missed minute is a relatively minor problem, it's an unfortunate eyesore when it occurs, and we don't want customers to miss even a single minute if it can be avoided -- single-second granularity is one of VividCortex's most powerful features, and never like seeing gaps of any size in a customer's information. 

This image demonstrates what a restart with a gap looks like. Notice the issue?


The considerable chunk of white space running through the chart at about 5:12:00pm shows the kind of gap we're not thrilled to see. Again, an omission like this represents a relatively minor hiccup, but, nonetheless, a hiccup we'd prefer to avoid.

Fortunately, to solve the present issue, the fix is fairly simple. We're still pushing to the API once a minute, but we're also now able to save the current minute -- the minute that is being interrupted by the push -- up to the point of the restart, so the next vc-aggregator can recover it.

The "after" image below demonstrates what restarts look like now, fixed and gapless. You can see where the restart occurred in this image, around 1:26:30, as the vc-aggregator shows a drop in memory, reflecting the restart that occurred. But also note that despite the restart, there is quite clearly no longer any time missing from the chart.


Our customers' data is now complete, and our second-by-second monitoring can be consistent across a variety activity throughout our system, giving VividCortex users deep insight into their data. No more gaps!

Recent Posts

Posts by Topic

see all