The Power and Ease of Adaptive Fault Detection

Posted by Alex Slotnick on Nov 30, 2015 2:39:41 PM

Adaptive Fault detection is a prime example of how efficiently VividCortex can help you understand and optimize your system. We define a “fault” as a special kind of system stall: a period during which applications are asking the server to perform a great deal of work, but that work is getting bottlenecked and therefore not completing.

This understanding of faults is based on Queueing Theory (if you need a refresher on QT, be sure to check out our recent and highly accessible ebook, Everything You Need to Know About Queueing Theory); we detect faults by using advanced statistics and machine learning. VividCortex’s Fault Detection is completely adaptive and self-tuning – it doesn’t require any configuration. The program can detect faults as short as one second in duration. Even the most attentive user would likely fail to notice system stalls so small, but with our Adaptive Fault Detection, they’re easily diagnosed and solved.

But why is it important for users to lock onto such small problems? Well, system performance problems almost always start small and, overtime, snowball into much more serious issues. Catching them early is the best way to prevent major performance problems and outages.

When you’re using Adaptive Fault detection, VividCortex displays faults in an easily understood timeline, running from left to right.

fault-detection-1.png

In the example here, you can see three stalls – notated as vertical bars along the timeline – that occurred in one of our production database servers. The widgets beneath the timeline (shown in the image below) illustrate what was happening in the server at the moment of each fault. As you select a fault, you can see a red line appear on the myriad charts, to indicate the precise instant when that fault occurred.

fault-detection-2 img.jpg

Right away, you’ll notice a few telltale signs of a fault. For instance, take a look at the chart tracking MySQL concurrency: when you hover over the fault in the timeline, you can see that at the moment of the fault, the concurrency spiked up to 141 queries, all trying to run simultaneously – much more than this machine can handle.

Now, let’s look at the fault on the far right of the timeline; we see that at the time it occurred, the newdownsample program had just started running, and we can see that there was a notably high amount of Disk Throughput and CPU activity in the server as a result.

fault-detection-6 img.jpg

And, again, when we hover over the pixel representing the exact instant of the fault, we see that MySQL concurrency spiked dramatically – in this case, to a clogged 154 queries running at a single moment.

fault-detection-3 img.jpg

Looking further down the collection of summary widgets, we find Top MySQL Queries. Here, we see the third query listed is quit abusive, arising from the newdownsample program as well. That abusive query whisks resources away from other processes.

fault-detection-4 img.jpg

Clearly, it would be beneficial to examine this query more closely. Doing so is simple: all you need to do is click to drill down into the widget, to see more top running queries during the relevant time range. You can then select that query and continue drilling down into it with the various tools at your disposal.

fault-detection-5 img.jpg

Also note that we’ve just introduced our new Profiler tool, which takes our older Top Queries and Top Processes tools to a whole new level. You’ll be able to drill down with more precision and customization than ever before. The APIs are faster too! To read more about the Profiler tool, check out our recent announcement here.

And finally, to watch VividCortex’s Adaptive Fault Detection in action, check out Baron Schwartz’s demo in the video below.

Recent Posts

Posts by Topic

see all