Unexpected downtime is one of your worst nightmares, but most attempts to find problems before they happen are threshold-based. Thresholds create noise, and alerts create false positives so often you may miss actual problems.
When we began building VividCortex, we introduced Adaptive Fault Detection, a feature to detect problems through a combination of statistical anomaly detection and queueing theory. It’s our patent-pending technique to detect system stalls in the database and disk. These are early indicators of serious problems, so it’s really helpful to find them. (Note: “fault” is kind of an ambiguous term for some people. In the context we’re using here, it means a stall/pause/freeze/lockup).
The initial version of fault detection enabled us to find hidden problems nobody suspected, but as our customer base diversified, we found more situations that could fool it. We’ve released a new version that improves upon it. Let’s see how.
How It Works
The old fault detection algorithm was based on statistics, exponentially weighted moving averages, and queueing theory. The new implementation ties together concepts from queueing theory, time series analysis and forecasting, and statistical machine learning. The addition of machine learning is what enables it to be even more adaptive (i.e. even less hard-coded).
Take a look at the following screenshot of some key metrics on a system during a fault. Notice how much chaos there is in the system overall. For example, the burst of network throughput just before an after the fault. Despite this, we would not have detected a fault if work were still getting done. We’re able to reliably detect single-second problems in systems that a human would struggle to make any sense of.
Adaptive fault detection is not based on simple thresholds on metrics such as
threads_running. Rather, its algorithm adapts dynamically to work for time series ranging from fairly stable (such as MySQL Concurrency shown above) to highly variable (such as MySQL Queries in the example above). Note how different those metrics are. What does “typical” even mean in such a system?
At the same time, we clearly identify and highlight both the causes and the effects in the system. For example, a screenshot of a different part of the user interface for the same time period highlights how badly a variety of queries were impacted. The fault stalled them.
If we drill down into the details page for one of those queries, we can see that the average latency around the time of the fault is significantly higher, implying that it’s taking more time to get the same amount of work done.
That’s an example of a very short stall, but long stalls are important too.
Detecting Longer Faults
Some customers had long-building, slow-burn stalls in systems. The new fault detection algorithm is better able to detect such multi-second faults. The chart below shows a multi-second fault.
The algorithm can also detect even longer faults. Sometimes these are subtle unless you “zoom out” to see how things have slowly been getting stuck over time. Trick question: what’s stalling our server here?
Okay, it’s xtrabackup. Not really a trick question :-)
You might think this kind of thing is easy to detect. “Just throw an alarm when
threads_running is more than 50,” you say. If you try that, though, you’ll see why we invented Adaptive Fault Detection. It’s not easy to balance sensitivity and specificity.
In addition to the improvements you’ll see, we’ve made a lot of changes to the code as well. Because the code is better organized and diagnostic tools readily available, we can easily add support for different kinds of faults, and because it is testable, we can make sure we are truly measuring system work, the monitoring metric that matters most.
We occasionally find new and interesting kinds of stalls that we want to capture, and we are now in a position to more generically detect such tricky scenarios.
In summary, the improved fault detection algorithm finds entirely new classes of previously undetectable problems for our customers–bona fide “perfect storms” of complex configuration and query interactions.