At Velocity last week, I spoke about how we quantify abnormality in a system’s time-series metrics cheaply, in realtime, at high frequency.
Note that this is not the same thing as our Adaptive Fault Detection algorithm. Our abnormality algorithm is one of the low-level building blocks of the adaptive fault detection algorithm. But as I pointed out in the talk, if you look at a system’s metrics in short time intervals, you will find abnormalities constantly. That’s why abnormality is a blunt instrument, not good enough to significantly reduce false alarms. If you alert on abnormalities, you’ll get a lot of spam, just like you will with thresholds on a metric.
Still, abnormality is at least a place to start, right? In the progression towards true fault detection, you can think of it this way: a fault is more specific than an abnormality, and an abnormality is more specific than a threshold being crossed. This assumes that you agree with our definition of a fault, which is defined in terms of the system not getting its assigned work done.
The slides are embedded below. Comments? Questions? I’ll do my best to answer.
For further information on this topic, read 4 Statistical Process Control Rules to Help You Find Abnormalities in Your System.