Alerting when a metric crosses a threshold is one of the worst things you can do.
Why? Because threshold-based monitoring is prone to false positives and false negatives, which correspond to false alarms and missed alarms (a.k.a. useless noise and useless silence).
One of the worst things about most monitoring systems is the incredible amount of noise they generate. IT staff members react to this predictably: they filter out as many alerts as they can, and ignore most of the rest. As a result, a system that didn’t really work well to begin with becomes even less useful.
This problem is universal. When I speak at conferences, I ask how many people are using Nagios. Usually well over half of the hands go up. I then ask how many people do not have email filters set up to shuffle some Nagios alerts to /dev/null. There’s usually one or two hands remaining, and everyone laughs uncomfortably.
We talk about root cause analysis for systems, but what’s the root cause of the email filters? It’s largely due to thresholds. Simple alive/dead status checks are easy to get right, but a lot of Nagios-style alerting is based on thresholds.
This never works. In theory it could sometimes work (albeit rarely), but in practice it never does. In my next post I’ll explain why this is.
Picture credit to Wwarby.