Monitoring systems need to be created for humans to use. Lots of things are theoretically possible, and if practice doesn’t match theory, many engineers (myself included) can blame the victim. Be honest — as an engineer, have you ever built something perfect and elegant, with the capability to serve a user perfectly, and then found that the user didn’t use it the way you designed it?
That’s why I’ve been suggesting that thresholds are a problem in monitoring systems.
Thresholds stack the deck against humans. Thresholds have weaknesses, I think everyone agrees, but as engineers, we believe an escape hatch is enough. Surely we can work around the weaknesses of thresholds without much trouble? Shlomi’s comment on my earlier blog post speaks to this point:
You can choose to receive an alert on a threshold only after it recurs for a given amount of seconds/minutes. Thus, it’s OK for, say, the slave to lag behind master for over 1 minute; lag happens. But it’s not OK if the situation is like this for 20 minutes straight.
Also, it is possible to make the error dependant of other variables, most notably the time. It is OK for system to have high load between 1:00am and 3:00am, where nightly maintenance is at work.
I’m not picking on Shlomi. I know him personally and consider him a friend and a talented software engineer and DBA. What Shlomi is saying, essentially, is that if you curate your thresholds carefully, you can avoid the problems of threshold-based alerting.
But that doesn’t happen in the vast majority of real monitoring deployments.
That’s a really good question. Rational, smart people do irrational things because they’re human. As humans, we’re all a little bit confused, lazy, inexperienced, forgetful, dysfunctional, or just generally imperfect, in many different ways. And a systems administrator or DBA is a busy person juggling lots of competing demands, with plenty of stress to go around. Dealing with Nagios in a tender and loving way, polishing the last imperfections from its configuration files with a soft cloth, is the last thing most of us actually spend our time doing.
I think we can agree that thresholds are at best an approximation to a proxy of the problem they’re trying to solve. The problem, in a nutshell, is to tell us that a system is in a bad state. We can’t really detect “bad” for sure, so we settle for “abnormal.” Abnormal is hard to define, but a threshold seems like a reasonable approximation at first glance. Now we have to think about how the system normally behaves — which is hard to get right and easy to get wrong — and we realize, as Shlomi points out, that normal is highly variable. As a result, we configure what we believe are really loose thresholds, far out at the edges of normal territory. At this point, we’re building a Rube Goldberg machine “threshold->abnormal->bad”, but it’s easy to lose sight of that and trick ourselves into thinking that the threshold was what we originally wanted.
That’s one way that our human-ness gets us in trouble.
Here’s another: our mistaken belief that most systems are similar, and that dissimilarities are outliers. If you think about that for a second, you’d say “of course most systems are different, I’d never make the mistake of believing that they are similar.” But after observing hundreds of companies over the last few years, I disagree. Most of us unconsciously behave as if most systems are similar, when that’s the furthest thing from the truth. Example: a large web company I visited last week splits and rebalances shards constantly. Every time they do so, they end up with systems that are not only wildly different, but have no history of “normalness” to serve as a foundation for any kind of decisions. As a result, their internal systems initially throw off false readings about the new systems’ trends and where they’re headed. Nothing weird about that at all.
Here’s another way we make thresholds a straw man, and monitor the threshold as though it’s the real thing. Another smart friend of mine, Singer Wang, asked in a comment (on the same previous post),
So you don’t want to get alerted when your Percona Server (MySQL) data drive reaches 90% or 95% full?
The real goal of a threshold on disk fullness is to avoid the disk running out of space. But is 90% or 95% the right number? No. Many disks are huge, with very stable utilization, and 5% or 10% free space is a vast amount of capacity. A threshold like that throws off false positives all the time (I have personal experience with this). Other disks are mostly unused until backups run every day, at which point they spike nearly full. I’ve experienced that too. We could customize the alerting for every disk drive in our entire environment, but see my previous discussion about why curated thresholds never really happen in the real world.
All of this misses the point. What are we trying to accomplish with disk fullness monitoring? Trying to avoid running out of space. Getting back to basics, what do I really need? I need enough warning to avoid an impending out-of-space problem. The real issue here is a prediction** or forecasting problem. I don’t want to know when my disk is almost full, I want to know when it will be full soon. Say, a week before it’s predicted to happen. Framing the issue this way makes it obvious that 90% or 95% is reactionary, not proactive.
I could go on. I spoke to another company last week, probably one of the best-run IT shops in America, whose staff members were able to quote from memory that 87% of their alerts for the previous week were false-positives because they were expected due to planned maintenance. Most organizations don’t have a clue about those kinds of numbers. Monitoring is a religion at this company. They do it very, very well — and yet, only 13% of their alerts are valid? If they work so hard on their monitoring, and can’t nail it perfectly, isn’t it obvious that the rest of the IT world is likely to do a worse job of it than they do?
These problems are largely because monitoring systems are designed to work as though humans are machines. We’re not. We need monitoring systems that treat us like people. Make the hard things easy and the impossible things possible, as Perl’s creator Larry Wall has said. These are systemic problems that must have systemic solutions — that is to say, fix the monitoring system, don’t blame the human victims. You can’t solve this by using the same booby-trapped technologies and asking the humans to try harder to overcome the pitfalls. We’re setting ourselves up to fail and we just keep doing it. When did foot-gunning ourselves become a “best practice?”
Adaptive Fault Detection is one part of the solution we propose to that problem at VividCortex. It’s only one of a series of innovations we think monitoring badly needs. Monitoring and performance management tools, in general, have shown very little imagination in the last decade or two. Getting rid of the need for thresholds, and monitoring the thing you cared about in the first place, knowing whether it’s truly experiencing a fault (as opposed to just being abnormal or outside of something you expected) is a good start, but there’s a lot more to the story.
So the answer is yes, it matters. Monitoring needs to be built for humans. In my next post I’ll explain a little bit more about Adaptive Fault Detection and how it works.
Pic from pf_vanf