It’s a pretty fair assumption that if your database is big enough and complex enough to produce metrics that warrant a monitoring system, it’s also complex enough to produce tons of data that are ultimately more distracting than relevant. It’s not unusual to look at a bevy of monitoring possibilities and feel overwhelmed, uncertain about where to center your focus. Of course, every database is different, but there are some fundamental truths you should consider when you ask yourself, “What should I monitor?” Some of these ideas might seem simple, but if you don’t keep these in mind, you’d be surprised how easy it can be to lose sight of the big picture.
Know Your Goal. What are you trying to achieve by monitoring in the first place? Ultimately, the real question of effective monitoring is, “What work is getting done?” Systems do work for you; the goal of monitoring is to make sure those systems are behaving correctly, i.e. queries are executing. And if the work is getting done, there is no immediate problem that needs to be resolved. There might be an impending problem, such as resources reaching capacity (i.e. a disk getting full), but you have plenty of time to address that before the work itself becomes affected in a practical way.
So, what is your goal? To make sure you’re looking at a proper representation of the work (often queries) executing within your system and that your attention is focused and calibrated so that when a real obstacle to that work appears, you can see it right away and address it accordingly.
Cut through the Fluff. Recognize that your time and your team’s time is valuable – you don’t want to spend hours sifting through metrics that are ultimately not helping you. This seems obvious, but it’s easy to lose sight of this simple principle: aim for what’s important and actionable. There are two common “resources” that can lead you off-track:
1) Bad plugins. If you’re using some pre-packaged Nagios plugin, there’s a good chance that its alerts are less helpful than you think. Many of these are based on outputs from things like SHOW STATUS or the MySQL reference manual and include far too many alerts for way too many variables.
2) Lists on Google search saying the “Top-N Metrics to Monitor.” These lists are almost guaranteed to include things that will cause lots of false alarms. As with the bad plugins, it’s easy to compile these lists and spit out dozens of things to watch – a sizable database will have hundreds of candidates for alerts – but the real trick is to understand which metrics provide the most value and the fewest false positives. It’s easy for some list author on Google to put together a set of alerts that will definitely trigger, but it’s another task altogether to compile triggers with value.
With that in mind, this list is a primer on what you should not look at. The metrics here are secondary and are not actionable – if you send an alert for something that can’t be fixed,you’re doing it wrong!
Which bring us to our next rule:
False Positives are the Archenemy. Intuitively, you might think that “bad monitoring” is monitoring that misses too many metrics and therefore misses big problems when they appear. Unfortunately this isn’t the case. “Bad monitoring” is often monitoring that exposes itself to too many false positives, meaning that when something important does come up, it can get lost in the flood. Your tolerance for false positives should be much, much lower than you might naturally assume.
This talk by Dan Slimmon is an excellent exploration of false-positives and some of the ways they can cause a system to become self-defeating.
Indifference and Laziness are the Arch-Archenemy. But really, the reason why false-positives are so dangerous is that they can quickly lead to indifference and laziness in the people responding to them. Just imagine: if you’re getting dozens of alerts that turn out to be nothing but noise, you become desensitized – when something important really does come up, nobody will be ready or willing to respond. And doesn’t that defeat the purpose of monitoring altogether?
To find out more about the principles behind effective monitoring, check out this recording of our “What Should I Monitor?” webinar with Baron Schwartz, VividCortex’s CEO, including Baron’s tried-and-true list of the top ten areas you should be concerned about monitoring.