Baron wrote an excellent blog post for O’Reilly on detecting problems in a system, the state of monitoring, systematic approach to problem detection and alerting strategies. Check out the article for yourself, or read these tidbits:
On false positives
In the end, the system administrator is on the horns of a dilemma: should I monitor for every possible failure mode I can imagine? Or just the ones I’ve seen before? What about the ones I can’t imagine and haven’t seen yet? And if I monitor as completely as I think I can, will I get too many false positives? If you haven’t been down this path before, I can give you a spoiler alert: here be false-positive dragons.
On a system workload focus
That’s why I think workload should be regarded as of primary importance. We have servers to do work for us. We don’t measure a system’s success by how busy the CPUs and disks are, or how low the cache hit ratio is. We measure success by how much work the system can do for us, and how consistently. In other words, we want to know the speed and quality of getting-work-done.
You can read the entirety of the article here.