This post was featured in our July 2015 anthology of most popular VividCortex blog posts of all time. To see more of our top content, check out that list here.
I’ve often been asked a question like the following: “I have installed a template to create graphs (charts) of my systems, but I don’t know what they mean. Can you explain them? Which ones should I watch?” This happens a lot with tools that output a generic page full of charts on a complex system like MySQL, which has a lot of metrics and therefore a lot of charts.
This isn’t the best way to use graphs. The problem is that graphs encourage people to approach things from an unproductive angle.
They do this by appearing to give an answer before you’ve asked them a question. It creates a backwards troubleshooting method, which quickly turns into trial-and-error and logical mistakes, such as confusing effects with causes. Although this sometimes works (hey, even broken clocks are sometimes right), it’s not efficient, it’s not complete, and when it goes wrong, much mischief can result.
Instead, here’s how I’d think of charts: it’s nice to know they’re there if you want them. When you observe a problem in your systems, a good methodical approach to troubleshooting is to formulate some hypothesis about the problem, based on what you know of the system and its functioning, and then look for evidence that corroborates or disproves the hypothesis. (If you want to be super-disciplined, read up on the null hypothesis, which is the gold standard.)
Smart troubleshooters follow a disciplined process that helps them avoid fallacies and pitfalls. My approach to solving problems is heavily influenced by Cary Millsap’s Method R, for example. But even smarter troubleshooters pay attention to intuition too, and quickly try to assess possible shortcuts.
Charts can be a great help for this. Here’s a (very simple, not bulletproof) example:
- Response times on the web tier are very high all of a sudden.
- Experience shows that the web tier can hang if the disk is too busy.
- Is the disk too busy? No, the charts show that the disk is idle.
- This shortcut is not productive. Back to the beginning of the flowchart.
By contrast, just staring at a page full of charts is relatively unhelpful. Suppose you see two that have a very similar shape. Is one of them the cause, and the other an effect? Or the reverse? Or are both showing the effect of a cause you’re not seeing? Or are they just noise? You can waste a lot of time on this kind of guesswork.
It goes beyond that, too. When you look at a page full of graphs, it can be tempting to assume that you’re seeing everything there is to see. Not so. You’re only seeing graphs someone curated, which may not be all the metrics that the system provides. And the system almost certainly doesn’t provide all of the metrics it could and needs to. In fact, if you ask a developer on one of these systems, you’ll probably find out that the metrics you assume must be important and meaningful are mostly accidental — added by a developer to solve a specific issue at a point in time.
In fact, this is such a big problem that I’ve sometimes said that a page full of graphs is intrinsically meaningless. That is, the graphs have no meaning in and of themselves; they only have a meaning when you ask a question for which they can provide some supporting evidence.
So, to sum up: don’t look for meaning in charts; that’s asking for an answer without first asking the question, to paraphrase Brendan Gregg. Instead, formulate a meaningful question and then try to use the charts to arrive at one of two outcomes: a) refute the hypothesis or b) further investigation is needed.