How Accurate Are Computed Columns?

Posted by Baron Schwartz on Apr 7, 2014 5:54:00 AM

If you’ve used our Top Queries feature to profile your MySQL server’s queries, perhaps you noticed the little icon to choose what columns display in the table. Click that icon and you can add columns for computing per-query CPU, IO, and other resources. Here’s a screenshot:

Computed_Columns

The result is very powerful for scientifically measuring and analyzing server load and how much each type of query contributes to it. We wrote about this in an earlier blog post.

These extra columns are not measured directly. They are computed with a special type of regression we invented, which we call weighted linear regression. As a result, there is some inaccuracy. How much? If you’re going to rely on the results, you need to know how good they are.

The usual metrics of quality for regression are things like R-squared, T-statistics, standard error, and so on. The problem is, these are all but useless to anyone who doesn’t have a firm grounding in statistics. And even if you are a statistician, it’s labor-intensive to read and understand such metrics. Not only that, but these types of numbers give only a very limited picture of quality.

We wanted something that’s simultaneously much simpler and clearer, and communicates much more information about the quality of the results and how much they should be trusted.

Perhaps the most obvious way to show uncertainty in these types of measurements is to show a +/- indicator. We tried this. It’s hard to read, and just like the statistical metrics, it doesn’t convey much information.

To solve this problem, we turned as usual to data visualization. A human can look at a scatterplot of one variable versus another and quickly understand things such as:

  • Are the variables related?
  • How strong and clean is the relationship?
  • Are there outliers?
  • How many data points are there?
  • Are there interesting or unusual patterns in the data?

Take a look at the following screenshot and see if you agree (it’s the same Top Queries view as above). With the click of a mouse, you can toggle between looking at the magnitude or quantity of the computed column, and looking at the quality or trustworthiness of that computation.

Computed_Columns_2

The obvious question is what’s being plotted. On the X axis we’re plotting the variable you’re ranking by (usually Total Time), and on the Y axis we’re plotting this query’s portion of the column, e.g. CPU or IO. The scatterplots aren’t labeled because they don’t need to be. That’s the beauty of scatterplots. You can understand them without labels. Oh, and you can also see things like the relationship between queries; look at the consistent pattern of the top two queries (notice the count of #1 is half the count of #2, too).

In that server, the relationship between some of the variables is very strong and clean. In other cases, not so much. Again, the scatterplots instantly help you understand that. Here’s another server with a more mixed workload:

Computed_Columns_3

How much work does it take to interpret your performance analysis and metrics tools? Do you have to scan your eyes down rows and columns of text and numbers, or can you glance at the tool and take away the essence of the data? Do your tools help you validate their results? Or do they just give you page after page of graphs?

At VividCortex we work very hard to make your tools do the work for you, instead of requiring you to become an expert in simple low-level things.

If you haven’t tried our offering, why not? It’s simple and there’s no commitment. Get started in a few minutes and find out things you never knew about your servers.

Recent Posts

Posts by Topic

see all