At VividCortex, we deal with a lot of data, day in and day out. It’s not just our customer data: we actually consume a lot of data and reports produced by others as well. For example, there are reports in Salesforce’s CRM, various analytics solutions we use, industry reports we read, and on and on. Speaking personally, I probably study at least 10 charts a day in depth to try to understand their meaning. And it’s only growing as our business and activity grows.
Many of these charts could be tremendously improved with a better choice of data visualizations. The right data visualization tells a story instantly. The wrong one leaves people puzzling over what the chart means, wracking their brains to extract the information. At best it is hard to interpret, but such charts can also hide vital information or even tell lies.
At VividCortex we work really hard to communicate a lot of dense information visually and instantaneously, and we’ve become very fond of the scatterplot chart. In my unscientific opinion, a lot of people producing reports and slideshows should use scatterplots a lot more than they do. I see so many pie/bar/line charts that would be much better as scatterplots.
Without beating a dead horse, let’s take a look at a few charts that confuse. You’ve probably seen some of these kinds of things before, but just to illustrate…
I’ve been reading some industry reports on sales management, which include a lot of survey results on things like compensation and performance. Exhibit A is a chart from one of these.
What does this pie chart mean? Words really fail me here. If it weren’t for the fact that it would intrude upon this post’s content, I’d insert an animated GIF here to try to express myself. True, I’ve removed the text, labels and other context, but trust me, this chart is no better or easier to interpret with context. This is just a cardinal sin of data visualization (not to mention the fact that it’s a 3D pie chart… please, someone send this person one of Tufte’s books).
Exhibit B is a bar chart from a similar report, which attempts to show the relationship between sales reps who meet quota, and the company’s attrition rate in sales:
Again, the story this chart tells is extremely hard to interpret. I might claim it’s nearly impossible.
Now let’s see how we might improve these charts with scatterplots.
Scatterplots To The Rescue
Both of the charts shown above are trying to tell essentially the same story: the relationship between two things. If this is the goal of a chart, I suggest that you consider a scatterplot. It shows Thing 1 on the X-axis, and Thing 2 on the Y-axis.
You really can’t get much better than that: if you want to show the relationship between two things, put them on the axes. The visualization corresponds directly to what you want to convey.
Then, of course, you simply plot all of your data onto points in X-Y space.
Benefits of a scatterplot:
- It shows a lot of data in full fidelity. It doesn’t aggregate or reduce information to a couple of values. Most charts you’ll see, like the pie charts and bar charts above, are the aggregates of dozens or hundreds of numbers. But scatterplots show all the numbers, not their aggregates. This gives you, the reader, much more power. It doesn’t force you to agree or disagree with the author’s intended interpretation, but lets you make up your own mind.
- It’s information-dense without being hard to read. The nuances come through clearly because your retinas do the work. You can see groupings, relationships and trends (like the up-and-right trend above) at a glance.
- It shows the quality of the relationship between the values plotted. You can see how tight or linear the trend is, visually. At the same time you can also see outliers instantly. What’s that Hat doing in the bottom middle of the chart, away from the rest of the points? If you tried to express this numerically, you’d end up with correlation coefficients, confidence intervals, and the like. Those are really hard to interpret.
We use scatterplots in several places in VividCortex’s product. One example is query samples, which plot timestamp versus latency.
Another place is in our computed columns, which use our special regression algorithm to show how related two metrics are, e.g. query latency versus CPU. This is great, but regression isn’t a magic wand you can just wave, so humans need to inspect and see whether the results even make sense. A scatterplot is the best way to do that. Tell me if anything but a scatterplot would show the information you can see in the following two plots:
The first plot clearly has two kinds of data, both strongly clustered, as well as an outlying point. The second plot has one strongly clustered dataset with some outlying points (including the ones at the bottom left). I don’t think you can get that information any other way than a scatterplot. If you’re interested, you can read more on the statistical basis of our regression algorithm.
If you’re trying to show the relationship between two kinds of quantities, I suggest you default to scatterplots instead of hiding information behind pie charts, bar charts, or otherwise aggregating and categorizing information needlessly. Be especially wary of pie charts as a dataviz tool. There is almost always a better alternative to a pie chart, even if it’s not a scatterplot!
Here are some resources you might find helpful:
- A meta-chart of choosing a chart type and the Encyclopedia of Slide Layouts
- Naomi B. Robbins
- Books by Nancy Duarte, such as slide:ology
- The D3 examples gallery
- Details on VividCortex’s regression algorithm
PS: Please, go ahead and critique this article in the comments. It’s pretty much a law that any article on dataviz is going to make some dataviz gaffes, no?