5 Ways Data Visualizations can Lie

Published in

Towards Data Science

9 min readMay 26, 2017

In an increasingly data driven world, charts have started appearing everywhere from news feeds to activity tracker. Some are beautifully visualized (the NY Times Visualization Lab does amazing work) but more often than not the quality can be lacking and the results confusing or even misleading.

We use visualizations to compress data to create intuitive understanding of trends. We will go through some of the basic mistakes that can be made when presenting data, but first a quick look at the anatomy of a chart:

Elements of a visualization can be modified in ways that can either emphasize or diminish the impact of the data. A change to the number of gridlines or ticks can emphasize granularity, labels can be carefully crafted to create bias, or color choices can evoke subconscious emotional reactions. I will explore ways visualizations can be modified including:

Axis Cropping
Axis Scaling
Binning
Pie Charts
2 Axis Plots

For each of these there are specific situations that these techniques can also be used to help a visualization make a clearer story, so like most things in data science they are tools that can be used or abused.

Axis Cropping

The X and Y axis are the key to understanding the scale and relationship of a plot. An axis can be cropped for the X or Y dimension (or both) to show a subset of the data. Depending on what program and type of plot this could happen accidentally as well as deliberately. I think all of us could find plots in that cut data to make their point look more valid than it is.

Here are three histograms of the same data. By limiting the axis, it not only reduces how much is visible by the ratio also shifts. In the middle example the cropping of the y axis makes the increase look steeper than when it is unedited.

Sometimes there is a good reason for cropping. If the idea of a visualization is to tell a story, this can be an invaluable tool to emphasize a change. This is a great article laying out when it is actually helpful:

It's OK not to start your y-axis at zero

We make thousands of charts a year at Quartz, and when we receive complaints about them, it's usually that the y-axis…

qz.com

However, I have seen multiple examples where it is used to confuse people by overly exaggerating the relational scale make differences look larger than they are in reality (made even worse when the numbers on the axis are omitted entirely). This is an amazing post about how a chart was very heavily edited to make something look different than the actual results:

Butcher: which part of the leg do you want? Me: All of it, in five pieces please

This ABC News chart seemed to have taken over the top of my Twitter feed so I better comment on it. Someone at ABC News…

junkcharts.typepad.com

Cropping a graph’s axis can hide all sort of problems. When I was trying my first machine learning problem on the Kaggle Titanic Competition there were multiple missing ages in the provided data set. The histogram below shows the distribution of ages (for those that had it).

I read through other kernels and took inspiration from their approaches to the problem. One submission in particular used Mice (a package in R) to fill in the ages. I copied over what they had done and ran it to understand how it worked. Every time I kept getting a huge spike at around age 16, but the visualizations in their notebook did not display any sort of spike. On the left is what I found when I did it and the right is what I saw in their notebook:

It took hours before I realized their histogram was set to hide any ages below 16, thereby obscuring the spike in the data. I used an alternative approach to filling in the ages, but it was an educational moment for me.

Axes Scaling

When comparing graphs, it is important to understand if they are both set to the same scale. The axes are often automatically scaled to best fit depending on which visualization program is being used. With more than one chart that does not have identical data, an inch in one chart could be equal 10 and in the other it could be 30.

I once learned this a hard way when I was working on a dataset for a client. After drawing several conclusions I was getting ready to finalize the presentation when I realized that the two charts I was comparing were completely different scales. I caught it in time, but the conclusions I had in my presentation were modified to be a bit more cautious.

I have also been working on a Burlington Vermont’s Open Data for police violations (see my post here) and ran across the same problem. Here are two different streets in Burlington with the count of violations for each.

Above it looks like Church Street beats Main Street on trespassing by a factor of four, but below when we scale them to the same scale it is clear that the amount of trespassing is much closer to being the same. It is also much clearer that thanks to traffic incidents there, Main Street has more violations in total, something the initial charts at a glance would not tell you.

Unlike the first problem, this is more often a mistake. The default behavior of visualizations often introduces the problem automatically. Just remember the importance of having apples to apples when comparing plots and tolook at the axes.

Binning

Histograms are very useful ways to understand the distribution of data. In some ways it trades absolutely accuracy for general understanding by counting the number of data points that are within a certain range. This is called binning, a bin value of 10 will split from zero to the maximum into 10 buckets.

There are several formulas that can be used to determine the bin size; Sturges’ Formula, Rice Rule and Freedman-Diaconis Choice all are good places to start. But let’s first consider what different binning can looks like with a sequence of histograms:

In the examples above, it is not hard to see how by changing the bins the insights can change. Through careful binning some irregularities can be obscured or outliers at specific value smoothed over.

There is also the y axis scale change with the bins that must be considered. If there 500 total values split into 5 bins versus 100 bins, that can produce a large degree of scaling. If histograms are being compared against each other, remember to make sure the bins are the same and the axis is as well.

Pie Charts

Pie charts generally are a horrible way to show complex information. Our brains do not deal well with trying to compare slices when there are more than three or so. To make matters worse, many often lack appropriate labeling or confusing affects that look cool but in reality are illegible.

Take a look at the mock pie chart on the left without labels and try to guess the various proportional relationships and then on the right where the labels are added.

The green slice is actually equal to a quarter of the yellow one, and pink is a third of the value of the purple slice. It is hard for our brands to transpose and compare when it is in pie chart form.

The dislike of pie charts is widely shared and many articles have been written on the subject. Even the order of the slices can play mental tricks. Below Maine and Vermont are actually the same value, but it would be hard to guess that off hand.

Here is another example of five fairly closes slices, one of which is about 3% more than the others. Try guessing which one.

If you choose New Hampshire you would be wrong, it was actually Maryland. The 3D affect visually adds more volume to that slice, tricking your eyes. Without labels telling the percentages there would be little to no chance of accurately guessing it.

2 Axis Plots

Charts can actually have two independent vertical axes on the left and the right. Sometimes this kind of chart can be used to show correlation between two plots, but the relationship may be tenuous if not none existent.

By using the axis cropping and axis scaling mentioned above, small inclines can become cliffs and non-conforming data can be chopped to draw false correlations and comparisons. Below is a made up example:

It is not hard to imagine that someone might not notice that two separate scales and believe New Hampshire’s Count 1 and Count 2 are the same when there is a difference of over 250. The lack of clarity can be further exasperated when one is percentage and the other is something like currency. Here is a particularly good example of how bad charts were used to make an inaccurate point.

Using two axes can be useful when drawing accurate correlations between two sets of data, but when someone is doing it, be sure to look at the axis for scale and understand any other modifications that may have been made.

Conclusion

We use visualizations to explore data and in understanding how mistakes, purposeful or not, are made will make us better at drawing accurate conclusions. The best thing to be skeptically thorough, look at all the parts before taking visualizations for granted.

While this focused on visual trick, but there is also the side of statistical manipulation. This includes things like hacking up data sets, ignoring relevant outliers, and so many other techniques that can be applied long before it even makes it into a chart.

It was hard to write this and not look at politics and media in a time with the phrase “fake news” is being thrown around so often. On both sides visualizations are manipulated and skewed to support conclusions that the data just doesn’t.

Just like we had to learn how to navigate skepticism with first internet scams in the 90s (that poor Nigerian prince…), we also now must develop the same sophistication behind understanding data. As data grows deeper and deeper into the heart of, well, pretty much everything, to not understand it is at your peril. And knowing how to be skeptical can make all the difference.