Bad Data Visualizations and How to Fix Them

Using data visualization principles to fix misleading and uninformative charts

Dea Bardhoshi
Towards Data Science

--

Photo by Firmbee.com on Unsplash

Building data visualizations: the stage in the data science cycle where you get to present your findings after you have worked on understanding and cleaning a dataset. I am sure you have wondered what the best way to go about showing the data graphically can be, and how different choices you make, whether they are colors, titles, labels, or units can affect how the audience perceives your results. So, what makes a visualization good or bad?

Intuitively, a good visualization should convey information about its contents clearly and accurately. While that might seem like an easy enough task, with a multitude of tools to build graphs, it can be easy to get caught up in making needlessly complex charts, that either are hard to understand or would be better simplified in a more concise chart. So, it seems that both design and content should be paramount when thinking about conveying your data to an audience.

Let’s consider one example of a visualization that does not fulfill the criteria of a good visualization. First off, let’s start by importing Plotly, our package of choice for getting its built-in datasets and open the carshare dataset:

From a first look, it appears that there is a latitude and longitude column, which also contain the word “centroid”, so I am maybe thinking they represent the center of some kind of polygon. The data also has a column called peak_hour, which seems to be the hour in 24-hr format of the peak car share availability, and a car_hours column, which isn’t immediately understandable. Heading to the Plotly documentation, we are told that this has to do with the car availability over one month is Montreal, so it is perhaps a mean of the number of cars for each peak hour in a month. Now, we want to explore the data further, but before we do that, let’s see if the data needs cleaning first.

Does the hour data have any weird values outside of 0 to 23?

No, it doesn’t. What about the car_hours column? Does it have impossible values such as lower than 0?

And we get back an empty Series. So, it appears that the data is in good shape, and we can move on to doing some visualizations on it.

Example 1: Histogram

One variable that is key in this dataset is the car_hours one, which we have assumed to mean the count of car sharing vehicles in the peak hour for a location. Since this is a numerical variable, we can use a histogram to visualize it, and for the purpose of showing how important choices you make in constructing graphs can be, let’s put a really low bin count.

What can we notice from this histogram? Firstly, it is skewed to the left: there seem to be much more peak car counts in the 1000s and the 2000s, as opposed to larger than 2000. There is also a very small amount of locations with peak car counts larger than 3000. How do we know whether this is informative or how large should our bins be? Is our present graph violating any of our good graph design principles? Arguably, yes: if we consider the fact we are comparing the frequencies of different car counts, a really small bin size does not help us reason about how these counts are distributed over peak hours.

Luckily, we can use Sturge’s rule, which is a good rule of thumb for determining bin size: K = 1 + 3. 322 logN, where K is the bin size and N is number of observations or values we have. In this case for N = 249 (as in 249 peak car count observations), K = 8.9 or about 9. So, let’s see that in action.

Now that the histogram is a bit more refined, we can see more specific trends in the data. For instance, it appears that the most common value is between 500 and 1000 cars per area (in the peak hours), and the histogram is left-skewed, but now its clearer how different counts subdivide into bins.

So, can we improve this chart further? Well, with Python visualization packages like Plotly, we have control over many of the chart’s properties, such as colors, labels and titles. Are the current labels providing enough information for a viewer to clearly understand what the chart is showing? Probably not, since we already established that the car_hours label is a bit obscure. So let’s fix them and also add a descriptive title:

2. Overlaying the Histograms

Let’s next visualize how the car_hours are distributed over the day hours versus the night hours. In this graph, I will define hours from 5 am to 8 pm as day and 8 pm to 4 am as night. Here is the code for doing the filtering and showing the histogram:

I have intentionally made a mistake here. Can you tell what it is? It’s the colors chosen for the overlaid histograms: red and green. Red and green colorblindness is the most common type of colorblindness, and you have to design your charts keeping in mind that they need to be accessible by your entire audience, often meaning thinking about the design of components in a visualization, as opposed to only about the content it is presenting. There is one more fix: adding the right labels in the legend and a proper title. So let’s fix all of these things.

Do we notice any interesting trends here? First off, the both histograms seem to have a similar spread over the x values they cover. Secondly, a lot of the mass of both histograms seems concentrated in the 500 to 1500 range, which is a bit surprising, as we would think that there would be less availability of car share services at night as compared to the day.

3. Interactiveness and Extra Elements

To add extra functionality to your charts, you have many options to choose from. One example is adding tooltips for your data, that show the value of a specific point once you hover over them. This can be useful when examining outliers, such as the bins in the 3000s in the histograms above, since they give extra insight on those data points. Plotly creates these tooltips automatically when you create a histogram, and help with the overall informativeness of your chart.

How do you know when to add an extra element such as a tooltip then? Do you run the risk of making a visualization overly complex or is the element in itself indispensable in giving the graph meaning? For the last example, let’s look at a 3D chart of Asian countries’ life expectancy data from Plotly’s built-in datasets.

And above is the result. It certainly looks very visually appealing and more modern than the previous visualizations. But, the question is: does this 3rd dimension add any information to the data? Could this chart be broken down into two 2D, more understandable ones? I think yes. In my opinion, it is easier to see the relationship between the year and life expectancy or year and population if we only viewed each of these pairs on their own. While we notice a clear upward trend in the life expectancy over the years, it is hard to determine precisely what the differences between countries are or how increases in population affect the expectancies. Overall, I think this chart is hard to interpret, so you would want to break it down into two or three line plots.

Now that we have seen a few examples of how to improve our visualizations, what are the main takeaways? First, think about the design of the graph: its colors and labels, while keeping in mind that you want to make your graph as easy and direct as possible for the audience to understand. Next, you might want to resist the temptation to go for more seemingly-sophisticated looking charts in favor of simpler ones that show relationships between variables more easily.

Thank you for reading and I hope you enjoyed this story!

--

--

👩‍💻 Data Science UC Berkeley '23 | 🏙 Data Science, Urban Planning, Civic Technology | ✍️ Newsletter: https://deabardhoshi.substack.com/