The world’s leading publication for data science, AI, and ML professionals.

Know These 5 Common Data Visualization Mistakes

Things to avoid when making your next plot

Photo by Jamie Street on Unsplash
Photo by Jamie Street on Unsplash

Data Visualization is one of the most powerful tools in analytics. It’s the best way to make sense of and communicate data to others. While a powerful tool, it is only as useful as the hands they are in. In visualizing data, as in any other field – there are rules, best practices, and guidelines. Of course, there are also mistakes.

I wanted to go over five of the most common mistakes that I’ve experienced and seen. I’ll focus on how these mistakes look in GGplot, but the universality of data visualization means that these lessons are applicable to other tools as well – Python, Excel, Tableau, etc.


1. Forgetting titles/titles too small

You want your visualization to make sense as a standalone image. The reason for this is because if your visualization ends up being shared, you want less room for misinterpretation. This means you’ll need to add sufficient context – at the minimum a descriptive title and good axis labels.

Because a default graph does not include a title, and your x- and y- axis titles are named after variables, a common mistake is forgetting to update your titles.

Let’s look at a simple example of a ggplot somebody might make:

# Set global theme for the rest of this article
theme_set(theme_bw())
mtcars %>% 
  ggplot(aes(x = mpg, y = hp)) +
  geom_point()

If you saw the visualization above, you might be able to deduce that it’s showing vehicle data by looking at the relationship between miles per gallon (mpg) versus horsepower (hp). Overall, the data is a bit ambiguous and the axis titles are small and hard to read.

How can we improve this? Let’s use more descriptive titles and increase the font size:

mtcars %>% 
  ggplot(aes(x = mpg, y = hp)) +
  geom_point() +
  labs(title = 'Comparison of Miles Per Gallon vs. Horsepower for 32 cars',
       x = 'Miles Per Gallon',
       y = 'Horsepower') +
  theme_bw(base_size = 20)

2. Using too many data labels

A rule of thumb when visualizing and displaying data is thinking about how to get a point across in the simplest terms possible. Oftentimes, this means removing irrelevant data that detracts from the main point. Oftentimes, I see people including way too many data labels.

The example below is an exaggeration, but we typically want to utilize labels to highlight a point. Too many, and it becomes noise. In this case, we only label outliers so that the reader is able to takeaway a key point.

p <-
data.frame(x = rnorm(1000),
           y = rnorm(1000)) %>% 
  ggplot(aes(x, y)) +
  geom_point()
# Example of mistake
p +  geom_text(aes(label = y))
Bad
Bad
# Fixed
p + 
  geom_text(data = . %>% filter(abs(x) >= 1.9, abs(y) >= 1.9),
            aes(label = y))
Better
Better

3. Having unwieldy numbers

This mistake goes hand in hand with the previous one.

When displaying numbers, you want to be aware of how many significant figures you use – this goes both for very large numbers (i.e. 1,344,323,400 -> 1.34 bn) as well as very precise numbers (1.12321931230 -> 1.12).

p + 
  geom_text(data = . %>% filter(abs(x) >= 1.9, abs(y) >= 1.9),
            aes(label = round(y, 2)),
            hjust = 0)

4. Using too many categories

If your legend takes up a lot of room, it’s a sign that you’re using too many categories. Think of ways to reduce the number of categories so that your legend contains 10 items at most. This might involve grouping categories into an "other" bucket, or coming up with another way of categorizing/displaying categories.

In the visualization below, it’s impossible to match each dot with its corresponding color – different categories becomes a waste of space.

mtcars %>% 
  ggplot(aes(x = mpg, y = hp, color = rownames(.))) +
  geom_point() +
  labs(title = 'Comparison of Miles Per Gallon vs. Horsepower for 32 cars',
       x = 'Miles Per Gallon',
       y = 'Horsepower',
       color = 'Car') +
  theme_bw(base_size = 20)

The clear solution here is to just label the categories in the graph directly rather than relying on the reader to match on colors:

library(ggrepel)
mtcars %>% 
  mutate(car = rownames(.)) %>% 
  ggplot(aes(x = mpg, y = hp)) +
  geom_point() +
  labs(title = 'Comparison of Miles Per Gallon vs. Horsepower for 32 cars',
       x = 'Miles Per Gallon',
       y = 'Horsepower') +
  theme_bw(base_size = 20) +
  geom_text_repel(aes(label = car))

5. Using a bad image resolution

This is the most subjective mistake on the list, but typically you want to shoot for a resolution that allows the reader to digest the data more easily. In RStudio, this is as simple as using the "Zoom" button in the image pane and resizing the image until you find a resolution that you like.

RStudio
RStudio

In the example below, I’ve created the same graph with two different resolutions – the first graph is a bit cramped and harder to read, while the second graph expands the graph vertically to give some more space between the labels and bars.

mtcars %>% 
  ggplot(aes(x = fct_reorder(rownames(.), mpg), y = mpg)) +
  geom_bar(stat = 'identity') +
  coord_flip()
The bars here are cramped together - harder to read
The bars here are cramped together – harder to read
Easier to read
Easier to read

Conclusion

Hopefully, this article has helped you with some of your visualization woes. Anyone who is new to data visualization is bound to make mistakes early on, so hopefully, this article has a takeaway or two that you can use for your future graphs!

If you’re interested in learning more about visualization, I have written another article going over some useful tips that I’ve learned over the years:

8 Tips for Better Data Visualization


Related Articles