The world’s leading publication for data science, AI, and ML professionals.

Ten random but useful things to know about ggplot2

ggplot2 is like a language within a language, but for data science graphics it’s the best thing going today

All images in this article are author generated
All images in this article are author generated

Last week I did some charts in ggplot2 to illustrate something to some analyst colleagues. The charts immediate resonated and worked perfectly for my purposes. Soon afterwards I got a message from the analyst – ‘How do I do this in Excel?’.

I explained that it was unlikely that they could create this chart in Excel – not in any straightforward way – and I took this as an opportunity to encourage the analyst to investigate and learn a Data Science language. For me, packages like ggplot2 are among the most compelling reasons to learn programming. In a few simple lines of code you can create amazing, tailored graphics of almost any statistical phenomenon. You can style them to look really beautiful, you can integrate them into any document, and you can share the code to make them easily and instantly reproducible by others.

That said, ggplot2 has a particular grammar which takes some getting used to and so it’s not something that can be picked up instantly. It takes practice, trial and error to become confident with ggplot2, but once you are there the world of statistical charting is your oyster.

I wanted to use this article to demonstrate a few things I do in ggplot2 on a regular basis. I hope it helps encourage you to play with this package more and you may even learn a few tricks you didn’t know about.

  1. Use aesthetic inheritance to make your code simpler

ggplot2 charts work through aesthetic inheritance. Aesthetics are effectively a mapping between the graphics and the data. You use aesthetics to tell ggplot2 what elements of the data to use for what features of the chart. For example, to create a simple scatter chart that shows mpg vs wt for the mtcars dataset, and to color the points according to cyl, you need to pass three aesthetics to ggplot():

library(ggplot2)
g1 <- ggplot(data = mtcars, 
             aes(x = wt, y = mpg, color = as.factor(cyl))) +
  geom_point()

Any aesthetics you put into that first ggplot() statement will be passed on to all subsequent graphics commands unless you specifically indicate otherwise. So if we want to draw fit lines separately for each cyl group, we just need to add geom_smooth() :

g1 + geom_smooth()

Maybe you didn’t want that and just wanted a fit line for the entire sample? Then just take the aesthetic elements you don’t want inherited out of the original ggplot() statement and put them in the specific geom function where you want to use them:

g2 <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(color = as.factor(cyl)))

This will create an identical scatter plot, but now when you add geom_smooth() the color grouping will no longer be inherited and a more general fit will be graphed:

g2 + geom_smooth()
  1. Chart any function without data

geom_function() is a relatively new addition to ggplot2 and allows you to plot a simulation of any function you define. This could be a built-in common statistical function:

ggplot() +
  xlim(-5, 5) +
  geom_function(fun = dnorm)

Or it could be a user-defined function. For example, an interesting function to look at is sin(1/x) which I was asked to sketch during an interview for my undergraduate mathematics program way back when. You can use the n argument to specify how many points to simulate – in this case I’ll use 100,000 points:

ggplot() +
  xlim(-1, 1) +
  geom_function(fun = function(x) sin(1/x), n = 1e5)

3. Overlay a density curve onto a histogram

When graphing distributions it is often nice to see both the histogram and the density curve. You might be tempted to do this by simply adding geom_histogram() and then geom_density() , but the problem is that geom_histogram() uses count rather than density by default. So it’s important to define density as your y-aesthetic so that both geoms work well together. Note also the use of graphic elements like fill and alpha to customize color and opacity.

# get some data on sat scores for 100 students
sat_data <- read.csv("https://www.openintro.org/data/csv/satgpa.csv")
g3 <- ggplot(data = sat_data, aes(x = sat_sum, y = ..density..)) +
  geom_histogram(fill = "lightblue") +
  geom_density(fill = "pink", alpha = 0.4)
  1. Overlay unrelated graphic elements by de-inheriting aesthetics

Sometimes you may want to illustrate a theoretical or comparison model by overlaying a graphic that does not inherit the aesthetics of the prior graphics. This is where the argument inherit.aes is really useful. Most geom functions have this argument as TRUE by default, but setting it to FALSE allows you to overlay something unrelated onto your chart. For example, let’s say I wanted to overlay a theoretical perfect normal distribution:

sat_mean <- mean(sat_data$sat_sum)
sat_sd <- sd(sat_data$sat_sum)
g3 +
  geom_function(
    fun = function(x) dnorm(x, mean = sat_mean, sd = sat_sd), 
    linetype = "dashed",
    inherit.aes = FALSE
  )
  1. Use geom_ribbon() to communicate ranges and uncertainty

In my opinion geom_ribbon() is one of the most underused geoms in gglot2. You’ve probably seen it in use since it is the engine behind the shaded confidence range in geom_smooth(), but I’ve used it a lot in many contexts. Let’s say we build a model to predict SAT from high school GPA in our prior dataset. We can use geom_smooth() to nicely show the confidence interval for the mean, but we can then layer geom_ribbon() over that to nicely show a much wider prediction interval.

model <- lm(sat_sum ~ hs_gpa, data = sat_data)
predictions <- predict(model, data.frame(hs_gpa = sat_data$hs_gpa), 
                       interval = "prediction")
ggplot(data = sat_data, aes(x = hs_gpa)) +
  geom_point(aes(y = sat_sum), color = "lightblue") +
  geom_smooth(aes(y = sat_sum), color = "red") +
  geom_ribbon(aes(ymin = predictions[,'lwr'], 
                  ymax = predictions[,'upr']), 
                  fill = "pink",
                  alpha = 0.3)
  1. Use geom_jitter() to give more ‘scatter’ to your scatterplot

A lot of data is ‘clumped’ because of its inherent scale. For example in the chart above you can see that hs_gpa seems to be somewhat clumped forcing the scatter plot into lines. This can cause data points to be hidden behind other data points and can mean that your sample size looks smaller than it is in a scatter plot. geom_jitter() is a really useful convenience function that puts a random jitter to your points to help with this issue. Just replace geom_point() with geom_jitter() and experiment with the width argument to get the amount of jitter you want:

g4 <- ggplot(data = sat_data, aes(x = hs_gpa)) +
  geom_jitter(aes(y = sat_sum), color = "lightblue", width = 0.05) +
  geom_smooth(aes(y = sat_sum), color = "red") +
  geom_ribbon(aes(ymin = predictions[,'lwr'], 
                  ymax = predictions[,'upr']), 
                  fill = "pink",
                  alpha = 0.3)
  1. Annotate text on your chart

geom_text() allows you to add useful text to the chart to help with understanding. Let’s say we want to label the prediction interval at an appropriate point on the x-y scale with text of a similar color:

g5 <- g4 +
  geom_text(x = 4.3, y = 100, 
            label = "Prediction nInterval", 
            color = "pink")
  1. Theme your chart to improve the look and feel

Built in themes are super useful to change the look and feel of your chart in one simple command. I’m a fan of a nice clean look, and so I am a big user of theme_minimal(). Combined with nice labeling, this can quickly get you to the look you want. Here’s an example:

g5 + 
  theme_minimal() +
  labs(x = "High School GPA", 
       y = "SAT",
       title = "Relationship between GPA and SAT")

As well as ggplot2‘s built-in themes, you can also try the themes in the ggthemes package, which includes some themes of popular data publications:

library(ggthemes)
g6 <- g5 + 
  theme_economist_white() +
  labs(x = "High School GPA", 
       y = "SAT",
       title = "Relationship between GPA and SAT")
  1. Get detailed with elements

When you are styling for a high stakes presentation and you want to get into the intricate details of axes, text, ticks and the like, you can edit elements to get the details right. Let’s say I want bold axis titles and a bigger title font on the above chart.

g6 +
  theme(axis.title = element_text(face = "bold"),
        plot.title = element_text(size = 20))
  1. Using the patchwork package to combine multiple ggplots easily in R Markdown

If you’ve created multiple ggplots and you want to combine them in some sort of way, I find the patchwork package a really easy way to do this and more intuitive than gridExtra::grid.arrange(). To use it you have to be working in R Markdown, but once you’ve knitted your document, you can always save the patchworked image as a separate file afterwards. Let’s say that I want to patch the images g4, g5 and g6 above together with g4 and g5 side-by-side on the top row and g6 spanning the second row, I can do this using patchwork in an R Markdown document as follows:

library(patchwork)
(g4 | g5) / 
  g6

Obviously these ten tips are not meant to be comprehensive, but I hope they illustrate some of the reasons why I love working in ggplot2 and will inspire you to try some new things and get to know the package better.


_Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter. Also check out my blog on drkeithmcnulty.com or my soon to be released textbook on People Analytics._


Related Articles