ggplot2
is like a language within a language, but for data science graphics it’s the best thing going today

Last week I did some charts in ggplot2
to illustrate something to some analyst colleagues. The charts immediate resonated and worked perfectly for my purposes. Soon afterwards I got a message from the analyst – ‘How do I do this in Excel?’.
I explained that it was unlikely that they could create this chart in Excel – not in any straightforward way – and I took this as an opportunity to encourage the analyst to investigate and learn a Data Science language. For me, packages like ggplot2
are among the most compelling reasons to learn programming. In a few simple lines of code you can create amazing, tailored graphics of almost any statistical phenomenon. You can style them to look really beautiful, you can integrate them into any document, and you can share the code to make them easily and instantly reproducible by others.
That said, ggplot2
has a particular grammar which takes some getting used to and so it’s not something that can be picked up instantly. It takes practice, trial and error to become confident with ggplot2
, but once you are there the world of statistical charting is your oyster.
I wanted to use this article to demonstrate a few things I do in ggplot2
on a regular basis. I hope it helps encourage you to play with this package more and you may even learn a few tricks you didn’t know about.
- Use aesthetic inheritance to make your code simpler
ggplot2
charts work through aesthetic inheritance. Aesthetics are effectively a mapping between the graphics and the data. You use aesthetics to tell ggplot2
what elements of the data to use for what features of the chart. For example, to create a simple scatter chart that shows mpg
vs wt
for the mtcars
dataset, and to color the points according to cyl
, you need to pass three aesthetics to ggplot()
:
library(ggplot2)
g1 <- ggplot(data = mtcars,
aes(x = wt, y = mpg, color = as.factor(cyl))) +
geom_point()

Any aesthetics you put into that first ggplot()
statement will be passed on to all subsequent graphics commands unless you specifically indicate otherwise. So if we want to draw fit lines separately for each cyl
group, we just need to add geom_smooth()
:
g1 + geom_smooth()

Maybe you didn’t want that and just wanted a fit line for the entire sample? Then just take the aesthetic elements you don’t want inherited out of the original ggplot()
statement and put them in the specific geom function where you want to use them:
g2 <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point(aes(color = as.factor(cyl)))
This will create an identical scatter plot, but now when you add geom_smooth()
the color grouping will no longer be inherited and a more general fit will be graphed:
g2 + geom_smooth()

- Chart any function without data
geom_function()
is a relatively new addition to ggplot2
and allows you to plot a simulation of any function you define. This could be a built-in common statistical function:
ggplot() +
xlim(-5, 5) +
geom_function(fun = dnorm)

Or it could be a user-defined function. For example, an interesting function to look at is sin(1/x) which I was asked to sketch during an interview for my undergraduate mathematics program way back when. You can use the n
argument to specify how many points to simulate – in this case I’ll use 100,000 points:
ggplot() +
xlim(-1, 1) +
geom_function(fun = function(x) sin(1/x), n = 1e5)

3. Overlay a density curve onto a histogram
When graphing distributions it is often nice to see both the histogram and the density curve. You might be tempted to do this by simply adding geom_histogram()
and then geom_density()
, but the problem is that geom_histogram()
uses count rather than density by default. So it’s important to define density as your y-aesthetic so that both geoms work well together. Note also the use of graphic elements like fill
and alpha
to customize color and opacity.
# get some data on sat scores for 100 students
sat_data <- read.csv("https://www.openintro.org/data/csv/satgpa.csv")
g3 <- ggplot(data = sat_data, aes(x = sat_sum, y = ..density..)) +
geom_histogram(fill = "lightblue") +
geom_density(fill = "pink", alpha = 0.4)

- Overlay unrelated graphic elements by de-inheriting aesthetics
Sometimes you may want to illustrate a theoretical or comparison model by overlaying a graphic that does not inherit the aesthetics of the prior graphics. This is where the argument inherit.aes
is really useful. Most geom functions have this argument as TRUE
by default, but setting it to FALSE
allows you to overlay something unrelated onto your chart. For example, let’s say I wanted to overlay a theoretical perfect normal distribution:
sat_mean <- mean(sat_data$sat_sum)
sat_sd <- sd(sat_data$sat_sum)
g3 +
geom_function(
fun = function(x) dnorm(x, mean = sat_mean, sd = sat_sd),
linetype = "dashed",
inherit.aes = FALSE
)

- Use geom_ribbon() to communicate ranges and uncertainty
In my opinion geom_ribbon()
is one of the most underused geoms in gglot2
. You’ve probably seen it in use since it is the engine behind the shaded confidence range in geom_smooth()
, but I’ve used it a lot in many contexts. Let’s say we build a model to predict SAT from high school GPA in our prior dataset. We can use geom_smooth()
to nicely show the confidence interval for the mean, but we can then layer geom_ribbon()
over that to nicely show a much wider prediction interval.
model <- lm(sat_sum ~ hs_gpa, data = sat_data)
predictions <- predict(model, data.frame(hs_gpa = sat_data$hs_gpa),
interval = "prediction")
ggplot(data = sat_data, aes(x = hs_gpa)) +
geom_point(aes(y = sat_sum), color = "lightblue") +
geom_smooth(aes(y = sat_sum), color = "red") +
geom_ribbon(aes(ymin = predictions[,'lwr'],
ymax = predictions[,'upr']),
fill = "pink",
alpha = 0.3)

- Use geom_jitter() to give more ‘scatter’ to your scatterplot
A lot of data is ‘clumped’ because of its inherent scale. For example in the chart above you can see that hs_gpa
seems to be somewhat clumped forcing the scatter plot into lines. This can cause data points to be hidden behind other data points and can mean that your sample size looks smaller than it is in a scatter plot. geom_jitter()
is a really useful convenience function that puts a random jitter to your points to help with this issue. Just replace geom_point()
with geom_jitter()
and experiment with the width
argument to get the amount of jitter you want:
g4 <- ggplot(data = sat_data, aes(x = hs_gpa)) +
geom_jitter(aes(y = sat_sum), color = "lightblue", width = 0.05) +
geom_smooth(aes(y = sat_sum), color = "red") +
geom_ribbon(aes(ymin = predictions[,'lwr'],
ymax = predictions[,'upr']),
fill = "pink",
alpha = 0.3)

- Annotate text on your chart
geom_text()
allows you to add useful text to the chart to help with understanding. Let’s say we want to label the prediction interval at an appropriate point on the x-y scale with text of a similar color:
g5 <- g4 +
geom_text(x = 4.3, y = 100,
label = "Prediction nInterval",
color = "pink")

- Theme your chart to improve the look and feel
Built in themes are super useful to change the look and feel of your chart in one simple command. I’m a fan of a nice clean look, and so I am a big user of theme_minimal()
. Combined with nice labeling, this can quickly get you to the look you want. Here’s an example:
g5 +
theme_minimal() +
labs(x = "High School GPA",
y = "SAT",
title = "Relationship between GPA and SAT")

As well as ggplot2
‘s built-in themes, you can also try the themes in the ggthemes
package, which includes some themes of popular data publications:
library(ggthemes)
g6 <- g5 +
theme_economist_white() +
labs(x = "High School GPA",
y = "SAT",
title = "Relationship between GPA and SAT")

- Get detailed with elements
When you are styling for a high stakes presentation and you want to get into the intricate details of axes, text, ticks and the like, you can edit elements to get the details right. Let’s say I want bold axis titles and a bigger title font on the above chart.
g6 +
theme(axis.title = element_text(face = "bold"),
plot.title = element_text(size = 20))

- Using the patchwork package to combine multiple ggplots easily in R Markdown
If you’ve created multiple ggplots and you want to combine them in some sort of way, I find the patchwork
package a really easy way to do this and more intuitive than gridExtra::grid.arrange()
. To use it you have to be working in R Markdown, but once you’ve knitted your document, you can always save the patchworked image as a separate file afterwards. Let’s say that I want to patch the images g4
, g5
and g6
above together with g4
and g5
side-by-side on the top row and g6
spanning the second row, I can do this using patchwork in an R Markdown document as follows:
library(patchwork)
(g4 | g5) /
g6

Obviously these ten tips are not meant to be comprehensive, but I hope they illustrate some of the reasons why I love working in ggplot2
and will inspire you to try some new things and get to know the package better.
_Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist. I am passionate about applying the rigor of all those disciplines to complex people questions. I’m also a coding geek and a massive fan of Japanese RPGs. Find me on LinkedIn or on Twitter. Also check out my blog on drkeithmcnulty.com or my soon to be released textbook on People Analytics._