Animating regression models in R using broom and ggplot2

It might not help the message, but it looks good on Twitter

Chris Bow
Towards Data Science

--

My first article in Towards Data Science was the result of a little exercise I set myself to keep the little grey cells ticking over. This is something of a similar exercise, albeit a bit more relevant to a project I’ve been working on. As I spend my time working in a marketing department, I have to get used to wearing [at least] two hats.

Often, these hats are mutually exclusive. In this case, the disagreement is in the form of another piece of animated data visualisation. As with the animated Scottish rugby champions graph, this example doesn’t really benefit from adding the animation as another dimension to the plot.

The graph is simply to show the trends for some metrics to do with UK university fundraising over time. I only really need x and y to represent the value and the year, but where’s the fun in that? That’s the sort of thing we can plot ridiculously easily using ggplot2:

ggplot(fund_tidy, aes(x = year, 
y = value,
colour = kpi)) +
geom_line()

Why not use this as a bit more of a learning exercise? I’ve played about with the gganimate package before, but never really spent any quality time with it. This seemed like a good opportunity.

The dataviz conflict

And that brings us on to the butting of hats. I don’t think that an animated plot is the best way to represent these data. I don’t know if it technically counts as non-data ink, but you get the idea: it’s just not necessary. If x and y were already taken and I wanted to show how those two values changed over time, animation presents those changes in a way that’s easy to understand. In this case, it’s redundant.

For further adventures where marketing meets data science, follow Chris on Twitter.

However, a lot of graphs are made not to represent the data as simply and accurately as possible, but to get attention. In many cases, particularly in the world of the marketing agency, there is a tendency to turn what could be presented as a clear, straightforward bar chart, into a full-on novelty infographic. Tourist footfall over time represented as a cartoon foot with the size of the toe representing the value for each year anyone? But that’s a story for another day.

The truth is, animation catches the eye, and it can increase the dwell time, allowing the reader time to take in the title, axes labelling, legends and the message. Possibly. As well as increasing exposure to any branding. I do have some principles though, so I wouldn’t ever intentionally set out to make a graph that was misleading. Playing with the colour schemes and layout to make it look a bit sleeker? Absolutely, but the data has to come first.

A CASE of trends

I had been doing some university fundraising work looking at historic Ross-CASE reports, and thought it would be interesting to look at how some of the key performance indicators had changed over time. I’d looked at some of the main ones before, but hadn’t looked at a few others, and thought it might be interesting to look at them together. And it would be some good ggplot2 and gganimate practice. So let us begin.

n.b. As the aim of this exercise was to compare underlying trends and spend more time with gganimate, not to produce a publication-quality figure, hence a somewhat ‘cavalier’ attitude to y axis labelling!

No onion skinning here

As ever, importing my pre-made dataset and having a quick look was first on the agenda:

# import yearly data (total, summed values, not means or medians)
# dataset compiled from historical Ross-CASE reports
library(readr)fund_df <- read_csv("year_sum.csv")# quick look at datalibrary(dplyr)glimpse(fund_df)
Observations: 12
Variables: 6
$ year <int> 2005, 2006, 2007, 2008, 2009, 2...
$ new_funds_raised <int> 452, 548, 682, 532, 600, 693, 7...
$ cash_received <int> 324, 413, 438, 511, 506, 560, 5...
$ fundraising_staff <int> 660, 734, 851, 913, 1043, 1079,...
$ contactable_alumni <dbl> 5.7, 6.2, 6.9, 7.7, 8.3, 8.0, 8...
$ contact_alum_x100 <dbl> 570, 620, 690, 770, 830, 800, 8...
library(ggplot2)ggplot(fund_df, aes(x = year,
y = new_funds_raised)) +
geom_line()

Okay, we have a dataset that seems to look how I would expect it to from previous work, so hopefully I’ve not screwed things up at the first hurdle. Onward!

As the values for contactable_alumni were a couple of orders of magnitude away from the rest of the values, I created a new column where those were multiplied by 100 to put them on the same scale. I then gathered the data into a tidy, ‘long’, format:

# create contactable alumni x100 variable to place values on equivalent scalefund_df <-
fund_df %>%
mutate(contact_alum_x100 = contactable_alumni * 100)
# create tidy dataframelibrary(tidyr)fund_tidy <-
fund_df %>%
gather(kpi, value, - year) %>%
mutate(kpi = as.factor(kpi))
glimpse(fund_tidy)
Observations: 60
Variables: 3
$ year <int> 2005, 2006, 2007, 2008, 2009, 2010, 2011, 20...
$ kpi <fct> new_funds_raised, new_funds_raised, new_fund...
$ value <dbl> 452, 548, 682, 532, 600, 693, 774, 681, 807,...

With the data transformed, we were ready to create our first animated plot, remembering to start by filtering out out original contactable_alumni variable:

# create animated plotlibrary(gganimate)
library(transformr)
first_animate <-
fund_tidy %>%
filter(kpi != "contactable_alumni") %>%
ggplot(aes(x = year,
y = value,
colour = kpi)) +
geom_line() +
# this next line is where the magic happens:
transition_reveal(kpi, year) +
labs(title = "Trends in University Fundraising KPIs Over Time",
subtitle = "Data from Ross-CASE reports",
x = "Year",
y = 'Value',
caption = "y axis labelling omitted due to differences in scale between KPIs",
colour = "KPI") +
scale_colour_discrete(labels = c("Cash received",
"Contactable alumni",
"Fundraising staff",
"New funds raised")) +
scale_y_discrete(labels = NULL) +
theme_chris()

And we’re off. But is that as good as it could be? I don’t think so. The main thing for me is that, as we’re interested in trends, we should have trendlines on there as well. How to go about that…?

To do that in a non-animated way, we’d simply add a geom_smooth() to our plotting code:

# create non-animated plot with trendlinesfund_tidy %>%
filter(kpi != "contactable_alumni") %>%
ggplot(aes(x = year,
y = value,
colour = kpi)) +
geom_line() +
geom_smooth(method = "lm", linetype = "dashed", se = FALSE) +
labs(title = "Trends in University Fundraising KPIs Over Time",
subtitle = "Data from Ross-CASE reports",
x = "Year",
y = 'Value',
caption = "y axis labelling omitted due to differences in scale between KPIs",
colour = "KPI") +
scale_colour_discrete(labels = c("Cash received",
"Contactable alumni",
"Fundraising staff",
"New funds raised")) +
scale_y_discrete(labels = NULL) +
theme_chris()

But can we simply do that and add the transition_reveal() line to animate that in the same way? Not in any convenient way I could find and believe me, there were some impressive plots produced during the course of my failures. There may very well be an easy way to do it with geom_smooth(), but five minutes searching Stackoverflow didn’t find it, and I had another idea. My next thought was to create the trendlines as a separate stage in the process, building another dataframe from which to build my animated plot:

#---- create linear model and augmented dataframe ----# build pre-filtered dataframefund_tidy2 <-
fund_tidy %>%
filter(kpi != "contactable_alumni")
# build linear modellin_mod <- lm(value ~ year + kpi, data = fund_tidy2)# augment linear model to produce tidy dataframe with fitted valueslibrary(broom)aug_mod <- augment(lin_mod)# create animated graphaug_animate <-
aug_mod %>%
ggplot(aes(x = year,
y = value,
colour = kpi)) +
geom_line(aes(group = kpi, y = .fitted), size = 0.5, linetype = "dashed") +
geom_point(size = 2) +
geom_line(aes(group = kpi)) +
transition_reveal(kpi, year) +
labs(title = "Trends in University Fundraising KPIs Over Time",
subtitle = "Data from Ross-CASE reports",
x = "Year",
y = 'Value',
caption = "y axis labelling omitted due to differences in scale between KPIs",
colour = "KPI") +
scale_colour_discrete(labels = c("Cash received",
"Contactable alumni",
"Fundraising staff",
"New funds raised")) +
theme_chris()
# animate and saveaug_animated <- animate(aug_animate, height = 500, width = 800)anim_save("aug_animated.gif", animation = aug_animated)

Oh dear, of course, that hasn’t worked. Well, the animation part has worked exactly as we wanted, but the trendlines are wrong. Due to the way we built the model, we have have created a parallel slopes type of linear regression. In doing that, we’ve lost the key finding of the data: that the number of fundraising staff is rising faster than the acquisition of new funds.

Multiple models

In order to get what we want, we’ll have to build a separate model for each KPI. This is very easy to do using tidy principles in R. By grouping by KPI and nesting in a tibble, we can build multiple models quickly and easily using the map function from the purrr package.

An alternative would be to create a separate dataframe per KPI, build and augment models separately, then bind them back together. That’s a great solution, but not really scalable. That said, it’s what I did first before I decided to bite the bullet and try and do it the ‘proper’ way:

#---- build multiple models for animated plot with trendlines ----# build nested tibblefund_nested <-
fund_tidy2 %>%
group_by(kpi) %>%
nest()
# build separate regression modelsfund_models <-
fund_nested %>%
mutate(lm_mod = map(data,
~lm(formula = value ~ year,
data = .x)))
# augment models and unnest tibblefund_models_aug <-
fund_models %>%
mutate(aug = map(lm_mod, ~augment(.x))) %>%
unnest(aug)
case_animate <-
fund_models_aug %>%
ggplot(aes(x = year,
y = value,
colour = kpi)) +
geom_line(aes(group = kpi, y = .fitted), size = 0.5, linetype = "dashed") +
geom_point(size = 2) +
geom_line(aes(group = kpi)) +
transition_reveal(kpi, year) +
labs(title = "Trends in University Fundraising KPIs Over Time",
subtitle = "Data from Ross-CASE reports",
x = "Year",
y = 'Value',
caption = "y axis labelling omitted due to differences in scale between KPIs",
colour = "KPI") +
scale_colour_discrete(labels = c("Cash received",
"Contactable alumni",
"Fundraising staff",
"New funds raised")) +
scale_fill_discrete() +
theme_chris()

And that’s about what we want. I’ve added a separate geom_point() for a bit of aesthetic niceness, and there are a few more things we could do to make things a bit prettier, but, generally, mission accomplished and knowledge reinforced!

Follow Chris on Twitter.

The code in this article can be found on GitHub.

--

--

Former immunologist turned data scientist and marketer. Proponent of applying scientific thinking to non-scientific problems. Consultant for Cairney & Company.