The world’s leading publication for data science, AI, and ML professionals.

R Toolkit for People Analytics: Telling Your Headcount Story

Common challenges in People Analytics solved using R

Working in People Analytics, you are often asked to tell the story of your company’s headcount and how the company evolved to what it is today. I often see this presented as a waterfall chart, which can be great, but it gets murky when trying to share year-over-year changes and to a less technical audience.

To address this need, I created iterative plots highlighting each year with some additional context. The plots can then be added to a PowerPoint to go through each year at a time, or can be animated together into a gif. Let’s make it together!

Challenge: Tell the story of how our headcount has changed year-over-year to what it is today.

Steps:

  1. Load necessary packages and data
  2. Calculate monthly headcount
  3. Add relevant context for each year
  4. Create a plot
  5. Set up to automatically create plot for each year
  6. Adjust theme and plot formatting

1. Load necessary packages and data

For this challenge, we will need the following packages:

  • tidyverse
  • hrbrthemes (for making our plots prettier)

To create our visual, we will need a file that contains a unique identifier (i.e., employee ID), hire date, and termination date. I am going to use mock data for this example (at the bottom I have included the code used to make the mock data if you want to follow along step-by-step).

# load packages
library(tidyverse)
library(hrbrthemes)

# load data
employee_data <- mock_data 

# alternatively you could use something like employee_data <- read.csv("input.csv")

As a side note, I generally assign a variable to my original read in of the data and then create a new variable that I use for future manipulations. This isn’t always necessary but makes things go faster when working with a large dataset so you don’t have to reload the data each time you need to change your code.

For the calculations to work correctly, we need to make sure that R knows that hire date and termination date are in fact dates. In general, working with dates in R can be a real pain but for the purpose of this challenge, we need to format the date columns as dates and that there are no NAs.

df <- employee_data %>%
  mutate(Hire.Date = as.Date(Hire.Date, format = "%m/%d/%Y"),
         Termination.Date = as.Date(Termination.Date, format = "%m/%d/%Y"))

In my inputs file, employees that are still active have a blank for termination date, because of course, they have not yet terminated. R is going to get all kinds of fussy if we have blanks in a date column so we are going to add a line of code that assign a date far in the future.

df <- employee_data %>%
  mutate(Hire.Date = as.Date(Hire.Date, format = "%m/%d/%Y"),
         Termination.Date = as.Date(Termination.Date, format = "%m/%d/%Y")) %>%
  mutate(Termination.Date = if_else(is.na(Termination.Date), 
         as.Date("2100-12-31"), Termination.Date))

This last line is saying anywhere that there is an NA/blank in the termination date column, assign a date far in the future. In this case, December 31, 2100. Let us all hope I’m not still working then.

2. Calculate monthly headcount

Hopefully this step looks simple, but I had a doozy of a time figuring it out so please be patient with yourself.

First we are going to create a sequence that will have a date for each month, then we will set up a data frame to be a placeholder for our monthly headcounts, and finally we are going to use a sapply function to calculate headcount for each month. Here we go!

Create a sequence for a date for each month (e.g., 1/1/2023, 1/2/2023, etc.):

month_seq <- seq(from = min(df$hire_date),
                 to = max(df$hire_date),
                 by = "1 month")

This is saying start at the minimum hire date, go to the maximum hire date, and sequence by month. This leaves us with one value for each month in our data. Here is what it looks like:

Now we are going to use that sequence to create a starter date frame that we can then add headcount to.

headcount_data <- data.frame(Date = month_seq)

Okay, now for the tricky part. We are going to calculate the count of active employees on each date in our headcount_data data frame. So how many employees are active on 1/1/2018, 2/2/2018, etc.

Let’s say we are wanting to calculate for 1/1/2018. We need to find the number of employees who have a hire date less than or equal to 1/1/2018 and a termination date greater than 1/1/2018. In other words, the number of employees that have already been hired and have not yet terminated.

Then we just use sapply to do that for each date in headcount_data.

headcount_data <- headcount_data %>%
  mutate(Active.Employees = sapply(Date, function(x) {
           sum(x >= df$hire_date & (is.na(df$termination_date) | x < df$termination_date))
         })) 

Still with me? If you’ve gotten everything working to this point, give yourself a big pat on the back! If you’re running into an issue, also give yourself a big pat on the back for making it this far and go to the full code here to see if you can spot any inconsistencies with your code.

3. Add relevant context

This is where the storytelling part comes into things. Depending on your experience with your familiarity with the organization, you might need to interview some subject matter experts or highly tenured employees. Essentially you are wanting to add important context that helps explain any increases or decreases in your headcount.

I want to add context for each year (you could also do monthly) so I am going to add a year column to headcount_data.

headcount_data <- headcount_data %>% 
  mutate(year = as.integer(year(Date))

This adds a column with the year for each date:

Now, we are going to add context for each year. Let’s say for 2020 we want to add the context "COVID-19" and want that to be shown for each month in 2020.

To do this, we are going to use case_when to add a column "context" based on year.

headcount_data <- headcount_data %>%
  mutate(context = case_when(
    year == 2018 ~ "Context for 2018",
    year == 2019 ~ "Context for 2019",
    year == 2020 ~ "COVID-19",
    TRUE ~ "No additional context"
))

In the above, we are saying that for each row where the year is 2018, we want the context column to be "Context for 2018." You can add context for each year of interest, and then in the TRUE clause, you can specify what you want the context to be for any of the years not specified above.

At this point, your headcount_data should look like this:

Now for the fun part! We get to start plotting.

4. Create plot

For starters, we are going to create a basic area chart with all of our data using ggplot. We will have Date on our x-axis and Active.Employees on our y-axis so we can see our headcount overtime.

headcount_data %>%
  ggplot(aes(x = Date, y = Active.Employees)) +
  geom_area()

That will give you this basic plot:

Now we start the process of some basic zhushing before moving onto some more advanced zhushing:

  1. Add annotations
  2. Add title and subtitle

We will be adding annotations with the ending headcount and year (this will become more relevant when we make a plot for each year). Let’s start by assigning them to variables to make it easier to update for each year:

# annotations
annotation_ending_year <- max(headcount_data$year)
annotation_ending_headcount <- max(headcount_data$Active.Employees)

# titles
labels_title <- "Our Headcount Story"
labels_subtitle <- last(headcount_data$context)

Now we are going to add those to our basic plot:

headcount_data %>%
  ggplot(aes(x = Date, y = Active.Employees)) +
  geom_area() +
  labs(title = labels_title,
       subtitle = labels_subtitle) +
  annotate("text", 
           x = max(headcount_data$Date),
           y = max(headcount_data$Active.Employees),
           label = annotation_ending_headcount,
           hjust = -.25)

This will give us our basic plot with some additional context:

Now that we have created our basic plot, we want to automatically create an additive plot for each year. So there will be a plot 2018 to the end of 2018, 2018 to the end of 2019, 2018 to the end of 2020, etc.

5. Automatically create a plot for each year

We are going to use a for loop to create a plot for each year in our data set.

Basically, we are going to get each unique year in our dataset in a vector called "years." Then for each year in "years" we are going to create a subset of our data and then a plot of that subset. That probably sounds confusing but looking at the code will probably make more sense.

First some set-up:

# create a vector for unique years
years <- unique(headcount_data$year)

# empty list for plots to go to
plots <- list()

Now for the loop! It might look overwhelming, but just take it one step at a time:

# loop over the each year in years and create plots
for (i in 2:length(years)) {
  # create subset adding one year at a time
  subset_df <- headcount_data %>% 
    filter(year <= years[i])

  # calculations for annotation
  annotation_ending_year <- max(subset_df$Date)

  annotation_ending_active <- subset_df %>% 
    filter(Date == ending_year) %>% 
    select(Active.Employees) %>% 
    as.numeric()

  # create a plot (p) using the subset
  p <- subset_df %>%
    ggplot(aes(x = Date, y = Active.Employees)) +
    geom_area() +
    labs(title = labels_title,
         subtitle = labels_subtitle) +
    annotate("text", 
             x = max(subset_df$Date),
             y = max(subset_df$Active.Employees),
             label = ending_active,
             hjust = -.25)

  # save each plot 
  ggsave(p, 
         file = paste("example_plot_", years[i], ".png"), 
         height = 6, width = 8, units = "in")
}

You should now have an a plot saved for each year in your working directory called "example_plot_year." I like having a separate plot for each year so I can put each one in a slide and pause as folks have questions. Alternatively, you can animate the plots together and create a gif or use a screen recorder like ScreenToGif and get something like this:

We did it!!!! All that is left is to add some styling to make the plot more true to your brand and add a rectangle to highlight the most recent year.

6. Adjust theme and plot formatting

The first thing that I want to do is add a rectangle that highlights the most recent year. This will help the viewer know where to focus and will update in each plot so we can look at one year at a time, within the larger context.

We will do this by adding another annotate layer "rect" that will look something like this:

annotate("rect", xmin = , xmax = , ymin = , ymax = )

This is another aspect that took me a while to get it to the way I wanted but the keys here are:

X-axis: I want the rectangle to start at the first (aka floor) date within the given year (aka max year in our subset of data) and to end at the last (aka ceiling) date within the given year (aka max year in our subset of data). So for the plot for 2019, we want the rectangle to start at 1/1/2019 and end at 12/1/2019.

annotate("rect", 
          xmin = floor_date(max(subset_df$Date), "year"), 
          xmax = ceiling_date(max(subset_df$Date), "year")

Y-axis: I want the rectangle to start at the y-axis and to end above the final headcount for that year so it is easier to read without being crowded. Looking at 2019 again, I want the rectangle to start right at the y-axis and to end above (+300) from the final headcount of 240.

    annotate("rect",
             xmin = floor_date(max(subset_df$Date), "year"),
             xmax = ceiling_date(max(subset_df$Date), "year"),
             ymin = -Inf, ymax = ending_active + 300)

Styling: Final, I’ll make the box be gray and change the alpha to 0.1 so it is fairly transparent and you can see the area chart underneath:

    annotate("rect",
             xmin = floor_date(max(subset_df$Date), "year"),
             xmax = ceiling_date(max(subset_df$Date), "year"),
             ymin = -Inf, ymax = ending_active + 300,
             alpha = .1, color = "gray", fill = "gray")

Limiting the axes: To help make the transitions smoother, I’m going to put limits on the x and y-axes so each plot is scaled the same.

scale_x_date(breaks = "1 year", date_labels = "%Y",
                 expand = c(.1,.1),
                 limits = c(min(headcount_data$Date), max(headcount_data$Date)))

Wahoo! We are so close, now I’m going to make some changes to the theme and then pour myself a glass of wine. Now is the time to put your own spin on things, mine ended up looking something like this:

Here is the code for my final for loop:

# loop over the each year in years and create plots
for (i in 2:length(years)) {
  # create subset adding one year at a time
  subset_df <- headcount_data %>%
    filter(year <= years[i])

  # calculations for annotation
  ending_year <- max(subset_df$Date)

  ending_active <- subset_df %>%
    filter(Date == ending_year) %>%
    select(Active.Employees) %>%
    as.numeric()

  # create a plot (p) using the subset
  p <- subset_df %>%
    ggplot(aes(x = Date, y = Active.Employees)) +
    geom_area(fill = "#457b9d") +
    labs(title = "Our Headcount Story",
         subtitle = paste(years[i],":", last(subset_df$context)),
         x = "", y = "") +
    scale_x_date(breaks = "1 year", date_labels = "%Y",
                 expand = c(.1,.1),
                 limits = c(min(headcount_data$Date), max(headcount_data$Date))) +
    theme_classic(base_family = "Arial") +
    theme(plot.title = element_text(size = 24, face = "bold", color = "#457b9d"),
          plot.subtitle = element_text(size = 18),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.ticks.y = element_blank(),
          axis.text.y = element_blank(),
          axis.line.y = element_blank()) +
    annotate("text", x = ending_year,
             y = ending_active, label = ending_active,
             vjust = -1.25, hjust = -.25, color = "#457b9d") +
    annotate("rect",
             xmin = floor_date(max(subset_df$Date), "year"),
             xmax = ceiling_date(max(subset_df$Date), "year"),
             ymin = -Inf, ymax = ending_active + 300,
             alpha = .1, color = "gray", fill = "gray")

  # save each plot
  ggsave(p,
         file = paste("example_plot_final", years[i], ".png"),
         height = 6, width = 8, units = "in")

}

ALLLLLL DOOOOONNNNEEEE!

We now have a dynamic view of how our headcount has changed overtime with additional context in our subtitle. Some ideas for future iterations: making the plots with gganimate, adding a % change to each year’s headcount, changing the color of the graph if headcount has increased or decreased, adding a projection of growth trend line, the possibilities are endless!

Did you try making one? If so, I would love to see what you came up with!

Full code on Github, here.


Want more general People Analytics resources?

12+ Free Resources to Get Started in People Analytics


If you want more resources like this and access to all the great content across this site, you can use my link to sign up for $5 a month (I’ll get a small commission at no additional cost to you).

Join Medium with my referral link – Jenna Eagleson

Jenna Eagleson My background is in Industrial-Organizational Psychology and I have found my home in People Analytics. Data viz is what makes my work come to life. I have fun learning and developing with Power BI, R, Tableau, and other tools I come across. I would love to hear more about your journey! Reach me on Linkedin or Twitter.


Related Articles