The world’s leading publication for data science, AI, and ML professionals.

Beginner’s Guide to Enhancing Visualizations in R

Develop your visualization skills in R with this practical, step-by-step tutorial.

Photo by Carlos Muza on Unsplash
Photo by Carlos Muza on Unsplash

Learning to build complete visualizations in R is like any other data science skill, it’s a journey. RStudio’s ggplot2 is a useful package for telling data’s story, so if you are newer to ggplot2 and would love to develop your visualizing skills, you’re in luck. I developed a pretty quick – and practical – guide to help beginners advance their understanding of ggplot2 and design a couple polished, business-insightful graphs. Because early success with visualizations can be very motivating!

This tutorial assumes you have completed at least one introduction to ggplot2, like this one. If you haven’t, I encourage you to first to get some basics down.

By the end of this tutorial you will:

  • Deepen your understanding for enhancing visualizations in ggplot2
  • Become familiar with navigating the ggplot2 cheat sheet (useful tool)
  • Build two original, polished visuals shown below through a simple, step-by-step format
Visualization #2
Visualization #2

Before we begin, here are a couple tools that can support your learning. The first is the ‘R Studio Data Visualization with ggplot2 cheat sheet‘ (referred to as ‘cheat sheet’ from now on). We will reference it throughout to help you navigate it for future use.

The second is a ggplot2 Quick Guide I made to help me build ggplots on my own faster. It’s not comprehensive, but it may help you more quickly understand the big picture of ggplot2.

Let’s go!

For this tutorial, we will use the IBM HR Employee Attrition dataset, available here. This data offers (fictitious) business insight and requires no preprocessing. Sweet!

Let’s install libraries and import the data.

# install libraries
library(Ggplot2)
library(scales)
install.packages("ggthemes") 
library(ggthemes)
# import data
data <- read.csv(file.path('C:YourFilePath', 'data.csv'), stringsAsFactors = TRUE)

Then check the data and structure.

# view first 5 rows
head(attrition)
# check structure
str(attrition)

Upon doing so, you will see that there are 1470 observations with 35 employee variables. Let’s start visual #1.

Visual #1

HR wants to know how monthly income is related to employee attrition by job role.

Step 1. Data, Aesthetics, Geoms

For this problem, ‘JobRole’ is our X variable (discrete) and ‘MonthlyIncome’ is our Y variable (continuous). ‘Attrition’ (yes/no) is our z variable.

Check side 1 of your cheat sheet under ‘Two Variables: Discrete X, Continuous Y,’ and note the various graphs. We will use geom_bar(). On the cheat sheet, it’s listed as geom_bar(stat = ‘identity’). This would give us total monthly income of all employees. We instead want average, so we insert (stat = ‘summary’, fun = mean).

# essential layers
ggplot(data, aes(x = JobRole, y = MonthlyIncome, fill=Attrition)) +
  geom_bar(stat = 'summary', fun = mean) #Gives mean monthly income

We obviously can’t read the names, which leads us to step 2…

Step 2. Coordinates and Position Adjustments

When names are too long, it often helps to flip the x and y axis. To do so, we will add coord_flip() as a layer, as shown below. We will also unstack the bars to better compare Attrition, by adding position = ‘dodge’ within geom_bar() in the code. Refer to the ggplot2 cheat sheet side 2, ‘Coordinate Systems’ and ‘Position Adjustments’ to see where both are located.

# unstack bars and flipping axis
ggplot(data, aes(x = JobRole, y = MonthlyIncome, fill=Attrition)) +
  geom_bar(stat = 'summary', fun = mean, position = 'dodge') +
  coord_flip()

Step 3. Reorder bars from highest to lowest

Now, let’s reorder the bars from highest to lowest Monthly Income to help us better analyze by Job Role. Add the reorder code below within the aesthetics line.

# reordering job role
ggplot(data, aes(x = reorder(JobRole, MonthlyIncome), y = MonthlyIncome, fill = Attrition)) +
  geom_bar(stat = 'summary', fun = mean, position = 'dodge') +
  coord_flip()

Step 4. Change bar colors and width

Let’s change the bar colors to "match the company brand." This must be done manually, so find scale_fill_manual() on side 2 of the cheat sheet, under "Scales." It lists colors in base R. You can try some, but they aren’t "company colors." I obtained the color #s below from color-hex.com.

Also, we will narrow the bar widths by adding ‘width=.8’ within geom_bar() to add visually-appealing space.

ggplot(data, aes(x = reorder(JobRole, MonthlyIncome), y = MonthlyIncome, fill = Attrition)) +
  geom_bar(stat='summary', fun=mean, width=.8, position='dodge') +
  coord_flip() +
  scale_fill_manual(values = c('#96adbd', '#425e72'))

Step 5. Title and Axis Labels

Now let’s add Title and Labels. We don’t need an x label since the job titles explain themselves. See the code for how we handled. Also, check out "Labels" on side 2 of the cheat sheet.

ggplot(data, aes(x = reorder(JobRole, MonthlyIncome), y = MonthlyIncome, fill = Attrition)) +
  geom_bar(stat='summary', fun=mean, width=.8, position='dodge') +
  coord_flip() +
  scale_fill_manual(values = c('#96adbd', '#425e72')) +
  xlab(' ') +  #Removing x label
  ylab('Monthly Income in USD') +
  ggtitle('Employee Attrition by Job Role &amp; Income')

Step 6. Add Theme

A theme will kick it up a notch. We will add a theme layer at the end of our code, as shown below. When you start typing ‘theme’ in R, it shows options. For this graph, I chose theme_clean()

#Adding theme after title
ggtitle('Employee Attrition by Job Role &amp; Income') +
  theme_clean()

Step 7. Reduce graph height and make outlines invisible

Just two easy tweaks. First, we will remove the graph and legend outlines. Second, the graph seems tall, so let’s reduce the height via aspect.ratio within theme(). Here is the full code for the final graph.

ggplot(data, aes(x = reorder(JobRole, MonthlyIncome), y = MonthlyIncome, fill = Attrition)) +
  geom_bar(stat='summary', fun=mean, width=.8, position='dodge') +
  coord_flip() +
  scale_fill_manual(values = c('#96adbd', '#425e72')) +
  xlab(' ') +
  ylab('Monthly Income in USD') +
  ggtitle('Employee Attrition by Job Role &amp; Income') +
  theme_clean() +
  theme(aspect.ratio = .65,
    plot.background = element_rect(color = 'white'),
    legend.background = element_rect(color = 'white'))

Nice. We see that Research Directors who make more in monthly income are more likely to leave the company. The opposite is the case for other job roles.

You’ve accomplished a lot. Ready for another go? Visual 2 walk-through will be a piece of cake.

Visual #2

For the second visual, we want to know if employee attrition has any relationship to monthly income, years since last promotion, and work-life balance. Another multivariate analysis.

Step 1. Data, Aesthetics, Geoms

For this problem, ‘MonthlyIncome’ is our X and ‘YearsSinceLastPromotion’ is our Y variable. Both are continuous, so check side 1 of your cheat sheet under ‘Two Variables: Continuous X, Continuous Y.’ For Visualization context, we will use geom_smooth(), a regression line often added to scatter plots to reveal patterns. ‘Attrition’ will again be differentiated by color.

ggplot(data, aes(x=MonthlyIncome, y=YearsSinceLastPromotion, color=Attrition)) +
  geom_smooth(se = FALSE) #se = False removes confidence shading

We can see that employees who leave are promoted less often. Let’s delve deeper and compare by work-life balance. For this 4th variable, we need to use ‘Faceting’ to view subplots by work-life balance level.

Step 2. Faceting to add subplots to the canvas

Check out ‘Faceting’ on side 2 of the cheat sheet. We will use facet_wrap() for a rectangular layout.

ggplot(data, aes(x = MonthlyIncome, y = YearsSinceLastPromotion, color=Attrition)) +
  geom_smooth(se = FALSE) +
  facet_wrap(WorkLifeBalance~.)

The facet grids look good, but what do the numbers mean? The data description explains the codes for ‘WorkLifeBalance’: 1 = ‘Bad’, 2 = ‘Good’, 3 = ‘Better’, 4 = ‘Best’. Add them in step 3.

Step 3. Add Labels to Facet Subplots

To add subplot labels, we need to first define the names with a character vector, then use the ‘labeller’ function within facet_wrap.

# define WorkLifeBalance values
wlb.labs <- c('1' = 'Bad Balance', '2' = 'Good Balance', '3' = 'Better Balance', '4' = 'Best Balance')
#Add values to facet_wrap()
ggplot(data, aes(x = MonthlyIncome, y = YearsSinceLastPromotion, color=Attrition)) +
  geom_smooth(se = FALSE) +
  facet_wrap(WorkLifeBalance~., 
    labeller = labeller(WorkLifeBalance = wlb.labs))

Step 4. Labels and Title

Add your labels and title at the end of your code.

facet_wrap(WorkLifeBalance~.,
    labeller = labeller(WorkLifeBalance = wlb.labs)) +
xlab('Monthly Income') +
ylab('Years Since Last Promotion') +
ggtitle('Employee Attrition by Workplace Factors')

Step 5. Add Space Between Labels and Tick Markers

When I look at the graph, the x and y labels seem too close to the tick markers. A simple trick is to insert newline (n) code within label names.

xlab('nMonthly Income') +  #Adds space above label
ylab('Years Since Last Promotionn')  #Adds space below label

Step 6. Theme

When you installed library(‘ggthemes’), it gave you more options. For a modern look, I went with theme_fivethirtyeight(). Simply add at the end.

ggtitle('Employee Attrition by Workplace Factors') +
  theme_fivethirtyeight()

Step 7. Override a Theme Default

What happened to our x and y labels? Well, the default for theme_fivethirtyeight() doesn’t have labels. But we can easily override that with a second theme() layer at the end of your code as shown below.

theme_fivethirtyeight() +
theme(axis.title = element_text())

Not bad. But…people may not be able to tell if ‘Better Balance’ and ‘Best Balance’ are for the top or bottom grids right away. Let’s also change our legend location in step 8.

Step 8. Add Space Between Grids and Change Legend Location

Adding space between top and bottom grids and changing the legend location both occur within the second theme() line. See side 2 of cheat sheet under ‘Legends.’

theme_fivethirtyeight() +
theme(axis.title = element_text(),
  legend.position = 'top',
  legend.justification = 'left',
  panel.spacing = unit(1.5, 'lines'))

Step 9. Change Line Color

It would be awesome to change line colors to pack a visual punch. Standard R colors don’t quite meet our needs. We will change manually just like we did with Visual #1. I obtained the colors #s from color-hex.com, which has become a useful tool for us.

Here is the full code for the second visual.

ggplot(data, aes(x = MonthlyIncome, y = YearsSinceLastPromotion, color=Attrition)) +
  geom_smooth(se = FALSE) +
  facet_wrap(WorkLifeBalance~., 
    labeller = labeller(WorkLifeBalance = wlb.labs)) +
  xlab('nMonthly Income') +  
  ylab('Years Since Last Promotionn') +
  theme_ggtitle('Employee Attrition by Workplace Factors') +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
    legend.position = 'top',
    legend.justification = 'left',
    panel.spacing = unit(1.5, 'lines')) +
  scale_color_manual(values = c('#999999','#ffb500'))

Another job well done. We see that employees in roles lacking work-life balance seem to stay if promotions are more frequent. The difference in attrition is less noticeable in good or higher work-life balance levels.

In this tutorial, we gained skills needed for ggplot2 visual enhancement, became more familiar with the R Studio ggplot2 cheat sheet, and built two nice visuals. I hope that the step-by-step explanations and cheat sheet referencing were helpful and enhanced your confidence using ggplot2.

Many are helping me as I advance my data science and machine learning skills, so my goal is to help and support others in the same way.


Related Articles