The world’s leading publication for data science, AI, and ML professionals.

R for Beginners: Learn How To Visualize Data Like a Pro

A case-study-based introduction that will get you started to use the ggplot2 library to create high-quality graphics and learn about the…

R TUTORIAL

Image by Author
Image by Author

If you have ever wanted to analyze data and add a new toolkit to your skillset that is used by professional data scientists, then read on.

Rather than presenting a dry list of commands, this tutorial will use a specific case study as a motivating example to teach you everything you need to know about ggplot2, the de facto standard for creating high-quality graphics in R. It is a third-party library supported by the tidyverse ecosystem. While you could plot in R using the base library, you will most likely end up using ggplot2 for any actual project. Therefore, I highly recommend that you can get straight into using this library for plotting.

This tutorial is suitable for anyone interested in wishing to learn R for their Data Science project. It does not assume any background knowledge in programming or R. The only pre-requisite is for you to have a working computer with internet with R and RStudio already installed. However, if they are not installed yet, you can check my brief video tutorial here that will help you set up R and then RStudio in no time. Once you set up your computer, come back to this tutorial, and let’s get cracking.

What if I am a Python User?

If you are used to Python, you may be less familiar with the grammar of the graphics paradigm that provides a more systematic way of visually presenting data. This is akin to grammar in a language where we form sentences by combining nouns, verbs, articles, etc. using certain rules. I therefore still encourage you to read through and understand this new paradigm. You will never look at graphics the same way again.

ggplot2 is an R implementation of the grammar of the graphics paradigm. Its equivalent is available in Python with a somewhat lesser-known library, plotnine.


Case Study Motivation

I want to introduce you to the Late Hans Rosling, a Swedish statistician, and best known for his famous talks (~ 10 million views on a single Youtube video) where he attempted to use actual data to debate commonly-held world views.

In his class on global health, he quizzed his students to identify which of the country in a given pair had the higher mortality. On average, the students got a score of 1.8 correct answers (out of a possible maximum score of 5). He then repeated this quiz on a select group of Professors (they hand Nobel Prizes) and they got a mean score of 2.4. The Professors did slightly better than the students but this score is still worse than someone (a chimpanzee for example) who answers randomly with no knowledge of the world whatsoever.

How come a chimpanzee would do better than high-achieving students and a select few Professors?

This is because our worldview is dominated by sensationalist media that focuses on negative events, giving us a distorted worldview. The mission of Gapminder Foundation, founded by Hans Rosling, is to use data to correct this distorted worldview. We have tons of data about the world collected by various organizations including United Nations Departments. However, the challenge thus far has been to access this data buried in tables in obscure servers and present it appropriately.

In this case study, we will use data from the Gapminder Foundation to answer 3 key questions:

1: Is there any relationship between life expectancy and GDP per capita?

2: How has the life-expectancy changed in the last 50 years across the world?

3: Is the world really divided into two distinct groups, the West and the rest (the developing countries)?

Using this case study, we will create six kinds of plots: scatter plot, line chart, boxplot, histogram, density plot, and bar chart.


Grammar of Graphics

Before we get into the plots, here is a very brief detour on the grammar of graphics implemented by ggplot2, the bare minimum that will help you get going.

A graphic is a visual representation of data. It helps us map data to the aesthetic attributes of the geometric object in question. Each of the graph types mentioned earlier is a "geometric object". Every "geometric object" will have some aesthetic properties. For example, a line chart is a geometric object while the x or y position, and color are the associated aesthetics. Every data graphic needs three essential components:

  • data
  • geometric object (specified by geom)
  • attributes (specified by aes)

If you have the above three, you can combine them together and form a meaningful graphical object. Don’t worry if this feels a little abstract. This will all start making sense once you see the associated code and realize how ridiculously simple it is to plot sophisticated graphics, simply by combing these three components.


Scatterplot

We will start by plotting a scatterplot. Scatterplot is one of the basic types of plots that help us visualize the relationship between two numerical quantities. We will use scatterplot to answer our first question, the relationship between GDP per capita and life expectancy. Both these quantities are numerical and therefore this is a suitable plot to give us an idea of any relationship between the two.

Data Loading and Inspection

Let’s start by importing the data. We will use data from the Gapminder Foundation. You can get data directly from their website. However, for this tutorial, we are going to use a third-party library called gapminder. (This library is developed by Jenny Bryan, who now works on the tidyverse team with RStudio).

Open R Studio and start by creating a new R script file. Let’s start by installing the third-party package. You only need to do this the first time.

install.packages("gapminder")

After installation, load the package. You will need to do this every time you open a new R Studio session.

library(gapminder)

Also load the tidyverse package that loads several packages as part of the whole ecosystem including ggplot2 (you can load each of the individual packages separately too, if you so wish). If this is your first time, then install the tidyverse packages first.

install.packages(tidyverse) # required only the first time
library(tidyverse)

The gapminder package has several datasets within it. The one we will use is called "gapminder". Go ahead and inspect the dataset. There are several commands to inspect the data. You can use head and tail commands that will provide you with the first few or last few rows in the dataset. However, the command that I highly recommend using is the summary command.

head(gapminder, n=8) # list the first 8 rows in the dataset
tail(gapminder, n=10) # list the last 10 rows in the dataset
summary(gapminder) # recommended method

The summary command returns an overall summary of all columns in the dataset. This gives you a good bird’s eye view of the dataset. As the output below shows, we can see that the gapminder dataset has 6 columns, and we can also see the number of rows for each category for the categorical columns, and summary statistics (mean/median/max/min etc.) for the numerical columns

The output of the summary command providing a bird's eye view of the dataset (Image by Author)
The output of the summary command providing a bird’s eye view of the dataset (Image by Author)

Basic Plot

Let’s now create our first scatterplot. We will use the ggplot function, and then pass the three essential components:

  • data (the data we wish to plot)
  • geom (the type of object we expect to see in the plot such as a line, bar, points)
  • aes (aesthetic attributes of the geometric object such as position, color, shape and size)

Let’s go ahead and pass these three components to the ggplot function and assign the resulting object to a new variable, p.

p <- ggplot( data=gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()
p

This is how your first plot, p should look like.

Your first scatterplot (Image by Author)
Your first scatterplot (Image by Author)

We will now gradually build this figure by adding additional layers and soon you will see how transformative it gets with minimal code.

Scale x-axis to a Logarithmic Scale

While there does seem to be a relationship between GDP per capita and life expectancy, the x-axis scale seems compressed. We could change the x-axis scale to a logarithmic scale simply by adding an extra command to the existing p object.

p <- p + scale_x_log10()
The Scatterplot after transforming the x-axis to a logarithmic scale (Image by Author)
The Scatterplot after transforming the x-axis to a logarithmic scale (Image by Author)

This looks better, and we can see a clear relationship now.

Color Attribute for Categorical Data

While scatterplot shows a relationship between numerical values, we can still include categorical data information by using the color attribute. Let’s represent each of the continents separately.

p <- p + aes(color=continent)
The scatterplot now showing each continent in a different color (Image by Author)
The scatterplot now showing each continent in a different color (Image by Author)

Other Aesthetics

There are several additional aesthetics that could be used to map data into a visual cue. Generally, I wouldn’t expect anyone to remember them. There is an official cheat sheet you can get (from the RStudio site). Once you get used to this and practice, it will become second nature. You will remember the aesthetics you use frequently, and you will be more than capable to look at the official documentation or cheat sheet to use any aesthetic that is appropriate for your application.

In our case, let’s try to use one more aesthetic, the size attribute, and use that to represent population size.

p + aes(size=pop)
The scatterplot now showing the size attribute (Image by Author)
The scatterplot now showing the size attribute (Image by Author)

Add Jitter to Reveal Overlapping Points

A limitation of scatterplot is its inability to differentiate between a very large number of overlapping points in one location from a different location with few points. Two ways to deal with such cases is:

  • change the transparency of each data point in the scatter plot. This is controlled by a variable named alpha that can be passed to the geom_point() command. By default, the value of alpha is 1 that corresponds to 100% opacity. You can now try changing the value of alpha in geom_point and see what you get.
Hint: geom_point(alpha = 0.1)
  • Add jitter to the points which is equivalent to giving a small nudge to each data point to let it move slightly in a random direction. To create a jittered plot, all we need to do is replace the command geom_point with geom_jitter and specify the amount of jitter using the width and height attributes. Here is an example of using jitter with specific values assigned to height and weight. Go ahead and play with these values and see how much difference it makes when you change the height and width attributes in jitter. Understand that jitter is added for visual appearance only.
p + geom_jitter(width=2,height=2)

Add Trend Line

While you can visually get an idea of the trend by just looking at the data points, you can also plot an actual line fitted to the data points. All you need to do is simply add an extra layer called geom_smooth(). In the code below, we have passed an additional value to the "line width" parameter and this results in fitting a thick line to every group of data points.

p+geom_smooth(lwd=3)
adding a trend line to each group of data points (Image by Author)
adding a trend line to each group of data points (Image by Author)

With just the scatterplot above, we can clearly answer our first question. There is a strong relationship between life expectancy and GDP per capita. This is true for all 5 continents.


Line Charts

Like scatterplots, line graphs also show a relationship between two numerical quantities. They are typically more useful when one of the variables is ordered (e.g. a timeline). We will use a line chart to investigate how has life expectancy changed over the last 40 years.

All we need to do is use the geom_line() argument instead of the geom_point().

p <- ggplot( data=gapminder, aes(x = year, y = lifeExp)) + geom_line()

This will give us the following plot.

line plot without any grouping and using all data (Image by Author)
line plot without any grouping and using all data (Image by Author)

However, we would instead like to plot the data separately for each country. We can, therefore "group" the data by country using the aesthetic parameter of "group".

p<-p+aes(group=country)
Line plot, one for each country (Image by Author)
Line plot, one for each country (Image by Author)

This looks better, and we can see an increase from ~1950 to ~2010. However, we would like to see the trend separately for each continent as the graph currently looks quite cluttered.

p <- p +  aes(color=continent)
Line plots for each country, separated by continent (Image by Author)
Line plots for each country, separated by continent (Image by Author)

Faceting

Rather than separating different categories by color, we can also plot them separately into small subplots. We can do this by the faceting layer. The figure below is generated where we have now added a facet layer to the original command. This then uses the same x and y for each subplot separated by "continent", the variable passed into faceting.

p <- p +  facet_wrap(~continent)
Life expectancy over the years, one subplot for each continent using faceting (Image by Author)
Life expectancy over the years, one subplot for each continent using faceting (Image by Author)

You can also subset data and look at a select list of countries, using the "subset" function. Let’s compare a select list of countries and plot their life expectancy over the years. I have arbitrarily selected Afghanistan, China, Japan, Kuwait, Malawi, Pakistan, Sweden, and United States.

ggplot(subset(gapminder, country=='Afghanistan' | country=='China' | country=='United States' | country=='Sweden' | country=='Japan' | country=='Kuwait' | country=='Malawi' | country=='Pakistan'), aes(x=year,y=lifeExp,colour=country))+geom_line(lwd=1.5)
The life expectancy of a few selected countries in the past 40+ years (Image by Author)
The life expectancy of a few selected countries in the past 40+ years (Image by Author)

The above line charts clearly show us that the life expectancy in the last few decades has increased for all countries in every continent.


Boxplots

Boxplots give us a sense of the distribution of a numerical variable, and it quite useful to compare multiple variables by plotting boxplots side by side. Let’s use boxplots to visualize the distribution of life expectancy of each continent.

ggplot(data=gapminder, aes(x=continent, y=lifeExp))+geom_boxplot()
Basic boxplot showing the distribution of life expectancy in the 5 continents (image by author)
Basic boxplot showing the distribution of life expectancy in the 5 continents (image by author)

You could add the raw points on top of the boxplot using geom_point(), or use the geom_jitter(). Here is an example of using the geom_jitter() function and changing the outlier color so that we can see distinguish those from the remaining raw data.

ggplot(data=gapminder, aes(x=continent, y=lifeExp))+geom_boxplot(outlier.colour = "red")+geom_jitter(width=0.2)
Basic boxplot showing the distribution of life expectancy in the 5 continents in addition to the raw points with jitter (image by author)
Basic boxplot showing the distribution of life expectancy in the 5 continents in addition to the raw points with jitter (image by author)

This plot is clearly showing that, while the mean life expectancy is higher in Europe and Oceania compared to the rest of the continents, there are some countries in other continents that have a much higher life expectancy than a few countries in Europe and Oceania.


Histogram

Another plot that can give an idea of the distribution of a variable is a histogram. We can create a histogram using the geom_histogram and specify the "fill" attribute to the continent to allow us to compare continents by using different colors.

ggplot(data=gapminder, aes(fill=continent, x=lifeExp))+geom_histogram()
Histogram plot showing the distribution of life expectancy in each continent (Image by author)
Histogram plot showing the distribution of life expectancy in each continent (Image by author)

A closely related plot to a histogram is to use probability density plots. Here, we will use geom_density and specify an alpha value to adjust the transparency for better viewing of overlapping plots. Let’s also use the xlab and ylab functions to appropriately label the two axes.

ggplot(data=gapminder, aes(fill=continent, x=lifeExp))+ geom_density(alpha = 0.2)+xlab('Life Expectancy')+ylab('Probability Density Function')

This will give us the following plot:

Probability density function plot showing the distribution of life expectancy in each continent (Image by author)
Probability density function plot showing the distribution of life expectancy in each continent (Image by author)

The boxplot, the histogram, and the density plots are making it increasingly clear is that there seems to be a continuum rather than a dichotomy between the West and the East.


Barcharts

Both boxplots, and histograms help us visualise the distribution of numerical variables. However, we may sometimes need to get an idea of the distribution of categorical data. This is equivalent to counting the number of categories in a given categorical variable. geom_boxplot() lets us plot bar charts and by default, it counts the number of categories in a given variable. Let’s plot a bar chart using the continent variable which will then give us the number of records we have for each continent.

ggplot(data=gapminder, aes(x=continent))+ geom_bar()
Barplot showing the number of total records we have for each continent (Image by Author)
Barplot showing the number of total records we have for each continent (Image by Author)

Lastly, we can also modify the code above to find the number of records we have in a given year (2007), and flip the coordinates so that we see the bars horizontally.

ggplot(data=subset(gapminder, year==2007), aes(x=continent))+ geom_bar() +coord_flip()
horizontal barplot showing the number of total records we have for each continent in a given year (Image by Author)
horizontal barplot showing the number of total records we have for each continent in a given year (Image by Author)

Summary

This tutorial introduced you to the ggplot2 library to help you visualize your data with six different types of plots (visual summary provided below). You were given a complete walkthrough using data from the Gapminder Foundation. All the code listed above is also available in my github.

Scatterplot, line chart, boxplot, histogram, density plot, and bar chat, the purpose of each, and the associated geometric object command (Image by Author)
Scatterplot, line chart, boxplot, histogram, density plot, and bar chat, the purpose of each, and the associated geometric object command (Image by Author)

Read every story from Ahmar Shah, PhD (Oxford) (and thousands of other writers on Medium)


Related Articles