Data Cleaning with R and the Tidyverse: Detecting Missing Values

John Sullivan
Towards Data Science
13 min readMar 21, 2019

--

Data cleaning is one of the most important aspects of data science.

As a data scientist, you can expect to spend up to 80% of your time cleaning data.

In a previous post I walked through a number of data cleaning tasks using Python and the Pandas library.

That post got so much attention, I wanted to follow it up with an example in R.

In this post you’ll learn how to detect missing values using the tidyr and dplyr packages from the Tidyverse.

The Tidyverse is the best collection of R packages for data science, so you should become familiar with it.

Getting Started

A good way to start any data science project is to get a feel for the data.

This is just a quick look to see the variable names and expected variable types. Looking at the dimensions of the data is also useful.

Exploratory data analysis (EDA) is extremely important, so it deserves its own blog post. We won’t go over a full EDA in this article.

Before we get started, head on over to our github page to grab a copy of the data. Make sure to put a copy in the same working directory where your R code will be.

Here’s a quick look at our data:

This is a small customer churn dataset.

For purposes of learning, this dataset shows some great real-world examples of missing values.

To start, load the tidverse library and read in the csv file.

library(tidyverse)# set working directory
path_loc <- "C:/Users/Jonathan/Desktop/data cleaning with R post"
setwd(path_loc)
# reading in the data
df <- read_csv("telecom.csv")

Usually the data is read in to a dataframe, but the tidyverse actually uses tibbles.

These are similar to dataframes, but also slightly different. To learn more about tibbles, check out this chapter from R for Data Science.

I like to use the glimpse function to look at the variable names and types.

# taking a quick look
glimpse(df)
> glimpse(df)
Observations: 10
Variables: 5
$ customerID chr "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7...
$ MonthlyCharges dbl 29.85, 56.95, NA, 42.30, 70.70, NaN, 89.10, ...
$ TotalCharges chr "109.9", "na", "108.15", "1840.75", NA, "820...
$ PaymentMethod chr "Electronic check", "Mailed check", "--", "B...
$ Churn chr "yes", "yes", "yes", "no", "no", "yes", "no"...

We can see that there’s 5 variables.

  • customerID
  • MonthlyCharges
  • TotalCharges
  • PaymentMethod
  • Churn

There’s also a description of the type for each variable:

  • customerID: chr which stands for character, another name for a string
  • MonthlyCharges: dbl which stands for double, which is a numeric type
  • TotalCharges: chr character
  • PaymentMethod: chrcharacter
  • Churn: chrcharacter

There’s 10 observations, which means there’s 10 rows of data.

Now that we’ve taken a quick look to become familiar with the data, let’s go over some basic data manipulation.

A Grammar of Data Manipulation: dplyr

Before we get started with missing values, let’s go over the dplyr library.

This is just a quick introduction, so be sure to check out the official dplyr documentation as well as Chapter 5 Data Transformation from R for Data Science.

This library uses a “grammar of data manipulation” which basically means that there’s a set of functions with logical verb names for what you want to do.

For example, maybe you want to only look at customers that churned. You can filter the data on Churn values equal to “yes”.

We can quickly do that using the filter function from dplyr.

# filter on customers that churned
df %>%
filter(Churn=="yes")
# A tibble: 5 x 5
customerID MonthlyCharges TotalCharges PaymentMethod Churn
chr dbl chr chr chr
1 7590-VHVEG 29.8 109.9 Electronic check yes
2 5575-GNVDE 57.0 na Mailed check yes
3 3668-QPYBK NA 108.15 -- yes
4 9305-CDSKC NaN 820.5 -- yes
5 6713-OKOMC NA N/A NA yes

Taking a look we can see that R returned an organized tibble that only includes customers that churned.

If you’re not familiar with the %>% operator, also known as the “pipe operator” check out this great blog post.

The pipe is a useful operator that comes from the magrittr package. It allows us to organize our code by eliminating nested parentheses so that we can make our code more readable.

For example, let’s say we had the following calculation:

# nested functions
log(sin(exp(2)))
> log(sin(exp(2)))
[1] -0.1122118

With all of the parentheses, this isn’t very readable. Now let’s look at a piped example.

# piped functions
2 %>% exp() %>%
sin() %>%
log()

It’s easy to see that the piped example is much more readable.

Okay, back to dplyr.

We just used the filter function to quickly filter out rows with a Churn value equal to “yes”.

Maybe we also want to just select the customerID and TotalChargescolumns. We can quickly do that as well using the select function.

# filter on customers that churned,
# select customerID and TotalCharges columns
df %>%
filter(Churn=="yes") %>%
select(customerID, TotalCharges)
# A tibble: 5 x 2
customerID TotalCharges
chr chr
1 7590-VHVEG 109.9
2 5575-GNVDE na
3 3668-QPYBK 108.15
4 9305-CDSKC 820.5
5 6713-OKOMC N/A

We can see just how easy it is to manipulate our data using these dplyr functions.

Chaining functions together vertically makes our code extremely readable.

This way of coding might seem a little strange at first, but after a little practice it will become extremely useful.

Standard Missing Values

Now that we’re a little bit more familiar with the pipe operator and dplyr, let’s dive right in to detecting missing values.

We’ll start by looking at standard missing values that R recognizes.

Go ahead and take a look at the MonthlyCharges column.

We can see that there’s three missing values.

There’s two empty cells, and one with “Nan”. These are obviously missing values.

We can see how R recognizes these using the is.na function.

First let’s print out that column and then apply is.na.

# looking at MonthlyCharges
df$MonthlyCharges
is.na(df$MonthlyCharges)
> df$MonthlyCharges
[1] 29.85 56.95 NA 42.30 70.70 NaN 89.10 NA 104.80
[10] 54.10
> is.na(df$MonthlyCharges)
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE

We can see that the two missing cells were recognized as “NA” and the other missing value with Nan was identified by R as “NaN”.

When we run the is.na function, R recognizes both types of missing values. We can see this because there’s three TRUE values that are returned when we run is.na.

It’s important to note the difference between “NA” and “NaN”. We can use the help function to take a closer look at both values.

# using the help function to learn about NA
help(NA)

Taking a look at the bottom right window we can see that “NA” or “Not Available” is used for missing values.

“NaN” or “Not a Number” is used for numeric calculations. If a value is undefined, such as 0/0, “NaN” is the appropriate way to represent this.

There is also a is.nan function. Try running this with both “NA” and “NaN”. You’ll see that it returns a value of TRUE for “NaN” but FALSE for “NA”.

The is.na function on the other hand is more generic, so it will detect both types of missing values.

Let’s go ahead and use dplyr to summarize our data a little bit.

We can use the distinct function to look at the distinct values that show up in the MonthlyCharges column.

# looking at the distinct values
df %>%
distinct(MonthlyCharges)
# A tibble: 9 x 1
MonthlyCharges
dbl
1 29.8
2 57.0
3 NA
4 42.3
5 70.7
6 NaN
7 89.1
8 105.
9 54.1

We can see there’s 9 distinct values. There’s 10 rows of data, but “NA” shows up twice, so there’s 9 distinct values.

If we want to get a quick count of the distinct values we can use the summarisefunction.

# counting unique values
df %>%
summarise(n = n_distinct(MonthlyCharges))
# A tibble: 1 x 1
n
int
1 9

This returns a simple tibble with a column that we named “n” for the count of distinct values in the MonthlyCharges column.

What we’re really after is the count of missing values. We can use the summarise function along with is.na to count the missing values.

# counting missing values
df %>%
summarise(count = sum(is.na(MonthlyCharges)))
# A tibble: 1 x 1
count
int
1 3

As we saw above, the number of missing values is 3.

Maybe we want to do multiple things at once. Let’s say we want to get a count of unique values, as well as missing values, and also the median value of MonthlyCharges.

Here’s how we can do that using summarise:

# counting unique, missing, and median values
df %>% summarise(n = n_distinct(MonthlyCharges),
na = sum(is.na(MonthlyCharges)),
med = median(MonthlyCharges, na.rm = TRUE))
# A tibble: 1 x 3
n na med
int int dbl
1 9 3 57.0

This produces an organized little tibble of our summary data.

Now that we’ve identified the missing values, let’s replace them with the median value of MonthlyCharges. To do that, we can use the mutate function from dplyr.

# mutate missing values
df %>%
mutate(MonthlyCharges
= replace(MonthlyCharges,
is.na(MonthlyCharges),
median(MonthlyCharges, na.rm = TRUE)))
# A tibble: 10 x 5
customerID MonthlyCharges TotalCharges PaymentMethod Churn
chr dbl chr chr chr
1 7590-VHVEG 29.8 109.9 Electronic check yes
2 5575-GNVDE 57.0 na Mailed check yes
3 3668-QPYBK 57.0 108.15 -- yes
4 7795-CFOCW 42.3 1840.75 Bank transfer no
5 9237-HQITU 70.7 NA Electronic check no
6 9305-CDSKC 57.0 820.5 -- yes
7 1452-KIOVK 89.1 1949.4 Credit card no
8 6713-OKOMC 57.0 N/A NA yes
9 7892-POOKP 105. 3046.05 Electronic check no
10 8451-AJOMK 54.1 354.95 Electronic check no

We can see that the missing values were replaced with the median value 57 in three different spots.

Just to double check that this worked, lets print out the whole tibble again.

df# A tibble: 10 x 5
customerID MonthlyCharges TotalCharges PaymentMethod Churn
chr dbl chr chr chr
1 7590-VHVEG 29.8 109.9 Electronic check yes
2 5575-GNVDE 57.0 na Mailed check yes
3 3668-QPYBK NA 108.15 -- yes
4 7795-CFOCW 42.3 1840.75 Bank transfer no
5 9237-HQITU 70.7 NA Electronic check no
6 9305-CDSKC NaN 820.5 -- yes
7 1452-KIOVK 89.1 1949.4 Credit card no
8 6713-OKOMC NA N/A NA yes
9 7892-POOKP 105. 3046.05 Electronic check no
10 8451-AJOMK 54.1 354.95 Electronic check no

It looks like all the missing values are back. So what happened?

This brings up an important point. The dplyr package won’t modify the data in place.

Basically this means if we apply a mutate to some of the data with just a pipe operator, it will show us a modified view of the data, but it won’t be a permanent modification.

To permanently modify the data, we need to assign the mutate to the original data using the assignment operator <-.

Here’s how we would do that:

# mutate missing values, and modify the dataframe
df <- df %>%
mutate(MonthlyCharges = replace(MonthlyCharges,
is.na(MonthlyCharges),
median(MonthlyCharges, na.rm = TRUE)))

Now if we take another look at the data, it should be modified.

df
# A tibble: 10 x 5
customerID MonthlyCharges TotalCharges PaymentMethod Churn
chr dbl chr chr chr
1 7590-VHVEG 29.8 109.9 Electronic check yes
2 5575-GNVDE 57.0 na Mailed check yes
3 3668-QPYBK 57.0 108.15 -- yes
4 7795-CFOCW 42.3 1840.75 Bank transfer no
5 9237-HQITU 70.7 NA Electronic check no
6 9305-CDSKC 57.0 820.5 -- yes
7 1452-KIOVK 89.1 1949.4 Credit card no
8 6713-OKOMC 57.0 N/A NA yes
9 7892-POOKP 105. 3046.05 Electronic check no
10 8451-AJOMK 54.1 354.95 Electronic check no

This time the MonthlyCharges column was modified permanently. Keep in mind that when you want to permanently mutate your data with dplyr, you need to assign the mutate to the original data.

Non-Standard Missing Values

A lot of times you won’t be lucky enough to have all standard missing value types that R will recognize right away.

Let’s take a quick look at the next column, TotalCharges, to see what I mean.

We can see there’s three different missing values, “na”, “NA”, and “N/A”.

In the previous example we saw that R recognized “NA” as a missing value, but what about “na” and “N/A”?

Let’s take a look at this column and use is.na to see if R recognizes all of these as missing values.

# looking at missing values
df$TotalCharges
is.na(df$TotalCharges)
> is.na(df$TotalCharges)
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE

Looking at the results we can see that R only identified “NA” as a missing value.

Let’s use the summarise function to see how many missing values R found.

# counting missing values
df %>%
summarise(count = sum(is.na(TotalCharges)))
# A tibble: 1 x 1
count
int
1 1

The result confirms that R only found one missing value.

We’ll need to replace both “na” and “N/A” with “NA” to make sure that R recognizes all of these as missing values.

Let’s use the mutate function to replace these with the correct missing value types. Keep in mind that we need to use the assignment operator to make sure the changes are permanent.

# replacing with standard missing value type, NA
df <- df %>%
mutate(TotalCharges = replace(TotalCharges, TotalCharges == "na", NA)) %>%
mutate(TotalCharges = replace(TotalCharges, TotalCharges == "N/A", NA))

If we take a look at this column again, we can see that now all of the missing values have been correctly identified by R.

# taking another look
df$TotalCharges
is.na(df$TotalCharges)
> df$TotalCharges
[1] "109.9" NA "108.15" "1840.75" NA "820.5"
[7] "1949.4" NA "3046.05" "354.95"
> is.na(df$TotalCharges)
[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE

Now we can see that R picked up all three missing values.

Before we replace the missing values, there’s still another problem.

R thinks that the column values are characters. We can confirm this with the glimpse function.

> glimpse(df$TotalCharges)
chr [1:10] "109.9" NA "108.15" "1840.75" NA "820.5" "1949.4" NA ...

Let’s change these to numeric types.

# changing to numeric type
df$TotalCharges <- as.numeric(df$TotalCharges)
glimpse(df$TotalCharges)
> df$TotalCharges <- as.numeric(df$TotalCharges)
> glimpse(df$TotalCharges)
num [1:10] 110 NA 108 1841 NA ...

Finally, let’s finish up by replacing the missing values with the median.

# replace missing values with median
df <- df %>%
mutate(TotalCharges = replace(TotalCharges,
is.na(TotalCharges),
median(TotalCharges, na.rm = T)))
df$TotalCharges
> df$TotalCharges
[1] 109.90 820.50 108.15 1840.75 820.50 820.50 1949.40 820.50
[9] 3046.05 354.95

An even simpler way to change all of the missing values is to change the column to numeric before doing anything else.

Let’s import the data again so that we have the missing values again.

# importing the data again
df <- read_csv("telecom.csv")
df$TotalCharges
> df$TotalCharges
[1] "109.9" "na" "108.15" "1840.75" NA "820.5"
[7] "1949.4" "N/A" "3046.05" "354.95"

Now let’s try changing the column to numbers.

# change TotalCharges to numeric type
df$TotalCharges <- as.numeric(df$TotalCharges)
df$TotalCharges
> df$TotalCharges <- as.numeric(df$TotalCharges)
Warning message:
NAs introduced by coercion
> df$TotalCharges
[1] 109.90 NA 108.15 1840.75 NA 820.50 1949.40 NA
[9] 3046.05 354.95

This time all of the different missing value types were changed automatically.

Although this is a little bit shorter, I don’t always prefer this solution.

This worked for our specific example, but if you’re trying to detect anomalies or other dirty data, this might not be a good solution.

Always make sure to read the R console for warnings like this. It can provide valuable information.

More Missing Values

So far we’ve looked at standard missing values like “NA” and non-standard values like “n/a” and “N/A”.

There’s numerous other ways to represent missing data.

Maybe I was manually entering in data and chose to use “ — ” for missing values.

On the other hand, maybe you prefer to just leave the cell blank.

Let’s learn about detecting some of these more unusual types of missing values.

Take a look at the PaymentMethod column:

We can see that there’s three missing values.

Two are represented with “ — ” and one is just an empty cell.

Let’s see what R thinks about these:

# looking at PaymentMethod
df$PaymentMethod
is.na(df$PaymentMethod)
> is.na(df$PaymentMethod)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE

R was only to identify one of the missing values, the empty cell.

Let’s go ahead and use mutate to change “ — “ to NA.

# replacing "--" with NA
df <- df %>%
mutate(PaymentMethod = replace(PaymentMethod, PaymentMethod == "--", NA))
is.na(df$PaymentMethod)
df$PaymentMethod
> df$PaymentMethod
[1] "Electronic check" "Mailed check" NA
[4] "Bank transfer" "Electronic check" NA
[7] "Credit card" NA "Electronic check"
[10] "Electronic check"

Now we can see that all three missing values now show up.

So far we’ve either left missing values alone, or replaced them with a median.

What about dealing with missing values in a column of character types?

Since all of the entries in the PaymentMethod column are strings, there’s no median value.

Rather than just exclude the missing values, let’s convert the NAs to a new category, called “unavailable”.

# replace NA with "unavailable"
df <- df %>%
mutate(PaymentMethod = replace(PaymentMethod, is.na(PaymentMethod), "unavailable"))
df$PaymentMethod> df$PaymentMethod
[1] "Electronic check" "Mailed check" "unavailable"
[4] "Bank transfer" "Electronic check" "unavailable"
[7] "Credit card" "unavailable" "Electronic check"
[10] "Electronic check"

Now we can see that our three missing values, NA, have been converted to a new category, “unavailable”.

Sometimes there’s a reason why values are missing, so it’s good to keep that information to see how it influences the results in our machine learning models.

We won’t get in to those details in this post, but keep in mind that throwing out missing values might not always be a good idea.

Conclusion

In this post we learned about data cleaning, one of the most important skills in data science.

Specifically, we looked at detecting different types of missing values.

We also learned about replacing both numeric and character type missing values.

You can expect to spend up to 80% of your time cleaning data, so this is a valuable skill to have.

For information on data cleaning and detecting missing values with Python, check out this post.

--

--