The world’s leading publication for data science, AI, and ML professionals.

5 Basic Commands to Get Started with dplyr in R

Dplyr is equivalent to the Pandas library in Python which enables easy data exploration and manipulation

Photo by Jeff Siepman on Unsplash
Photo by Jeff Siepman on Unsplash

I started out my Data Science journey learning how to use the Pandas library and truthfully, there is everything to love about it – It is easy to use, straightforward and has functionalities for just about any tasks that involve manipulating and exploring a data frame.

Heck, I even made a full video series on YouTube teaching other people how to use Pandas. Feel free to check it out (shameless plug)!

Pandas Zero to Hero – A Beginner’s Tutorial to Using Pandas

However, lately, I find myself spending more and more time on R primarily because I am preparing for my actuarial exams but also I am curious to learn the Pandas equivalent tools that people are using in R. It didn’t take long at all before I stumbled upon the Dplyr library.

Now, if you are looking for a Python versus R debate, this article is not it. Personally, I think the more tools you can add to your toolbox, the better mechanic you become or in data science terms, the more languages or libraries that you know, the more flexible and effective you become as a data scientist.

Plus, in the real world, it is not always up to you to dictate which language you want to use for a particular project so it really doesn’t hurt to expand your knowledge and skillset.


Dplyr functionality

As I have mentioned, for those of you who are already familiar with Pandas, dplyr is awfully similar in terms of their functionalities. One can argue that dplyr is more intuitive to write and interpret especially when using the chaining syntax, which we will discuss later on.

In the event that you are completely new, don’t worry because, in this article, I will share 5 basic commands to help you get started with dplyr and those commands include:

  1. Filter
  2. Select
  3. Arrange
  4. Mutate
  5. Summarise

In addition to these commands, I will also demonstrate the base R approach to obtain the same result in efforts to highlight the readability of dplyr.

Without further ado, let’s begin!


Import dplyr library and drinks dataset

If you don’t already have dplyr installed on your computer, you can do so via the following command.

install.packages("dplyr")

Once you have installed the library, we can now proceed to import dplyr as well as the dataset that we will be using for this particular tutorial, the drinks by country dataset.

library(dplyr)
drinks = read.csv("http://bit.ly/drinksbycountry")
head(drinks) 

The dataset contains information about the alcohol consumption of 193 countries in the world.


Command 1: Filter

The filter command allows us to keep rows that match some specified criteria, commonly and/or.

And criteria

Suppose we would like to view countries in Asia that have zero beer servings.

# Base R approach
drinks[drinks$continent == "Asia" & drinks$beer_servings == 0, ]
# Dplyr approach
filter(drinks, continent == "Asia", beer_servings = 0)

Or criteria

Suppose now we want to view countries with either zero spirit servings or zero wine servings.

# Base R approach
drinks[drinks$spirit_servings == 0 | drinks$wine_servings == 0, ]
# Dplyr approach
filter(drinks, spirit_servings == 0 | wine_servings == 0)

Command 2: Select

Next, we have the select command which allows us to select columns by name.

Suppose, we would like to view the first 6 rows of the country and total litres of pure alcohol columns.

# Base R approach
head(drinks[, c("country", "total_litres_of_pure_alcohol")])
# Dplyr approach
head(select(drinks, country, total_litres_of_pure_alcohol))

To make the select command more robust, we also have the following to match columns by name:

  • starts_with
  • ends_with
  • matches
  • contains

Here, I want to select columns that contain the word servings.

head(select(drinks, contains("servings")))

Bonus: Chaining multiple operations via %>%

I spoke earlier about the advantage of readability that R has over Python. This advantage is largely attributed to being able to chain multiple operations together via the %>% __ syntax.

Rather than nesting different operations together into a single line of code, which can sometimes become confusing and hard to read, the chaining method allows for a more intuitive approach to writing and interpreting operations.

A good way to think about this process is to imagine that you are at a factory and each operation is like a worker who is part of this ginormous manufacturing process and responsible for a very specific task. Once he completes his task, he passes it on to the next worker in line to perform their task and so on and so forth until the final product is delivered.

It is also worth sharing the Windows shortcut for %>% and that is Ctrl+Shift+M.

Now, suppose we would like to select both the country and total litres of pure alcohol columns and filter rows that have a pure alcohol volume of more than 12 litres.

# Nesting method
filter(select(drinks, country, total_litres_of_pure_alcohol), total_litres_of_pure_alcohol > 12)
# Chaining method 
drinks %>% 
    select(country, total_litres_of_pure_alcohol) %>% 
    filter(total_litres_of_pure_alcohol > 12)

As we can see, chaining significantly increases code readability especially when there are many commands in play.

Chaining can also be used to replace nesting in R commands outside of dplyr. For example, suppose we want to calculate the root mean square error between two sets of numbers to five decimal places.

# Create two vectors
set.seed(42)
x = runif(10); y = runif(10)
# Nesting method
round(sqrt(mean((x-y)^2)), 5)
# Chain method
(x-y)^2 %>% mean() %>% sqrt() %>% round(5)

Here, you should get an RMSE of 0.36466.


Command 3: Arrange

Next, we have the arrange command which is pretty self-explanatory. It helps order rows in ascending or descending order.

Suppose we would like to order beer servings in each country from the lowest serving to the highest serving.

# Base R approach
drinks[order(drinks$beer_servings), c("country", "beer_servings")]
# Dplyr approach
drinks %>% 
    select(country, beer_servings) %>% 
    arrange(beer_servings)

Alternatively, for descending order, we need to use desc.

drinks %>% 
    select(country, beer_servings) %>% 
    arrange(desc(beer_servings))

Command 4: Mutate

Command 4 is mutate which allows us to create new variables that are functions of the existing variables in the data frame.

Suppose we would like to add a new variable called average alcohol, which is simply the average servings between beer, spirit and wine.

# Base R approach
drinks$avg_alcohol = round((drinks$beer_servings + drinks$spirit_servings + drinks$wine_servings) / 3)
head(drinks[, c("country", "beer_servings", "spirit_servings", "wine_servings", "avg_alcohol")])
# Dplyr approach
drinks %>% select(country, beer_servings, spirit_servings, wine_servings) %>% mutate(avg_alcohol = round((beer_servings + spirit_servings + wine_servings) / 3)) %>% head()

Command 5: Summarise

Last but not least, the summarise command reduces variables to values using what are called, aggregate functions. Some examples of aggregate functions include minimum, maximum, median, mode and mean.

Here, let’s see an example where we would like to compute the average wine servings by continent.

# Base R approach
aggregate(wine_servings ~ continent, drinks, mean)
# Dplyr approach
drinks %>% 
    group_by(continent) %>% 
    summarise(avg_wine = mean(wine_servings, na.rm = TRUE))

Conclusion

To summarise, in this article, we have learned the 5 basic commands of the dplyr library which enables us to explore and transform data frames. Those 5 commands are:

  1. Filter
  2. Select
  3. Arrange
  4. Mutate
  5. Summarise

Furthermore, we also saw how we can deploy the chaining method via the %>% operator to make our code easier to write and read.

I hope you enjoyed this article and gained something useful out of it. Feel free to check out my other tutorial articles on how to use R.

Back to Basics – Linear Regression in R

Customer Segmentation Using K-Means Clustering in R

Take care and keep learning!


Related Articles