Hi, I am Gregor, a data scientist or someone who needs to assess and clean data most of the time. I love to work with Python/ Pandas and R/ tidyverse in my projects equally. Since we used R and the tidyverse package in our most recent project, I would like to share the most basic but most-used functions to manipulate data sets.
In the next section, I outline my technical setup for this article to use the examples in this article yourself immediately. Then, in section 3, I present the seven functions using the Gapminder dataset. If you have any questions or any comments, please feel free to share them with me.
2 Setup
To showcase the functions, I will make use of the Gapminder data set. The Gapminder data set contains data about life expectancy, GDP per capita, and population for the country, spanning many decades.
The seven functions are part of the package dplyr developed by Hadley Wickham et al. It is part of the tidyverse package ecosystem. In my opinion, it makes R such a powerful and clean Data Science platform. If you want to know more about the tidyverse, I highly recommend the free book "R for Data Science".
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
Gapminder data set (10 rows); image by the author
3 Seven most-Basic but yet most-Often used Data Wrangling Functions
The seven functions allow you to select and rename specific columns, sort and filter your data set, create and calculate new columns, and summarize values. I will use the Gapminder data for each of the functions to be easy to follow and apply it to your data sets. Please note that the farther we go, I will use a combination of these functions.
3.1 select() – Selecting Columns in your Data Set
Selecting only the columns continent, year, and pop.
Gapminder data set (10 rows); image by the author
Selecting all columns but the year column.
Gapminder data set (10 rows); image by the author
Selecting all columns that start with co using _startswith(). Please have a look at the documentation for additional useful functions, including _endswith() or contains().
Gapminder data set (10 rows); image by the author
3.2 rename() – Renaming Columns
Rename the columms year into Year and lifeExp into Life Expectancy.
Gapminder data set (10 rows); image by the author
3.3 arrange() – Sorting your Data Set
Sort by year.
Gapminder data set (10 rows); image by the author
Sort by lifeExp and the by year (descending).
Gapminder data set (10 rows); image by the author
3.4 filter() – Filtering Rows in your Data Set
Filter rows with the year 1972.
Gapminder data set (10 rows); image by the author
Filter rows with the year 1972 and with a life expectancy below average.
Gapminder data set (10 rows); image by the author
Filter rows with the year 1972 and with a life expectancy below average, and with the country either to be Bolivia OR Angola.
Gapminder data set (2 rows); image by the author
3.5 mutate () – Generate new Rows in your Data Set
Create a column that combines continent and coountry information, and another column that shows the rounded lifeExp information.
Gapminder data set (10 rows); image by the author
3.6 summarize() – Create Summary Calculations in your Data Set
For the whole data set calculate mean and standard deviation for population and life expectations.
Gapminder data set (summarized); image by the author
3.7 group_by() – Group your Data Set and Create Summary Calculations
The summary function is only so useful without the _groupby() function. Using both together is a powerful way to create new data sets. In the example below, I will group the data set by continent and then I will create summaries for _population_and lifeExp.
Gapminder data set (grouped and summarized); image by the author
It is also possible to group by more than one column. In the next example I use _groupby() with continent and year.
Gapminder data set (grouped and summarized); image by the author
4 Conclusion
In this article, I showed you my most-used R functions to manipulate data sets. I provided you with some examples that are hopefully a perfect basis for you to try them out for each function. If you want to know more about R and dplyr, please make sure you check out the official documentation as well as the beautiful R for Data Science book.
Please let me know what you think and what your most-used functions are. Thank you!