R is an Open Source language with a ton of libraries to explore. In this post, you will find some of the most famous ones and where you can learn more about them.

Learning open source languages became the de facto standard to work in Data Science in the past decade. Open source tools are free, scalable and have an extensive support of thousands of individuals in cooperative communities.
R is one of the most famous open source languages that picked up the preference of millions of Data Scientists around the world. One of its main advantages is the large community that supports a multitude of libraries that are constantly updated and improved, being able to capture the recent developments of machine learning and data science research around the world.
There are literally hundreds of libraries that one can use when working with R -for beginners this may seem confusing, particularly after you notice that many libraries have the same purpose and do similar things.
This post will help you to understand which libraries are frequently used by data scientists around the world. This summary should provide you an extensive list of the most famous libraries out there in the context of data science.
Ranging from data wrangling to training Machine Learning models, the following libraries are part of most data science scripts in production nowadays. If you are new to R, developing a study plan around the next libraries will give you a good method to improve your R skills and be able to deal with a lot of data science projects our there.
Let’s start!
Hint: To be able to run the examples in this article, don’t forget to download and install the libraries mentioned throughout the articles.
Data Wrangling – Dplyr
Our first library comes from the tidyverse universe of packages. Dplyr is a data wrangling library, famous for unlocking the power of the pipe operator inside R (%>%
).
R syntax is not particularly clean when it comes to deal with complex data pipelines. Adding several filters or merging different data frames may make your code messy and confusing. Dplyr powers up your data wrangling game, enabling you to write complex data pipelines swiftly.
Also, with dplyr you can easily encapsulate functions one inside the other, instead of using the typical composite format f(g(x)), which can scale to messy R code that is almost impossible to maintain or debug and creates a lot of technical debt.
Let’s see the following example with the iris dataset – imagining I would like to average the _sepalwidth after filtering the rows for the virginica species. If we use R base code, we would probably would do the following:
mean(iris[iris$Species=='virginica', 'Sepal.Width'])
That doesn’t seem too complex! But, now imagine that we want to add another condition, taking out virginicas a Petal Width size lower than 2:
mean(iris[(iris$Species=='virginica') & (iris$Petal.Width < 2), 'Sepal.Width'])
This code is a bit more confusing. With Dplyr, we can simply:
iris %>%
filter(Species == 'virginica', Petal.Width < 2) %>%
summarise(mean(Sepal.Width))
How cool is that? The %>%
works by acting on the returned elements of the previous function. The function summarise
is being applied only on the return of the filter
one. In the filter
, we directly use both conditions separated by commas, something more practical and clean than doing multiple &
conditions.
Dplyr is really cool and one of the most important bits to learn if you want to improve your code and make it more readable for others. Some resources you can use to learn more about it:
- The library’s official page.
- The Dplyr cheat sheet.
- The first section of my R For Data Science Udemy course.
Data Access— RODBC
When you want to retrieve data directly from databases, RODBC is your friend. This library enables you to directly connect to tables inside Database Management Systems using ODBC(Open Database Connection) channels and retrieve some examples directly from database systems, without using any csv, xlsx or json interfaces. Using RODBC, you use query language to transform data located in a DBMS directly into a dataframe.
To use RODBC you need to:
- Configure an ODBC connection in your system to the DBMS you want to;
- Setup the credentials to access the database server. Of course, this means that you need valid permissions to access the data.
And that’s it! Super easy.
For instance, in the example below I am creating a connection to a local MYSQL server using the root username:
connection <- odbcConnect("mysqlconn", uid="root")
After creating it, I can immediately use some SQL code to retrieve data directly from MYSQL server:
result1 <- sqlQuery(channel, paste("SELECT * from sakila.city"))
I’m using SQL code with a query SELECT * from sakila.city
to retrieve the entire city table from the sakila database. When doing this, this table will be stored in dataframe format in my result1
object! After this, I can apply all the R code I know into this object because I’m working with a dataframe!
RODBC can access most database systems such as Microsoft SQL Server, PostgresSQL, MYSQL, SQLite, etc.
Tied with sqldf, these two libraries are able to interpret SQL code inside R and can give huge productivity gains to your data science and analytics projects.
You can learn more about RODBC:
- RODBC’s official documentation.
- R Bloggers RODBC example.
- Reading External Data Section from my R Programming Course.
Data Visualization – GGPlot2
R base contains plotting functions that you can use as soon as you install R – some examples are plot
or barplot
that draw line and bar plots, respectively. These functions are cool but they have two main limitations:
- Each function has it’s own arguments to feed data and set up the canvas of the plot.
- Adding elements (headings, labels, etc.) to the plot is pretty cumbersome and confusing.
Luckily, there is another library straight out of tidyverse that is, arguably, the most famous R library of all time. GGplot2, commonly called GGplot.
Mastering GGplot is a skill in itself. One of its main features is enabling the user to switch between plots with minimal changes in the code. For instance, let’s imagine we want to plot the internet usage per minute time series using a line plot – we are using R’s WWWusage sandbox dataset:
internet <- data.frame(
internetusage = WWWusage
)
To plot a line plot, I can use GGplot’s native functions:
ggplot(
data = internet,
aes(x=row.names(internet), y=internetusage)
) + geom_line(group = 1) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
GGplot is modular – breaking down the elements we see above:
- we start with the
ggplot
function that contains the "base" of our plot. This is where we state the data and how we will map the x and y dimensions. - Then, we add a
geom_line
. We’ve just added a new "module" to our plot, stating that we want a line plot. - Finally, we add small aesthetic changes to our plot, using
theme
. In our case, we are hiding all labels and info of the x-axis.
We can keep adding modules to our GGplot using a +
sign.
But.. I’ve told you that it was really simple to change between types of plots – how can we change this into another plot type?
Super simple! For instance, to change this plot into a scatter, we just need to replace the geom_line
by a geom_point
:
ggplot(
data = internet,
aes(x=row.names(internet), y=internetusage)
) + geom_point() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Tied with plotly (another cool plotting library!), GGplot is probably the first choice by most R users to draw plots. Compared with R base, it’s easier to switch between different plots and the format just can’t compete with GGPlot in terms of flexibility and code smoothness.
Some resources for learning GGplot:
- R Graph Gallery – GGplot;
- GGplot2’s cheat sheet.
- Plotting Data section from my R Programming Course.
Model Training – rpart and randomForest/ranger
If we are talking about libraries for Data Scientists, we must include libraries to train models, right? Let’s start with three of the most famous libraries that train tree-based models.
Rpart is the go-to library to train decision trees. With rpart you can train a simple decision tree with a single line of code:
rpart(
formula = Species ~ .,
data = iris
)
With default parameters, rpart trains the following decision tree:

We can also override other tree parameters on rpart, using the control
argument:
rpart(
formula = Species ~ .,
data = iris,
control = list(minsplit=1, minbucket=1, cp=.001)
)
Here, we are overriding three hyperparameters on our Decision Tree – minsplit
, minbucket
and cp
. This yields a deeper tree:

Rpart is a great way to train your first model with R. It’s also a great way to wrap your head around the huge impact of hyper-parameters, particularly in tree based models.
After stumbling into the concept of Decision Trees, you will, naturally, end up studying Random Forests – a stabler version of the former.
Random Forests are the "democratic" version of Decision Trees. They are stabler and less prone to overfit new data. In R, we also have libraries that will train these models with a single line of code, the randomForest and ranger libraries.
The randomForest library is the first R implementation of Leo Breiman’s original paper. Consequentially, it is also slower than ranger. You can train a Random Forest algorithm using the randomForest library by doing:
randomForest(
formula = Species ~ .,
data = iris,
ntree=10
)
It’s that simple! Here, we are training a Random Forest with 10 different decision trees. We can also tweak the hyper-parameters by adding new arguments to the function:
randomForest(
formula = Species ~ .,
data = iris,
ntree=10,
mtry = 2
)
In this example, we are using a sample of 2 random columns in each tree we are building. You can check the full list of parameters using ?randomForest on your R console.
Alternatively, ranger is a faster implementation of Random Forest – if you have high dimensional data (lots of rows or columns), ranger is a better choice:
ranger(
formula = Species ~ .,
data = iris,
num.trees=10
)
We can also give new hyper-parameters to ranger using new arguments on the function:
ranger(
formula = Species ~ .,
data = iris,
num.trees = 10,
mtry = 2
)
rpart, randomForest and ranger are great starting points to understand how we can train models with different hyper-parameters using R. Wouldn’t it be cool if we had some abstraction layer, similar to what GGplot does __ for plots, i.e., a library that would let us switch between models much faster?
Luckily, we do! It’s called caret!
Before talking about caret, some resources to learn about these libraries:
- Rpart official documentation;
- randomForest official documentation;
- ranger official documentation;
- Tree-based Models Section of my R For Data Science Udemy course.
Model Training—Caret
The caret library is one of the most cool libraries related with machine learning inside R. As we’ve seen before, there are certain libraries that can train specific models such as rpart (for decision trees) and randomForest/ranger (for Random Forests). To switch between these models we need to change the functions and probably tweak some of the parameters we are feeding to them.
Caret abstracts the models into a generic train
function that can be used with different models, providing a method
. The main difference with other libraries is that the model is now abstracted as an argument inside caret
instead of being a standalone function.
With caret, it’s also very simple to compare between models’ performance and results – a standard task in machine learning projects.
Exemplifying, let’s see how to train a Random Forest model using thistrain
function from caret – for simplicity, we are using the iris dataset:
caret_model <- train(
Species ~ .,
data = iris,
method = "ranger",
num.trees = 4
)
Nice! We just use method
to define the model! In this case, we used ranger
to train a random forest based on the ranger library implementation. How easy it is to train a decision tree on the same dataset?
We just change the method
!
decision_tree <- train(
Species ~ .,
data = iris,
method = "rpart"
)
You can literally choose hundreds of models when using caret.
This flexibility makes caret one of the most impressive libraries you can use inside R. Other advantages of caret include:
- Nice implementation of cross-validation methods;
- Smooth way to perform grid searches on hyper-parameters;
You can learn more about caret on the following links:
- Caret’s official documentation;
- Machine Learning Plus blog post about caret;
Model Training – h2o
Caret is suited for most machine learning models. Nevertheless, when you want to get a bit more advanced, h2o is your friend.
h2o contains really cool implementation of feed-forward neural networks and other advanced models. If you want to perform experiments and advanced tweakings of your models, h2o should be a great place to start.
From all the libraries presented here, h2o is definitely the most advanced one. By studying it, you will stumble into a lot of new concepts such as distributed and scalable environments. These features are the ones that make h2o tailored for machine learning deployments, something that caret or other ML libraries may have trouble with.
Let’s build a quick example of a simple Neural Network using h2o on the iris dataset:
h2o.init()
iris.frame <- as.h2o(iris)
model <- h2o.deeplearning(
x = c('Sepal.Length','Sepal.Width','Petal.Length','Petal.Width'),
y = 'Species',
training_frame = iris.frame,
hidden = c(2,2),
epochs = 10
)
Due to it’s distributed nature, using h2o is not as simple as using other machine learning libraries. On the code above, we are doing the following:
h2o.init()
– starts the h2o local cluster.as.h2o(iris)
– converts the iris dataframe into a h2o frame.h2o.deeplearning()
– trains neural network using the iris h2o frame.
Although caret supports neural networks, h2o has more flexibility when tweaking layers, activation functions and other parameters. Above, we are using 2 hidden layers with 2 nodes – with the hidden
parameter, we could scale this to even deeper networks just by modifying the vector given in this argument.
Use the following resources to get a bit more deep into h2o:
- KDNuggets’ h2o using R blog post;
- h2o official documentation;
And we’re done! Thank you for taking the time to read this post. Do you often use other libraries during your work that are not mentioned here? Write them down on the comments below!
I’ve set up an introduction to R and a Bootcamp on learning Data Science on Udemy. Both courses are tailored for beginners and I would love to have you around!