The world’s leading publication for data science, AI, and ML professionals.

7 great functions for your Exploratory Analysis in R

7 functions to improve your EDA process in R

These functions will elevate your Exploratory Analysis to the next level

Photo by Carlos Muza on Unsplash
Photo by Carlos Muza on Unsplash

The EDA – Exploratory Data Analysis – phase of the Data Mining framework is one of the main activities when it comes to extracting information from a dataset. Whatever your ultimate goal is: Neural Networks, Statistical Analysis, or Machine Learning, everything should start with a good understanding and overview of the data you’re dealing with.

One of the main characteristics of an EDA is that it is a somewhat open process that depends on the toolbox and the inventiveness of the Data Scientist. Unfortunately, this is both a blessing and a curse, as a poorly done EDA may hide relevant relationships in the data or even impair the study’s validity.

Some of the activities that are usually carried out in an EDA (this is by no means an exhaustive list):

  • Analysis of numerical variables: average, minimum and maximum values, data distribution.
  • Analysis of categorical variables: list of categories, frequency of records in each category.
  • Diagnosis of outliers and how they impact the distribution of data for each variable.
  • Analysis of correlations between predictive variables.
  • Relationship between predictive variables vs. the outcome variable.

There is no magic formula when it comes to EDA, but there are certainly some packages and functions to keep in mind when analyzing your data to maintain the perfect balance of agility and flexibility in your analysis.

The dataset

The dataset used in this article is pretty well-known by everyone who has studied or practiced Data Science. It’s found here and it contains information on wines, made available in Portugal in 2008. It contains 1599 observations with 11 psychochemical attributes of Portuguese red wines and 1 target variable with a numerical discrete quality index from 0 to 10.

An outlook of the dataset, 1599 obs. x 12 vars (Image by author)
An outlook of the dataset, 1599 obs. x 12 vars (Image by author)

Packages and functions

The R Programming language is one of the most widespread programming languages ​​among data enthusiasts. After extensive research, I have compiled a list with some of the best functions you should know to perform EDA on your data, focusing on functions that allow a more visual analysis, with tables and graphs.

Overview of the dataset

The visdat package has two interesting functions for a quick and practical overview of your dataset. The _visdat function shows the variables, number of observations, and the type of each variable, in a way that’s very easy to interpret.

 visdat::vis_dat(wine,sort_type = FALSE) 
vis_dat for visualizing the type of the variables (Image by author)
vis_dat for visualizing the type of the variables (Image by author)

The _vismiss function, from the same package, allows you to view the number of missing values ​​per variable, thus giving an overview of the integrity of the dataset.

vis_miss for viewing the missing values (Image by author)
vis_miss for viewing the missing values (Image by author)

Note: the dataset in question did not have any variable with Missing values, so some Missing Values were generated "artificially" for better visualization.

Visualizing numeric variables

The package called funModeling provides great functions to plot useful information about your dataset, from basic information about the variables to more specific information such as the gain of information that each variable provides, and the relationship between the predictive variables and the result variable.

One of the functions to keep in mind is the _plotnum function, which plots a histogram of each numeric variable. There are several similar functions in other packages and even ways to do the same directly through ggplot2, but the plot_num function greatly simplifies the task.

plot_num for plotting distributions of numerical variables (Image by author)
plot_num for plotting distributions of numerical variables (Image by author)

With that, you have everything in one plot – of course, depending on the number of variables – an overview of your numerical varieties.

Visualizing categorical variables

As with numerical variables, it is important to have an overview of the categorical variables of your dataset. The _inspectcat function of the package inspectdf allows you to plot a summary of all categorical variables at once, showing the most common categories within each one.

inspect_cat for viewing most common categories in categorical variables (Image by author)
inspect_cat for viewing most common categories in categorical variables (Image by author)

Note: the dataset used in the article did not have any categorical variables, so the image is illustrative and it was taken from here.

Outlier identification and treatment

The dlookr package is a package that has very interesting functions of analysis, data processing, and reporting and brings some of the best solutions when it comes to EDA for the R language. One of the aspects in which this package shines is in its information about outliers in numerical variables.

Firstly, the _diagnoseoutlier function generates a data frame with information for each variable: count of outliers, the outliers x total observations-ratio, and the average with and without outliers.

dlookr::diagnose_outlier(wine)
diagnose_outlier for viewing information on outliers (Image by author)
diagnose_outlier for viewing information on outliers (Image by author)

The same package also offers the _plotoutlier function, which shows plots for all variables in the value distribution with and without the aforementioned outliers.

dlookr::plot_outlier(wine)
plot_oulier to see how the variables look with and without outliers (Image by author)
plot_oulier to see how the variables look with and without outliers (Image by author)

As can be seen in the chlorides variable, several high values will certainly affect the results when applying statistical models, especially when some models have the assumption that the data is normally distributed.

Note: it is important to remember that outliers should not always be removed, as in this case, they may indicate a specific subcategory of wines, especially due to the high concentration of outliers in this variable (7% of the values).

Correlation visualization

There are many packages with functions to generate plots of correlations between variables in a dataset, but few provide a complete visualization of several factors like the chart.Correlation function of the PerformanceAnalytics package.

chart.Correlation(wine[,2:7], histogram = TRUE, pch = 15)
chart.Correlation for viewing interactions between variables (Image by author)
chart.Correlation for viewing interactions between variables (Image by author)

It presents:

  • Numerical correlations (Pearson’s coefficient) between numerical variables in the dataset, with larger sources for larger correlations
  • A mini-scatterplot between each of the pairs of variables
  • A histogram and density plot of each variable

Note: the function only accepts a data frame with only numeric variables as input, so you should perform this treatment beforehand if your dataset contains categorical predictive variables. A suggestion for this treatment is using the method keep (available in the dplyr package):

wine %>%
    keep(is.numeric)

That way, you keep only the numeric variables in the dataset.

Automated report

Some packages also have functions that automate the generation of EDA reports on their data set. The currently available report options vary in the extent and dimension of the analyzes presented, but all show some kind of summary of the dataset variables, information about missing values, histograms and bar graphs of each variable, etc.

Possibly the best function I tested is the _createreport function of the DataExplorer package. For a more standard analysis – which is already quite comprehensive – it allows you to generate a report with just one line of code:

DataExplorer::create_report(wine)
sample of the create_report function from the DataExplorer package, for creating a complete EDA report (Image by author)
sample of the create_report function from the DataExplorer package, for creating a complete EDA report (Image by author)

Luckily, this function goes far beyond that, since it is possible to customize various aspects of the report, from changing the layout and themes to adjusting specific parameters or choosing exactly which graphics should be included in the report.

Note: It is always important to remember that there is no single solution that covers all the bases when it comes to data analysis and visualization and the automated report generation functions should also not be treated as such.

TLDR

  • Do you need a high-level overview of the dataset? visdat::vis_dat (overview) and visdat::vis_miss (missing values).
  • Do you need information about the numeric variables? funModeling::plot_num.
  • How about the categorical variables? inspectdf::inspect_cat.
  • Correlations between variables? PerformanceAnalytics::chart.Correlation.
  • How about an automated and configurable report?Package DataExplorer::create_report.

And how about you? Is there a can’t-miss function to automate or aid in the visualization of your data in the Exploratory Analysis? Let me know in the comment section! 😁

Sources and acknowledgments

We would first like to thank the developers responsible for creating and maintaining these incredible packages – which can all be found in R’s official repository, and also to the following sources consulted in my research:

Exploring Categorical Data With Inspectdf

Part 2: Simple EDA in R with inspectdf – Little Miss Data

The Landscape of R Packages for Automated Exploratory Data Analysis


Related Articles