Photo by Jessica Ruscello on Unsplash

More Robust Multivariate EDA with Statistical Testing

Improve robustness in identifying relationships between variables by incorporating relevant hypothesis testing methods

Pararawendy Indarjo
Towards Data Science
8 min readApr 16, 2024

--

Exploratory Data Analysis (EDA) is a fundamental skill for data scientists. To emphasize its significance, I would argue that EDA is more important than ML modeling skills. Why? Because EDA is relevant in a larger context than ML modeling.

You come across new data that you need to familiarize yourself with? Do EDA. You want to gain insights from data? Do EDA. Even if you want to create an ML model? You still need to perform EDA to prepare the modeling dataset.

As you may already be aware, one step in EDA is multivariate analysis. This exercise aims at identifying whether there are relationships between variables in the dataset. Knowing such relationships (if they exist) may allow one to take appropriate action based on the data’s high-level context.

Having said that, we should be aware of the noise-induced relationship. A seemingly interesting relationship between two variables observed on a bivariate visualization may be due to noise rather than a true meaningful relationship.

Fortunately, statistics provide us with a tool to help us determine whether an observed relationship is truly meaningful or is most likely due to noise. Yes, we’re talking about statistical hypothesis-testing methods.

In this blog post, I’ll discuss how to use statistical testing methods in addition to standard bivariate visualization to improve the robustness of multivariate EDA exercises.

The remainder of this article will be organized as follows.

  1. Dataset preparation.
  2. Identifying numeric-to-numeric variables relationship: scatter plot + correlation test.
  3. Identifying numeric-to-categoric variables relationship: KDE plot + one-way ANOVA.
  4. Identifying categoric-to-categoric variables relationship: Count plot + Chi-square test.

Dataset Preparation

We’ll use the popular MPG dataset. It’s a publicly available dataset with a Creative Common 4.0 license (which allows for sharing and adaptation of the dataset for any purpose).

The dataset is about car fuel efficiency details (mile per gallon/MPG) and other vehicle attributes. The data dictionary information can be found on its UCI Machine Learning repo page.

We load the dataset and remove missing values (which are minimal in this dataset and therefore acceptable to remove). For method demonstration purposes, we also create a new categorical column called efficiency from the mpg column. It has a value of “yes” if the mpg is at least 25, and “no” otherwise.

# import libraries
import seaborn as sns
import pandas as pd

# load dataset via seaborn lib
df = sns.load_dataset("mpg")

# inspect missing values
df.isna().sum() #result: minimal, hence OK to remove

# remove missing values
df.dropna(inplace=True)

# create a new categorical column based on mpg column
df["efficiency"] = df["mpg"].apply(lambda x: "yes" if x >= 25 else "no")

# dataframe info
df.info()

# dataframe head
df.head()
df.info() output (Image by Author)
df.head() output (Image by Author)

Regarding the goal of multivariate EDA on this dataset, we naturally want to know which factors influence car fuel efficiency. To that end, we will answer the following questions:

  1. What numerical features influence mpg performance?
  2. Do mpg profiles vary depending on origin?
  3. Do different origins result in varying profiles of car efficiency?

Numeric-to-Numeric Relationship

For the first case of multivariate EDA, let’s discuss about identifying relationship between two numerical variables. In this case, it is well known that we can use a scatter plot to visually inspect any relationship that exists between the variables.

As previously stated, not all observed patterns are guaranteed meaningful. In the numeric-to-numeric case, we can supplement the scatter plot with the Pearson correlation test. First, we calculate the Pearson correlation coefficient for the plotted variables. Second, we determine whether the obtained coefficient is significant by computing its p-values.

The latter step is important as a sanity check whether a certain value of correlation coefficient is larger enough to be considered as meaningful (i.e., there is a linear relationship between plotted variables). This is especially true in the small data size regime. For example, if we only have 10 data points, the correlation coefficient must be at least 0.64 to be considered significant (ref)!

In python, we can use pearsonr function from thescipy library to do the mentioned correlation test.

In the following codes, we draw a scatter plot for each pair of numerical features-mpg column. As a title, we print the correlation coefficient plus conditional double-asterix characters if the coefficient is significant (p-value < 0.05).

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# prepare variables to inspect
numeric_features = ['cylinders','displacement','horsepower',
'weight','acceleration','model_year']
target = 'mpg'

# Create a figure and axis
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 6))

# Loop through the numerical columns and plot each scatter plot
for i, col in enumerate(numeric_features):
# Calculate Pearson correlation coefficient
corr_coeff, p_val = pearsonr(df[col],df[target])

# Scatter plot using seaborn
sns.scatterplot(data=df, x=col, y=target, ax=axes[i//3, i%3])

# Set title with Pearson correlation coefficient
# Print ** after the correlation if the correlation coefficient is significant
axes[i//3, i%3].set_title(f'{col} vs {target} (Corr: {corr_coeff:.2f} {"**" if p_val < 0.05 else ""})')

plt.tight_layout()
plt.show()
Numerical features vs mpg (Image by Author)

Observe that all plot titles contain a double asterix, indicating that the correlations are significant. Thus, we can conclude the following:

  1. Cylinders, displacement, horsepower, and weight have a strong negative correlation with mpg. This means that for each of these variables, a higher value corresponds to lower fuel efficiency.
  2. Acceleration and model year have a medium positive correlation with mpg. This means that longer acceleration times (slower cars) and more recently produced cars are associated with higher fuel efficiency.

Numeric-to-Categoric Relationship

Next, we’ll investigate if the mpg profiles differ depending on the origin. Note that origin is a categorical variable. As a result, we’re considering the numeric-to-categorical case.

A KDE (kernel density estimation) plot, also known as a smooth version of a histogram, can be used to visualize the mpg distribution with breakdowns for each origin value.

In terms of statistical testing, we can use one-way ANOVA. The hypothesis we want to test is whether there are significant mean differences in mpg between different car origins.

In python, we can use f_oneway function from scipy library to perform one-way ANOVA.

In the following code, we create a KDE plot of mpg with breakdowns for different origin values. Next, we run one-way ANOVA and display the p-value in the title.

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

# Create a KDE plot with hue
sns.set(style="whitegrid")
ax = sns.kdeplot(data=df, x="mpg", hue="origin", fill=True)

# Calculate one-way ANOVA p-value
p_value = f_oneway(*[df[df['origin'] == cat]['mpg'] for cat in df['origin'].unique()])[1]

# Set title with one-way ANOVA p-value
ax.set_title(f'KDE Plot mpg by origin (One-way ANOVA p-value: {p_value:.4f})')

plt.show()
KDE plot of MPG by origin (Image by Author)

The p-value in the plot above is less than 0.05, indicating significance. On a high level, we can interpret the plot like this: In general, cars made in the United States are less fuel efficient than cars made elsewhere (this is because the peak of USA mpg distribution is located on the left when compared to other origins).

Categoric-to-Categoric Relationship

Finally, we will evaluate the scenario in which we have two categorical variables. Considering our dataset, we’ll see if different origins produce different car efficiency profiles.

In this case, a count plot with breakdown is the appropriate bivariate visualization. We’ll show the frequency of cars for each origin, broken down by efficiency flag (yes/no).

In terms of statistical testing method to use, chi-square test is the one to go. Using this test, we want to validate if different car origins have different distribution of efficient vs inefficient cars.

In python, we can use chisquare function from scipy library. However, unlike the previous cases, we must first prepare the data. Specifically, we need to calculate the “expected frequency” of each origin-efficient value combination.

For readers who want a more in-depth explanation of this expected frequency concept and chi square test overall mechanics, I recommend reading my blog on the subject, which is attached below.

The codes to perform the mentioned data preparation are given below.

# create frequency table of each origin-efficient pair
chi_df = (
df[['origin','efficiency']]
.value_counts()
.reset_index()
.sort_values(['origin','efficiency'], ignore_index=True)
)

# calculate expected frequency for each pair
n = chi_df['count'].sum()

exp = []
for i in range(len(chi_df)):
sum_row = chi_df.loc[chi_df['origin']==chi_df['origin'][i],'count'].sum()
sum_col = chi_df.loc[chi_df['efficiency']==chi_df['efficiency'][i],'count'].sum()
e = sum_row * sum_col / n
exp.append(e)

chi_df['exp'] = exp
chi_df
chi_df result (Image by Author)

Finally, we can execute the codes below to draw the count plot of car origins with breakdowns on efficiency flags. Furthermore, we use chi_df to perform the chi-square test and get the p-value.

import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chisquare

# Create a count plot with hue
sns.set(style="whitegrid")
ax = sns.countplot(data=df, x="origin", hue="efficiency", fill=True)

# Calculate chi-square p-value
p_value = chisquare(chi_df['count'], chi_df['exp'])[1]

# Set title with chi-square p-value
ax.set_title(f'Count Plot efficiency vs origin (chi2 p-value: {p_value:.4f})')

plt.show()
Count plot efficiency vs origin (Image by Author)

The plot indicates that there are differences in the distribution of efficient cars across origins (p-value < 0.05). We can see that American cars are mostly inefficient, whereas Japanese and European cars follow the opposite pattern.

Summary

In this blog post, we learned how to improve bivariate visualization using appropriate statistical testing methods. This would improve the robustness of our multivariate EDA by filtering out noise-induced relationships that would otherwise be visible based solely on visual inspection of bivariate plots.

I hope this article will help you during your next EDA exercise! All in all, thanks for reading, and let’s connect with me on LinkedIn! 👋

--

--