The world’s leading publication for data science, AI, and ML professionals.

8 Seaborn Plots for Univariate Exploratory Data Analysis (EDA) in Python

Learn how to visualize and analyze one variable at a time using seaborn and matplotlib

Photo by Pixabay from Pexels
Photo by Pixabay from Pexels

Data exploration, particularly exploratory data analysis (EDA), is my favorite part of the Data Science project. There’s a lingering curiosity and excitement that comes with a new set of data with the opportunity of discovering subtle relationships and unexpected trends.

In an older post, I covered the 11 essential EDA code blocks that I use every time on a new data set. Today’s article focuses on seaborn visualization plots for univariate analysis (focusing on one feature at a time). Use these plots at any stage of the data science process.

Loading the libraries and data

We are going to explore the vehicles dataset¹ from Kaggle. Also note that in this article, the words ‘variable,feature, andcolumn` all mean the same thing.

Open a new Jupyter notebook and import the required libraries. Seaborn relies on matplotlib and we’ll import them both. Also, set the style and font.

import Pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set_style('darkgrid')
sns.set(font_scale=1.3)

Next, we load the csv file and run some EDA code lines to get a general overview of the data.

cars = pd.read_csv('Car details v3.csv')
cars.head()
Image by author
Image by author
cars.shape
#(8128, 13)
cars.dtypes
Image by author
Image by author

Through these results, we learn the size of our data set (rows=8128 and columns=13), the column names and their data types, and display five rows to get a glimpse of the values in the data.

Univariate analysis

The prefix ‘Uni’ means one, meaning ‘univariate analysis’ is the analysis of one variable at a time.

For numeric features, we want to know the range of values present and how often these values (or groups of values) occur.

For categorical features, we want to know the number of unique classes and how frequently they occur.

Part 1: Numeric features

These are features with numbers that you can perform mathematical operations on. They are further divided into discrete (countable integers with clear boundaries) and continuous (can take any value, even decimals, within a range).

We first display summary statistics for each numeric feature using df.describe(). This shows us the actual statistics.

cars.describe()
Summary stats of numeric features
Summary stats of numeric features

To get a better intuitive understanding of this distribution, we will need to graphically visualize them.

1. Histograms – sns.histplot()

A histogram groups values into ranges (or bins), and the height of a bar shows how many values fall in that range.

Histogram by author
Histogram by author

From a histogram, we will get the following:

  • Range of the data. The minimum and maximum values are on opposite edges of the histogram. Highly concentrated regions are also apparent. Tall bars are where most data points fall whereas sparsely represented ranges appear as gaps or short bars.
  • Shape or skewness of the feature. A feature can be right-skewed (tail is towards right), left-skewed (left-tailed), normally distributed (one center), or randomly distributed (no apparent pattern, multiple peaks).
  • Presence of outliers. These appear as isolated bars on the far left or right.

The code below creates a seaborn histogram for our target variable selling price.

sns.histplot(x='selling_price', data=cars);
Histogram by author
Histogram by author

The code below utilizes a loop to create individual histograms for all numeric variables.

cols = 3
rows = 3
num_cols = cars.select_dtypes(exclude='object').columns
fig = plt.figure( figsize=(cols*5, rows*5))
for i, col in enumerate(num_cols):

    ax=fig.add_subplot(rows,cols,i+1)

    sns.histplot(x = cars[col], ax = ax)

fig.tight_layout()  
plt.show()

2. KDE plot – sns.kdeplot()

The ‘kernel density estimate’ plot creates a smooth version of a histogram by normalizing all points to appear under one curve.

KDE plot by author
KDE plot by author

It is best used when comparing a variable’s distribution between groups of another variable, a concept known as segmented univariate distribution.

The code below compares how engine sizes are distributed among the fuel types. We pass hue='fuel' to split the data by the fuel types.

sns.kdeplot(x='engine_cc', data=cars, hue='fuel')
kdeplot by author
kdeplot by author

3. KDE with Histogram plot – sns.histplot(kde=True)

We can display a histogram with a KDE curve as below. See this post for an understanding of kde and histogram.

sns.histplot(x='selling_price', data=cars, kde=True)
Histogram with kde by author
Histogram with kde by author

4. Rug plot – sns.rugplot()

A rug plot draws ticks on the x-axis that show the location of individual data points.

Rugplot by author
Rugplot by author

The dense areas are places where most observations fall under while the heights of the ticks are inconsequential.

Rug plots complement histograms when it comes to outliers because we can see where the outlier data points fall. The code below creates a rugplot and histogram for the kilometers driven feature. Note the outlier positions.

sns.rugplot(x='km_driven', data=cars, height=.03, color='darkblue')
sns.histplot(x='km_driven', data=cars, kde=True)
Rugplot and histogram by author
Rugplot and histogram by author

5. Box plots – sns.boxplot()

A boxplot shows the distribution, center and skewness of a numeric feature. It divides the data into sections that contain 25% of the data approximately.

Boxplot illustration created by author
Boxplot illustration created by author

Outliers, if present, appear as dots on either end. The whiskers that extend from the box represent the minimum and maximum values. The box depicts the Interquartile range and holds 50% of the data.

Boxplots take up less space than histograms as they are less detailed. They also define quartile locations and are good for quick comparisons between different features or segments.

The code below creates a boxplot of the mileage feature.

sns.boxplot(x=cars['mileage_kmpl'])
Boxplot by author
Boxplot by author

Suppose you want to compare the distribution of two columns that are related; perhaps they have the same measuring unit. We can create a box plot and pass the two columns in the data as below.

sns.boxplot(data=cars.loc[:, ['engine_cc', 'max_power_bhp']])
Boxplots by author
Boxplots by author

The code below creates boxplots for all the numeric variables in a loop.

cols = 3
rows = 3
num_cols = cars.select_dtypes(exclude='object').columns
fig = plt.figure(figsize= (15,9))
for i, col in enumerate(num_cols):

    ax=fig.add_subplot( rows, cols, i+1)

    sns.boxplot(x=cars[col], ax=ax)

fig.tight_layout()  
plt.show()
Boxplots for all features by author
Boxplots for all features by author

6. Violin plot – sns.violinplot()

The violin plot features a combination of a box plot and a kernel density plot. This means that in addition to showing the quartiles, it also lays out the underlying distribution such as presence and location of different peaks.

Image from source
Image from source

The code below creates a seaborn violin plot of the year data.

sns.violinplot(x=cars["year"])
Violin plot by author
Violin plot by author

7. Strip plot – sns.stripplot()

A strip plot implements a scatter plot to show the spread of individual observations for a feature.

Dense locations indicate areas with many overlapping points, and you can quickly spot outliers. It’s however hard to establish the relative center unlike a box plot, and it’s best for smaller datasets.

The code below creates a strip plot of selling price.

sns.stripplot(x=cars["selling_price"]);
Strip plot by author
Strip plot by author

Part 2: Categorical features

These are columns with a limited number of possible values. Examples are sex, country, or age group.

Before creating the plots, we’ll first run the summary statistics that show information such as the number of unique classes per feature. This will inform us which features can be effectively visualized. If there are too many classes, the plots are cluttered and unreadable.

cars.describe(include='object')
Image by author
Image by author

Now we can visualize the four categorical features highlighted above.

8. Count plot – sns.countplot()

A count plot compares different classes of a categorical feature and how often they occur. Think of a bar chart with the bar height showing number of times each class occurs in the data.

Count plot by author
Count plot by author

The code below uses a loop to create count plots for the categorical features with unique classes less than 10.

cols = 4
rows = 1
fig = plt.figure(figsize= (16,6))
all_cats = cars.select_dtypes(include='object')
cat_cols = all_categs.columns[all_categs.nunique() < 10]
for i, col in enumerate(cat_cols):

    ax=fig.add_subplot(rows, cols, i+1)

    sns.countplot(x=cars[col], ax=ax)

    plt.xticks(rotation=45, ha='right')

fig.tight_layout()  
plt.show()
Countplots by author
Countplots by author

9. Pie chart – plt.pie()

A pie chart displays the percentage distribution of a categorical variable in a circular graph.

Pie chart by author
Pie chart by author

Pie charts are not very popular with the Visualization community. For one, the graph appears cluttered when the groups exceed four. Two, sometimes the widths for the slices are not intuitively clear.

Seaborn does not implement pie charts. We’ll use the matplotlib version.

Unlike seaborn plots, pie charts do not calculate the counts under the hood. We will therefore get the counts using Series.value_counts().

df = cars['transmission'].value_counts()
###df
Manual       7078
Automatic    1050
Name: transmission, dtype: int64

Next, we create the pie chart using plt.pie() and pass it the values for each group, the labels for each slice (optional), and how to display the values inside the slices (optional).

plt.pie(df, labels=df.index, autopct="%.0f%%");
Pie chart by author
Pie chart by author

autopct = "%.0f%%" creates a formatted string. (%.0f) is a placeholder for the floating-point value rounded to 0 decimal places. The last part (2 percentage signs) comprises the placeholder (%), and the actual percentage sign to be printed.

Conclusion

Univariate analysis covers just one aspect of data exploration. It examines the distribution of individual features to determine their importance in the data. The next step is to understand the relationships and interactions between the features, also called bivariate and multivariate analysis.

I hope you enjoyed the article. All the code and data files are in this GitHub link.

You are welcome to subscribe here to get an email alert every time I publish a data science article. If you are not a medium member, you can support me as a writer by joining through this link and I will earn a small commission. Thank you for reading, and I wish you the best in your data journey!

References

  1. ‘Vehicle dataset’ by Nehal Birla, Nishant Verma and Nikhil Kushwaha is licenced under ‘DbCL’

Related Articles