
Data exploration, particularly exploratory data analysis (EDA), is my favorite part of the Data Science project. There’s a lingering curiosity and excitement that comes with a new set of data with the opportunity of discovering subtle relationships and unexpected trends.
In an older post, I covered the 11 essential EDA code blocks that I use every time on a new data set. Today’s article focuses on seaborn visualization plots for univariate analysis (focusing on one feature at a time). Use these plots at any stage of the data science process.
Loading the libraries and data
We are going to explore the vehicles dataset¹ from Kaggle. Also note that in this article, the words ‘variable,
feature, and
column` all mean the same thing.
Open a new Jupyter notebook and import the required libraries. Seaborn
relies on matplotlib
and we’ll import them both. Also, set the style
and font
.
import Pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set_style('darkgrid')
sns.set(font_scale=1.3)
Next, we load the csv file
and run some EDA code lines to get a general overview of the data.
cars = pd.read_csv('Car details v3.csv')
cars.head()

cars.shape
#(8128, 13)
cars.dtypes

Through these results, we learn the size of our data set (rows=8128 and columns=13), the column names and their data types, and display five rows to get a glimpse of the values in the data.
Univariate analysis
The prefix ‘Uni’ means one, meaning ‘univariate analysis’ is the analysis of one variable at a time.
For numeric features, we want to know the range of values present and how often these values (or groups of values) occur.
For categorical features, we want to know the number of unique classes and how frequently they occur.
Part 1: Numeric features
These are features with numbers that you can perform mathematical operations on. They are further divided into discrete (countable integers with clear boundaries) and continuous (can take any value, even decimals, within a range).
We first display summary statistics for each numeric feature using df.describe()
. This shows us the actual statistics.
cars.describe()

To get a better intuitive understanding of this distribution, we will need to graphically visualize them.
1. Histograms – sns.histplot()
A histogram groups values into ranges (or bins), and the height of a bar shows how many values fall in that range.

From a histogram, we will get the following:
- Range of the data. The minimum and maximum values are on opposite edges of the histogram. Highly concentrated regions are also apparent. Tall bars are where most data points fall whereas sparsely represented ranges appear as gaps or short bars.
- Shape or skewness of the feature. A feature can be right-skewed (tail is towards right), left-skewed (left-tailed), normally distributed (one center), or randomly distributed (no apparent pattern, multiple peaks).
- Presence of outliers. These appear as isolated bars on the far left or right.
The code below creates a seaborn histogram for our target variable selling price
.
sns.histplot(x='selling_price', data=cars);

The code below utilizes a loop to create individual histograms for all numeric variables.
cols = 3
rows = 3
num_cols = cars.select_dtypes(exclude='object').columns
fig = plt.figure( figsize=(cols*5, rows*5))
for i, col in enumerate(num_cols):
ax=fig.add_subplot(rows,cols,i+1)
sns.histplot(x = cars[col], ax = ax)
fig.tight_layout()
plt.show()

2. KDE plot – sns.kdeplot()
The ‘kernel density estimate’ plot creates a smooth version of a histogram by normalizing all points to appear under one curve.

It is best used when comparing a variable’s distribution between groups of another variable, a concept known as segmented univariate distribution.
The code below compares how engine sizes
are distributed among the fuel types
. We pass hue='fuel'
to split the data by the fuel types.
sns.kdeplot(x='engine_cc', data=cars, hue='fuel')

3. KDE with Histogram plot – sns.histplot(kde=True)
We can display a histogram with a KDE curve as below. See this post for an understanding of kde and histogram.
sns.histplot(x='selling_price', data=cars, kde=True)

4. Rug plot – sns.rugplot()
A rug plot draws ticks on the x-axis that show the location of individual data points.

The dense areas are places where most observations fall under while the heights of the ticks are inconsequential.
Rug plots complement histograms when it comes to outliers because we can see where the outlier data points fall. The code below creates a rugplot and histogram for the kilometers driven
feature. Note the outlier positions.
sns.rugplot(x='km_driven', data=cars, height=.03, color='darkblue')
sns.histplot(x='km_driven', data=cars, kde=True)

5. Box plots – sns.boxplot()
A boxplot shows the distribution, center and skewness of a numeric feature. It divides the data into sections that contain 25% of the data approximately.

Outliers, if present, appear as dots on either end. The whiskers that extend from the box represent the minimum and maximum values. The box depicts the Interquartile range and holds 50% of the data.
Boxplots take up less space than histograms as they are less detailed. They also define quartile locations and are good for quick comparisons between different features or segments.
The code below creates a boxplot of the mileage
feature.
sns.boxplot(x=cars['mileage_kmpl'])

Suppose you want to compare the distribution of two columns that are related; perhaps they have the same measuring unit. We can create a box plot and pass the two columns in the data
as below.
sns.boxplot(data=cars.loc[:, ['engine_cc', 'max_power_bhp']])

The code below creates boxplots for all the numeric variables in a loop.
cols = 3
rows = 3
num_cols = cars.select_dtypes(exclude='object').columns
fig = plt.figure(figsize= (15,9))
for i, col in enumerate(num_cols):
ax=fig.add_subplot( rows, cols, i+1)
sns.boxplot(x=cars[col], ax=ax)
fig.tight_layout()
plt.show()

6. Violin plot – sns.violinplot()
The violin plot features a combination of a box plot and a kernel density plot. This means that in addition to showing the quartiles, it also lays out the underlying distribution such as presence and location of different peaks.

The code below creates a seaborn violin plot of the year
data.
sns.violinplot(x=cars["year"])

7. Strip plot – sns.stripplot()
A strip plot implements a scatter plot to show the spread of individual observations for a feature.
Dense locations indicate areas with many overlapping points, and you can quickly spot outliers. It’s however hard to establish the relative center unlike a box plot, and it’s best for smaller datasets.
The code below creates a strip plot of selling price
.
sns.stripplot(x=cars["selling_price"]);

Part 2: Categorical features
These are columns with a limited number of possible values. Examples are sex, country, or age group.
Before creating the plots, we’ll first run the summary statistics that show information such as the number of unique classes per feature. This will inform us which features can be effectively visualized. If there are too many classes, the plots are cluttered and unreadable.
cars.describe(include='object')

Now we can visualize the four categorical features highlighted above.
8. Count plot – sns.countplot()
A count plot compares different classes of a categorical feature and how often they occur. Think of a bar chart with the bar height showing number of times each class occurs in the data.

The code below uses a loop to create count plots for the categorical features with unique classes less than 10.
cols = 4
rows = 1
fig = plt.figure(figsize= (16,6))
all_cats = cars.select_dtypes(include='object')
cat_cols = all_categs.columns[all_categs.nunique() < 10]
for i, col in enumerate(cat_cols):
ax=fig.add_subplot(rows, cols, i+1)
sns.countplot(x=cars[col], ax=ax)
plt.xticks(rotation=45, ha='right')
fig.tight_layout()
plt.show()

9. Pie chart – plt.pie()
A pie chart displays the percentage distribution of a categorical variable in a circular graph.

Pie charts are not very popular with the Visualization community. For one, the graph appears cluttered when the groups exceed four. Two, sometimes the widths for the slices are not intuitively clear.
Seaborn does not implement pie charts. We’ll use the matplotlib version.
Unlike seaborn plots, pie charts do not calculate the counts under the hood. We will therefore get the counts using Series.value_counts()
.
df = cars['transmission'].value_counts()
###df
Manual 7078
Automatic 1050
Name: transmission, dtype: int64
Next, we create the pie chart using plt.pie()
and pass it the values for each group, the labels for each slice (optional), and how to display the values inside the slices (optional).
plt.pie(df, labels=df.index, autopct="%.0f%%");

autopct = "%.0f%%"
creates a formatted string. (%.0f
) is a placeholder for the floating-point value rounded to 0 decimal places. The last part (2 percentage signs) comprises the placeholder (%), and the actual percentage sign to be printed.
Conclusion
Univariate analysis covers just one aspect of data exploration. It examines the distribution of individual features to determine their importance in the data. The next step is to understand the relationships and interactions between the features, also called bivariate and multivariate analysis.
I hope you enjoyed the article. All the code and data files are in this GitHub link.
You are welcome to subscribe here to get an email alert every time I publish a data science article. If you are not a medium member, you can support me as a writer by joining through this link and I will earn a small commission. Thank you for reading, and I wish you the best in your data journey!
References
- ‘Vehicle dataset’ by Nehal Birla, Nishant Verma and Nikhil Kushwaha is licenced under ‘DbCL’