The world’s leading publication for data science, AI, and ML professionals.

8 Must-Know Data Visualizations for Better Data Analysis

Explained with examples using Seaborn

Photo by Lucas Benjamin on Unsplash
Photo by Lucas Benjamin on Unsplash

Data visualization is a very important part of Data Science. It is quite useful in exploring and understanding the data. In some cases, visualizations are much better than plain numbers at conveying information.

The relationships among variables, the distribution of variables, and underlying structure in data can easily be discovered using data visualization techniques.

In this post, we will learn about the 8 most commonly used types of data visualizations. I will use Seaborn to create visualizations and also try to explain what kind of information we can infer.

We will use the grocery and direct marketing datasets available on Kaggle to create the visualizations.

The grocery dataset contains information about customer purchases at grocery stores. The direct marketing dataset contains relevant data of a marketing campaign done via direct mail.

Let’s start by reading the datasets into a pandas dataframe.

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')
grocery = pd.read_csv("/content/Groceries_dataset.csv", parse_dates=['Date'])
print(grocery.shape)
grocery.head()
Grocery dataset (image by author)
Grocery dataset (image by author)

The dataset contains about 40k rows and 3 columns. We have member number, date of purchase, and the purchased items as columns.

marketing = pd.read_csv("/content/DirectMarketing.csv")
print(marketing.shape)
marketing.head()
Marketing dataset (image by author)
Marketing dataset (image by author)

The marketing dataset consists of 1000 observations (i.e. rows) and 10 features (i.e. columns). The focus is on the "AmountSpent" column which indicates how much a customer has spent so far.


1. Line plot

Line plots visualize the relation between two variables. One of them is usually the time. Thus, we can see how a variable changes over time.

In our dataset, we can visualize the number of items purchased over time. First, we will calculate the number of items purchased on each day. The groupby function of pandas will help us do that.

items = grocery[['Date','itemDescription']]
.groupby('Date').count().reset_index()
items.rename(columns={'itemDescription':'itemCount'}, inplace=True)
items.head()
Item count per day (image by author)
Item count per day (image by author)

We can now plot the item count over time. For demonstration purposes, I will only use the last 6 months.

sns.relplot(x = 'Date', y = 'itemCount', 
            data=items[items.Date > '2015-06-01'], 
            kind='line', height=5, aspect=2)
Lineplot (image by author)
Lineplot (image by author)

Line plots can be produced using the relplot function. We need to pass ‘line’ argument to the kind parameter.

Note: The purpose of this article is to explain different kinds of visualizations. Thus, we will not focus on customizing or editing the plots (e.g. fontsize, labels, colors, and so on)


2. Scatter plot

Scatter plot is also a relational plot. It is commonly used to visualize the values of two numerical variables. We can observe if there is a correlation between them.

We will visualize the "Salary" and "AmountSpent" columns of the marketing dataset.

Scatter plots can also be produced using the relplot function. The kind parameter takes ‘scatter’ as argument.

sns.relplot(x='Salary', y='AmountSpent', hue='Married', data=marketing, kind='scatter', height=7)
Scatter plot (image by author)
Scatter plot (image by author)

The hue parameter allows to compare different categories based on the numerical variables. We passed the "Married" column to the hue parameter. Thus, the data points that belong to different categories of this column (single and married) are marked with different colors.

It seems like there is a positive correlation between the spent amount and salary which is expected. The more money you have, the more you are likely to spend. We also see that married people are likely to earn and spend more money than single people.


3. Histogram

Histograms are used to visualize the distribution of a continuous variable.

In the previous plot, we saw that married people earn more money than single people in general. Let’s plot the distribution of the "Salary" column for married and single people separately.

sns.displot(x='Salary', hue='Married', data=marketing, kind='hist', aspect=1.5)
Histogram of salary (image by author)
Histogram of salary (image by author)

We used the displot function which is used to create distribution plots. The kind parameter is set as ‘hist’ to produce histograms.

The salary of married people has a normal distribution. There are many single people with very low salaries.


4. Kernel density plot (KDE)

Kde plots are also used to visualize distributions. Instead of using discrete bins like histograms, kde plots smooth the observations with a Gaussian kernel. As a result, a continuous density estimate is produced.

In order to compare histogram and kde plots, let’s create the kde plot version of the histogram in the previous section.

sns.displot(x='Salary', hue='Married', data=marketing, kind='kde', aspect=1.5)
Kde plot of salary (image by author)
Kde plot of salary (image by author)

5. Box plot

Box plot provides an overview of the distribution of a variable. It shows how values are spread out by means of quartiles and outliers.

Let’s check the distribution of the ‘AmountSpent’ column based on different age groups.

We will use the cat function (stands for categorical) and pass ‘box’ to the kind parameter.

sns.catplot(x='Age', y='AmountSpent', data=marketing, kind='box', height=6, aspect=1.3)
Box plot of Age (image by author)
Box plot of Age (image by author)

The range of spent amount is narrow for young people compared to other age groups. All age groups have outliers on the upper side. The line in the middle of box shows the median (the middle value) of the variable.


6. Boxen plot

Boxen plot is similar to box plot but shows more information about the distribution. It can be considered as a box plot with a higher resolution.

Let’s create the boxen plot version of the visualization in previous section so that we can compare.

The only difference in syntax is on the kind parameter which becomes ‘boxen’ in this case.

sns.catplot(x='Age', y='AmountSpent', data=marketing, kind='boxen', height=6, aspect=1.3)
Boxen plot of Age (image by author)
Boxen plot of Age (image by author)

The boxen plot produced more quantiles compared to the box plot. Thus, it provides more information about the distribution, especially in the tails.

Boxen plot can be preferred over box plot when working with large datasets or when we need more detailed information about the distribution.


7 . Bar plot

Bar plot provides an overview of the central tendency of a variable. We get an idea about the mean value of the variable as well as the uncertainty about the mean.

It will be more clear when we actually create one. We will use the catplot function with ‘bar’ option for the kind parameter.

The following is the bar plot of a categorical variable (‘Children’) and a numerical variable (‘AmountSpent’).

sns.catplot(x='Children', y='AmountSpent', data=marketing, kind='bar', height=5, aspect=1.5)
Bar plot of Children and AmountSpent (image by author)
Bar plot of Children and AmountSpent (image by author)

The heights of bars show the mean value for each category. When we are asked the spent amount for a person with no children, we can estimate based on the average value which is 1400. That would be the most likely value. Similarly, for a person with 2 children, the estimation will be around 900.

The lines on top of bars are called error bars and they indicate the level of uncertainty about the estimation. We are more certain about the estimation on 0 children than the estimation on 2 children.


8. 2D Histogram

2D histograms combine 2 different histograms on a grid (x-axis and y-axis). Thus, we are able to visualize the density of overlaps or concurrence. In other words, we visualize the distribution of a pair of variables.

We can easily create a 2D histogram using the displot function.

sns.displot(marketing, x='Salary', y='AmountSpent', kind='hist',
            height=6, aspect=1.2)
2D histogram (image by author)
2D histogram (image by author)

2D histogram provides an overview of how two variables change concurrently. The darker regions contain more data points. We can say most people are in the lower region of both ‘AmountSpent’ and ‘Salary’ columns.


Conclusion

We have covered 8 basic yet very functional visualization types. Seaborn makes the syntax highly simple.

The concept of grouping visualizations with similar functionalities make it easier to learn the library. For instance, different visualizations on distributions can be created with the displot functon by changing the kind parameter.

These basic visualizations can be created with almost any visualization library. The important points are to know when to use them and understand what they tell us.

Thank you for reading. Please let me know if you have any feedback.


Related Articles