
Data visualization is a very important part of Data Science. It is quite useful in exploring and understanding the data. In some cases, visualizations are much better than plain numbers at conveying information.
The relationships among variables, the distribution of variables, and underlying structure in data can easily be discovered using data visualization techniques.
In this post, we will learn about the 8 most commonly used types of data visualizations. I will use Seaborn to create visualizations and also try to explain what kind of information we can infer.
We will use the grocery and direct marketing datasets available on Kaggle to create the visualizations.
The grocery dataset contains information about customer purchases at grocery stores. The direct marketing dataset contains relevant data of a marketing campaign done via direct mail.
Let’s start by reading the datasets into a pandas dataframe.
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')
grocery = pd.read_csv("/content/Groceries_dataset.csv", parse_dates=['Date'])
print(grocery.shape)
grocery.head()

The dataset contains about 40k rows and 3 columns. We have member number, date of purchase, and the purchased items as columns.
marketing = pd.read_csv("/content/DirectMarketing.csv")
print(marketing.shape)
marketing.head()

The marketing dataset consists of 1000 observations (i.e. rows) and 10 features (i.e. columns). The focus is on the "AmountSpent" column which indicates how much a customer has spent so far.
1. Line plot
Line plots visualize the relation between two variables. One of them is usually the time. Thus, we can see how a variable changes over time.
In our dataset, we can visualize the number of items purchased over time. First, we will calculate the number of items purchased on each day. The groupby function of pandas will help us do that.
items = grocery[['Date','itemDescription']]
.groupby('Date').count().reset_index()
items.rename(columns={'itemDescription':'itemCount'}, inplace=True)
items.head()

We can now plot the item count over time. For demonstration purposes, I will only use the last 6 months.
sns.relplot(x = 'Date', y = 'itemCount',
data=items[items.Date > '2015-06-01'],
kind='line', height=5, aspect=2)

Line plots can be produced using the relplot function. We need to pass ‘line’ argument to the kind parameter.
Note: The purpose of this article is to explain different kinds of visualizations. Thus, we will not focus on customizing or editing the plots (e.g. fontsize, labels, colors, and so on)
2. Scatter plot
Scatter plot is also a relational plot. It is commonly used to visualize the values of two numerical variables. We can observe if there is a correlation between them.
We will visualize the "Salary" and "AmountSpent" columns of the marketing dataset.
Scatter plots can also be produced using the relplot function. The kind parameter takes ‘scatter’ as argument.
sns.relplot(x='Salary', y='AmountSpent', hue='Married', data=marketing, kind='scatter', height=7)

The hue parameter allows to compare different categories based on the numerical variables. We passed the "Married" column to the hue parameter. Thus, the data points that belong to different categories of this column (single and married) are marked with different colors.
It seems like there is a positive correlation between the spent amount and salary which is expected. The more money you have, the more you are likely to spend. We also see that married people are likely to earn and spend more money than single people.
3. Histogram
Histograms are used to visualize the distribution of a continuous variable.
In the previous plot, we saw that married people earn more money than single people in general. Let’s plot the distribution of the "Salary" column for married and single people separately.
sns.displot(x='Salary', hue='Married', data=marketing, kind='hist', aspect=1.5)

We used the displot function which is used to create distribution plots. The kind parameter is set as ‘hist’ to produce histograms.
The salary of married people has a normal distribution. There are many single people with very low salaries.
4. Kernel density plot (KDE)
Kde plots are also used to visualize distributions. Instead of using discrete bins like histograms, kde plots smooth the observations with a Gaussian kernel. As a result, a continuous density estimate is produced.
In order to compare histogram and kde plots, let’s create the kde plot version of the histogram in the previous section.
sns.displot(x='Salary', hue='Married', data=marketing, kind='kde', aspect=1.5)

5. Box plot
Box plot provides an overview of the distribution of a variable. It shows how values are spread out by means of quartiles and outliers.
Let’s check the distribution of the ‘AmountSpent’ column based on different age groups.
We will use the cat function (stands for categorical) and pass ‘box’ to the kind parameter.
sns.catplot(x='Age', y='AmountSpent', data=marketing, kind='box', height=6, aspect=1.3)

The range of spent amount is narrow for young people compared to other age groups. All age groups have outliers on the upper side. The line in the middle of box shows the median (the middle value) of the variable.
6. Boxen plot
Boxen plot is similar to box plot but shows more information about the distribution. It can be considered as a box plot with a higher resolution.
Let’s create the boxen plot version of the visualization in previous section so that we can compare.
The only difference in syntax is on the kind parameter which becomes ‘boxen’ in this case.
sns.catplot(x='Age', y='AmountSpent', data=marketing, kind='boxen', height=6, aspect=1.3)

The boxen plot produced more quantiles compared to the box plot. Thus, it provides more information about the distribution, especially in the tails.
Boxen plot can be preferred over box plot when working with large datasets or when we need more detailed information about the distribution.
7 . Bar plot
Bar plot provides an overview of the central tendency of a variable. We get an idea about the mean value of the variable as well as the uncertainty about the mean.
It will be more clear when we actually create one. We will use the catplot function with ‘bar’ option for the kind parameter.
The following is the bar plot of a categorical variable (‘Children’) and a numerical variable (‘AmountSpent’).
sns.catplot(x='Children', y='AmountSpent', data=marketing, kind='bar', height=5, aspect=1.5)

The heights of bars show the mean value for each category. When we are asked the spent amount for a person with no children, we can estimate based on the average value which is 1400. That would be the most likely value. Similarly, for a person with 2 children, the estimation will be around 900.
The lines on top of bars are called error bars and they indicate the level of uncertainty about the estimation. We are more certain about the estimation on 0 children than the estimation on 2 children.
8. 2D Histogram
2D histograms combine 2 different histograms on a grid (x-axis and y-axis). Thus, we are able to visualize the density of overlaps or concurrence. In other words, we visualize the distribution of a pair of variables.
We can easily create a 2D histogram using the displot function.
sns.displot(marketing, x='Salary', y='AmountSpent', kind='hist',
height=6, aspect=1.2)

2D histogram provides an overview of how two variables change concurrently. The darker regions contain more data points. We can say most people are in the lower region of both ‘AmountSpent’ and ‘Salary’ columns.
Conclusion
We have covered 8 basic yet very functional visualization types. Seaborn makes the syntax highly simple.
The concept of grouping visualizations with similar functionalities make it easier to learn the library. For instance, different visualizations on distributions can be created with the displot functon by changing the kind parameter.
These basic visualizations can be created with almost any visualization library. The important points are to know when to use them and understand what they tell us.
Thank you for reading. Please let me know if you have any feedback.