The world’s leading publication for data science, AI, and ML professionals.

Practical_Guide_To_Data_Visualization

A Practical Guide To Data Visualization: Part 1

Using Python, Seaborn, and Matplotlib

Photo by Sharon McCutcheon from Pexels
Photo by Sharon McCutcheon from Pexels

If you are reading this, you probably already figured I will be talking about Data Visualization (judging from the headline) and probably created some or intending to do so.

Sounds like you?

Read on!!!

Data Visualizations are beautiful ways to communicate an observation during exploratory data analysis.

I mean, a picture is more than a thousand words right???

I recently hopped on a Kaggle hackathon hosted by Data Science Nigeria to mark the completion of an AI beginner course and may I also state that this was my first submission to any Kaggle Competition, and really, I feel super excited to have participated.

While working on the project, I spent some time researching more resources on data visualization and somewhere in my mind, I knew I would love to share this with you, so I had you on my mind all through the process of building this project.

And here I am today, sharing my discoveries with you…

This Practical Guide is a series and it will equip you with the basis of Data Visualization and I will encourage you to code along with me!!!

We are going to take a look at a few graphs using Python, Seaborn, and Matplotlib.

Not familiar with the term Data Visualization?? Here is a quick definition.

Data Visualization can be summarized into these four lines:

  • A representation of information, be it a trend, pattern, outliers, in a given data.
  • It is an interdisciplinary field.
  • It involves the use of charts, graphs, plots, and other visual elements.
  • An efficient way to represent numerous information.

Take for instance,

When we see a chart, we can quickly spot trends and outliers and thus internalize the information easily. The chart ultimately summarizes a story that you probably won’t be able to understand by staring at a pandas dataframe or a spreadsheet.

Ever tried to stare at a spreadsheet for long?

How easy was it to make a sense out of it?

The focus of this article will be on the use of Python, Matplotlib, and Seaborn Libraries and the reference data set is the Data Science Nigeria hackathon hosted on Kaggle, you can have a look at it HERE.

Take Note…

It’s gonna be a long and sweet ride!!!

TL: DR

Grab your coffee and let’s do this!!!

Photo by cottonbro from Pexels
Photo by cottonbro from Pexels

CONTENT OUTLINE

  1. Import Libraries
  2. Importing and reading the Dataset
  3. Setting Figure Aesthetics
  4. Other Customization
  5. Working with colors
  6. Problem Statement
  7. List of Plots covered in the series
  8. Line Plot
  9. Scatter Plot
  10. Count Plot
  11. Box Plot
  12. Categorical Plot
  13. Pair Plots
  14. Creating Multiple Plots
  15. Importance of having domain knowledge before commencing a project.
  16. Conclusion

If you are not familiar with Seaborn and Matplotlib, I will suggest you go through the documentation links below;

In the meantime, Seaborn is a Python library that is built on Matplotlib and they are both used for data visualization.

Okay, let’s dive in!!!

Import Libraries

pip install opendatasets --upgrade --quiet

I installed opendatasets because I worked directly on google colab and my dataset was directly from Kaggle.

pssst…If you are working directly in your Jupyter Note, you can skip this code

import pandas as pd
import opendatasets as od
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
  • %matplotlib inline helps to ensure that our plots are shown and embedded within the Jupyter Notebook itself. rather than appearing in the pop-up windows.

DATASET

I used the opendataset library to import the dataset directly from Kaggle.

in_out_train =  '/content/data-science-nigeria-patient-treatment/train.csv'
in_out_test = '/content/data-science-nigeria-patient-treatment/test.csv'
sample_submission = '/content/data-science-nigeria-patient-treatment/sample_submission.csv'

Read CSV

train=pd.read_csv(in_out_train)
test=pd.read_csv(in_out_test)
sample = pd.read_csv (sample_submission)

Now that we have our dataset, let’s take a quick look at the first few rows of the train CSV.

train.head(10)
Image by Author
Image by Author
  • You can also try train.tail(10) to see the last ten rows.

Target Variable

We need to identify our target variable for this project using a few lines of code.

target = [col for col in train.columns if col not in test.columns]
target

[out]

['SOURCE']
  • Our target variable for this project is the ‘SOURCE column. Hence we would reference it all through our visualization.

Create a plot to see the number of categories present in the ‘SOURCE’ column

plt.figure (figsize= (10,6), tight_layout= True)  # Set figure size
sns.countplot(train['SOURCE'], label = 'counts')
Image by Author
Image by Author

The plot above is referred to as the count plot and at a glance, it shows the counts of observations in each ‘SOURCE’ column. We can also have an idea of how balanced the target variable is.

pssttt…

As much as we want to create visualizations for easy understanding, we also do not want to spend so much time setting the features on each plot.

e.g.

  • title
  • fonts
  • style etc.

It can be time-consuming, hence it is good practice to customize your plots from the beginning.

On this note, let’s have a look at setting our figure aesthetics.

Figure Aesthetics

We can modify the plot appearance by modifying the style or the scale.

  1. STYLE:
  • set_style()
  • axes_style()
  1. SCALE:
  • set_context()
  • plotting_context()
sns.set_style('whitegrid') # adjust the style
sns.set_context ('paper')  # modify the scale
plt.figure (figsize= (10,6), tight_layout= True) # Set figure size
sns.countplot(train['SOURCE'], label = 'counts')
Image by Author
Image by Author

First, we set the background style of our plots by using sns.setstyle(). The default is the ‘darkgrid’, and other options include ‘whitegrid’, ‘dark’, ‘white’, ‘ticks’_

This will give our graphs a better overall look.

Next, we modify the scale of the plot by calling the _sns.setcontext() function.

The default for this is the ‘notebook’. Other preset contexts include; ‘paper’, ‘talk’ and ‘poster’.

  • Explore these options to see the one that best suits your project.
plt.figure (figsize= (10,8), tight_layout= True) #set figure size
sns.set_style('ticks')  # adjust the style
sns.set_context ('poster')  # modify the scale
sns.countplot(train['SOURCE'], label = 'counts')
Image by Author
Image by Author

I had initially set the figure size as in the previous plots i.e plt.figure (figsize= (10,6), tight_layout= True). But I noticed the y-axis count ended at "1500", rather than "1750". So when you change the scale (sns.set_context (‘poster’)), always take note of what happens to your plot.

Okay, let’s continue…

Sometimes, I prefer not to have spines on the top and right axes, so I call the despine() function.

This function applies only to ‘white’ and ‘ticks’ parameters for the _setstyle()

plt.figure (figsize= (10,8), tight_layout= True) #set figure size
sns.countplot(train['SOURCE'], label = 'counts')
sns.despine()
Image by Author
Image by Author

Can you spot the difference between the two plots above?

plt.figure (figsize= (10,8), tight_layout= True) #set figure size
sns.set_style('darkgrid')   # adjust the style
plt.rc('grid', c ='r', ls ='-', lw = 0.5)  # gridstyle
plt.rc('axes', titlesize=18)  # axes title fontsize
plt.rc('axes', labelsize=14)   # x and y labels fontsize
plt.rc('xtick', labelsize=12)  # fontsize of the tick labels
plt.rc('ytick', labelsize=12)  # fontsize of the tick labels
plt.rc('legend', fontsize=12)  # legend fontsize
plt.rc('font', size=12)        # modify text sizes
sns.countplot(train['SOURCE'], label = 'counts')
Image by Author
Image by Author
  • The next thing we want to do is to set our font sizes (title, axes, and legends) and we do not need to repeat it in the subsequent plots.

Sounds good hun?

However, to switch to the default parameters, you can call sns.set_theme()

# Another quick option is to set the parameters in a dictionary and call it on the plt.rc() as shown below.
#font = {'family' : 'serif',
#'weight' : 'light'}
# pass in the font dict as kwargs
#plt.rc('font', **font)
#sns.countplot(train['SOURCE'], label = 'COUNTS')
# You can take out the comments in the codes above to see what the visulization looks like.

I have compiled a list of some aliases to make your work faster and less cumbersome.

df= {'Alias' : ['lw', 'ls','c', 'fc', 'ec', 'ha'], 'Property': ['linewidth', 'linestyle', 'color', 'facecolor', 'edgecolor', 'horizaontal alignment']}
short_forms= pd.DataFrame (df)
short_forms
Image by Author
Image by Author

Quick Reminder;

Don’t forget to code along…

Other Customization

Personally, I do not like to work with the default colors so I tweak to my satisfaction.

The same applies to the figure size and appearance.

Take note;

"It is necessary to label our x and y axes and also add a title to our graphs"

  • let’s have a look at how to achieve this.
plt.figure (figsize= (12,8), tight_layout= True)
sns.set(font_scale= 1.5)
sns.countplot(train['SOURCE'], label = 'counts')
plt.xlabel('Count')
plt.ylabel('Target')
plt.title ('Target Column Unique Values')
plt.show()
Image by Author
Image by Author
  • The tight_layout parameter comes into play when we have side-by-side subplots. I will talk more about that later in the article.

Are you still following me?

Working With Colours

Combining seaborn with Matplotlib makes color usage better and more visually appealing.

sns.color_palette() #calling the default seaborn colour palettes.
Image by Author
Image by Author

These color palettes have a form of a list, thus we can call the specific color based on its index and we will get an RGB code.

eg ;1.0, 0.6235294117647059, 0.6078431372549019

Image by Author
Image by Author

other variations include;

  • ‘colorblind’
  • ‘deep’
  • ‘muted’

Seaborn also has some other interesting color palettes; refer to the documentation for more details;

https://seaborn.pydata.org/tutorial/color_palettes.html

  • ‘paired’
  • ‘Set2’
  • ‘hls’

Another interesting class of colors in Seaborn is the Perceptually uniform colors.

  • ‘rocket’ (suitable for heatmaps)
  • ‘mako’ (suitable for heatmaps)
  • ‘flare’
  • ‘crest’
sns.color_palette("rocket", as_cmap=True)

We really have a lot of colors to play with, so why stick to the conventional blue and orange?

We will definitely try out this customization as we proceed.

Let’s have a look at the problem statement.

Problem Statement

Since the focus of this article will be on data visualization, we won’t be doing any data cleaning or Exploratory Data Analysis.

But it would be great to have an idea of the problem statement…

You think so right?

Me too!!!

" As a part of the HealthIsWealth hospital automation system, you have been contracted as a professional data scientist who will build a system that would predict and estimate whether the patient should be categorized as an in care patient or an out care patient with the help of several data points about the patients, their conditions and lab tests "

The problem statement is self-explanatory and honestly, I had never thought about the process of categorizing patients into inpatient and outpatient care, but after working on this project, not only do I have an idea of the process but a clearer understanding of its importance.

At the latter part of this article, I will shed more light on why it is necessary to have domain knowledge of whatever project you are working as a Data Analyst/ Scientist.

Here is a list of the Plots covered in this Series.

  1. Line Plot
  2. Scatter Plot
  3. Count Plot
  4. Box Plot
  5. Categorical Plots
  6. Pair Plots
  7. Bar Plot
  8. Pie Chart
  9. Histogram
  10. Heat Map

We will also have a look at;

  • Importing Images
  • Creating Multiple Plots

Let’s go!!!

Line Plots

  • A line plot is one of the simplest and widely used charts.
  • It shows the observations in a dataset in form of data points or markers connected by straight lines.
sns.set_theme  # setting sns parameters to the defaults

The code above resets the theme.

sns.set_style('ticks')   # set the style
sns.set_context ('paper')  # set the scale

For our first plot, we will create a hypothetical example of the estimated fare from Alausa to Obalende during the last quarter of a year and the beginning of another year.

time = [4.00, 5.00, 5.30, 6.00, 6.30, 7.15, 8.00]
days= range(1,8)
est_fare_Q1= [250,300,400, 550, 600, 500,450]
est_fare_Q4= [250,350,500, 600, 600, 550,500]

Plotting a Line Chart is as simple as running a few lines of code as shown below;

plt.figure (figsize= (10,8), tight_layout= True)
plt.plot(time)
Image by Author
Image by Author

This does not really give us much information, so let’s add more details.

plt.figure (figsize= (10,8), tight_layout= True)plt.plot(time,est_fare_Q1)
plt.rc('grid', c ='b', ls ='-', lw = 0.5)  # gridstyle
plt.xlabel('Time (hour)')
plt.ylabel('Estimated Q4 Fare (naira)')
Image by Author
Image by Author
  • Now we have more information. At a glance, we can tell what values are on the x and the y-axes and we can see the relationship between them.
  • Let’s have a look at multiple plots on a line plot.
plt.figure (figsize= (10,8), tight_layout= True)plt.plot(time,est_fare_Q1)
plt.plot(time,est_fare_Q4)
plt.xlabel('Time (hour)')
plt.ylabel('Estimated Fare (naira)'
Image by Author
Image by Author
  • We can clearly see the two plots, but can’t exactly tell which is for the Q1 or the Q4.
  • To solve this, we set the legend parameter. It enables us to distinguish between the two plots above.
plt.figure (figsize= (10,8), tight_layout= True)
plt.plot(time,est_fare_Q1, marker= 'x')
plt.plot(time,est_fare_Q4, marker= 'o')
plt.title ('Graph Of Estimated Fare Between The 1st and 4th Quarter Of The Year')
plt.legend(["est_fare_Q1", "est_fare_Q4"])
plt.xlabel('Time (hour)')
plt.ylabel('Estimated Fare (naira)')
Image by Author
Image by Author

In the above plot, I added some markers to further distinguish between Q1 and Q4.

But that’s not all the details we can add.

Here are more options on how to style your markers.

  • color or c: Set the color of the line (supported colors)
  • linestyle or ls: Choose between a solid or dashed line
  • linewidth or lw: Set the width of a line
  • markersize or ms: Set the size of markers
  • markeredgecolor or mec: Set the edge color for markers
  • markeredgewidth or mew: Set the edge width for markers
  • markerfacecolor or mfc: Set the fill color for markers
  • alpha: Opacity of the plot

Let’s apply to see how this looks on our plots!!!

plt.figure (figsize= (10,8), tight_layout= True)
plt.plot(time,est_fare_Q1, marker='x', c='b', ls='-', lw=2, ms=8, mew=2, mec='navy')
plt.plot(time,est_fare_Q4, marker='s', c='r', ls='--', lw=3, ms=10, alpha=.5)
plt.title ('Graph Of Estimated Fare Between The 1st and 4th Quarter Of The Year')
plt.legend(["est_fare_Q1", "est_fare_Q4"])
plt.xlabel('Time (hour)')
plt.ylabel('Estimated Fare (naira)')
Image by Author
Image by Author

Matplotlib has a shorthand argument that can be added to the plt.plot

fmt = ‘[marker][line][color]’

plt.figure (figsize= (10,8), tight_layout= True)
plt.plot(time,est_fare_Q1, 'o--r')
plt.plot(time,est_fare_Q4, 's--b')
plt.title ('Graph Of Estimated Fare Between The 1st and 4th Quarter Of The Year')
plt.legend(["est_fare_Q1", "est_fare_Q4"])
plt.xlabel('Time (hour)')
plt.ylabel('Estimated Fare (naira)')
Image by Author
Image by Author

Shorter and Faster!

Well, data visualization is not only about creating visuals, we are trying to pass across a message right?

A quick look at the insights from this plot;

The line plot above shows ;

  • The estimated fare increases from somewhere between 5:50 am and peaks at 6:00 am during the fourth quarter (Q4)of the year while it peaks somewhere between 6:00 and 6:50 during the first quarter(Q1) of the year.
  • In both quarters of the year, the fare drops at 8:00 am with an estimated Q4 fare price fare of 500 naira and an estimated Q1 fare of 450 naira.

"As a Data Analyst, at this point, you might become curious to know why the fare prices are higher during Q1. Then you begin to search if there are any events, festivals, or any general price hike in the country, eg fuel price"

As simple as a line plot is, we were able to get some information from it.

Now we can proceed to our original data set, we will begin with a line graph for the ‘AGE’ column.

plt.figure (figsize= (10,8), tight_layout= True)
plt.plot(train['AGE'])
Image by Author
Image by Author

oops!!!

This does not tell us anything meaningful about the ‘SOURCE’ column.

Before we continue, it will be great to set the basic features we want for our project.

# Customizing Our Plots
sns.set_style('white')    # set the style
sns.set_context ('talk')     # set the scale
plt.rc('grid', c ='b', ls ='-', lw = 0.5)  # gridstyle
plt.rc('axes', titlesize=18)    # axes title fontsize
plt.rc('axes', labelsize=14)     # x and y labels fontsize
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)     # fontsize of the tick labels
plt.rc('legend', fontsize=12)     # legend fontsize
plt.rc('font', size=12)           # modify text sizes

We then procedd to visualize the relationship between the ‘AGE’ and the ‘SOURCE’ column.

Image by Author
Image by Author

The result above is neither appealing nor informative because there are too many combinations of the two properties within the dataset.

This implies that there is no linear relationship between the two columns in question. When using Line Plots, note that it is not suitable for datasets with too many combinations, hence no insights can be derived from it **** just as we have seen in the example above.

A better option for variables with many combinations would the Scatter Plot!!!

Scatter Plots

  • It shows the relationship between two numerical variables.
  • Unlike the Line Graph, Scatter Plots are informative even when the variables have too many combinations.

We can use a scatter plot to visualize the relationship between the ‘AGE’ and the ‘LEUCOCYTE’ columns.

plt.figure (figsize= (10,8), tight_layout= True)
sns.scatterplot(x=train['AGE'], y=train['LEUCOCYTE'], hue= train['SOURCE']);
Image by Author
Image by Author
  • before we note our insights, let’s customize the scatter plot above.
plt.figure(figsize=(10, 8), tight_layout= True)
plt.title('RELATIONSHIP BETWEEN LEUCOCYTE AND AGE')
sns.scatterplot(x=train['AGE'], y=train['LEUCOCYTE'], hue= train['SOURCE'],s=70);
plt.show()
Image by Author
Image by Author

Adding hue to the plot above makes it more informative and we can note the following insights;

  • There are more out-care patients (represented as 1 and in orange) from age 40 and above.
  • There is a distinct cluster with some outliers.
  • The oldest patient (around 100) has a low leucocyte and is categorized as an in-care patient.

Setting the hue= ‘SOURCE’ column makes it easier to see what is happening to the in-care and out-care patient categories.

# Another option is to try the lines of code below;
plt.figure(figsize=(12,10), tight_layout=True)
ax = sns.scatterplot(data= train, x='AGE', y='LEUCOCYTE', alpha= 0.6 , hue='SOURCE', palette='Set2', s=70)
ax.set(xlabel='AGE', ylabel='LEUCOCYTE')
plt.title('RELATIONSHIP BETWEEN LEUCOCYTE AND AGE')
plt.show()
Image by Author
Image by Author
  • Here, I tweaked the color palette

A Scatter plot is my go-to plot for a target variable with numerous combinations.

Count Plot

  • Count Plot is an example of a categorical estimate plot.
  • It shows the estimated counts of observations in each categorical bin using bars.
  • We can also set the hue to see the counts of the observations relative to the target variable.
plt.figure (figsize= (10,8), tight_layout= True)
plt.title ('SOURCE COLUMN WITH RESPECT TO THE SEX COLUMN')
sns.countplot (x= 'SEX', hue= 'SOURCE', data =train, palette= 'dark');
plt.show()
Image by Author
Image by Author

Notice the palette was set to ‘dark’, while it is still the regular blue and orange, but it appears darker, more saturated, and prettier.

So all you need is a little tweak here and there, viola!!!

You have an interesting plot.

Okay back to our analysis…

From the chart above, we can note the following observations;

  • The number of female and male in-care patients(0) is very close. There are more male out-care patients(1) than female out-care patients.

QUESTION 1.

Is there a major difference in the male lab test results that results in fewer male patients in the patient care?

# We can also have an horizontal arrangement for a count plot.
plt.figure (figsize= (12,10), tight_layout= True)
plt.title ('SOURCE COLUMN WITH RESPECT TO THE SEX COLUMN')
sns.countplot (y= 'SEX', hue= 'SOURCE', data =train, palette= 'Set2');
plt.show()
Image by Author
Image by Author

Another useful plot to consider using is the Box Plot.

Box Plot

A boxplot is used to compare the distribution of continuous variables and clearly shows the outliers, the skewness of the data,i.e the spread or closeness of the data, statistics, such as;

  • minimum value
  • first quartile (Q1)
  • median (Q2)
  • third quartile (Q3)
  • maximum value

and the interquartile range can be calculated from it.

It is an example of a categorical distribution plot, hence, comes in handy when comparing multiple sets of values.

  • The median value is represented via a line inside the box.
  • The "whiskers" represent the minimum & maximum values (sometimes excluding outliers, which are represented as black diamonds).

SYNTAX:

When making boxplots with multiple categorical variables we need two arguments – the name of the categorical variable (SOURCE) and the name of the numerical variable (LEUCOCYTE).

plt.figure(figsize=(10, 8), tight_layout=True)
plt.title('Box Plot')
sns.boxplot (x= 'SOURCE', y= 'LEUCOCYTE', data =train, palette= 'Set2', linewidth=2);
plt.show()
Image by Author
Image by Author
train.describe() #calling the .describe() function gives you the statistics of the numerical columns.
Image by Author
Image by Author

Checking For Outliers In A Boxplot (1.5 IQR Rule)

Given that the IQR = Q3-Q1, minimum= Q1–1.5 IQR, and maximum = Q3 + 1.5IQR.

  • Data with values less than the minimum value (Q1–1.5 IQR) and more than the maximum value (Q3 + 1.5 IQR) is said to contain some outliers or some sort of inaccuracies.

For the LEUCOCYTE column;

Q3= 10.4 Q1= 5.7

  • Hence, the IQR=4.7
  • minimum value= 5.7–1.5*4.7 = -1.35
  • minimum value= 10.4 + 1.5*4.7 = 17.45

Therefore, for the leucocyte column (considering the first rectangle i.e in-care patients represented as 0), most of the values fall close to the median value.

  • The out-care patients have a higher median value
  • The minimum value should be -1.35 and the estimated maximum value should be 17.45.
  • In other words, the smallest values of the data set should not be less than -1.35, while the highest value in the dataset should not be more than 17.45.

So considering our dataset, the least leucocyte value for the in-care patient is 1.2. Since 1.2 > than -1.35, then we don’t have a lower outlier.

  • The highest value in the dataset is 76.7, which is higher than 17.45 (maximum value).
  • Hence, the data set contains upper outliers.

A histogram plot of the distribution above will surely give us more insights.

QUESTION 2

Can you take a look at the values for the out-care patients (represented as 1 & the orange rectangle boxplot).

Share your observations in the comment section.

Categorical Scatterplots

This category of plots gives a better idea of the number of data present at a certain point on the graph and also gives insights on how closely or sparsely packed the dataset is.

sns.catplot(x, y, data)

  • There are two type of categorical scatterplots;
  • stripplot()
  • swarmplot()

when calling the sns.catplot(), the strip plot is the default ‘kind’ setting. So to create a swarmplot, you to set kind= ‘swarm’

Also note,

Swarmplot is generally preferred when dealing with a small set of data, otherwise, the obtained will appear too clustered and less helpful for our analysis.

Now, let’s work with our train dataset to see what we can infer from a categorical plot.

plt.figure(figsize=(12,10))
sns.set_style('whitegrid')
sns.swarmplot(x='SEX', y='LEUCOCYTE', data=train,palette= 'husl', hue= 'SOURCE')
plt.xlabel("Size")
plt.ylabel("Total Bill")
plt.title("Total bill per size of the table")
plt.show()
Image by Author
Image by Author

As I had earlier stated, this plot is best suited for a smaller dataset. As dataset size increases, the categorical scatter plots become limited in the information they can provide about the distribution of values within each category.

To better understand the usage of this plot, we will use one of Seaborn’s inbuilt datasets.

df = sns.load_dataset("tips")
df.head(7)
Image by Author
Image by Author
sns.catplot(x= 'time', y= 'tip', data=df) # default setting i.e the stripplot.
Image by Author
Image by Author

We can add the jitter parameter. It controls the presence or absence of the jitters.

sns.catplot(x= 'time', y= 'tip',jitter= False, data=df)
Image by Author
Image by Author

Can you spot the difference in the plot, with and without setting the jitter?

Before explaining our insights, let’s make this plot more aesthetically pleasing.

Moreso, rather than use the code above, if you are sure you want to use a swarmplot, you can simply call the sns.swarmplot() rather than setting the kind each time.

Okay, let’s have a look at this.

plt.figure (figsize= (10,8), tight_layout= True)
sns.swarmplot(x='time', y='tip', data=df,hue= 'sex')
plt.xlabel('Time')
plt.ylabel('Tip')
plt.title("Tip Per Meal Time")
plt.show()

Okay, I think it would be nice to plot a graph of the tip per day since the day column has more categories and we can get more insights.

sns.set_style('dark')
plt.figure(figsize=(12,10))
sns.swarmplot(x='tip', y='day', data=df,hue= 'time', palette= 'husl', size= 8)
plt.xlabel('Time')
plt.ylabel('Tip')
plt.title("Tip Per Day")
plt.show()
Image by Author
Image by Author

With categorical scatter plots we can have a better idea of the distribution of the tips across each day of the week.

i.e how closely or sparsely packed the dataset is.

A box plot for instance does not necessarily show us the spread at each point. So really the plot you use for your visualization depends on the questions, insights, and results you want to achieve.

If you have read to this point, big thumbs to you!!!

Don’t stop reading…

Pair Plots

Certainly one of my favorite plots! Very informative and gives me hints on the columns I want to do a deeper analysis on.

Here are some quick points about Pair Plots

  • A great way to end this tutorial will be to talk about the pair plots. It’s a really simple and easy way to visualize the relationships between the variables.
  • It creates some sort of matrix that can hint at what columns to consider for feature selection or feature engineering.
  • Pair plots are somewhat similar to heatmaps in the sense that they both produce a matrix / correlation-like result and I personally like to use a pair plot or a heatmap earlier in my EDA process especially since it can hint me at the correlation between my variables, hypothesis and sometimes guides my EDA flow.
sns.set_style('ticks')
sns.pairplot(train, hue="SOURCE", diag_kind='kde', kind='scatter', palette='Set2')
plt.show()
Image by Author
Image by Author

The pair plot above shows the relationships and distribution of each continuous variable in our train dataset.

So even though you are not familiar with the project domain you are working on, at a glance, you can see the correlation between the variables.

This insight can inform your research, hypothesis, and perhaps the regression analysis to consider.

Take for instance, in the pair plot above, setting the hue= ‘SOURCE’ helps us to identify the distribution relative to the ‘SOURCE’ column, and considering the fact that it is our target variable.

So I will spend some extra time explaining some of the Insights from this plot.

  1. The diagonal row simply shows the histogram of each variable and the number of occurrences.
  2. The ‘SEX’ column is missing because it is not a numerical (continuous) variable.
  3. From my initial research, I had discovered that leucocytes are the white blood cells and they play a major role in defending the body against germs and diseases.

Now, considering the ‘AGE’ and ‘LEUCOCYTE’ columns on the pair plots;

  • I can immediately spot the outliers (confirmed using the box plot)
  • At the bottom of the plot, where the age is lower (20 downwards), I notice there are more green marks (in-care patients, mint green) with relatively few orange marks. (Category A)
  • Also, the ‘LEUCOCYTE’ of the patients within this category is less than 25
  • On the other hand, the older age seems to have both the orange and the green marks with the orange (out-care patients) being more dominant especially within a certain age range. (CATEGORY B)
  • The value of the ‘LEUCOCYTE’ also seems to have increased with a higher ‘AGE’.

I could also begin to ask questions such as ;

  • Since the majority of the patients in CATEGORY A are below 20 and have ‘LEUCOCYTES’ below 25, does it mean that the lower the ‘LEUCOCYTES’, the higher the probability of a patient being categorized as an in-care patient?
  • Why are do the younger patients have lower ‘LEUCOCYTES’ compared to the older ones?
  • At what age range do we observe an increase in ‘LEUCOCYTES’ according to our plots?
  • All these questions can keep coming and then inform what I need to research on (in line with the domain).
  • I can therefore decide to isolate these two variables by plotting a scatter plot, box plot to understand the distribution clearly.

Take some time to study these pair plots and let me know some of the inferences you can make.

psttt…Never underestimate the power of domain research. A quick google search could do the trick.

You get my point, right?

You can also try the next code, set the hue = ‘SEX’, and note the observations.

#sns.pairplot(train, hue="SEX", diag_kind='kde', kind='scatter', palette='Set2')
#plt.show()

I find this idea very cool and interesting. Creating multiple charts in a single grid by using plt.subplots.

Here’s a quick walk-through of a single grid showing the different types of charts we’ve covered in this tutorial.

Take note;

  • Creating subplots requires using set_title, set_xlabel,set_ylabel as against plt.title, plt.xlabel, plt.ylabel.

- sns.scatterplot takes in extra argument - ax= axes[0,1]

fig, axes = plt.subplots(2, 3, figsize=(22,16)) # setting the number of rows, columns and figure dimension
plt.tight_layout(pad=4)  #control overlapping of the figures.
# First Plot Line Graph
axes [0,0].plot(time,est_fare_Q1, 'o--r')
axes [0,0].plot(time,est_fare_Q4, 's--b')
axes [0,0].set_title ('Estimated Fare For The 1st and 4th Quarter Of The Year')
axes [0,0].legend(["est_fare_Q1", "est_fare_Q4"])
axes [0,0].set_xlabel('Time (hour)')
axes [0,0].set_ylabel('Estimated Fare (naira)')
# Second Plot; Scatterplot
axes [0,1].set_title('LEUCOCYTE vs AGE')
sns.scatterplot(x=train['AGE'], y=train['LEUCOCYTE'], hue= train['SOURCE'], s=55, palette= 'Set2', ax= axes[0,1]);
# Third Plot; Count Plot
axes[0,2].set_title ('SOURCE COLUMN WITH RESPECT TO THE SEX COLUMN')
sns.countplot (x= 'SEX', hue= 'SOURCE', data =train, palette= 'dark', ax= axes[0,2]);
#Fourth Plot; Box Plot
axes[1,0].set_title('Box Plot')
sns.boxplot (x= 'SOURCE', y= 'LEUCOCYTE', data =train, palette= 'Set2', linewidth=2, ax= axes[1,0]);
# Fifth Plot; Count Plot (horizontal arrangement)
axes[1,1].set_title ('SOURCE COLUMN WITH RESPECT TO THE SEX COLUMN')
sns.countplot (y= 'SEX', hue= 'SOURCE', data =train, palette= 'colorblind', ax= axes[1,1]);
# Sixth Plot ; Categorical Plot
sns.swarmplot(x='tip', y='day', data=df,hue= 'time', palette= 'husl', size= 7,ax=axes[1,2] )
axes[1,2].set_xlabel('Time')
axes[1,2].set_ylabel('Tip')
axes[1,2].set_title("Tip Per Day")
Image by Author
Image by Author

Key Things To Note While Creating Multiple Graphs On A Single Grid

  1. Fig Size: The figsize you set should comfortably accommodate your graphs without making it look cluttered, difficult to read, and more importantly, to prevent overlapping of the individual graphs.
  2. Subplot Dimension: Your subplot dimension is key to achieving a readable set of charts. In the above example, I initially had the following settings;

fig, axes = plt.subplots(3, 3, figsize=(16,8)) and I realized some of the titles were overlapping, so I adjusted those titles and adjusted my settings to what I eventually used in the graph above.

fig, axes = plt.subplots(2, 3, figsize=(20,14))

  1. Tight_layout (padding): The padding is as important as the other two points discussed above.

plt.tight_layout(pad=2) my initial padding

plt.tight_layout(pad=4) final padding

  • One of the major goals of creating multiple plots is to visualize the distribution of specific variables on different plots at a glance. Hence, you should ensure this purpose is not defeated.
  • Make sure your individual plots are readable and the overall plot is informative and pleasing to the eyes.

You can try more figsize, column, rows and padding settings for this project , and see which one looks better.

SUMMARY

Importance Of Having A Domain Knowledge Prior To A Data Science Project

Prior to commencing a data science project, it is a good practice to have at least a basic knowledge of the domain, a quick google search should do the trick and /or a conversation with someone within the specific domain.

Since the health field isn’t my primary domain, I simply did a google search to get some knowledge on inpatient and outpatient care.

What’s the difference between inpatient and outpatient care?

An inpatient care requires a patient to stay in a hospital overnight while an outpatient is simply monitored for a few hours and does not have to stay overnight.

As mentioned in the problem statement, being categorized as an in-care patient or an out-care patient will depend on the information about the patient, conditions as well as lab test results.

You will agree with me that this process is very important and should not be mismanaged.

Right???

It would be a disaster if, for instance, a pregnant woman goes for her regular check-ups, yet the doctor never refers to her previous case files to understand her medical history. Thus she might just tell her everything is fine and she is free to go home…

Photo by MART PRODUCTION from Pexels
Photo by MART PRODUCTION from Pexels

Perhaps a patient’s lab results are even correct but the model/system used to categorize the patient (in care or out care patient) cannot accurately predict and estimate which category is appropriate.

I definitely do not wish to build such a model, and I am sure you are in sync with me right?

So this got me thinking about the lab results in the dataset and the possible questions, observations, and insights I might get through the data visualization.

The subject of data visualization combines both art and data science. While data visualization can be creative and pleasing to look at, it should also be functional in its visual communication of the data. It should clearly tell a story and clearly communicate your insights to the readers.

I suggest you try some of these plots yourself, check HERE for the dataset used and complete codes in my GitHub.

CONGRATS FOR MAKING IT TO THE END!!!

CONCLUSION

  • It is best practice to have knowledge of the domain project you are working on.
  • Remember you the aim of your data visualization is to communicate to your readers, hence ensure it is informative and aesthetically pleasing as well.
  • The type of dataset, task, questions, and insights will guide the types of plots you will use to communicate to your readers/ stakeholders.

Hope you enjoyed reading this article as much as I enjoyed writing it.

Don’t forget to hit the clap icon and drop your comments below!

I will be happy to connect with you on LinkedIn.

CHEERS!


Related Articles