The world’s leading publication for data science, AI, and ML professionals.

7 Points to Create Better Scatter Plots with Seaborn

How to make the most of a scatter plot

Photo by Tom Thain on Unsplash
Photo by Tom Thain on Unsplash

Data visualization is of crucial importance in Data Science. It helps us explore the underlying structure within a dataset as well as the relationships between variables. We can also use data visualization techniques to report our findings more effectively.

How we deliver a message through data visualization is also important. We can make the plots more informative or appealing by small adjustments. Data visualization libraries provide several parameters to customize the generated plots.

In this article, we will go over 7 points to customize a scatter plotin Seaborn library. Scatter plots are mainly used to visualize the relationship between two continuous variables. They provide an overview of the correlation between the variables.

We will be creating several scatter plots using the Melbourne housing dataset available on Kaggle. Let’s first import the libraries.

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')

The next step is to read the dataset into a Pandas dataframe.

cols = ['Price','Distance','Rooms','Type','Landsize','Regionname']
melb = pd.read_csv("/content/melb_data.csv", 
                   usecols=cols).sample(n=300)
melb = melb[melb.Landsize < 3000]
(image by author)
(image by author)

I have only selected a small sample from the dataset and also included only 6 columns. The last line filters out the rows which can be considered as outliers with regards to the land size.

We will be using the relplot function of Seaborn. It is a figure-level interface for drawing two different relational plots which are scatter and line. The type of plot is selected using the kind parameter.

Let’s create a simple scatter plot of the price and distance columns with the default settings. We can then go over some tips to make the scatter plots more informative and appealing.

sns.relplot(data=melb, x='Price', y='Distance', kind='scatter')
(image by author)
(image by author)

The distance column indicates the distance to the central business district (CBD). Each dot represents an observation (i.e. a house). We observe a negative correlation between the price and distance to CBD.


1. Adjusting the size

The size of a visualization is an important feature which should be easily customized. The height and aspect parameters of the relplot function are used to change the size of a visualization. The aspect parameter represents the ratio of the width and height.

sns.relplot(data=melb, x='Price', y='Distance', kind='scatter',
            height=6, aspect=1.2)
(image by author)
(image by author)

2. Separating categories with hue

It can be more informative to represent different categories in a column separately. For instance, we can distinguish house types by using a different color for each type. Such tasks can be done with the hue parameter.

sns.relplot(data=melb, x='Price', y='Distance', kind='scatter',
            height=6, aspect=1.2, hue='Type')
(image by author)
(image by author)

We clearly see that the housed with u (unit) type are closer to CDB and cheaper in general. The t (townhouse) type is kind of in the middle. As we move away from the CBD, the houses get more expensive and larger.


3. Separating categories with row or col

We can also use multiple subplots to separate different categories. The col parameter is used to represent each category as a new column. Similarly, the row parameter does the same using rows.

sns.relplot(data=melb, x='Price', y='Distance', kind='scatter',
            height=4, aspect=1.5, col='Type')
(image by author)
(image by author)

4. Size of dots

Scatter plots represent data points (i.e. rows) with dots. We can use the size of dots to deliver information as well. For instance, the rooms column is passed to the size parameter, the size of a dot becomes proportional to the number of rooms in a house.

sns.relplot(data=melb, x='Price', y='Distance', kind='scatter',
            height=6, aspect=1.2, size='Rooms')
(image by author)
(image by author)

The general trend is to have larger houses as we move away from the CDB. It makes sense because the space becomes more of a concern in the city centre.


5. Color

The colors are essential pieces of visualizations. In a scatter plot, we have two options to change the color of dots. If hue parameter is used, we pass a palette to change the colors.

sns.relplot(data=melb, x='Price', y='Distance', kind='scatter',
            height=6, aspect=1.2, hue='Type', palette='cool')
(image by author)
(image by author)

Without the hue parameter, we can simply use the color parameter to choose the desired color for the dots.

sns.relplot(data=melb, x='Price', y='Distance', kind='scatter',
            height=6, aspect=1.2, color='orange')
(image by author)
(image by author)

6. Pairwise relationships

The pairplot function of Seaborn can be used to generate a grid of scatter plots to explore the pairwise relationships between variables. By default, it includes all the numerical variables. However, we can change it by selecting only the columns of interest.

sns.pairplot(melb[['Price','Distance','Landsize']], height=3)
(image by author)
(image by author)

It is important to note that the height parameter of the pairplot function adjusts the size of the subplots, not the entire grid.

Each subplot except for the ones on the diagonal represents the relationship between the columns indicated on the x-axis and y-axis. By default, the subplots on the diagonal show the histogram of columns. We can change the type of plot drawn on the diagonal by using the diag_kind parameter.

Histograms are mainly used to check the distribution of a continuous variable. It divides the value range into discrete bins and shows the number of data points (i.e. rows) in each bin.


7. Customizing the pairplot

The pairplot function can also be customized to carry more information. For instance, we can add a hue variable just like we have done with the relplot function.

sns.pairplot(melb[['Price','Distance','Landsize','Type']], hue='Type', height=3)
(image by author)
(image by author)

When the hue parameter is used, the plots on the diagonal automatically become kernel density estimate (kde).


Conclusion

We have covered 7 tips for making the scatter plots with Seaborn more informative and appealing. There are other techniques to further customize these visualizations but the 7 tips in this article will be enough in most cases.

Data visualizations are highly important in data science. They are not only helpful for reporting and delivering results but also a powerful tool for data analysis.

In order to make most out of data visualization, we need to go beyond the default settings of a function or library in some cases. Therefore, we should learn how to customize or adjust them.

Thank you reading. Please let me know if you have any feedback.


Related Articles