The world’s leading publication for data science, AI, and ML professionals.

Practical Data Visualization Guide: Seaborn vs Ggplot2

Hands-on tutorial with examples

Photo by Jørgen Håland on Unsplash
Photo by Jørgen Håland on Unsplash

Data visualization is a substantial part of Data Science. It helps to better understand the data by unveiling the relationships among variables. The underlying structure within a dataset can also be explored using well-designed data visualizations.

In this article, we will compare two popular data visualization libraries: Seaborn for Python and ggplot2 for R.

We will use the famous titanic dataset to create the visualizations. You can download the "train.csv" file from Kaggle to follow along.

The first step is import the libraries and create a data frame. We will use the Pandas library for Python and data.table library for R to handle data manipulation operations.

import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')
titanic = pd.read_csv("/content/titanic.csv")
titanic.drop(['PassengerId', 'Name', 'Ticket'], 
             axis=1, inplace=True)
titanic.head()
(image by author)
(image by author)
> library(ggplot2)
> library(data.table)
> titanic <- fread("/home/soner/Downloads/datasets/titanic.csv")
> titanic[, c("PassengerId", "Name", "Ticket"):=NULL]
> head(titanic)
(image by author)
(image by author)

We now have the dataset saved in a proper data structure. Let’s start with creating a scatter plot.

Scatter plot is a relational plot which is commonly used to visualize the values of two numerical variables. We can observe if there is a correlation between them.

Seaborn:

sns.relplot(data=titanic, x="Age", y="Fare", hue="Survived",
            kind='scatter', aspect=1.4)
Seaborn scatter plot (image by author)
Seaborn scatter plot (image by author)

The relplot function of Seaborn creates different kinds of relational plots such as scatter plot or line plot. The type of plot is specified with the kind parameter. We pass the columns to be plotted on x axis and y axis to x and y parameters, respectively. The hue parameter separates the data points based on the categories in the given column by using different colors for each category. Finally, the aspect parameter adjusts the width-height ratio of the figure.

Ggplot2:

> ggplot(data = titanic) + 
+     geom_point(mapping = aes(x = Age,  y = Fare, color =  
                 Survived))
Ggplot2 scatter plot (image by author)
Ggplot2 scatter plot (image by author)

The first step is the ggplot function that creates an empty graph. The data is passed to the ggplot function. The second step adds a new layer on the graph based on the given mappings and plot type. The geom_point function creates a scatter plot. The columns to be plotted are specified in the aes method. The color column is same as the hue parameter in Seaborn library.

We do not observe a distinctive relationship between age and fare which is kind of expected.

We use the color parameter to separate data points based on the survived column. It seems like the passengers who pay more have higher chance to survive.


We can create a histogram to check the distribution of a numerical variable. Histograms are created by dividing the value range into discrete bins and the number of data points (or values) in each bin is visualized with bars.

Let’s also show the survived and not-survived passengers on different plots.

Seaborn:

sns.displot(data=titanic, x="Age", col="Survived", kind="hist")
Histogram of Age (image by author)
Histogram of Age (image by author)

The col parameter separates the data points by creating separate subplots. If we use the row parameter, the subplots are created as rows.

Ggplot2:

> t <- ggplot(titanic, aes(Age)) + 
+     geom_histogram(bins=10, fill='lightblue')
> t + facet_grid(cols=vars(Survived))
Histogram of Age (image by author)
Histogram of Age (image by author)

In ggplot2 library, we can use the facet_grid function to create a grid of subplots based on the categories in given columns. It is similar to the FacetGrid object in Seaborn.


For the last example, we will create a larger grid of plots using both row and col parameters. In the previous examples, we see that there is a couple of outliers in the fare column. We will first filter out these observations and then generate the plots.

Seaborn:

titanic = titanic[titanic.Fare < 300]
sns.relplot(data=titanic, x="Age", y="Fare", kind="scatter",
            hue="Survived", row="Sex", col="Pclass",
            height=4)
(image by author)
(image by author)

We clearly see that passengers in class 1 are more likely to survive than others. Another finding is that female passengers have higher chance of surviving than male passengers.

Ggplot2:

The same grid of plots can be created with ggplot2 library as below:

> titanic <- titanic[Fare < 300]
> t <- ggplot(titanic, aes(x=Age, y=Fare, color=Survived)) + geom_point()
> t + facet_grid(cols=vars(Pclass), rows=vars(Sex))
(image by author)
(image by author)

Although the syntax is different, the approach is similar. We add dimensions by using color, rows, and cols parameters.


Conclusion

Both Seaborn and ggplot2 are powerful and versatile Data Visualization libraries. I think both are more than enough to perform typical data visualization tasks.

Which one to use comes down to your choice of programming language. Since both Python and R are predominant in data science ecosystem, either one will do the job for you.

Thank you for reading. Please let me know if you have any feedback.


Related Articles