The world’s leading publication for data science, AI, and ML professionals.

How To Do Exploratory Data Analysis in Python With One Line of Code

Save hours on your next project with Pandas Profiling

PYTHON

Let’s imagine that you have to do an exploratory data analysis. Where do you start? You load the dataset, print the data, look for the number of observations, data types, missing values, duplicated values, and statistics summarization. Then, you perhaps need to create a few data visualizations to understand better what you have. All this will take multiple lines of code and maybe hours. What if I tell you that you can save your precious time, skip all these steps, and focus only on analyzing the data? Magic? No, that’s Pandas Profiling.

There are quite a few blogs about Pandas Profiling on Medium. However, this is my take on this great library. Surprisingly, most data scientists that I know either don’t know about Pandas Profiling or are familiar but have never tried to use it.

Looking at the download trend for Pandas Profiling and Pandas above, we can see Pandas is dozens of times more popular than Pandas Profiling. This is, of course, expected. However, we can also see that Pandas Profiling is not as popular as it should be. Well, if you don’t know about it, let’s explore it together, and if you already know about it, let me show you why you should be using it in every project you work.

Pandas Profiling

In short, Pandas Profiling is a low-code library that allows us to perform a thorough exploratory data analysis. By typing one line of code, you will save yourself of all the steps that I mentioned at the beginning of this article with a beautiful and interactive HTML file that you can visualize in a notebook or share the file with anyone. It’s like turning your Jupyter Notebook into a Tableau dashboard. Let’s get started!

How To Use Pandas Profiling

First, you need to install Pandas Profiling. You can install it using pip by typing the following lines of code:

pip install -U pandas-profiling[notebook]
jupyter nbextension enable --py widgetsnbextension

If you prefer using conda, you just need to type the following code:

conda env create -n pandas-profiling
conda activate pandas-profiling
conda install -c conda-forge pandas-profiling

Now that you have it installed, it’s time to have fun. For this demonstration, I will use the famous Titanic dataset, and you can find it here.

# Importing pandas and pandas profiling
import pandas as pd
from pandas_profiling import ProfileReport
# Importing the dataset
df = pd.read_csv('world-happiness-report-2021/world-happiness-report-2021.csv')

That’s all you need to do to use Pandas Profile finally. Pretty easy, right? Now, let’s create the report. I will type the following line of code, and we will be good to go.

profile = ProfileReport(df, title='Titanic Report')
profile

There we go. This report took 5.67 seconds to run on my computer and one line of code. I will break it down into parts so that we can better analyze what we are getting.

Overview

First, we get an overview of the dataset showing the number of variables, missing values, duplicate rows, how many of the values were numeric, and how many were categorical.

Then, we have a tab showing warnings about the dataset. These warnings will be of great help if your final project is related to machine learning. It shows categories that have high cardinality, high correlation, a high percentage of missing values, a high percentage of zeros, and more. In my opinion, these two features would already make Pandas Profiling worth it.

Variables

Now, we can explore the variables. Pandas Profiling will show us a breakdown of each category, including statistical information, such as minimum, maximum, infinite numbers, and how much percent of the values are distinct.

If we click on Toggle details, we will see more complex information, including range, coefficient of variation, skewness, standard deviation, percentiles, etc.

Are you using 5+ lines of code to get a decent graph to understand a category better? Pandas Profiling will help you with that. Clicking on Histogram, you will be able to visualize the category distribution. Keep in mind that this all was created in 5 seconds!

Instead of typing df['age'].value_counts(normalize=True), you can see the frequency of each value on a category with one click. Oh, and it looks way better.

Finally, we can easily see extreme values (or outliers) in the dataset. For example, it seems like there were people who were 0.92 years old. This could mean that they were babies or that the dataset is wrong. Either way, you wouldn’t find this easily if you were doing it manually.

Interactions

Another very cool feature that Pandas Profiling offers is seeing how the categories interact between them. However, the Titanic dataset is not the best option to analyze this feature. For this reason, I will use the World Happiness Report 2021.

To do so, all you need to do is choose the two features that you want to compare, and Pandas Profiling will create instantaneous scatter plots. How cool is that?

Correlations

Since we can see how each feature interacts with the other, it makes sense to imagine that we can also see a correlation table. Pandas Profiling goes a few steps ahead and shows four different correlation tables: Pearson’s, Spearman’s, Kendall’s, Phik’s, and Cremér’s. Not sure what these all are? No problem. Clicking on Toggle correlation descriptions, you will get an explanation for each of them.

First and last rows

The developers behind Pandas Profiling wanted to make sure that they covered every step of an Exploratory Data Analysis. Thus, they included the first and last rows of the dataset – a very nice touch

Sharing the report

Another advantage of Pandas Profiling is that it makes it very easy to share the report. Instead of sharing a Jupyter Notebook where the other user will have to run all the code, you will share a file ready for analysis. You can download the report by typing profile.to_file('report_name.html'), and you will see that a new file will be created in the same folder as the notebook you are using is.

Conclusions

Pandas Profiling has become more popular in the past few months. However, it’s not as popular as it deserves. You should learn how to do an EDA manually to improve your Coding skills, but let’s admit that you can save at least one hour of your time with Pandas Profiling.

You will not be less professional if you use this great library. In fact, it’s a smart way to save some time and focus on what matters: the analysis! I highly recommend you trying this library and if you do, let me know how it goes. Happy coding!


Related Articles