The world’s leading publication for data science, AI, and ML professionals.

How to Do a Ton of Analysis in Python in the Blink of An Eye.

Cut down your data exploration time to one-tenth of its original duration using these Python exploratory data analysis tools.

Image by Daniel Hannah from Pixabay
Image by Daniel Hannah from Pixabay

I remember the good old college days where we spent weeks analyzing survey Data in SPSS. It’s interesting to see how far we came from that point.

Today, we do all of them and a lot more in a single command before you even blink.

That’s a remarkable improvement!

This short article will share three impressive Python libraries for exploratory data analysis (EDA). Not a Python pro? Don’t worry! You can benefit from these tools even if you know nothing about Python.

They could save weeks of your data exploration and improve its quality. Also, you are going to have a lot fewer hair-pulling moments.

The first one is the most popular, then my favorite and the last one is the most flexible. Even if you know these libraries before, the CLI wrapper I introduce in this post may help you use them at lightning speed.

The most popular Python exploratory data analysis library.

With over 7.7k stars in GitHub, Pandas-Profiling is our list’s most popular exploratory data analysis tool. It’s easy to install, straightforward to use, and impeccable in its results.

You can use either PyPI or Conda to install Pandas-Profiling.

pip install pandas-profiling
# conda install -c conda-forge pandas-profiling

The installation allows you to use the pandas-profiling CLI in your terminal window. Within seconds, it generates an HTML report with tons of analysis about your dataset.

The blink moment: Here’s a demo that shows how it works. We use the popular titanic survivor dataset for our analysis and store it in an HTML file. We then use our favorite browser to open it. Here is a live version you can play around with.

Illustration by the Author.
Illustration by the Author.

When you open the file or the live link above, it will look like the following.

Screenshot by the Author.
Screenshot by the Author.

The variables section is a comprehensive analysis of every variable in your dataset. It includes descriptive statistics, histograms, common and extreme values of the variable.

In the interactions section, you can choose any two variables and create a scatterplot.

It’s a single-page dependency-free web app. You can host it with any static site hosting provider because the generated HTML is a self-contained application.

One of my favorites in this report is the correlation section. It creates a heatmap of correlations of variables. You can choose the type of correlation to use in the heatmap.

My favorite EDA library.

Though it has only 1.7k stars on GitHub, Sweetviz fascinates me in many ways. The obvious magnet is the library’s super-cool interactive HTML output. But my love for this tool is for other reasons.

You can install the library using the below command

pip install sweetviz

Sweetviz doesn’t ship with a command-line interface. But the below code creates a CLI wrapper around the library. If you want to learn more about creating nifty CLI’s for your Data Science projects, check out my previous article on the topic.

The complete code is available in the Github repository. Non-python users can follow the instructions there to get started quickly.

The primary usage of Sweetviz with the CLI.

For the above script to work,

  1. Copy the content to a file called sweet(note that the file doesn’t have an extension);
  2. make the file an executable. You can do it with chmod +x sweet, and;
  3. add the current directory to the system path with export PATH=$PATH:$PWD.

The blink moment: This creates the CLI we need to generate EDA’s quicker. Here’s the primary usage of it.

Illustration by the Author.
Illustration by the Author.

The above example generates a detailed report about the dataset and opens it in the browser. The output may look like the below. A live version is available too.

Illustration by the Author.
Illustration by the Author.

You may see Sweetviz gives almost the same information Pandas-Profiling does. Sweetviz, too, generates a self-contained HTML. You can host it with static hosting solutions such as Github pages.

Sweetviz is my favorite in two of its remarkable features – dataset comparisons and setting target variables. We’ll look at them one by one and then together.

Comparing datasets with Sweetviz in a CLI.

Update the sweet file we created with the below content. You can paste it below the ‘MORE FEATURES’ line. This function gives the extra capability to your CLI – comparison.

The blink moment: Here’s how it works. It takes two files as arguments and generates the report as it did earlier. For this example, I created a second file by sampling the Titanic dataset. In a real-life scenario, you may have a different version of the same file.

Illustration by the Author.
Illustration by the Author.

The generated output now looks different. It now contains a comparison value displayed at every level. You can see it clearly in this live version.

Illustration by the Author.
Illustration by the Author.

Making such a comparison with two datasets might take significant effort otherwise.

Another cool thing about Sweetvis is its target variable setting. With this, you can generate the report where every cut is examined against a target variable. The below update to the code will let you do it with the CLI.

The blink moment: Now, you can specify the dataset name and a target variable in the CLI. Here’s the demo and the output (live version).

Illustration by the Author.
Illustration by the Author.
Illustration by the Author.
Illustration by the Author.

I’ve specified the ‘Survived’ variable as the target variable. Now, alongside every variable, you can also study the variability of the target.

In most cases, you’ll have to see how your target variable has changed from different versions of your dataset. It’s only another blink with Sweetviz.

Dataset comparison with a target variable

The below code will update the CLI to accept three arguments. The first is the primary dataset, then the comparison dataset, and the last is the target variable.

The blink moment: You can run it with the sample dataset we created earlier for the comparison and the ‘Survived’ column as the target.

Illustration by the Author.
Illustration by the Author.

The output now has both the comparison dataset and analysis against a target variable. In most professional endeavors, this could be extremely useful. If you work on the same dataset, update it with new observations and focus on a single variable. Here is the live version to test.

Illustration by the Author.
Illustration by the Author.

The flexible EDA playground.

If you can spend a few more blinks but need more control over your analysis, here is what you need. Pandas GUI creates a graphical wrapper around your data frame. Instead of writing code, you can use a convenient interface. Pandas GUI is more of an exploration playground than a quick exploration tool.

You can install it with PyPI:

pip install pandasgui

Like Sweetviz, Pandas GUI, too, doesn’t come with a CLI. Although starting it isn’t tricky, the CLI wrapper below could help you if you aren’t a Python user.

Like we did for the Sweetviz, create a file named pgui with the above content. make it executable with chmod +x pgui.But, you don’t have to add the current directory again to the path, as we already did. The below command will start the UI.

pgui titanic.csv
Illustration by the Author.
Illustration by the Author.

You can see interactive software popping up. With this tool, you can do different analyses that are impossible with the two other tools I’ve mentioned.

For example, here is a contour plot of survivors against their age.

Illustration by the Author.
Illustration by the Author.

We aren’t going into more details about Pandas GUI here. But the below video from their official docs will help you learn more about it.

Conclusion

Besides its interpretation, exploratory data analysis is repetitive for the most part. Gone are the days we struggled with SPSS and Excel to do trivial things. Today, we can do a lot more than that in the blink of an eye.

In this article, I’ve discussed three strikingly convenient Python libraries to do EDA. Pandas profiling is the most popular one among them. Sweetviz creates a self-contained HTML application that I find handy. Lastly, we discussed Pandas GUI, a tool that allows you to control your analysis.

Along with the library, we’ve also discussed creating CLI wrappers to make it more convenient. It allows non-Python users to also benefit from these tools.

Installation and usage are straightforward for all three libraries. With the repetitive tasks of EDA being taken care of, you may focus your attention on the more exciting stuff.

Be armed to surprise your audience before they blink.


Thanks for reading, friend! It seems you, and I have lots of common interests. I’d love to connect with you on LinkedIn, Twitter, and Medium

Not a Medium member yet? Please use this link to become a member. You can enjoy thousands of insightful articles and support me as I earn a small commission for referring you.


Related Articles