The world’s leading publication for data science, AI, and ML professionals.

3 Not So Common Yet Functional Python Libraries for Data Science

That make your life easier.

Photo by Samantha Gades on Unsplash
Photo by Samantha Gades on Unsplash

One of the reasons why Python dominates Data Science is the rich selection of libraries it offers to the users. The active Python community keeps maintaining and improving these libraries which helps Python to stay on top.

Some of the most commonly used Python libraries for data science are Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, TensorFlow, and PyTorch. They can be considered as the FAANG of the Python library ecosystem.

Just like there are many successful companies other than the FAANG, Python has other libraries that come in handy in particular cases. In this article, I will inform you about 3 of them.


Altair

Altair is a statistical visualization library for Python. It is not as popular as Seaborn or Matplotlib but I suggest you give Altair a chance as well.

What I like best about Altair is the filtering and data transformation operations. It provides many options to manipulate data while creating a visualization. In this sense, Altair can be considered as a more complete exploratory data tool.

We can also create interactive visualizations with Altair. Furthermore, it is possible to add selection objects to the visualizations in a way that what you select on one chart makes changes on another one. Cool feature! 🙂

The following interactive visualization was created with Altair. The one on the right is a histogram that shows the price distribution of the data points selected on the left plot.

(GIF by author)
(GIF by author)

I have written a few articles that explain how to use Altair. They constitute a practical Altair tutorial so I suggest you visit them if you’d like to learn more about Altair.


Sidetable

Sidetable is an add-on to the Pandas library. It was created by Chris Moffitt.

Pandas has some accessors for using certain types of methods. For instance, the methods for manipulation strings are accessed using the str accessor. The reason why I’m giving this information is that Sidetable can be used as an accessor on data frames just like the str accessor.

It can be installed from the terminal or in a jupyter notebook.

#from terminal
$  python -m pip install -U sidetable
#jupyter notebook
!pip install sidetable

In order to have fun with Sidetable, we need to import it along with Pandas.

import pandas as pd
import sidetable

What Sidetable does is similar to the value_counts function of Pandas but it provides much more insight.

When applied to a categorical variable, the value_counts function gives us the number of observations or percent share for each category. On the other hand, Sidetable not only gives the number of observations and the percent share together, it also provides cumulative values.

Let’s do a simple example to demonstrate the difference. Consider we have the following data frame.

The first five rows of df (image by author)
The first five rows of df (image by author)

We can find out the number of cars for each brand as follows.

The value_counts function (image by author)
The value_counts function (image by author)

The sidetable returns a more informative table.

Sidetable (image by author)
Sidetable (image by author)

This is a sample data frame with only 25 rows. When you work with larger data frames which you are in real life, Sidetable will be much more practical and functional.

In addition to the freq function, Sidetable has counts, missing, and subtotal functions which are quite functional as well.

If you’d like to learn more about Sidetable, here are two articles with several examples.


Missingno

Missingno, as its name suggests, is a library that helps handling the missing values in a data frame.

Pandas has functions to find the number of missing values or to replace them with appropriate values. What Missingno does is to create visualizations that provide an overview of the distribution of the missing values.

It is definitely more informative than just knowing the number of missing values. It is an important insight for handling the missing values as well.

For instance, if most of the missing values are in the same rows, we can choose to drop them. However, if the missing values in different columns happen to be in the different rows, we should probably find a better approach.

Missingno makes it easier for us to explore the distribution of missing values.

Visualize Missing Values with Missingno


Conclusion

Python is a prominent library in data science for a reason. There are numerous libraries that make your life easier. Many thanks for the great Python community for creating such outstanding libraries.


Last but not least, if you are not a Medium member yet and plan to become one, I kindly ask you to do so using the following link. I will receive a portion from your membership fee with no additional cost to you.

Join Medium with my referral link – Soner Yıldırım


Thank you for reading. Please let me know if you have any feedback.


Related Articles