The world’s leading publication for data science, AI, and ML professionals.

A Practical Guide on Pandas Sidetable

How to use the sidetable efficiently.

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

Pandas is a very powerful and versatile Python Data Analysis library that expedites the preprocessing steps of data science projects. It provides numerous functions and methods that are quite useful in data analysis.

Although the built-in functions of Pandas are capable of performing efficient data analysis, custom made functions or libraries add value to Pandas. In this post, we will explore one of these add-ons which is sidetable.

Sidetable, created by Chris Moffitt, provides a frequency table based on selected columns. This can also be achieved using value_counts function of Pandas but sidetable is more flexible and comprehensive as we will see in the examples.

We first need to install it with pip or !pip if you use a jupyter notebook:

!pip install sidetable

We can now import sidetable along with pandas.

import pandas as pd
import sidetable

I will use a video game dataset which is available here on kaggle. Let’s read the csv file into a pandas dataframe and start working on examples.

df = pd.read_csv("/content/vgsales.csv")
df.shape
(16598, 11)
df.head()

The dataset contains 16598 video games with sales amount (in millions) and some features such as genre, publisher and year. Let’s see how many different genres exist in the dataset.

df.Genre.nunique()
12

We can see how many games each genre has using value_counts.

df.Genre.value_counts()

Action and Sports genres have the most video games. We may want to see the ratio of each genre rather than the numbers. Normalize parameter of value_counts can be used as follows.

df.Genre.value_counts(normalize=True)

Almost 20% of the games belong to action genre. I think it is much better to see both count and percentage at the same time. Sidetable provides this convenience as well as some other informative statistics.

Sidetable is used ** as an accessor on dataframes with stb** keyword.

df.stb.freq(['Genre'])

As you can see, both count and percent values are displayed. In addition to these, cumulative values are also provided. For instance, the first 4 genres constitute 53% of the entire dataset.

In some cases, there might be so many different categories which makes it hard to visualize all. Thresh parameter of sidetable allows to limit the displayed values based on a threshold value on cumulative percent. For example, we can display the platforms that contain 70% of all video games.

df.stb.freq(['Platform'], thresh=70)

The first 8 platforms in terms of number of video games constitute 69.96% of the entire dataset. The remaining ones are combined under "others" label. Sidetable allows to change this label using other_label parameter.

df.stb.freq(['Platform'], thresh=70, other_label="Other Platforms")

Up until now, we have investigated the distribution of data according to the categories in one column. Sidetable allows to combine multiple columns and check more specific categories. For instance, the most common platform-genre combinations can be checked with sidetable.

df.stb.freq(['Platform', 'Genre'], thresh=15)

The most common platform-genre category is PS2-Sports which form 2.4% of the entire dataset. The first 8 platform-genre pairs constitute almost 15% of all video games in the dataset.

By default, as indicated with "count" column, sidetable returns the number of observations that each category has. Consider the example above. There are 400 video games in the dataset that belong to PS2-Sports category (i.e. platform is PS2 and genre is sports). However, we might be interested in some other measure. For instance, in our case, we may need to sort the platforms according to the global sales amount. It can be easily achieved with sidetable using value parameter. We just need to pass the name of the column to value parameter.

df.stb.freq(['Platform'], value='Global_Sales', thresh=50)

Global sales amount of PS2 games is approximately 1.26 billion USD. It is interesting that the platform "DS" is not in top 4 in terms of global sales although it has the highest number of games in the dataset.

Comparison of platforms in terms of global sales and counts
Comparison of platforms in terms of global sales and counts

DS platform is not generating the highest sales although it is number one in terms of the number of games. The possible reasons:

  • DS games are not very popular so they are not sold as much as the games of other platforms
  • DS games are cheaper than the games of other popular platforms

Contrary to DS, X360 platform has less games than DS, PS3 and Wii but generates more sales. Same possible reasons apply but in opposite way. X360 games might be more inexpensive or sold more than other platforms.


Sidetable can also be used to generate an overview of missing values.

It returns number of missing values in each column along with the ratio of missing values.


Another very useful function of sidetable is subtotal which can be applied to the groupby function to easily calculate subtotals of categories. It will be more clear with examples. Let’s first check the total sales of each genre in North America, Europe, and Japan.

df[['NA_Sales','EU_Sales','JP_Sales', 'Genre']].groupby('Genre').sum()

It returns the total sales of each genre in specified regions but does not calculate the subtotals. We just need to add stb.subtotal() to calculate the subtotals.

df[['NA_Sales','EU_Sales','JP_Sales', 'Genre']].groupby('Genre').sum().stb.subtotal()

There are other ways to calculate subtotals but stb.subtotal makes it simpler. You can feel the convenience more when there are nested categories.


I think the success and prevalence of Pandas come from the versatile, powerful and easy-to-use functions to manipulate and analyze data. There are almost always multiple ways to do a task with Pandas. The custom made utility libraries added on Pandas optimize certain operations and bring more value to Pandas. Thanks again to Chris Moffitt for this valuable tool.

Thank you for reading. Please let me know if you have any feedback.


Related Articles