The world’s leading publication for data science, AI, and ML professionals.

How to Become a Pandas Jedi

A practical guide to boost your data analysis skills

(image by author)
(image by author)

Pandas is a popular data analysis and manipulation library for Python. It provides a wide variety of functions to manipulate, transform, and analyze data.

It is important to know what a function does. But, more importantly, we should know which functions to use for a given task. In a typical case, we are not told to apply a particular function. Instead, we are given a task and expected accomplish it.

To become a master of pandas, we should feel comfortable at selecting and applying appropriate functions to complete a task. It requires a comprehensive understanding of functions and lots of practice.

In this article, we will work on some typical data analysis and manipulation tasks on a supermarket dataset. The focus is not a particular function of Pandas. Instead, we focus on the given task and try to implement an efficient solution.

Here is an overview of the dataset. I have excluded some of the columns in the original version of it on Kaggle.

import numpy as np
import pandas as pd
df = pd.read_csv("/content/supermarket.csv", parse_dates=['date'])
df.head()
(image by author)
(image by author)

Note: Parse_dates parameter stores the given columns with datetime data type. It is important when working with dates and times.


Let’s suppose, for a particular branch, we need to calculate the highest difference in total sales between two consecutive days. It might help us understand the factors that increase the sales. We can also find out if there is an extraordinary situation.

It is better to design our solution before writing the code. We can approach this tasks as follows:

  • Filter the data points (i.e. rows) that belong to the branch of interest
  • Select the date and total sales amount columns
  • Group the data points by date and calculate the sum
  • Sort the results by date
  • Calculate the difference between two consecutive dates
  • Select the maximum

It seems like a highly complicated operation. However, pandas provides versatile and powerful functions that allow us to tackle down such tasks easily.

Here is our solution:

df[df.branch == 'A'][['date','total']]
.groupby('date', as_index=False)
.sum()['total'].diff().max()
2888.53

Let’s elaborate on the code. The first line filters data points that belong to branch A and selects the date and total columns. The second line groups the rows by date. The output of the group by function is automatically sorted by date because we use the date as the grouping column. If not, we could use the sort_values function after the groupby function.

The third line calculates the total amount for each group (i.e. each day). Then, we select the total column and apply the diff function. It calculates the difference between a row and its previous row. Since the rows are consecutive days, we end up getting the difference between the sales amount of two consecutive days. Finally, we use the max function to see the maximum difference.

We know the highest difference but it is not useful without knowing the date. Thus, we should improve our solution a little.

df_sub = df[df.branch == 'A'][['date','total']]
.groupby('date', as_index=False).sum()
df_sub['diff'] = df_sub['total'].diff()
df_sub.sort_values(by='diff', ascending=False, inplace=True)

The first operation creates a dataframe that contains the total daily sales of branch A. The second one creates a column with the difference between the totals of two consecutive days. Finally, we sort the rows based on the difference in descending order.

The first row of df_sub shows the date with the highest difference with its previous day.

(image by author)
(image by author)

We can confirm the result by checking the total amounts on 2019–02–16 and 2019–02–17.

(image by author)
(image by author)

It is important to note that this might not be the only solution for this task. Pandas provides various functions and techniques which make it a versatile and powerful tool for data analysis. Thus, we may come up with multiple solutions for a given task.


The sales are divided into 6 product lines. For a particular branch, we may want to transform the dataframe to a format in which the product lines are represented as columns. The dates will constitute the rows and the values are the total sales amount in each day.

We can approach this tasks as follows:

  • Filter the data points (i.e. rows) that belong to the branch of interest
  • Select the product line, date, and total columns
  • Group the data points by product line and date and calculate the sum of total sales for each group
  • Transform the dataframe from long to wide format

Here is our solution:

df_new = df.query('branch == "A"')[['prod_line','date','total']]
.groupby(['prod_line','date'], as_index=False).sum()
.pivot_table(index='date', columns='prod_line', fill_value=0)

We use the query function just to demonstrate a different way for filtering rows. Then, we select the desired columns and apply the groupby function.

The pivot_table function creates a pivot table like in excel. The index parameter represents the rows which are dates in our case. The values in the pivot table become the total sales amount since it is the only appropriate options. If there are multiple candidates for values, we can specify it by using the values parameter.

If there is no sales in a product line on a particular date, the value becomes "NaN" but we can change it using the fill_value parameter.

Here is an overview of the new dataframe:

df_new.head()
(image by author)
(image by author)

Conclusion

I think this article demonstrate the power of Pandas. It is simple and efficient. The syntax is highly intuitive which makes the code easy to understand. However, mastering Pandas requires practice just like any other tool does.

It is one thing to know what a function does but we should be able to combine multiple functions to accomplish a given task. It requires a comprehensive understanding of functions and how they can be used together.

Thank you for reading. Please let me know if you have any feedback.


Related Articles