The world’s leading publication for data science, AI, and ML professionals.

Pandas Pivot – The Ultimate Guide

Everything you've always wanted to know about pandas Pivot but were too afraid to ask.

Photo by Christian Fregnan on Unsplash
Photo by Christian Fregnan on Unsplash

Pandas pivot is an essential tool of every Data Scientist. Some use it daily and others avoid it because it seems complex. I was in the latter group for quite a while. After I took the time and did some research, I felt like I wasted a lot of time writing unnecessary code. To my surprise, I already knew the main building blocks of pandas. It is all simpler than it may seem.


What is a pivot table?

A pivot table is a table of statistics that summarizes the data of a more extensive table. In practical terms, a pivot table calculates a statistic on a breakdown of values. For the first column, it displays values as rows and for the second column as columns.

Let’s look at the example of a pivot table that calculates sum statistic on a breakdown by fruits and customers:

The left table is the base table for the pivot table on the right.
The left table is the base table for the pivot table on the right.

How can I pivot a table in pandas?

Pandas has a pivot_table function that applies a pivot on a DataFrame. It also supports aggfunc that defines the statistic to calculate when pivoting (aggfunc is np.mean by default, which calculates the average). I use the sum in the example below.

Let’s define a DataFrame and apply the pivot_table function.

df = pd.DataFrame(
    {
        "fruit": ["apple", "orange", "apple", "avocado", "orange"],
        "customer": ["ben", "alice", "ben", "josh", "steve"],
        "quantity": [1, 2, 3, 1, 2],
    }
)

I would like to have a breakdown with fruits in rows (specify index) and customers in columns (specify columns). For each entry, I would like to calculate the summation of quantities. The result is the same as in the table above.

df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum)

Why do I get a value error when pivoting?

Gif from giphy
Gif from giphy

The most likely reason is that you’ve used the pivot function instead of pivot_table. This confused me many times. Pandas pivot function is a less powerful function that does pivot without aggregation that can handle non-numeric data.

The error "The ValueError: Index contains duplicate entries, cannot reshape" occurred because there are duplicates entries in your DataFrame. If I would pivot the DataFrame above, I would get the same error as apple and ben are duplicated and pivot function doesn’t aggregate.

It helped me to understand this problem when I went through pandas source code for pivot function. Basically it does the same as the command below:

df.set_index(["fruit", "customer"])["quantity"].unstack()

What operations do actually happen when pivoting?

Gif from giphy
Gif from giphy

At first, pivoting may seem to you like a hard-to-understand concept. But what if I told you that you were using it all along – at least the commands that are at the core of pivoting in pandas. Let’s look at the command below:

df.groupby(['fruit', 'customer']).quantity.sum().unstack()

Have you ever used groupby function in pandas? What about the sum command? Yes? I thought so. The output of the above command is the same as of pivot_table.

I haven’t use unstack many times but it basically unpacks multi-index to columns like in the image below

Unstack operation
Unstack operation

How can I set missing values to 0?

Say no more! pivot_table has a fill_value argument to replace missing values with. It is None by default. Let’s try it.

df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum, fill_value=0)

Can I calculate multiple statistics at the same time?

The argument aggfunc of pivot_table function takes a list of functions. Let’s try it with sum and mean.

df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=[np.sum, np.mean], fill_value=0)

Can I aggregate multiple values at the same time?

Gif from giphy
Gif from giphy

You can! Similar to aggfunc argument, values argument takes a list of column names. Let’s add a price column to the DataFrame.

df['price'] = [0.1, 0.2, 0.1, 0.4, 0.15]

Now that we have two columns with values, let’s apply pivot_table function:

df.pivot_table(index="fruit", columns="customer", values=["quantity", "price"], aggfunc=np.mean, fill_value=0)

Can I do a breakdown of rows/columns even further?

The answer is yes yet again. Arguments index and column both take lists. Let’s add column origin of fruit to the DataFrame.

df['origin'] = ['italy', 'spain', 'spain', 'mexico', 'portugal']

Now, let’s do a breakdown of fruit and origin by rows and customers by column.

df.pivot_table(
    index=["fruit", "origin"],
    columns=["customer"],
    values=["quantity"],
    fill_value=0,
    aggfunc=np.mean,
)

It works like a charm!

Let’s connect

Talk: Book a call Socials: YouTube 🎥 | LinkedIn | Twitter Code: GitHub


Related Articles