
Pandas pivot is an essential tool of every Data Scientist. Some use it daily and others avoid it because it seems complex. I was in the latter group for quite a while. After I took the time and did some research, I felt like I wasted a lot of time writing unnecessary code. To my surprise, I already knew the main building blocks of pandas. It is all simpler than it may seem.
What is a pivot table?
A pivot table is a table of statistics that summarizes the data of a more extensive table. In practical terms, a pivot table calculates a statistic on a breakdown of values. For the first column, it displays values as rows and for the second column as columns.
Let’s look at the example of a pivot table that calculates sum statistic on a breakdown by fruits and customers:

How can I pivot a table in pandas?
Pandas has a pivot_table function that applies a pivot on a DataFrame. It also supports aggfunc that defines the statistic to calculate when pivoting (aggfunc is np.mean by default, which calculates the average). I use the sum in the example below.
Let’s define a DataFrame and apply the pivot_table function.
df = pd.DataFrame(
{
"fruit": ["apple", "orange", "apple", "avocado", "orange"],
"customer": ["ben", "alice", "ben", "josh", "steve"],
"quantity": [1, 2, 3, 1, 2],
}
)
I would like to have a breakdown with fruits in rows (specify index) and customers in columns (specify columns). For each entry, I would like to calculate the summation of quantities. The result is the same as in the table above.
df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum)
Why do I get a value error when pivoting?

The most likely reason is that you’ve used the pivot function instead of pivot_table. This confused me many times. Pandas pivot function is a less powerful function that does pivot without aggregation that can handle non-numeric data.
The error "The ValueError: Index contains duplicate entries, cannot reshape" occurred because there are duplicates entries in your DataFrame. If I would pivot the DataFrame above, I would get the same error as apple and ben are duplicated and pivot function doesn’t aggregate.

It helped me to understand this problem when I went through pandas source code for pivot function. Basically it does the same as the command below:
df.set_index(["fruit", "customer"])["quantity"].unstack()
What operations do actually happen when pivoting?

At first, pivoting may seem to you like a hard-to-understand concept. But what if I told you that you were using it all along – at least the commands that are at the core of pivoting in pandas. Let’s look at the command below:
df.groupby(['fruit', 'customer']).quantity.sum().unstack()
Have you ever used groupby function in pandas? What about the sum command? Yes? I thought so. The output of the above command is the same as of pivot_table.
I haven’t use unstack many times but it basically unpacks multi-index to columns like in the image below

How can I set missing values to 0?
Say no more! pivot_table has a fill_value argument to replace missing values with. It is None by default. Let’s try it.
df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=np.sum, fill_value=0)

Can I calculate multiple statistics at the same time?
The argument aggfunc of pivot_table function takes a list of functions. Let’s try it with sum and mean.
df.pivot_table(index="fruit", columns="customer", values="quantity", aggfunc=[np.sum, np.mean], fill_value=0)

Can I aggregate multiple values at the same time?

You can! Similar to aggfunc argument, values argument takes a list of column names. Let’s add a price column to the DataFrame.
df['price'] = [0.1, 0.2, 0.1, 0.4, 0.15]

Now that we have two columns with values, let’s apply pivot_table function:
df.pivot_table(index="fruit", columns="customer", values=["quantity", "price"], aggfunc=np.mean, fill_value=0)

Can I do a breakdown of rows/columns even further?
The answer is yes yet again. Arguments index and column both take lists. Let’s add column origin of fruit to the DataFrame.
df['origin'] = ['italy', 'spain', 'spain', 'mexico', 'portugal']

Now, let’s do a breakdown of fruit and origin by rows and customers by column.
df.pivot_table(
index=["fruit", "origin"],
columns=["customer"],
values=["quantity"],
fill_value=0,
aggfunc=np.mean,
)

It works like a charm!
Let’s connect
Talk: Book a call Socials: YouTube 🎥 | LinkedIn | Twitter Code: GitHub