The world’s leading publication for data science, AI, and ML professionals.

Reshaping Pandas DataFrames

Melt, Stack and Pivot functions

Pandas is a very powerful Python data analysis library that expedites the preprocessing steps of your project. The core data structure of Pandas is DataFrame which represents data in tabular form with labeled rows and columns. In this post, I will try to explain how to reshape a dataframe by modifying row-column structure.

Photo by Paul Skorupskas on Unsplash
Photo by Paul Skorupskas on Unsplash

There are multiple ways to reshape a dataframe. We can choose the one that best fits the task at hand. The functions to reshape a dataframe:

  • Melt
  • Stack and unstack
  • Pivot

As always, we start with importing numpy and pandas:

import pandas as pd
import numpy as np

Melt

Melt is used to convert wide dataframes to narrow ones. What I mean by wide is a dataframe with a high number of columns. Some dataframes are structured in a way that consecutive measurements or variables are represented as columns. In some cases, representing these columns as rows may fit better to our task.

Consider the following dataframe:

df1 = pd.DataFrame({'city':['A','B','C'],
                   'day1':[22,25,28],
                   'day2':[10,14,13],
                   'day3':[25,22,26],
                   'day4':[18,15,17],
                   'day5':[12,14,18]})

We have three different cities and measurements done on different days. We decide to represent these days as rows in a column. There will also be a column to show the measurements. We can easily accomplish this by using melt function:

df1.melt(id_vars=['city'])

Variable and value column names are given by default. We can use var_name and value_name parameters of melt function to assign new column names. It will also look better if we sort the data by city column:

df1.melt(id_vars=['city'], var_name = 'date', value_name = 'temperature').sort_values(by='city').reset_index(drop=True)

Stack and unstack

Stack function kind of increases the index level of the dataframe. What I mean by increasing the level is:

  • If dataframe has a simple column index, stack returns a series whose indices consist of row-column pairs of original dataframe.
  • If dataframe has multi-level index, stack increases the index level.

It is better explained with examples. Consider the following dataframe:

df1 has 3 rows and 6 columns with simple integer column index. If stack function is applied to df1, it will return a series with 3 x 6 = 18 rows. The index of the series will be [(0, ‘city’), (0, ‘day1’), … , (2, ‘day5’)].

Let’s also check the shape and index:

df1.shape
(3,6)
df1.stack().shape
(18,)
df1.stack().index[0] #multilevel index
(0, 'city')

Stack and unstack functions are more commonly used for dataframes with multi-level indices. Let’s create a dataframe with multi-level index:

tuples = [('A',1),('A',2),('A',3),('B',1),('A',2)]
index = pd.MultiIndex.from_tuples(tuples, names=['first','second'])
df2 = pd.DataFrame(np.random.randint(10, size=(5,2)), 
                   index=index, columns=['column_x', 'column_y'])

If we apply stack function on this dataframe, the level of index will be increased:

df_stacked = df2.stack().to_frame()
df_stacked

Now the names of the colums (column_x and column_y) are part of multi-level index. So the resulting dataframe has one column and a 3-level multi-index.

len(df_stacked.index.levels)
3
len(df2.index.levels)
2

Unstack is just the opposite of stack. If we apply unstack to the stacked dataframe, we will get back the original dataframe:

df_stacked.unstack().index
MultiIndex(levels=[['A', 'B'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1], [0, 1, 2, 0, 1]],
           names=['first', 'second'])
df2.index
MultiIndex(levels=[['A', 'B'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1], [0, 1, 2, 0, 1]],
           names=['first', 'second'])

Pivot

Pivot function can also be considered as a way to look at the dataframe from a different perspective. It is used to explore the relationships among variables by allowing to represent data in different formats.

Consider the following dataframe:

We want to see how values change according to city-name pairs. We can create a new representation of this dataframe with an index of names and columns of cities.

If a city-name pair does not exist, corresponding cell is filled with NaN.

We do not have to see all the values at once. The values to put in the dataframe can be filtered using values parameter:


I think the success and prevalence of Pandas come from the versatile, powerful and easy-to-use functions to manipulate and analyze data. There are almost always multiple ways to do a task with Pandas. Since a big portion of time spent on a Data Science project is spent during data cleaning and preprocessing steps, it is highly encouraged to learn Pandas.


Thanks for reading. Please let me know if you have any feedback.


Related Articles