Data does not come in a usable format by default; a Data Science professional has to spend 70–80% of their time in data cleaning and manipulation to make it ready to use to generate meaningful insights.
In the data manipulation process, a data scientist/analyst might have to do various types of data transformations and sometimes we feel stuck as we don’t know if a direct function exists in python to perform the required transformations.
In this blog post, we will look at 4 advanced python Data Transformation functions that will make your life easier as a data science professional and will be a great addition to your arsenal of data manipulation functions.
1. Pivot
The pivot function in pandas has the same functionality as the pivot operation in excel. We can transform a dataset from a long format to a wide format.
Let’s understand this with an example. Imagine, we have a dataset around Covid-19 cases across countries, as shown below.
We want to convert the dataset into a form such that each country becomes a column and the new confirmed cases as values corresponding to the countries. We can perform this data manipulation using the pivot function.
### Pivot the dataset
pivot_df = pd.pivot(df, index =['Date'], columns ='Country', values =['NewConfirmed'])
## renaming the columns
pivot_df.columns = df['Country'].sort_values().unique()
We can bring the new columns to the same level as the index column Data by resetting the index.
## reset the index to modify the column levels
pivot_df = pivot_df.reset_index()
2. Melt
Melt is the opposite of pivot – it is used to unpivot the dataset. It converts the data from wide format to long format.
Let’s see how we can unpivot the wide format Covid-19 dataset that we created above.
## The dataset is melted by setting the id column - a column that will not change.
## and value column - columns we want to unpivot
melted_df = pivot_df.melt(id_vars = 'Date', value_vars = ['US', 'India', 'China'])
# we can rename the columns too
melted_df.columns = ['Date', 'Country', 'NewConfirmed']
3. Stack
The stack function is used to convert(or unpivot) the multi-level columns to rows.
Let’s look at a few examples!
If we pick the pivoted covid-19 dataset without resetting the index then it would look something like this.
We can stack the country columns back to rows using the stack function as shown below.
## stack the dataset
stack_df = pivot_df.stack()
## reset the index and set column names
stack_df = stack_df.reset_index()
stack_df.columns = ['Date','Country','NewConfirmed']
stack_df
Now, you might be thinking that the same transformation can be done using the melt function too and you are right. But still, there is a difference between the two, the stack function is more advanced – it works on multi-level columns but melt cannot. For example, the stack function can transform the below data having 2 column levels:
The ‘-1’ level denotes the 1st column from the last.
4. Unstack
Unstack is the opposite of stack – it is used to pivot one/multiple levels of a multi-level column dataset.
Let’s look at a few examples to understand it better!
Using unstack, we can pivot the column of the dataset as shown below.
Unstack function can work on multi-level column datasets too while the melt function cannot.
Pivot/Melt functions are subsets of the Stack/Unstack functions. Pivot/Melt does not work with multi-level columns.
9 Python Concepts You Should not Skip for Effective Data Science
Conclusion
In this blog, we looked at the 4 advance data transformation techniques to convert the data format from long to wide format or vice versa.
Pivot/Melt works with single-level column datasets while Stack/Unstack can be applied to any complex multi-level column datasets too.
Thank You
I hope you found the story useful. You can get all my posts in your inbox. Do that here!
If you like to experience Medium yourself, consider supporting me and thousands of other writers by signing up for a membership. It only costs $5 per month, it supports us, writers, greatly, and you get to access all the amazing stories on Medium.