It’s a well-known fact among Data scientists: your data will never be exactly the way you want it. You might get a somewhat organized spreadsheet or reasonable sensible tables, but there will always be some cleaning up to do before you’re ready for analysis.
As a result, it’s crucial to be able to transition between different forms of data. Sometimes, it’s simply a matter of readability and ease of interpretation. Other times, you’ll quite literally find that a software package or model you’re trying to use simply won’t work unless your data is in a specific format. Whatever the case may be, this is a good skill to have.
In this article, I’m going to discuss two common forms of data: long-form data and wide-form data. These are widely used paradigms in Data Science, and it is good to be familiar with them. We’ll look at some examples to see what exactly both data formats look like, and then we’ll see how to convert between them using Python (and, more specifically, Pandas).
Let’s get into it.
Long-Form vs. Wide-Form Data
The simplest way to begin is with straight forward definitions [1]:
- Wide-form data has one row for each possible value of your independent variable, with all dependent variables recorded in the column labels. As a result, the label in each row (for the independent variable) will be unique.
- Long-form data has one row for each observation, and each dependent variable is recorded as a new value over multiple rows. Hence, values for the independent variable repeat within the rows.
Okay, cool – but what does that mean? It’ll be easier to understand if we look at an example. Say we have a data set of students, and we are storing their scores on a midterm exam, final exam, and class project. In wide form, our data would look like this:
Here, each student is an independent variable, and each of the scores are respective dependent variables (because the score for a particular exam or project depends on the student). We can see that the value of Student
is unique for each row, as we would expect for wide-form data.
Now let’s look at the exact same data, but in long form:
This time around, we have a row for each observation. In this case, an observation corresponds to a score on a particular assignment. In the wide-form version of this data above, we recorded multiple observations (scores) in a row, whereas here every row has its own score.
Additionally, we can see that the values for our independent variable Student
repeat in this data format, which is again what we expected.
In a moment, we’ll talk about why you should even care about these different formats. But first, let’s take a quick look at how we can use Pandas to convert between these different data formats.
Wide-Form Data to Long-Form Data: The Melt Function
Once again, let’s take a look at the wide-form data from above. This time, we’ll give the DataFrame a name: student_data
:
To convert student data into long form, we use the following line of code:
student_data.melt('Student', var_name='Assignment', value_name='Score')
Here’s a step-by-step explanation:
- The
melt
function is designed to convert wide-form data into long-form data [2]. - The
var_name
parameter specifies what we want to name the second column – the one which will contain our respective dependent variables. - The
value_name
parameter specifies what we want to name the third column – the one containing the individual values we are observing (in this case, the scores).
Okay, so we have our long-form data now. But what if – for whatever reason – we need to go back to a wide format?
Long-Form Data to Wide Form Data: The Pivot Function
Now, we’re starting with the long-form version of our data above, called student_data_long
. The following line of code will convert it back to our original format:
student_data_long.pivot(index='Student', columns='Assignment', values='Score')
Excepting the slightly updated labels ( pivot
shows the overall column label 'Assignment'
, this is precisely the data we started with above.
Here’s a step-by-step explanation:
- The
pivot
function is designed to convert wide-form data into long-form data [3], but can actually accomplish much more than what’s shown here [4]. - The
index
parameter specifies which column’s values we want to make our unique rows (i.e. the independent variable). - The
columns
parameter specifies which column’s unique values (in long form) will become the unique column labels. - The
values
parameter specifies what column’s labels will make up the actual data entries in our wide format.
And that’s all there is to it!
Why Does it Matter?
Finally, I want to briefly emphasize that while the above might seem superficial at first glance, it’s actually a very useful skill to have. Many times, you’ll find that having your data in a certain format will make your life much, much easier.
I’ll illustrate with an example from my own work. I often need to make data visualizations in Python, and my module of choice is Altair. This led to an unanticipated issue: most spreadsheets tend to be in wide format, but Altair’s specifications are significantly easier to use in long format.
I struggled for quite a while to develop one particular visualization earlier this year. Upon resigning myself to Stack Overflow, I discovered that all I needed to do was convert my data into a long format. If you’re skeptical, feel free to check out the post yourself.
Now, you may not work in visualization, but if you’re reading this, it’s safe to assume that you do work with data. Therefore, you should know how to manipulate it, and this is just one more useful skill to keep in your toolbox.
Best of luck on your data science endeavors.
Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!
References
[1] https://altair-viz.github.io/user_guide/data.ht [2] https://pandas.pydata.org/docs/reference/api/pandas.melt.html [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html [4] https://medium.com/towards-data-science/a-comparison-of-groupby-and-pivot-in-pythons-pandas-module-527909e78d6b