The world’s leading publication for data science, AI, and ML professionals.

Structure Your Data Like a Nobel Prize Winner

To facilitate manipulation, visualization, and modeling

Photo by Samet Kurtkus on Unsplash
Photo by Samet Kurtkus on Unsplash

Do you think that there is such a thing as "the best way" to structure your data?

I’m particularly interested to have your opinion in the comments. Because on my side, I initially thought that there must be many ways to structure a dataset, and the way we store and manipulate our data is specific to the project we are working on.

But in fact, there is a method–a framework–that we can almost systematically use to make data cleansing as simple and efficient as possible.

The first time I heard about this method was in a lecture from Esther Duflo (Nobel Prize in Economic Sciences 2019) given online at MIT in the course "Data Analysis for Social Scientists" where she promotes "tidy data" and a framework called "Tidyverse".

This structure is excellent to facilitate data manipulation, visualization, and modeling. We should probably add it to our best-practice list for our next projects. Don’t you think?

In this blog post, I’d like to give you an overview of:

  • the concept of tidy data,
  • how messy a dataset can be, and
  • the python functions that we can use to transform messy data into tidy.

Let’s dig in.


So, what is tidy data?

The concept of tidy data was first introduced by Wickham Hadley in his paper "Tidy Data"¹ and his framework presented in "Welcome to the Tidyverse"².

Let’s have a look at the definition.

To be considered as "tidy", the data should be arranged in a table that follows the following three rules:

  1. Each variable in its own column
  2. Each observation in its own row
  3. Each value in its own cell

And that’s it. Only three rules. But that’s actually nice because a small set of rules leads to a small set of tools that we would need to rearrange and clean messy datasets.

For illustration, I will define two simple tidy datasets. A first observational unit that looks at the min and max (of some imaginary measures) for two consecutive years (on the left in the picture below). And a second observational unit that looks at the size in km² of different countries (on the right in the picture below).

Those two tables are tidy: each variable has its own column, each observation has its own row, and each value has its own cell.

Two examples of tidy observational units
Two examples of tidy observational units

The structure is quite simple, but you will see in the next part that such an arrangement is not always obvious when looking at disorganized data.


How messy can it be?

Messy datasets can take several forms, but thankfully, Wickham Hadley filtered the noise for us and divided "messy data" into 5 common categories.

  • When column headers have values instead of variables. This would happen, for instance, if we set the years (1992, 1993) in the column headers.
Column headers are values instead of variables
Column headers are values instead of variables
  • When one column contains more than one variable. In our example, this would happen if we append the min and max in one column.
One column contains more than one variable
One column contains more than one variable
  • When variables are stored in columns and rows simultaneously. This example is trickier to spot. That would typically happen if you spot a column that stores variable names (Min, Max in our example) instead of values.
Variables are stored in columns and rows simultaneously
Variables are stored in columns and rows simultaneously
  • When multiple types of observational units are stored in the same table. A lot of undefined cells are great evidence of such a situation. In the previous part, we defined two tables (observational units) that report separate studies. If we merge them together, we would end up with a lot of undefined "N/A" values.
Multiple types of observational units are stored in the same table
Multiple types of observational units are stored in the same table
  • When a single observational unit is stored in multiple tables. In our previous example, the "Min" and "Max" variables were measured in the same study, and we can agree that it’s more convenient to work with a single table.
A single observational unit is stored in multiple tables
A single observational unit is stored in multiple tables

Now that we saw how messy a dataset can be, it’s time to have a look at the Python functions that we can use to tidy them up!


Transform messy data into tidy data

Let’s have a look at the common Python functions that we can use to tidy the five types of messy datasets described in the previous part.

The good news is that tidy data is quite an intuitive way of working with Pandas DataFrame. As an example, Rodrigo Marino analyzed all scenarios above and demonstrated how to tidy each of them in this article. From his study, I came up with the following functions that I found to be the most relevant:

  • pivot_table: create a spreadsheet-style pivot table
  • melt: "unpivot" a DataFrame from wide to long format
  • assign: compute and append one or more new columns
  • factorize: encode a list as a categorical variable
  • concat: concatenate two pandas object together
  • sort_values: sort by the values along column or row

To understand the mechanism of each function, I would like to share with you the code of my Jupyter notebook that practices them with simple examples:


Final Words

We saw together the concept of tidy data, how messy it can become, and what tools we can use to tidy them up.

The first main advantage of such a simple framework is that we can tidy messy datasets with a small number of functions. And the second main advantage in my opinion is that a tidy dataframe works perfectly and very intuitively in Python with the popular Pandas library.

Data wrangling is a big field and "tidy data" only covers part of it, but it would definitely give you a good start!


Reference

[1] Wickham, Hadley. "Tidy Data." Journal of statistical software 59.1 (2014): 1–23.

[2] Wickham, Hadley, et al. "Welcome to the Tidyverse." Journal of open source software 4.43 (2019): 1686.


Related Articles