What’s Tidy Data?

How to organize messy datasets in Python with Melt and Pivotable functions

Rodrigo Marino
Towards Data Science

--

Source: R for Data Science (Hadley Wickham & Garrett Grolemund)

Data scientists spend about 80% of their time cleaning and organizing the data. Tidy Data is a way of structuring datasets to facilitate analysis.

In 2014, Hadley Wickham published an awesome paper named Tidy Data, that describes the process of tidying a dataset in R. My goal with this article is to summarize these steps and show the code in Python.

In tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each type of observational unit forms a table.

Messy data is any other arrangement of the data.

Messy Data

There are 5 examples of messy data we will explore here:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

Of course, there are more types of messiness that are not shown above, but they can be tidied in a similar way.

Column headers are values, not variable names.

For this example, we will use the dataset “relinc.csv” which explores the relationship between income and religion. Note, despite being messy, this arrangement can be useful in some cases, so we will learning how to tidy and untidy it.

import pandas as pddf_relinc=pd.read_csv("relinc.csv")
df_relinc.head()

There are three variables in this dataset: religion, income, and frequency. The column headers are values, not variable names, so we need to turn the variables in the columns (income) into rows. We will use Panda’s function Melt.

# Applying melt (to a long format)df_relinc=df_relinc.melt(id_vars=["religion"],var_name=["income"],value_name="frequency")df_relinc.head()

The output above is our tidy version of the dataset.

To return the dataset to a wide format we will use Panda’s function pivot_table.

# Applying pivot_table (to a wide format)df_relinc=(df_relinc.pivot_table(index = "religion", columns = "income", values = "frequency")
.reset_index()
.rename_axis(None, axis = 1))
df_relinc.head()

Multiple variables stored in one column.

Now we will explore the tuberculosis dataset from the World Health Organisation. The records show the count of tuberculosis cases by country, year and demographic group.

The demographic groups are broken down by sex (m, f) and age (0–14, 15–24, 25–34, 35–44, 45–54, 55–64, 65+, unknown).

df_tb=pd.read_csv('tb.csv')
df_tb.columns
df_tb.tail()

The columns contain sex and age values. Note the mixture of 0s and missing values (NaN). This is due to the data collection process and the distinction is important for this dataset.

First, we will gather up the non-variable columns, by moving the age range and sex to a single column. To do so, we will use the Melt.

# Applying melt (to a long format)df_tb=df_tb.melt(id_vars=["iso2","year"],var_name=["demographic"],value_name="cases")df_tb.sample(5)

Now we need to split the column demographic to get two columns for the variables sex and age.

# Creating new columns for sex and agedf_tb=(df_tb.assign(sex = lambda x: x.demographic.str[0].astype(str),
age = lambda x: x.demographic.str[1:].astype(str))
.drop("demographic",axis=1))
df_tb.sample(5)

Now, each observation has its own row and each variable has its own column. We just tidied our dataset! Before go ahead, let’s clean the data.

# Styling the datasetdf_tb.update(pd.DataFrame({"age":[age[:2]+'-'+age[2:] if len(age) == 4 else (age) for age in df_tb["age"]]}))df_tb=(df_tb.replace(to_replace =["m","f","014","65","u"],value =["Male","Female","0-14","65+","unknown"])
.dropna())
df_tb.sample(10)

Variables are stored in both rows and columns.

We will use the data from the Global Historical Climatology Network that represents the daily weather station (MX17004) in Mexico for five months in 2010.

It has variables in individual columns (id, year, month), spread across columns (day, d1–d31) and across rows (tmin, tmax) (minimum and maximum temperature). Months with fewer than 31 days have structural missing values for the last day(s) of the month. The columns d9 to d31 have been omitted for better visualization.

import datetimedf_weather = pd.read_csv('weather-raw.csv')df_weather.sample(5)

As seen above, the dataset is messy. Variables are stored in both rows (tmin, tmax) and columns (days). Let’s start by working on d1, d2, d3… columns.

We will apply melt to create a row for each record for the day variable.

# Applying melt (to a long format)df_weather=df_weather.melt(id_vars=["id","year","month","element"],var_name=["day"],value_name="temp")
df_weather.update(pd.DataFrame({"day":[day[1:] for day in df_weather["day"]]}))
df_weather.sample(5)

Now, we will use pivot_table function to create new columns for the tmin and tmax, once they are variables.

# applying pivot_table to create columns for tmin and tmaxdf_weather=(df_weather.pivot_table(index = ["year","month","day","id"], columns = "element", values = "temp")
.reset_index().rename_axis(None, axis = 1))
df_weather

The dataset looks better but we still need to improve it. Let’s create a column for the dates and group it.

# Creating a date columndf_weather=(df_weather.assign(date = lambda x: x.year.astype("str") +"/"+ x.month.astype("str").str.zfill(2) +"/"+ x.day.astype("str").str.zfill(2))
.drop(["year", "month","day"],axis=1))
df_weather['date'] = pd.to_datetime(df_weather['date'], format='%Y/%m/%d')
# Grouping by datedf_weather=(df_weather.filter(["date", "tmax","tmin"]).groupby("date").aggregate("mean").reset_index())df_weather

We finally tidied our dataset.

Multiple types of observational units are stored in the same table.

The dataset shows the Billboard top hits for 2000. This dataset records the date a song first entered the Billboard Top 100. It has variables for artist, track, date entered, date peaked, genre, time, rank and week.

import pandas as pd
import re
import numpy as np
import datetime
df_bill = pd.read_csv('billboard.csv',header=0,encoding = 'unicode_escape')df_bill.head()

The rank in each week after it enters the top 100 is recorded in 76 columns, x1st.week to w76rd.week. If a song is in the Top 100 for less than 76 weeks the remaining columns are filled with NaN.

df_bill.columns

This dataset contains observations on two types of observational units: the song and its rank in each week. As a consequence of it, the artist and time are repeated for every song in each week. Before break the Billboard dataset into two, we need to tidy it. Let’s start by gathering all the week columns.

# Applying melt (to a long format)
df_bill=(df_bill.melt(id_vars=["year","artist.inverted","track","genre","date.entered","date.peaked","time"],var_name=["week"],value_name="rank"))
# Week to number
df_bill.update(pd.DataFrame({"week": np.ravel([list(map(int, re.findall(r'\d+', i))) for i in df_bill["week"]])}))
df_bill.head()

It looks better! Now we have a column for the variable week. By the way, we can use the information from the date entered and the week to create a new column, which will be the date column.

# creating a date column from date.entered and weekdf_bill['date.entered'] =  pd.to_datetime(df_bill['date.entered'], format='%Y/%m/%d')df_bill=(df_bill.assign(date= [df_bill['date.entered'][i]+datetime.timedelta(weeks = df_bill["week"][i]-1) for i in range(len(df_bill["week"]))])
.drop(['date.entered','date.peaked','week'], axis=1)
.sort_values('artist.inverted', ascending=True)
.reset_index(drop=True))
df_bill.head()

Now, we will create an id from the track. Each song must have a unique id number. To do so, we will use Panda’s function factorize.

# creating an id column from tracklabels,unique=pd.factorize(list(df_bill["track"]))
df_bill["id"]=labels+1
df_bill.reset_index(drop=True)
df_bill.head()

Finally, we will break our dataset into two datasets: the song dataset and the rank dataset.

# creating a new dataframe for rankdf_rank=df_bill.filter(["id", "date", "rank"]).dropna()
df_rank=df_rank.sort_values(by=['id','date']).reset_index(drop=True)
# creating a new dataframe for songdf_song=df_bill.filter(["id", "artist.inverted", "track","time"])
df_song=df_song.drop_duplicates('id').reset_index(drop=True)
df_rank.head(10)
df_song.head()

We just tackled the problem of multiple types of observational units stored in the same table!

A single observational unit is stored in multiple tables.

This problem uses to be easy to fix. We basically need to read the tables, to add a new column that records the original file name and finally combine all tables into a single one.

import pandas as pddf_baby14 = pd.read_csv("2014-baby-names-illinois.csv")
df_baby15 = pd.read_csv("2015-baby-names-illinois.csv")
df_baby14.head()
df_baby15.head()

Let’s create a column year in each dataset according to the file name. Finally, we will apply Panda’s concat function to concatenate the data frames.

# Creating a column for the year
df_baby14["year"]="2014"
df_baby15["year"]="2015"
# Concatenating the datasets
df_baby = pd.concat([df_baby14, df_baby15]).sort_values(by=['rank'])
(df_baby.set_index('rank', inplace=True))df_baby.head()

Final Comments

The goal of this article was to explain the concept of Tidy Data, by covering the five most commons types of messy data, as well as how to organize and clean these datasets in Python.

If you find any mistakes, please don’t hesitate to contact me! I started to surf the data science world recently and in spite of loving it, I am a dummy.

My code and the datasets are available on Github: https://github.com/rodrigomf5/Tidydata. Also, here is the Hadley Github with much more information about Tidy Data: https://github.com/hadley

--

--

Venture Capital analyst, fascinated by AI and addicted to audiobooks and documentaries.