The world’s leading publication for data science, AI, and ML professionals.

Long and Wide Formats in Data, Explained

How to deal with them Pandas-style

Photo by Laura The Explaura from Pexels
Photo by Laura The Explaura from Pexels

Tables are a two-dimensional, very used form of storing and preparing data, and this can be done in two ways: wide and long format. Each one has some advantages and disadvantages which I will do my best to describe in this post with some examples. So let us get started.

TL/DR. It is better to use the long format for storing data and use the wide format at the very end of a Data Analysis process to reduce the data dimensionality. Going from one format to the other one using Pandas is as easy as writing one line of code.

The wide-format

You can think of the wide-format the "Excel" way. Think about this. You have salespersons and the data about each sale during the last year grouped per month, how do you represent this in a tabular way? Well, you write the salespersons’ names in the rows and the year’s month names in the columns, then write the amount ($) each salesperson sold at each row/column intersection.

name    jan    feb    mar    apr  ... dec
jim     $100   $200   $200   $300     $300
mike    $200   $250   $50    $100     $170

Since it is too wide (pun intended) I used three points (…) to not write the columns corresponding to the months between April and December. You may say: "Provided that you have one piece of data, the amount sold in $ grouped by salesperson and month, a tabular wide-format deals with the whole thing easily". And you would be right, no problem so far. Let’s make this a bit harder. Assume you have now not only the sales amount ($) but also the number of items sold per salesperson. You build another table like this one.

name    jan    feb    mar    apr  ... dec
jim     10     10     10     30       30
mike    5      20     7      25       5

Now you have two wide-format tables. If you were to gather now the number of hours worked per month per salesperson you would have now 3 tables. I start to feel that this is getting repetitive don’t you think?

I would have a hard time thinking of a way that gives me peace of mind to store this data in a database. Three tables to store data of the same people of the same months where the only difference is the type of data stored? What if we want to store now 20 points of data (features) per salesperson. I think you get the idea of the problem here. Let’s try to solve it.

Can this data be stored in a single table that can generate all these tables? Enter the long format.

But first, just to keep widening (pun intended again) the ugliness of this mess… Let’s assume you try to solve this using a wide approach as in "I add a column for the name, another column for the month and then one column per each feature"; a solution like this one.

name    month    sales   items_count   hours_worked
jim     jan      bla     bla           bla
mike    feb      bla     bla           bla
...

Well… It may work don’t you think? At least you have only one table. Yes, but if for some reason your requirements change and you need to add (or remove) features you will have to physically modify the table. Another issue might be if a salesperson may not have all but only a subset (or one) of the features: you will have many empty cells. If you are using an Excel notebook with a few columns and rows it will not be an issue, however, if this data is stored in a, let’s say, Postgres database, you will have to run some alter command each time your data structure changes, and also waste space because of the empty cells. I don’t like this idea very much. How about you?

Now really, enter the long format.

The long format

Without much of a preamble let me show you the same data described above but using the long format.

name    month    kpi     value
jim     jan      amount  100
jim     feb      amount  120
...
jim     jan      items   10
jim     feb      items   10
...
mike    dec      amount  170
...
mike    dec      items   5

Since it is too long (pun intended again) I used, again, three points to state that not all the information is shown in the table. Now we do not have a column per month, but a column per each feature plus a month and name columns. The advantages of this approach are the following.

  • You don’t waste space if some data is missing. If this happens then a row will not exist and that is all.
  • You do not have to modify the table structure if for some reason a new feature is added, you just modify the data by adding rows with the new feature as needed.

Another long format example

Think about this problem. You are recording interactions on a website and there are three types: view, click, and conversion. The task at hand is to generate a table that shows how many views, clicks, and conversions occurred every day.

Let us assume that when an interaction occurs you receive two pieces of information: the date and time of the interaction and its type. Knowing this the long format table would be as the one shown below.

type    date
view    2021-12-12 23:00
click   2021-12-12 23:30

Per each interaction, the data is stored in a new row, and each row keeps a record of the event’s type and the time when it was recorded. Now, let’s deal with the report part. To accomplish this, some Python and Pandas magic would be handy.

The code

Assume that the table was created and the data was recorded per day and we can grab it using a function. In this case, the function fakes the data, in the real world the function would retrieve it from a database or some other data source.

First of all, we should import Numpy and Pandas, and create a function to fake the data.

def build_dataframe(periods=100):
    """
    Creates columns with random data.
    """
import random
random.seed(10)
data = {
        'type': random.choices(population=['impressions', 'clicks', 'conversions'], weights=[100, 10, 1], k=periods)
    }
from datetime import date, timedelta
    from random import choices

    # create some random dates in a test period 
    test_date1, test_date2 = date(2020, 1, 1), date(2020, 2, 1)

    res_dates = [test_date1]

    # loop to get each date till end date
    while test_date1 != test_date2:
        test_date1 += timedelta(days=1)
        res_dates.append(test_date1)

    # random dates
    res = choices(res_dates, k=periods)

    df = pd.DataFrame(data)
    df['date'] = res

    return df
# retrieve the data
long_df = build_dataframe(periods=500)

I will not go into detail on what is going on here, since it is not very relevant. The retrieved data looks like this.

The build_dataframe function retrieves 500 rows of data described by two columns. Image by the author.
The build_dataframe function retrieves 500 rows of data described by two columns. Image by the author.

Each data point contains the event’s type and date as expected. We have data to manipulate. Before moving forward let us think about how the wide tabular format we aim to achieve should look like.

We expect to have one row per date, and three columns: impressions, clicks, and conversions. The intersection of each row/column will contain the count of each event for each date. The tabular output should be something similar to this.

date         impressions     clicks     conversions
2020-01-01   1000            20         1
...
2020-01-15   500             2          n/a
...

Note the n/a value in the conversions column, this means that there is no data in the long format source for that type/date combination.

The summary of the tasks ahead is as follows.

  1. Create a column to hold the event count
  2. Group the long format table so that there is one row per each possible date/type combination
  3. Use the Pandas pivot function to build the wide-format tabular output
  4. Use the Pandas melt function to reconstruct the long-format tabular input

The code that accomplishes all of the latter is the following.

# Step 1: add a count column to able to summarize when grouping
long_df['count'] = 1
# Step 2: group by date and type and sum
grouped_long_df = long_df.groupby(['date', 'type']).sum().reset_index()
# Step 3: build wide format from long format
wide_df = grouped_long_df.pivot(index='date', columns='type', values='count').reset_index()
# Step 4: build long format from wide format
new_long_df = new_long_df = pd.melt(wide_df, id_vars=['date'], var_name='type', value_vars=['impressions', 'clicks', 'conversions'], value_name='count')

Wow! Just four lines of code to do the required transformations. Steps 1 and 2 are self-explanatory if you are familiarized with Pandas, so let’s explain what happened in steps 3 and 4.

Step 3

The Pandas pivot function expects a long format input where the desired column’s values passed as the index argument are unique per each index/column (date/type) combination. This is the reason why it was required to group the original long format table so that we had one row per date/type combination. Pandas will create a column per each value found in the type column and place in the date/type intersection the value existing in the count column, as stated by the values parameter. Let’s take a look at the table contained in the wide_df variable.

The contents of wide_df. Image by the author.
The contents of wide_df. Image by the author.

Note that there is one row per date and 3 data columns: impressions, clicks, and conversions. When Pandas cannot find the corresponding value for a certain type on any given date a NaN value is inserted. This is great!

With step 3 completed we could say that our job here is done. We generated a table containing the event count for every date retrieved. Well done!

You could place this one perhaps in a Streamlit dashboard?

Let us focus now on completing the last step.

Step 4

Pandas melt function builds a long format data frame from a wide format dataframe. The function expects you to tell it 1) which one is the column that will be used as the new rows reference (id_vars), which in our case is the date column, 2) which are the columns that will be stacked into a new column (value_vars), 3) what will be its name (var_name) and finally 4) what will be the name of the column containing the stacked columns values (value_name).

The result is stored in the new_long_df variable and it looks like this.

new_long_df dataframe contents. Image by the author.
new_long_df dataframe contents. Image by the author.

It looks quite similar to the dataframe created by the build_dataframe function! However quite similar is not enough, we should dig a bit further by comparing the data contained for one date in both new_long_df and grouped_long_df.

What about the date 2020–01–01? Let’s try it by querying both data frames.

First new_long_df.

new_long_df[new_long_df['date'] == datetime.date(2020, 1, 1)]
The data contained in new_long_df for the date 2020–01–01. Image by the author.
The data contained in new_long_df for the date 2020–01–01. Image by the author.

Now the original grouped_long_df.

grouped_long_df[grouped_long_df['date'] == datetime.date(2020, 1, 1)]
grouped_long_df contents for the date 2020–01–01. Image by the author.
grouped_long_df contents for the date 2020–01–01. Image by the author.

This is enough for me: both data frames agree that there were 1 click and 18 impressions that day, with zero conversions!

Conclusions

This post attempted to describe what tabular long and wide formats are with some examples, and how to deal with them using Python and Pandas. Each format works best for certain tasks: the long format allows data to be stored more densely, while the wide format has more explanatory power if tabular formats are required in a report. It’s up to you to choose which format works best depending on what you expect to accomplish.

You can find a Google Colab notebook here.


If you liked the content please follow me and share!


Related Articles