
Tables are a two-dimensional, very used form of storing and preparing data, and this can be done in two ways: wide and long format. Each one has some advantages and disadvantages which I will do my best to describe in this post with some examples. So let us get started.
TL/DR. It is better to use the long format for storing data and use the wide format at the very end of a Data Analysis process to reduce the data dimensionality. Going from one format to the other one using Pandas is as easy as writing one line of code.
The wide-format
You can think of the wide-format the "Excel" way. Think about this. You have salespersons and the data about each sale during the last year grouped per month, how do you represent this in a tabular way? Well, you write the salespersons’ names in the rows and the year’s month names in the columns, then write the amount ($) each salesperson sold at each row/column intersection.
name jan feb mar apr ... dec
jim $100 $200 $200 $300 $300
mike $200 $250 $50 $100 $170
Since it is too wide (pun intended) I used three points (…) to not write the columns corresponding to the months between April and December. You may say: "Provided that you have one piece of data, the amount sold in $ grouped by salesperson and month, a tabular wide-format deals with the whole thing easily". And you would be right, no problem so far. Let’s make this a bit harder. Assume you have now not only the sales amount ($) but also the number of items sold per salesperson. You build another table like this one.
name jan feb mar apr ... dec
jim 10 10 10 30 30
mike 5 20 7 25 5
Now you have two wide-format tables. If you were to gather now the number of hours worked per month per salesperson you would have now 3 tables. I start to feel that this is getting repetitive don’t you think?
I would have a hard time thinking of a way that gives me peace of mind to store this data in a database. Three tables to store data of the same people of the same months where the only difference is the type of data stored? What if we want to store now 20 points of data (features) per salesperson. I think you get the idea of the problem here. Let’s try to solve it.
Can this data be stored in a single table that can generate all these tables? Enter the long format.
But first, just to keep widening (pun intended again) the ugliness of this mess… Let’s assume you try to solve this using a wide approach as in "I add a column for the name, another column for the month and then one column per each feature"; a solution like this one.
name month sales items_count hours_worked
jim jan bla bla bla
mike feb bla bla bla
...
Well… It may work don’t you think? At least you have only one table. Yes, but if for some reason your requirements change and you need to add (or remove) features you will have to physically modify the table. Another issue might be if a salesperson may not have all but only a subset (or one) of the features: you will have many empty cells. If you are using an Excel notebook with a few columns and rows it will not be an issue, however, if this data is stored in a, let’s say, Postgres database, you will have to run some alter command each time your data structure changes, and also waste space because of the empty cells. I don’t like this idea very much. How about you?
Now really, enter the long format.
The long format
Without much of a preamble let me show you the same data described above but using the long format.
name month kpi value
jim jan amount 100
jim feb amount 120
...
jim jan items 10
jim feb items 10
...
mike dec amount 170
...
mike dec items 5
Since it is too long (pun intended again) I used, again, three points to state that not all the information is shown in the table. Now we do not have a column per month, but a column per each feature plus a month and name columns. The advantages of this approach are the following.
- You don’t waste space if some data is missing. If this happens then a row will not exist and that is all.
- You do not have to modify the table structure if for some reason a new feature is added, you just modify the data by adding rows with the new feature as needed.
Another long format example
Think about this problem. You are recording interactions on a website and there are three types: view, click, and conversion. The task at hand is to generate a table that shows how many views, clicks, and conversions occurred every day.
Let us assume that when an interaction occurs you receive two pieces of information: the date and time of the interaction and its type. Knowing this the long format table would be as the one shown below.
type date
view 2021-12-12 23:00
click 2021-12-12 23:30
Per each interaction, the data is stored in a new row, and each row keeps a record of the event’s type and the time when it was recorded. Now, let’s deal with the report part. To accomplish this, some Python and Pandas magic would be handy.
The code
Assume that the table was created and the data was recorded per day and we can grab it using a function. In this case, the function fakes the data, in the real world the function would retrieve it from a database or some other data source.
First of all, we should import Numpy and Pandas, and create a function to fake the data.
def build_dataframe(periods=100):
"""
Creates columns with random data.
"""
import random
random.seed(10)
data = {
'type': random.choices(population=['impressions', 'clicks', 'conversions'], weights=[100, 10, 1], k=periods)
}
from datetime import date, timedelta
from random import choices
# create some random dates in a test period
test_date1, test_date2 = date(2020, 1, 1), date(2020, 2, 1)
res_dates = [test_date1]
# loop to get each date till end date
while test_date1 != test_date2:
test_date1 += timedelta(days=1)
res_dates.append(test_date1)
# random dates
res = choices(res_dates, k=periods)
df = pd.DataFrame(data)
df['date'] = res
return df
# retrieve the data
long_df = build_dataframe(periods=500)
I will not go into detail on what is going on here, since it is not very relevant. The retrieved data looks like this.

Each data point contains the event’s type and date as expected. We have data to manipulate. Before moving forward let us think about how the wide tabular format we aim to achieve should look like.
We expect to have one row per date, and three columns: impressions, clicks, and conversions. The intersection of each row/column will contain the count of each event for each date. The tabular output should be something similar to this.
date impressions clicks conversions
2020-01-01 1000 20 1
...
2020-01-15 500 2 n/a
...
Note the n/a value in the conversions column, this means that there is no data in the long format source for that type/date combination.
The summary of the tasks ahead is as follows.
- Create a column to hold the event count
- Group the long format table so that there is one row per each possible date/type combination
- Use the Pandas pivot function to build the wide-format tabular output
- Use the Pandas melt function to reconstruct the long-format tabular input
The code that accomplishes all of the latter is the following.
# Step 1: add a count column to able to summarize when grouping
long_df['count'] = 1
# Step 2: group by date and type and sum
grouped_long_df = long_df.groupby(['date', 'type']).sum().reset_index()
# Step 3: build wide format from long format
wide_df = grouped_long_df.pivot(index='date', columns='type', values='count').reset_index()
# Step 4: build long format from wide format
new_long_df = new_long_df = pd.melt(wide_df, id_vars=['date'], var_name='type', value_vars=['impressions', 'clicks', 'conversions'], value_name='count')
Wow! Just four lines of code to do the required transformations. Steps 1 and 2 are self-explanatory if you are familiarized with Pandas, so let’s explain what happened in steps 3 and 4.
Step 3
The Pandas pivot function expects a long format input where the desired column’s values passed as the index argument are unique per each index/column (date/type) combination. This is the reason why it was required to group the original long format table so that we had one row per date/type combination. Pandas will create a column per each value found in the type column and place in the date/type intersection the value existing in the count column, as stated by the values parameter. Let’s take a look at the table contained in the wide_df variable.

Note that there is one row per date and 3 data columns: impressions, clicks, and conversions. When Pandas cannot find the corresponding value for a certain type on any given date a NaN value is inserted. This is great!
With step 3 completed we could say that our job here is done. We generated a table containing the event count for every date retrieved. Well done!
You could place this one perhaps in a Streamlit dashboard?
Let us focus now on completing the last step.
Step 4
Pandas melt function builds a long format data frame from a wide format dataframe. The function expects you to tell it 1) which one is the column that will be used as the new rows reference (id_vars), which in our case is the date column, 2) which are the columns that will be stacked into a new column (value_vars), 3) what will be its name (var_name) and finally 4) what will be the name of the column containing the stacked columns values (value_name).
The result is stored in the new_long_df variable and it looks like this.

It looks quite similar to the dataframe created by the build_dataframe function! However quite similar is not enough, we should dig a bit further by comparing the data contained for one date in both new_long_df and grouped_long_df.
What about the date 2020–01–01? Let’s try it by querying both data frames.
First new_long_df.
new_long_df[new_long_df['date'] == datetime.date(2020, 1, 1)]

Now the original grouped_long_df.
grouped_long_df[grouped_long_df['date'] == datetime.date(2020, 1, 1)]

This is enough for me: both data frames agree that there were 1 click and 18 impressions that day, with zero conversions!
Conclusions
This post attempted to describe what tabular long and wide formats are with some examples, and how to deal with them using Python and Pandas. Each format works best for certain tasks: the long format allows data to be stored more densely, while the wide format has more explanatory power if tabular formats are required in a report. It’s up to you to choose which format works best depending on what you expect to accomplish.
You can find a Google Colab notebook here.
If you liked the content please follow me and share!