
0 Introduction & Base Case
Imagine you’re working with a massive dataset using Pandas, excited to unlock valuable insights and make data-driven decisions. However, as you dive into the analysis, you encounter a roadblock – memory usage. The larger the dataset grows, the slower your data transformation operations become, hampering your progress and leaving you scratching your head.
By applying the easy best practices presented in this article you can optimize memory consumption and improve the performance of your data transformations. So, let’s dive in and discover how to efficiently manage memory usage in Pandas, allowing you to work seamlessly with large datasets and achieve faster data processing.

Let’s dive straight into it!
First, to illustrate the point, let’s build a fictitious DataFrame containing data representing 1,000,000 made-up football players:
import pandas as pd
import numpy as np
def create_df(n):
df = pd.DataFrame()
df['position'] = np.random.choice(['GK', 'DEF', 'MID', 'ST'], size=n)
df['height'] = np.round(np.random.normal(loc=180, scale=5, size=n))
df['is_captain'] = np.random.choice(['YES', 'NO'], size=n)
df['scoring_probability'] = np.random.uniform(low=0, high=1, size=n)
df['minutes_played_last_season'] = np.random.randint(low=0, high=3000, size=n)
return df
df = create_df(1000000)
df.head()
This simplified dataset we created contains 5 columns:
- position: a categorical variable with 4 possible values
- height: a normally distributed variable around 1.80m
- _iscaptain: a simple ‘YES’ or ‘NO’ attribute
- _scoringprobability: a uniformly distributed variable between 0 and 1
- _minutes_played_lastseason: a random integer between 0 and 3000
This is what it looks like:

And how much memory it uses:
df.info()

Let’s see how the best practices shown below improve the situation.
1 First Tip: Only Keep the Relevant Columns
It does sound obvious, but when memory usage and computational time are at stake, exploring the dataset you’re working with and dropping the columns which aren’t relevant is often a great start.
In this simplified case, let’s say we are not interested in the _minutes_played_lastseason column and only keep the position, height, _iscaptain and _scoringprobability columns.
# Only keep the relevant columns
df = df[['position', 'height', 'is_captain', 'scoring_probability']]

Even when the column we dropped was just a column of integers, we still gained 7.6MB of space with this operation.
Let’s keep going!
2 Second Tip: Use the Correct Data Types
When importing a new dataset, it is not infrequent that a majority of columns are initially read as object dtype if they contain any strings or a mix of numeric and string data. Although sometimes Pandas will automatically detect the correct data types for certain columns.
Let’s take a closer look at the object columns. If they contain regular strings, then fair enough, there’s nothing to do. However, some strings might be recast as more efficient types.
Object to Category
The category type is a data type that is used to represent columns with a limited set of unique values, such as categorical data.
The "position" column clearly fits the description and can be converted to a category column.
Object to Boolean
The boolean type is a data type used to represent columns with boolean values, which are values that can be either True
or False
.
The "is_captain" column is rightfully an object type as it consists only of "YES" and "NO" strings, however it might has well be converted to booleans (True
or False
) which are way more memory efficient.
And this is the result:
# Object to Category and Object to Boolean
df['position'] = df['position'].astype('category')
df['is_captain'] = df['is_captain'].map({'YES':True, 'NO':False})
df.info()

With these two type conversions we gained 13.3MB, which is almost half of the initial relevant dataset!
3 Third Tip: Downcast the Data Types
The last thing left to do to optimize our dataset to the max is to downcast some variables. Downcasting consists of changing the data type of a column to the smallest possible type that can accurately represent the data.
Integers
For integers, the smallest type you can use really depends on the min and max values of your column. The table below summarizes everything:

In my case, the height column is supposed to contain positive integers no greater than 255 (cm), so I can downcast it from float64 to uint8 and gain no less than 7 bytes per occurrence! Over 1 million rows, that’s 7MB.
Floats
For floats, the smallest type you can use depends not only on the min and max values of your column but also on the precision required – the decimal digits of the number.

In my case, I get the required precision with float32 in the _scoringprobability column, so let’s downcast it from float64.
And this is the result:
# Float64 to Uint8 and Float64 to Float32
df['height'] = df['height'].astype('uint8')
df['scoring_probability'] = df['scoring_probability'].astype('float32')
df.info()

4 Speed test
On top of the gains in memory usage, the speed of the computations done on the DataFrame is drastically increased.
Let’s test it out. We create a function which applies a series of data transformations on our DataFrame. We apply it to our initial DataFrame and to the optimized version we got in the end.
# Funtion that does a bunch of random operations on our DataFrame
def my_func(df):
df.groupby('position').mean()
df.sort_values('height', ascending=False)
filtered_df = df[(df['position'] == 'DEF') & (df['scoring_probability'] > 0.5)]
# Time these operations on the base case and on the optimized DataFrame
%timeit -r 7 -n 10 my_func(df_base_case)
%timeit -r 7 -n 10 my_func(df_optimized)

The results are pretty spectacular. These specific operations were done 3 times as fast on the optimized DataFrame than on the original one!
5 Conclusion
By implementing the following best practices in your data workflows, you can make staggering differences both in terms of memory usage and computational speed:
- Only keep the relevant columns
- Use the correct data types
- Downcast the data types
But don’t take my word for it and give these tips a try in your next data project!
Thanks for reading all the way to the end of the article. Follow for more! Feel free to leave a message below, or reach out to me through LinkedIn if you have any questions / remarks!