The world’s leading publication for data science, AI, and ML professionals.

3 Easy Tips to Optimize Pandas DataFrames

Streamline your data workflow with proven techniques for optimizing Pandas DataFrames


Gain space with the right data management techniques - Image by Author
Gain space with the right data management techniques – Image by Author

0 Introduction & Base Case

Imagine you’re working with a massive dataset using Pandas, excited to unlock valuable insights and make data-driven decisions. However, as you dive into the analysis, you encounter a roadblock – memory usage. The larger the dataset grows, the slower your data transformation operations become, hampering your progress and leaving you scratching your head.

By applying the easy best practices presented in this article you can optimize memory consumption and improve the performance of your data transformations. So, let’s dive in and discover how to efficiently manage memory usage in Pandas, allowing you to work seamlessly with large datasets and achieve faster data processing.

Lightbulb moment - Image by Undraw
Lightbulb moment – Image by Undraw

Let’s dive straight into it!

First, to illustrate the point, let’s build a fictitious DataFrame containing data representing 1,000,000 made-up football players:

import pandas as pd
import numpy as np

def create_df(n):
    df = pd.DataFrame()
    df['position'] = np.random.choice(['GK', 'DEF', 'MID', 'ST'], size=n)
    df['height'] = np.round(np.random.normal(loc=180, scale=5, size=n))
    df['is_captain'] = np.random.choice(['YES', 'NO'], size=n)
    df['scoring_probability'] = np.random.uniform(low=0, high=1, size=n)
    df['minutes_played_last_season'] = np.random.randint(low=0, high=3000, size=n)
    return df

df = create_df(1000000)
df.head()

This simplified dataset we created contains 5 columns:

  • position: a categorical variable with 4 possible values
  • height: a normally distributed variable around 1.80m
  • _iscaptain: a simple ‘YES’ or ‘NO’ attribute
  • _scoringprobability: a uniformly distributed variable between 0 and 1
  • _minutes_played_lastseason: a random integer between 0 and 3000

This is what it looks like:

And how much memory it uses:

df.info()
Memory usage of the original DataFrame - Image by Author
Memory usage of the original DataFrame – Image by Author

Let’s see how the best practices shown below improve the situation.


1 First Tip: Only Keep the Relevant Columns

It does sound obvious, but when memory usage and computational time are at stake, exploring the dataset you’re working with and dropping the columns which aren’t relevant is often a great start.

In this simplified case, let’s say we are not interested in the _minutes_played_lastseason column and only keep the position, height, _iscaptain and _scoringprobability columns.

# Only keep the relevant columns
df = df[['position', 'height', 'is_captain', 'scoring_probability']]

Even when the column we dropped was just a column of integers, we still gained 7.6MB of space with this operation.

Let’s keep going!


2 Second Tip: Use the Correct Data Types

When importing a new dataset, it is not infrequent that a majority of columns are initially read as object dtype if they contain any strings or a mix of numeric and string data. Although sometimes Pandas will automatically detect the correct data types for certain columns.

Let’s take a closer look at the object columns. If they contain regular strings, then fair enough, there’s nothing to do. However, some strings might be recast as more efficient types.

Object to Category

The category type is a data type that is used to represent columns with a limited set of unique values, such as categorical data.

The "position" column clearly fits the description and can be converted to a category column.

Object to Boolean

The boolean type is a data type used to represent columns with boolean values, which are values that can be either True or False.

The "is_captain" column is rightfully an object type as it consists only of "YES" and "NO" strings, however it might has well be converted to booleans (True or False) which are way more memory efficient.

And this is the result:

# Object to Category and Object to Boolean
df['position'] = df['position'].astype('category')
df['is_captain'] = df['is_captain'].map({'YES':True, 'NO':False})
df.info()
Recasting the object types to more memory efficient types - Image by Author
Recasting the object types to more memory efficient types – Image by Author

With these two type conversions we gained 13.3MB, which is almost half of the initial relevant dataset!


3 Third Tip: Downcast the Data Types

The last thing left to do to optimize our dataset to the max is to downcast some variables. Downcasting consists of changing the data type of a column to the smallest possible type that can accurately represent the data.

Integers

For integers, the smallest type you can use really depends on the min and max values of your column. The table below summarizes everything:

Integers memory usage and values - Image by Author
Integers memory usage and values – Image by Author

In my case, the height column is supposed to contain positive integers no greater than 255 (cm), so I can downcast it from float64 to uint8 and gain no less than 7 bytes per occurrence! Over 1 million rows, that’s 7MB.

Floats

For floats, the smallest type you can use depends not only on the min and max values of your column but also on the precision required – the decimal digits of the number.

Floats memory usage, values and precision - Image by Author
Floats memory usage, values and precision – Image by Author

In my case, I get the required precision with float32 in the _scoringprobability column, so let’s downcast it from float64.

And this is the result:

# Float64 to Uint8 and Float64 to Float32
df['height'] = df['height'].astype('uint8')
df['scoring_probability'] = df['scoring_probability'].astype('float32')
df.info()
Recasting the float64 types to more memory efficient types - Image by Author
Recasting the float64 types to more memory efficient types – Image by Author

4 Speed test

On top of the gains in memory usage, the speed of the computations done on the DataFrame is drastically increased.

Let’s test it out. We create a function which applies a series of data transformations on our DataFrame. We apply it to our initial DataFrame and to the optimized version we got in the end.

# Funtion that does a bunch of random operations on our DataFrame
def my_func(df):
    df.groupby('position').mean()
    df.sort_values('height', ascending=False)
    filtered_df = df[(df['position'] == 'DEF') & (df['scoring_probability'] > 0.5)]

# Time these operations on the base case and on the optimized DataFrame 
%timeit -r 7 -n 10 my_func(df_base_case)  
%timeit -r 7 -n 10 my_func(df_optimized)
The computations are 3 times as fast on the optimized DataFrame - Image by Author
The computations are 3 times as fast on the optimized DataFrame – Image by Author

The results are pretty spectacular. These specific operations were done 3 times as fast on the optimized DataFrame than on the original one!


5 Conclusion

By implementing the following best practices in your data workflows, you can make staggering differences both in terms of memory usage and computational speed:

  1. Only keep the relevant columns
  2. Use the correct data types
  3. Downcast the data types

But don’t take my word for it and give these tips a try in your next data project!


Thanks for reading all the way to the end of the article. Follow for more! Feel free to leave a message below, or reach out to me through LinkedIn if you have any questions / remarks!

Join Medium with my referral link – Guillaume Weingertner


Related Articles