The world’s leading publication for data science, AI, and ML professionals.

Pandas – Save Memory with These Simple Tricks

How to use Pandas more efficiently in terms of memory usage

Photo by Daniel Cheung on Unsplash
Photo by Daniel Cheung on Unsplash

Memory is not a big concern when dealing with small-sized data. However, when it comes to large datasets, it becomes imperative to use memory efficiently. I will cover a few very simple tricks to reduce the size of a Pandas DataFrame. I will use a relatively large dataset about cryptocurrency market prices available on Kaggle. Let’s start with reading the data into a Pandas DataFrame.

import pandas as pd
import numpy as np
df = pd.read_csv("crypto-markets.csv")
df.shape
(942297, 13)

The dataframe has almost 1 million rows and 13 columns. It includes historical prices of cryptocurrencies.

Let’s check the size of this dataframe:

df.memory_usage()
Index               80
slug           7538376
symbol         7538376
name           7538376
date           7538376
ranknow        7538376
open           7538376
high           7538376
low            7538376
close          7538376
volume         7538376
market         7538376
close_ratio    7538376
spread         7538376
dtype: int64

memory_usage() returns how much memory each row uses in bytes. We can check the memory usage for the complete dataframe in megabytes with a couple of math operations:

df.memory_usage().sum() / (1024**2) #converting to megabytes
93.45909881591797

So the total size is 93.46 MB.

Let’s check the data types because we can represent the same amount information with more memory-friendly data types in some cases.

df.dtypes
slug            object
symbol          object
name            object
date            object
ranknow          int64
open           float64
high           float64
low            float64
close          float64
volume         float64
market         float64
close_ratio    float64
spread         float64
dtype: object

The first thing comes to mind should be "object" data type. If we have categorical data, it is better to use "category" data type instead of "object" especially when the number of categories is very low compared to the number of rows. This is the case for "slug", "symbol" and "name" columns:

df.slug.value_counts().size
2071

There are 2072 categories, which is very low compared to 1 million rows. Let’s convert these columns to "category" data type and see the reduction in memory usage:

df[['slug','symbol','name']] = df[['slug','symbol', 'name']].astype('category')
df[['slug','symbol','name']].memory_usage()
Index          80
slug      1983082 #previous: 7538376
symbol    1982554
name      1983082
dtype: int64

So the memory usage for each column reduced by %74. Let’s check how much we have saved in total:

df.memory_usage().sum() / (1024**2) #converting to megabytes
77.56477165222168

The total size reduced to 77.56 MB from 93.46 MB.


The "ranknow" column shows the rank among different currency categories. Since there are 2072 categories, the maximum value should be 2072.

df.ranknow.max()
2072

The data type of "ranknow" column is int64 but we can represent the range from 1 to 2072 using int16 as well. The range that can be represented with int16 is -32768 to +32767.

df["ranknow"] = df["ranknow"].astype("int16")
df["ranknow"].memory_usage()
1884674 #previous: 7538376

So the memory usage reduced by %75 as expected because we went down to int16 from int64.

The floating point numbers in the dataset are represented with "float64" but I can represent these numbers with "float32" which allows us to have 6 digits of precision. I think 6 digits is enough unless you are making highly sensitive measurements.

  • float32 (equivalent C type: float): 6 digits of precision
  • float64 (equivalent C type: double): 15 digits of precision
floats = df.select_dtypes(include=['float64']).columns.tolist()
df[floats] = df[floats].astype('float32')
df[floats].memory_usage()
Index               80
open           3769188 #previous: 7538376
high           3769188
low            3769188
close          3769188
volume         3769188
market         3769188
close_ratio    3769188
spread         3769188
dtype: int64

The conversion to "float32" from "float64" reduces the memory usage for these columns by %50 as expected.

float32 has 6 digits of precision
float32 has 6 digits of precision

In some cases, the dataframe may have redundant columns. Let’s take a look at the dataframe we have:

The columns "slug", "symbol", "name" represent the same thing in different formats. It is enough to only have one of these three columns so I can drop two columns. The dataframe you have may not have columns like this but it is always a good practice to look for redundant or unnecessary columns. For example, the dataframe might include "count", "value" and "sum" columns. We can easily obtain sum by multiplying count and value so sum column is unnecessary. Some columns might be completely unrelated to the task you want to accomplish so just look for these columns. In my case, I will drop "symbol" and "name" columns and use "slug" column:

df.drop(['symbol','name'], axis=1, inplace=True)

Let’s check the size of the final dataframe:

df.memory_usage().sum() / (1024*1024)
39.63435745239258

The total size reduced to 36.63 MB from 93.46 MB which I think is a great accomplishment. We were able to save 56,83 MB of memory.

Another advantage of reducing the size is to simplify and ease the computations. It takes less time to do calculations with float32 than with float64.

We should always look for ways to reduce the size when possible.


Thank you for reading. Please let me know if you have any feedback.


Related Articles