Memory is not a big concern when dealing with small-sized data. However, when it comes to large datasets, it becomes imperative to use memory efficiently. I will cover a few very simple tricks to reduce the size of a Pandas DataFrame. I will use a relatively large dataset about cryptocurrency market prices available on Kaggle. Let’s start with reading the data into a Pandas DataFrame.
import pandas as pd
import numpy as np
df = pd.read_csv("crypto-markets.csv")
df.shape
(942297, 13)
The dataframe has almost 1 million rows and 13 columns. It includes historical prices of cryptocurrencies.

Let’s check the size of this dataframe:
df.memory_usage()
Index 80
slug 7538376
symbol 7538376
name 7538376
date 7538376
ranknow 7538376
open 7538376
high 7538376
low 7538376
close 7538376
volume 7538376
market 7538376
close_ratio 7538376
spread 7538376
dtype: int64
memory_usage() returns how much memory each row uses in bytes. We can check the memory usage for the complete dataframe in megabytes with a couple of math operations:
df.memory_usage().sum() / (1024**2) #converting to megabytes
93.45909881591797
So the total size is 93.46 MB.
Let’s check the data types because we can represent the same amount information with more memory-friendly data types in some cases.
df.dtypes
slug object
symbol object
name object
date object
ranknow int64
open float64
high float64
low float64
close float64
volume float64
market float64
close_ratio float64
spread float64
dtype: object
The first thing comes to mind should be "object" data type. If we have categorical data, it is better to use "category" data type instead of "object" especially when the number of categories is very low compared to the number of rows. This is the case for "slug", "symbol" and "name" columns:
df.slug.value_counts().size
2071
There are 2072 categories, which is very low compared to 1 million rows. Let’s convert these columns to "category" data type and see the reduction in memory usage:
df[['slug','symbol','name']] = df[['slug','symbol', 'name']].astype('category')
df[['slug','symbol','name']].memory_usage()
Index 80
slug 1983082 #previous: 7538376
symbol 1982554
name 1983082
dtype: int64
So the memory usage for each column reduced by %74. Let’s check how much we have saved in total:
df.memory_usage().sum() / (1024**2) #converting to megabytes
77.56477165222168
The total size reduced to 77.56 MB from 93.46 MB.
The "ranknow" column shows the rank among different currency categories. Since there are 2072 categories, the maximum value should be 2072.
df.ranknow.max()
2072
The data type of "ranknow" column is int64 but we can represent the range from 1 to 2072 using int16 as well. The range that can be represented with int16 is -32768 to +32767.
df["ranknow"] = df["ranknow"].astype("int16")
df["ranknow"].memory_usage()
1884674 #previous: 7538376
So the memory usage reduced by %75 as expected because we went down to int16 from int64.
The floating point numbers in the dataset are represented with "float64" but I can represent these numbers with "float32" which allows us to have 6 digits of precision. I think 6 digits is enough unless you are making highly sensitive measurements.
- float32 (equivalent C type: float): 6 digits of precision
- float64 (equivalent C type: double): 15 digits of precision
floats = df.select_dtypes(include=['float64']).columns.tolist()
df[floats] = df[floats].astype('float32')
df[floats].memory_usage()
Index 80
open 3769188 #previous: 7538376
high 3769188
low 3769188
close 3769188
volume 3769188
market 3769188
close_ratio 3769188
spread 3769188
dtype: int64
The conversion to "float32" from "float64" reduces the memory usage for these columns by %50 as expected.

In some cases, the dataframe may have redundant columns. Let’s take a look at the dataframe we have:

The columns "slug", "symbol", "name" represent the same thing in different formats. It is enough to only have one of these three columns so I can drop two columns. The dataframe you have may not have columns like this but it is always a good practice to look for redundant or unnecessary columns. For example, the dataframe might include "count", "value" and "sum" columns. We can easily obtain sum by multiplying count and value so sum column is unnecessary. Some columns might be completely unrelated to the task you want to accomplish so just look for these columns. In my case, I will drop "symbol" and "name" columns and use "slug" column:
df.drop(['symbol','name'], axis=1, inplace=True)
Let’s check the size of the final dataframe:
df.memory_usage().sum() / (1024*1024)
39.63435745239258
The total size reduced to 36.63 MB from 93.46 MB which I think is a great accomplishment. We were able to save 56,83 MB of memory.
Another advantage of reducing the size is to simplify and ease the computations. It takes less time to do calculations with float32 than with float64.
We should always look for ways to reduce the size when possible.
Thank you for reading. Please let me know if you have any feedback.