
Designing and building real-world applicable machine learning models has always been of great interest to data scientists. This has inevitably led them to leverage optimized, efficient, and accurate methods at scale.
Optimization, both on the level of run-time and memory, plays a foundational role in sustainably delivering real-world and user-facing software solutions.

In one of my earlier posts, I presented some of the top run-time optimization techniques that you can use in your regular Data Science projects.
Five Killer Optimization Techniques Every Pandas User Should Know
In this post, we will explore another area of optimization, and I will introduce you to a handful of incredible techniques to optimize the memory usage of your Pandas DataFrame.
These tips will assist you in efficiently conducting your typical tabular data analysis, management, and processing tasks in Pandas.
To get a brief overview, I will discuss the following topics in this post:
#1 Make Inplace Modifications to DataFrame #2 Read only required columns from a CSV #3–#5 Alter the datatype of columns #6 Specify column datatype while reading a CSV #7 Read data in chunks from a CSV Conclusion
Let’s begin 🚀 !
1 Make Inplace Modifications to DataFrame
Once we load a DataFrame into the Python environment, we typically perform a wide range of modifications on the DataFrame, don’t we? These include adding new columns, renaming headers, deleting columns, altering row values, replacing NaN values, and many more.
These operations can be typically performed in two ways, as depicted below:

Standard Assignment intends to create a new copy of the DataFrame after transformation, leaving the original DataFrame untouched.

As a result of the standard assignment, two distinct Pandas DataFrames (original and transformed) co-exist in the environment (df
and df_copy
above), doubling the memory utilization.
In contrast to the standard assignment operations, inplace assignment operations intend to modify the original DataFrame itself without creating a new Pandas DataFrame object. This is demonstrated below:

Therefore, if the intermediate copy of the DataFrame (df_copy
above) holds no utility in your project, taking the route of inplace assignment is the ideal way to proceed in memory-constrained applications.
You can read my detailed post on inplace assignment operations below:
Key Takeaways/Final Thoughts:
- Use standard assignment (or
inplace=False
) when the intermediate dataframe is necessary, and you don’t want to mutate the input. - Use inplace assignment (or
inplace=True
) if you are working with memory constraints and have no particular use of the intermediate DataFrame.
2 Read only required columns from a CSV

Imagine a scenario where you have hundreds of columns in your CSV file, of which only a subset of columns are of interest to you.
For instance, consider the first five rows of the dummy DataFrame I created with 25 columns and 10⁵ rows using Faker (filename
: dummy_dataset.csv
):

From these 25 columns, say only five columns are of utmost interest to you, and you want to load them as a Pandas DataFrame. These columns are Employee_ID
, First_Name
, Salary
, Rating
and Company
.
- Loading ALL columns:
If you were to go about reading the entire CSV file into the python environment, it would have forced Pandas to load those columns that are of no use and infer their data types – leading to increased run-time and memory usage.
We can find the memory usage of a Pandas DataFrame using the [info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)
method as shown below:
The DataFrame holds 137 MBs of space in memory with all the 25 columns loaded. The run-time to load the CSV file is computed below:
- Loading required columns:
In contrast to reading all the columns, if there is only a subset of columns that are of interest to you, you can pass them as a list to the usecols
argument of the [pd.read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
method.
The computation of memory utilization is demonstrated below:
Loading only the columns of interest reduced the memory utilization by close to 9 times, occupying approximately 15 MBs of space instead of 137 MBs before.
The loading run-time is also reduced significantly, providing a boost of close to 4 times compared to loading all columns.
Key Takeaways/Final Thoughts:
- Loading only the required columns can significantly improve run-time and memory utilization. Therefore, before loading a large CSV file, load only a few rows (say the first five) and list the columns of interest.
3–#5 Alter the data-type of columns

By default, Pandas always assigns the highest memory datatype to columns. For instance, if Pandas interpreted a column as integer-valued, there are possibly four sub-categories (signed) to choose from:
int8
: 8-bit-integer that covers integers from [-2⁷, 2⁷].int16
: 16-bit-integer that covers integers from [-2¹⁵, 2¹⁵].int32
: 32-bit-integer that covers integers from [-2³¹, 2³¹].int64
: 64-bit-integer that covers integers from [-2⁶³, 2⁶³].
However, Pandas will always assign int64
as the datatype of the integer-valued column, irrespective of the range of current values in the column.
Similar connotations exist for float-valued numbers as well: float16
, float32
and float64
.
Note: I will refer to the same dummy dataset we discussed in previous section. Below, I have mentioned the datatypes again.
The current memory utilization of the DataFrame is 137 MBs.
- Altering datatype of integer columns (#3)

Let’s consider the Employee_ID
column, and find its maximum and minimum value.
Notice that even though the column can be potentially interpreted as int32
(2¹⁵< 10⁵ < 2³¹), Pandas still adopted int64
type for the column.
Thankfully, Pandas provides the flexibility to change the datatype of a column using the [astype()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)
method.
The conversion of the Employee_ID
column, along with the memory usage before and after conversion, is demonstrated below:
The total memory utilized by the Employee_ID
column halved with this simple one-line datatype transformation.
With a similar min-max analysis, you can also alter the datatype of other integer and float valued columns.
- Altering datatype of columns representing categorical data (#4)

As the name suggests, a categorical column is a column that consists of only a few unique values that are repeated over and over in the entire column.
For instance, let’s find the number of unique values in a few columns using the [nunique()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html)
method as shown below:
The number of unique values in these columns with respect to the size of the DataFrame depicts that they are categorical columns.
However, by default, Pandas inferred the datatype of all these columns as object
, which is essentially a string
type.
Using the astype()
method, you can change the datatype of a categorical column to category
. The reduction in memory utilization is demonstrated below:
With the conversion from string to categorical, we notice a decrease in memory utilization by 75%, which is massive.
With a similar unique-element analysis, you can alter the datatype of other potential categorical columns.
- Altering datatype of columns with NaN values (#5)

Missing values are inevitable in real-world datasets, aren’t they? Consider that a column in your DataFrame has a significant proportion of NaN values, as shown below:
In such a scenario, representing the column as a Spare Data Structure (more on that in an upcoming post) can provide significant memory efficiency.
Using the astype()
method, you can change the datatype of a sparse column to Sparse[str]
/Sparse[float]
/Sparse[int]
datatype. The reduction in memory utilization and the datatype conversion are demonstrated below:
The conversion from float32
to Sparse[float32]
reduces the memory utilization by close to 40 percent, which is roughly the percentage of NaN values in the Rating
column.
Key Takeaways/Final Thoughts:
- Pandas always interprets its columns with the largest memory datatype. If the range of values in your column does not span the scope of the datatype, consider downgrading the column’s datatype to the most optimal type.
You can find a reference code to execute these datatype conversions in this StackOverflow answer.
6 Specify column datatype while reading a CSV
The tips discussed in sections #3-#5 above assume that you already have a Pandas DataFrame loaded in the python environment. In other words, those were post-input techniques for optimizing memory utilization.
However, in situations where loading the dataset is the primary challenge, you can take control of the datatype interpretation task carried out by Pandas during input and specify the particular datatype you want your columns to be inferred as.

You can achieve this by passing the dtype
argument to the pd.read_csv()
method as follows:
As shown above, the dtype
argument expects a dictionary mapping from column-name
to the data-type
.
Key Takeaways/Final Thoughts:
- If you are aware of the type of data within some (or all) columns of a CSV, either through a data dictionary or some other source, try to infer the most suitable datatype yourself and pass it to the
dtype
argument of thepd.read_csv()
method.
7 Read data in chunks from a CSV

Lastly, assume that you have done everything you possibly could in tip #6, but the CSV is still impossible to load due to memory constraints.
Although my last technique won’t help optimize the net memory utilization, it would rather be a work-around for loading large datasets, which you can use in such extreme situations.
Pandas’ input methods are serialized. Therefore, it reads only one row (or line) at a time from a CSV file.

If the number of rows is extremely large to load in memory at once, you can instead load a segment (or chunk) of rows, process it, and then read the next segment of the CSV file. This is demonstrated below:

You can leverage the above chunk-based input process by passing the chunksize
argument to the pd.read_csv()
method as follows:
Every chunk
object is a Pandas DataFrame, and we can verify this using the type()
method in Python as follows:
Key Takeaways/Final Thoughts:
- If the CSV file is too large to load and fit in memory, use the chunking method for loading segments of the CSV and processing them one after the other.
- One major drawback of this method is that you cannot perform operations that need the entire DataFrame. For instance, say you want to perform a
groupby()
operation on a column. Here, it is possible that rows corresponding to a group may lie in different chunks.
Conclusion
To conclude, in this post, I discussed seven incredible memory optimization techniques in Pandas, which you can directly leverage in your next data science project.
In my opinion, the areas I have discussed in this post are subtle ways to optimize memory utilization, which are often overlooked to seek optimization in. Nonetheless, I hope this post gave you an insightful understanding of these day-to-day Pandas’ functions.
Thanks for reading!
🧑💻 Become a Data Science PRO! Get the FREE Data Science Mastery Toolkit with 450+ Pandas, NumPy, and SQL questions.
✉️ Sign-up to my Email list to never miss another article on data science guides, tricks and tips, Machine Learning, SQL, Python, and more. Medium will deliver my next articles right to your inbox.