The world’s leading publication for data science, AI, and ML professionals.

Seven Killer Memory Optimization Techniques Every Pandas User Should Know

Simple tips to optimize the memory utilization in Pandas

Photo by Denise Jans on Unsplash
Photo by Denise Jans on Unsplash

Designing and building real-world applicable machine learning models has always been of great interest to data scientists. This has inevitably led them to leverage optimized, efficient, and accurate methods at scale.

Optimization, both on the level of run-time and memory, plays a foundational role in sustainably delivering real-world and user-facing software solutions.

Optimization Categorization (Image by author)
Optimization Categorization (Image by author)

In one of my earlier posts, I presented some of the top run-time optimization techniques that you can use in your regular Data Science projects.

Five Killer Optimization Techniques Every Pandas User Should Know

In this post, we will explore another area of optimization, and I will introduce you to a handful of incredible techniques to optimize the memory usage of your Pandas DataFrame.

These tips will assist you in efficiently conducting your typical tabular data analysis, management, and processing tasks in Pandas.

To get a brief overview, I will discuss the following topics in this post:

#1 Make Inplace Modifications to DataFrame #2 Read only required columns from a CSV #3–#5 Alter the datatype of columns #6 Specify column datatype while reading a CSV #7 Read data in chunks from a CSV Conclusion

Let’s begin 🚀 !


1 Make Inplace Modifications to DataFrame

Once we load a DataFrame into the Python environment, we typically perform a wide range of modifications on the DataFrame, don’t we? These include adding new columns, renaming headers, deleting columns, altering row values, replacing NaN values, and many more.

These operations can be typically performed in two ways, as depicted below:

Categorization of DataFrame Manipulation in Pandas (Image by author).
Categorization of DataFrame Manipulation in Pandas (Image by author).

Standard Assignment intends to create a new copy of the DataFrame after transformation, leaving the original DataFrame untouched.

Creating a new DataFrame from a given DataFrame (Image by author).
Creating a new DataFrame from a given DataFrame (Image by author).

As a result of the standard assignment, two distinct Pandas DataFrames (original and transformed) co-exist in the environment (df and df_copy above), doubling the memory utilization.

In contrast to the standard assignment operations, inplace assignment operations intend to modify the original DataFrame itself without creating a new Pandas DataFrame object. This is demonstrated below:

Performing in-place operation (Gif by author)
Performing in-place operation (Gif by author)

Therefore, if the intermediate copy of the DataFrame (df_copy above) holds no utility in your project, taking the route of inplace assignment is the ideal way to proceed in memory-constrained applications.

You can read my detailed post on inplace assignment operations below:

A Simple Guide to Inplace Operations in Pandas

Key Takeaways/Final Thoughts:

  1. Use standard assignment (or inplace=False) when the intermediate dataframe is necessary, and you don’t want to mutate the input.
  2. Use inplace assignment (or inplace=True) if you are working with memory constraints and have no particular use of the intermediate DataFrame.

2 Read only required columns from a CSV

Reading only columns of interest (Image by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.
Reading only columns of interest (Image by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.

Imagine a scenario where you have hundreds of columns in your CSV file, of which only a subset of columns are of interest to you.

For instance, consider the first five rows of the dummy DataFrame I created with 25 columns and 10⁵ rows using Faker (filename: dummy_dataset.csv):

Dummy Dataset (Gif by author)
Dummy Dataset (Gif by author)

From these 25 columns, say only five columns are of utmost interest to you, and you want to load them as a Pandas DataFrame. These columns are Employee_ID, First_Name, Salary, Rating and Company.

  • Loading ALL columns:

If you were to go about reading the entire CSV file into the python environment, it would have forced Pandas to load those columns that are of no use and infer their data types – leading to increased run-time and memory usage.

We can find the memory usage of a Pandas DataFrame using the [info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method as shown below:

The DataFrame holds 137 MBs of space in memory with all the 25 columns loaded. The run-time to load the CSV file is computed below:

  • Loading required columns:

In contrast to reading all the columns, if there is only a subset of columns that are of interest to you, you can pass them as a list to the usecols argument of the [pd.read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method.

The computation of memory utilization is demonstrated below:

Loading only the columns of interest reduced the memory utilization by close to 9 times, occupying approximately 15 MBs of space instead of 137 MBs before.

The loading run-time is also reduced significantly, providing a boost of close to 4 times compared to loading all columns.

Key Takeaways/Final Thoughts:

  1. Loading only the required columns can significantly improve run-time and memory utilization. Therefore, before loading a large CSV file, load only a few rows (say the first five) and list the columns of interest.

3–#5 Alter the data-type of columns

Datatype conversion in Pandas (Image by author)
Datatype conversion in Pandas (Image by author)

By default, Pandas always assigns the highest memory datatype to columns. For instance, if Pandas interpreted a column as integer-valued, there are possibly four sub-categories (signed) to choose from:

  • int8: 8-bit-integer that covers integers from [-2⁷, 2⁷].
  • int16: 16-bit-integer that covers integers from [-2¹⁵, 2¹⁵].
  • int32: 32-bit-integer that covers integers from [-2³¹, 2³¹].
  • int64: 64-bit-integer that covers integers from [-2⁶³, 2⁶³].

However, Pandas will always assign int64 as the datatype of the integer-valued column, irrespective of the range of current values in the column.

Similar connotations exist for float-valued numbers as well: float16, float32 and float64.

Note: I will refer to the same dummy dataset we discussed in previous section. Below, I have mentioned the datatypes again.

The current memory utilization of the DataFrame is 137 MBs.

  • Altering datatype of integer columns (#3)
Downgrading integer datatypes (Image by author)
Downgrading integer datatypes (Image by author)

Let’s consider the Employee_ID column, and find its maximum and minimum value.

Notice that even though the column can be potentially interpreted as int32 (2¹⁵< 10⁵ < 2³¹), Pandas still adopted int64 type for the column.

Thankfully, Pandas provides the flexibility to change the datatype of a column using the [astype()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) method.

The conversion of the Employee_ID column, along with the memory usage before and after conversion, is demonstrated below:

The total memory utilized by the Employee_ID column halved with this simple one-line datatype transformation.

With a similar min-max analysis, you can also alter the datatype of other integer and float valued columns.

  • Altering datatype of columns representing categorical data (#4)
Converting to categorical column (Image by author)
Converting to categorical column (Image by author)

As the name suggests, a categorical column is a column that consists of only a few unique values that are repeated over and over in the entire column.

For instance, let’s find the number of unique values in a few columns using the [nunique()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html) method as shown below:

The number of unique values in these columns with respect to the size of the DataFrame depicts that they are categorical columns.

However, by default, Pandas inferred the datatype of all these columns as object, which is essentially a string type.

Using the astype() method, you can change the datatype of a categorical column to category. The reduction in memory utilization is demonstrated below:

With the conversion from string to categorical, we notice a decrease in memory utilization by 75%, which is massive.

With a similar unique-element analysis, you can alter the datatype of other potential categorical columns.

  • Altering datatype of columns with NaN values (#5)
Converting various datatypes to Sparse types (Image by author)
Converting various datatypes to Sparse types (Image by author)

Missing values are inevitable in real-world datasets, aren’t they? Consider that a column in your DataFrame has a significant proportion of NaN values, as shown below:

In such a scenario, representing the column as a Spare Data Structure (more on that in an upcoming post) can provide significant memory efficiency.

Using the astype() method, you can change the datatype of a sparse column to Sparse[str]/Sparse[float]/Sparse[int] datatype. The reduction in memory utilization and the datatype conversion are demonstrated below:

The conversion from float32 to Sparse[float32] reduces the memory utilization by close to 40 percent, which is roughly the percentage of NaN values in the Rating column.

Key Takeaways/Final Thoughts:

  1. Pandas always interprets its columns with the largest memory datatype. If the range of values in your column does not span the scope of the datatype, consider downgrading the column’s datatype to the most optimal type.

You can find a reference code to execute these datatype conversions in this StackOverflow answer.

6 Specify column datatype while reading a CSV

The tips discussed in sections #3-#5 above assume that you already have a Pandas DataFrame loaded in the python environment. In other words, those were post-input techniques for optimizing memory utilization.

However, in situations where loading the dataset is the primary challenge, you can take control of the datatype interpretation task carried out by Pandas during input and specify the particular datatype you want your columns to be inferred as.

Providing datatype instructions to Pandas (Image by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.
Providing datatype instructions to Pandas (Image by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.

You can achieve this by passing the dtype argument to the pd.read_csv() method as follows:

As shown above, the dtype argument expects a dictionary mapping from column-name to the data-type.

Key Takeaways/Final Thoughts:

  1. If you are aware of the type of data within some (or all) columns of a CSV, either through a data dictionary or some other source, try to infer the most suitable datatype yourself and pass it to the dtype argument of the pd.read_csv() method.

7 Read data in chunks from a CSV

Reading file in chunks (Image by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.
Reading file in chunks (Image by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.

Lastly, assume that you have done everything you possibly could in tip #6, but the CSV is still impossible to load due to memory constraints.

Although my last technique won’t help optimize the net memory utilization, it would rather be a work-around for loading large datasets, which you can use in such extreme situations.

Pandas’ input methods are serialized. Therefore, it reads only one row (or line) at a time from a CSV file.

Reading one row at a time (Gif by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.
Reading one row at a time (Gif by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.

If the number of rows is extremely large to load in memory at once, you can instead load a segment (or chunk) of rows, process it, and then read the next segment of the CSV file. This is demonstrated below:

Processing data in chunks in Pandas (Gif by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.
Processing data in chunks in Pandas (Gif by author). Note: A CSV file is a text file, and the above illustration is not how a CSV looks. This is just to elaborate the point intuitively.

You can leverage the above chunk-based input process by passing the chunksize argument to the pd.read_csv() method as follows:

Every chunk object is a Pandas DataFrame, and we can verify this using the type() method in Python as follows:

Key Takeaways/Final Thoughts:

  1. If the CSV file is too large to load and fit in memory, use the chunking method for loading segments of the CSV and processing them one after the other.
  2. One major drawback of this method is that you cannot perform operations that need the entire DataFrame. For instance, say you want to perform a groupby() operation on a column. Here, it is possible that rows corresponding to a group may lie in different chunks.

Conclusion

To conclude, in this post, I discussed seven incredible memory optimization techniques in Pandas, which you can directly leverage in your next data science project.

In my opinion, the areas I have discussed in this post are subtle ways to optimize memory utilization, which are often overlooked to seek optimization in. Nonetheless, I hope this post gave you an insightful understanding of these day-to-day Pandas’ functions.

Thanks for reading!


🧑‍💻 Become a Data Science PRO! Get the FREE Data Science Mastery Toolkit with 450+ Pandas, NumPy, and SQL questions.

✉️ Sign-up to my Email list to never miss another article on data science guides, tricks and tips, Machine Learning, SQL, Python, and more. Medium will deliver my next articles right to your inbox.


Related Articles