Using Python’s Garbage Collector with Pandas DataFrames: Higher Efficiency and Performance for Larger Datasets

Apratim Biswas
Towards Data Science
8 min readDec 11, 2020

--

Stacked Shipping Containers: A Simplistic Analogy for Objects in Memory
Picture Credit: Teng Yuhong (From Unsplash at https://unsplash.com/photos/qMehmIyaXvY)

It’s almost 2021. Memory is inexpensive and it’s easy to access cloud platforms like Amazon Web Services (AWS) or Google Cloud Platform (GCP) and throw vast amount of resources at a data problem. And so, we usually don’t worry about memory (RAM) these days. But there are at least two problems with this line of thinking:

i) if we can use our resource efficiently, we can do more with the same amount of resources (i.e. save money!); and,
ii) “data has mass” in the sense that the rate at which large volume of data moves is slower than smaller volumes of data. In other words, smaller volume of data moves around faster, and hence, processed faster (i.e. save time and money!!).

There are several aspects of managing memory usage. To list a few, we have: garbage collection, the option to use certain data types over others, the option to use tries and directed acyclic word graphs, and option to use probabilistic data structures. Each of these deserves an article (or perhaps several articles!) on their own. So in this article, I’ll stick with just one of them: garbage collection. Often overlooked, it is one of the primary ways Python manages memory. It happens in the background without us doing anything special. But it is possible to control some aspects of it, and knowing them can be really useful when handling large amounts of data.

Before we can dive into the details, it may be useful to provide a little bit of background in variable names in python. In python, variable names are simply symbolic names that are pointers to objects. The schematic below illustrates this.

Internal Representation of Objects in Python [1]

title

Let’s define a and b separately:

a = "banana"
b = "banana"

Now let’s take a look at the location of these two variables are referring to:

for name in [a,b]:
print(object.__repr__(name))
<str object at 0x7f0f901548b0>
<str object at 0x7f0f901548b0>

Notice that internally both a and b are pointing to the same object. When we assign a = "banana", we create a string object with value "banana". And when we assign b = "banana", we are simply creating a new symbolic name b for the same object. Put in the language of computer science, we create a second reference to the object.

Now that we have some background on variable names in python, we’re ready to get started! We’ll load and process a medium sized dataset using the python library, pandas. At every step, we’ll carefully monitor memory usage and try to implement garbage collection strategies to make it more efficient. Please note this is not meant as an analysis of the dataset. Any analytical step in this exercise is merely to demonstrate its effect on memory usage.

#Import necessary libraries

import pandas as pd
import sys #system specific parameters and names
import gc #garbage collector interface

Let’s define the relative path to the dataset. The data is in a csv file that occupies 205.2 MB on my system. I’ll use the usual read_csv function to load it to memory.

file_path='./data/Housing_Maintenance_Code_Complaints.csv'df=pd.read_csv(file_path, low_memory=False)
df.head(3)
png

Let’s check the objects occupying space in the memory at this stage. We should see the newly created DataFrame, df, in the list.

memory_usage_by_variable=pd.DataFrame({k:sys.getsizeof(v) for (k,v) in locals().items()},index=['Size'])
memory_usage_by_variable=memory_usage_by_variable.T
memory_usage_by_variable=memory_usage_by_variable.sort_values(by='Size',ascending=False).head(10)
memory_usage_by_variable.head()
png

df is at the top of the list, occupying a little over 1.2 GB. There is a large variation in the order of magnitudes. Let’s define a function that makes it easier to check memory usage and generate more human-readable result.

def obj_size_fmt(num):
if num<10**3:
return "{:.2f}{}".format(num,"B")
elif ((num>=10**3)&(num<10**6)):
return "{:.2f}{}".format(num/(1.024*10**3),"KB")
elif ((num>=10**6)&(num<10**9)):
return "{:.2f}{}".format(num/(1.024*10**6),"MB")
else:
return "{:.2f}{}".format(num/(1.024*10**9),"GB")


def memory_usage():
memory_usage_by_variable=pd.DataFrame({k:sys.getsizeof(v)\
for (k,v) in globals().items()},index=['Size'])
memory_usage_by_variable=memory_usage_by_variable.T memory_usage_by_variable=memory_usage_by_variable\
.sort_values(by='Size',ascending=False).head(10)
memory_usage_by_variable['Size']=memory_usage_by_variable['Size']
\.apply(lambda x: obj_size_fmt(x))
return memory_usage_by_variable

memory_usage()
png

Looking at the sizes, there are two things that stand out:

  1. df accounts for almost all of the memory usage; and,
  2. the DataFrame is markedly larger than the csv file. The original csv file I uploaded is only 205.2 MB. df was created simply by converting the data in the csv file to pandas dataframe. But the DataFrame occupies over 1.22 GB, about 6 times the size of the csv file.

It is important to keep these observations in mind while processing large datasets. Working with DataFrames involves treacherously large amounts of memory, a lot more than one’s intuition would suggest. We have to be judicious about creating slices and intermediate DataFrames, and ensure that we don’t make unnecessary copies. Otherwise, space occupied by copies of slices can add up quickly. Let’s see an example. Say, we want to take the first 6 columns and name it df2.

df2=df.iloc[:,0:6]
df2.head(3)
png
#Let's check the memory usage
memory_usage()
png

As before, we check the location that the variable names df and df2 are referring to.

print(object.__repr__(df))
print(object.__repr__(df2))
<pandas.core.frame.DataFrame object at 0x7f0f90136ca0>
<pandas.core.frame.DataFrame object at 0x7f0f58980490>

Clearly, df2 is now pointing to a new object that occupies 468.49 MB in the memory. One way to deal with this problem is by using the garbage collection module. And that’s what we are going to look into in the next section.

Garbage collection and the gc module in Python

The primary garbage collection algorithm used by python is reference counting. Python keeps track of all the different places that have reference to an object. A “place” can be a variable or another object. When the number of such places becomes zero, the object is deallocated. In turn, reference count of all other objects that the deallocated object had reference to, is decreased by 1. For reasons outside the scope of this article, reference counting alone cannot clean data structure with reference cycles (see official python documentation for further details). This is where the Garbage Collector (GC) comes in. The GC focuses exclusively on cleaning objects with reference cycles and supplements reference counting.

The gc module provides an interface to the garbage collector. We can perform a wide variety of functions through it, including enabling and disabling the collector. I'll touch upon just a few of them.

Garbage collection is a scheduled activity. Python keeps track of all objects in the memory and classifies them into three generations: 0 being the newest and 2 is the oldest. Each generation has its own counter and threshold. Python initiates garbage collection when the magnitude of (the number of allocation minus the number of deallocation) exceeds the threshold value for that generation. The default threshold values for generations 0 to 2, in order, are 700, 10 and 10. We can check the threshold values for all three generations using the function get_threshold as shown below:

gc.get_threshold() #Tuple showing thresholds for the 3 generations at which automatic garbage collection is triggered.
(700, 10, 10)

We can also trigger garbage collection at any point using the collect function. But before we do that, let's take a look at the current memory usage and collection counts for the three generations. This will be our baseline.

gc.get_count()(534, 4, 9)memory_usage()
png

It appears that there are close to 600 objects, two of them accounting for almost 100% of memory usage.

Using the collect function

gc.collect()
gc.get_count()
(34, 0, 0)memory_usage()
png

The object count decreased from 598 to 21. That’s >96% drop in count! But notice we still have df and df2. This is because we still have references to df and df2. So let’s call the collect function again, but this time we first delete reference to df and df2:

#deleting references
del df
del df2

#triggering collection
gc.collect()

#finally check memory usage
memory_usage()
png

And they’re gone!!

Setting thresholds for the garbage collector

When working with large datasets, chances are, many of the objects in the memory are also going to be large. Their size can quickly add up to a level where performance becomes a problem. I find the set_threshold function particularly effective at addressing this challenge. Setting thresholds to lower values for each of the generations makes garbage collection sweeps more frequent and makes more memory accessible.

There is a big caveat to this, however. Garbage collection is computationally expensive. It’s something that must be considered before changing the default thresholds. The appropriate threshold values are application-specific and depends on resource constraints. In fact, the standard conservative recommendation in python is to stick with default values. That said, if you are memory constrained, and/or you are working on a single machine, it can be a useful tool. As the python source code describes, the python object allocator is designed as “a fast, special-purpose memory allocator for small blocks, to be used on top of a general-purpose malloc”, referring to malloc function in the C language.[2][3] If you find yourself deallocating large objects, or a lot of smaller objects for that matter, using the garbage collection module may give your process a big boost.

LIMITATION

Once an object is collected by the garbage collector, the “freed” memory can be occupied by a new python object. But it does not guarantee that it’ll return the “freed” memory to the system. Instead, the object allocator keeps a portion of it, just in case, for future usage.

References

  1. https://runestone.academy/runestone/books/published/thinkcspy/Lists/ObjectsandReferences.html
  2. https://github.com/python/cpython/blob/7d6ddb96b34b94c1cbdf95baa94492c48426404e/Objects/obmalloc.c (lines 559–560 as of 12/10/2020).
  3. https://realpython.com/python-memory-management/

--

--

I’m a data scientist. I love working with data, seeking patterns, building models and translating findings to compelling stories. https://abiswas20.github.io