How to avoid Memory errors with Pandas

Some strategies to scale pandas while working with medium and large datasets

Nicolas Bohorquez
Towards Data Science

--

Photo by Stephanie Klepacki on Unsplash

TL;DR If you often run out of memory with Pandas or have slow-code execution problems, you could amuse yourself by testing manual approaches, or you can solve it in less than 5 minutes using Terality. I had to discover this the hard way.

Context: Exploring unknown datasets

Recently, I had the intention to explore a dataset that contains 720,000 rows and 72 columns from a popular video game: the idea was to discover if there were any consistent patterns in the players’ strategies. But I spent more time trying to load the data — due to memory errors — than actually exploring the data. I’m well aware of the 80/20 rule of data analysis, where most of your time is spent exploring the data — and I am OK with that. I have always thought that a new dataset is like exploring a new country, with its own context and customs that you must decipher in order to explain or discover some pattern; but in this case, I wasn’t able to even start working.

Image by Author

As soon as my data was ready to be processed, I started to experience some issues because some Pandas functions needed way more memory to run than my machine had available. For example, I ran out of memory when I wanted to sort one column. Given that this was not the first time that this has happened to me, I applied the usual techniques to solve this problem. First, I will explain what I tried to do to solve this problem before I discovered Terality.

Strategy 1: Load less data (sub-sampling)

One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or columns in the dataset. In my case, however, I was only loading 20% of the available data, so this wasn’t an option as I would exclude too many important elements in my dataset.

Strategy 2: Scaling Vertically

If you can’t or shouldn’t use less data, and you have a lack of resources problem, you have two options: scaling vertically, which means adding more physical resources (in this case more RAM) to your environment (i.e. working on a single-bigger computer), or scaling horizontally, which means distributing the problem into several smaller, coordinated instances. Vertical scaling is easier as you only have to plug in the new hardware and play in the same way as before, just faster and better.

I have used Google Colab before as my default option to scale my resources vertically. It offers a Jupyter-like environment with 12GB of RAM for free with some limits on time and GPU usage. Since I didn’t need to perform any modeling tasks yet, just a simple Pandas exploration and a couple of transformations, it looked like the perfect solution. But no, again Pandas ran out of memory at the very first operation.

Image by Author

Strategy 3: Modify the Data Types

Given that vertical scaling wasn’t enough, I decided to use some collateral techniques. The first one was to reduce the size of the dataset by modifying the data types used to map some columns. Given a certain data type, for example, int64, python allocates enough memory space to store an integer in the range from -9223372036854775808 to 9223372036854775807. After reading the description of the data I changed the data types to the minimum, thereby reducing the size of the dataset by more than 35%:

Image by Author
Image by Author

Be careful with your choices when applying this technique: some data types will not improve the size in memory, in fact, it can even make it worse. A common suggestion is to change object types to categorical, however, in my case it canceled out my previous gains:

Image by Author

Strategy 4: Horizontal scaling with Terality

Scaling horizontally, which basically means adding more machines, would require that I distribute my code to multiple servers. That’s a tough problem, typically solved by map-reduce or Spark, for instance; but I didn’t know of any solution that could easily do that for Pandas’ code. Some open source solutions could help, but they don’t have some Pandas’ methods, don’t provide the same syntax, and are limited in their scope.

While I was googling for new tricks to apply I discovered Terality. It looks like they are the new kid on the block offering an interesting alternative for this kind of problem: Managed Horizontal Scaling of Pandas. In other words, it seems that Terality spawns a cluster of machines behind the scenes, connects that cluster with the environment, and runs the code in the new cluster. This is perfect because it is hidden from the analyst’s point of view and doesn’t require upgrading the local environment with new hardware to either modify the data or the code. Probably you can get a better description of their offer on this medium article.

After a simple onboarding process, I was ready to test this approach. It really surprised me that the load and merge operations that previously failed in my local and Google Colab environments ran faster and without any issues with Terality. Also, given that they are currently at a private Beta stage, I didn’t have to pay anything. Some other bold claims of the Terality team about their product include:

  • Execution of Pandas code 100x faster, even on big datasets
  • Full support of the Pandas API (Methods, integrations, errors, etc.)
  • Savings on infrastructure costs

My use case wasn’t big enough to test all this functionality, neither was it my intention to build a benchmark with other tools in the same space. But, using the private Beta account, I worked flawlessly with the dataset through Terality. You can check the times recorded for loading and merging in the following screenshot:

Image by Author

The only difference, besides the awesome speed (1 minute 22 seconds for a merge output over 100GB), is that the dataframe is a terality.DataFrame. The integration and setup took less than five (5) minutes, I only needed to Pip install their client Library from PyPy, then replace the import pandas as pd with import terality as pd, and the rest of the code didn’t need any change at all. In the end, using Terality gave me several advantages:

  • Super easy setup
  • No learning curve — no need to change any code
  • Instant and unlimited scalability
  • No infrastructure to manage
  • Faster pandas execution (However, I need to benchmark this)

I’m tempted to test the limits of this tool by working with datasets larger than a few GB, such as a large, public dataset like COVID-19 Open Research Dataset. My next step would be to test if the 100x improvement over pandas code execution is a legitimate claim. Finally, the idea of running Pandas in a Horizontal Cluster without the setup and administration burden is appealing not only for independent data scientists but also for more sophisticated or enterprise use cases.

Conclusion

There are many use cases with enough data to be processed that can break local or cloud environments with Pandas. Like many other data scientists, I tried several coding techniques and tools before looking out of the box and experimenting with an external solution that gave me better results. You should try Terality now to measure if it is the proper tool to solve your Pandas memory errors too. You just need to contact their team on their website, and they’ll guide you through the process.

--

--

Data Architect @merqueo. Excited about exploration, complexity, sports, and the power of software to shape a better multiverse.