The world’s leading publication for data science, AI, and ML professionals.

An App to Optimize a Pandas DataFrame With Streamlit

Reduce Pandas DataFrame memory by 50% or more with this code and app.

Making Sense of Big Data

Photo by Yulia Matvienko on Unsplash
Photo by Yulia Matvienko on Unsplash

I built a simple app that takes a csv file and returns a memory-optimized pickle file for use as a Pandas DataFrame – this story shares my experience building with Streamlit, describes the problem to be solved, and promotes a prototype of the app.

This article may be for you if you are working with large CSV files, Pandas DataFrames, and Python.


Edit on 7 April 2021: Since writing this article, I have created and deployed a Python package called pd-helper that runs an optimized version of the code discussed in this story. You can search for it as "pd-helper" or pip install pd-helper. For more, see PyPi.

The Problem

I often spend most of my time cleaning, wrangling, and munging data before getting to the heart of a data project. However, as my datasets grow larger, I find myself focusing more on optimizing for memory. In other words, when working with Pandas, we need to make the DataFrame consume as little memory as possible.

what if you have a csv file and want an optimized Pandas DataFrame without any of the overhead?

Existing Solutions

There are many articles and gists on the subject of how to code for a solution and I have written a few myself. However, despite the many articles, sometimes the code is too much and implementation is overwhelming. So, what if you have a csv file and want an optimized Pandas DataFrame without any of the overhead?


The Code

I have open-sourced this app with the help of numerous other blogs and resources; however, the orchestration and implementation of the app is my original work. References to other works are provided in-line with code; however, there are two key resources worth mentioning on their own.

First, I adapted a function from this GitHub gist to programmatically reduce the numeric precision of data, i.e. from float64 to float32. My improvements include transforming data types to boolean, string, datetime, and/or categorical where appropriate. The function’s purpose is to change all columns to a dtype other than object, if at all possible or reasonable. The goal is to reduce memory use by 50% or more.

Second, I leveraged Streamlit blogs to understand how to enable data download of a pickled DataFrame. Unlike CSV files, pickled files (among others such as HDF5) are important because they persist data transformations as described above.

How to do this yourself in code. The following gist provides a function called reduce_precision() that consumes a DataFrame and returns a DataFrame with reduced memory. You should be able to copy and paste this into your code and let it do some memory magic.


The App

Two things happened recently. First, I wanted to get experience building an app that has a lot of user interactions such as data upload, data manipulation, and data download. Second, I wanted to make something useful – at the very least, create an app that I would use regularly. As a result, when I found myself writing this optimization code for another project, I thought it might be worth a shot to make this app about data optimization.

App Link: [EDIT – Due to app limitations (limit of three per user), I have replaced the app on Streamlit, but the full code is still available in GitHub to run on your own!]

Data

To test and develop app, I wrote another function that creates some perfectly imperfect data to the tune of 500,000 rows and 58 MB on disk as a CSV file (available here). I like the 58 MB size because I think it’s right at the threshold of being too much to handle both on disk and in memory.

UX

Working with Streamlit has been an interesting way to gain front-end experience with pure Python. In this prototype, the core function is to transform the user’s input. The default Steamlit UI is pretty clean and I tried to maintain that aesthetic to focus users on the workflow. However, there are things that users expect see when making these transformations.

For example, I usually filter certain columns when reducing a DataFrame’s size also and check data types by column name. As a result, I created a responsive feature selector that filters the DataFrame by column – initially it is populated with all columns that a user can later edit.

In addition, I wanted to make sure the user gets feedback on how much memory is saved to show the value of the process. As a result, I ended up creating a side-by-side layout with containers to display the change in memory.


gif Demo

A version 0 prototype of a Data Helper App - From the Author Justin Chae
A version 0 prototype of a Data Helper App – From the Author Justin Chae

Results

The original CSV file is transformed from 34 MB in memory to about 15 MB for over 50% savings. On disk, the file size is reduced from 58MB to 31 MB – not as big on savings but can be improved with compression such as .BZ2 or .ZIP.

Improvements

This prototype works exactly as I would hope but there are a few improvements that I still need to work on. The top three issues so far.

  1. Text Processing. Categorical dtype is applied without cleaning the unique data types. As a result, "Toaster" is seen as a different object than ‘toaster’ and results in two categories that describe the same thing. To be addressed with conditional checks and text processing.
  2. Data Integrity. While working on the app features, I learned there is a security consideration with un-pickling data from the documentation. As a result, since the app provides pickled data, I need to include some type of file verification with HMAC to mitigate a cybersecurity risk.
  3. Features. I have found apps like the pandasgui to be quite useful in exploring DataFrames and there are some ‘nice to have’ features that convey insights about the table. Although the app provides a very brief overview, I would like to include other useful features. For instance, some high-level histograms and statistics that visualize data might do the trick.

Conclusion

In this story, I share a prototype app that solves a data optimization problem that I encounter almost everyday. The problem deals with programmatically reducing the precision of columns in a DataFrame so that it consumes less space in memory and on disk.

To build up experience with UX design, I decided to turn this coding problem into a web app development project. Now, users can upload a csv and produce a pickled Pandas DataFrame that reduces memory consumption by about half. With some additional features, this app might be useful enough to kick off a new project or get started quickly without writing a ton of code.

Although there are still improvements to be made, I am pleased with the initial product and am happy to have gained some experience at the front-end of the stack.


All the code for this project in my GitHub Repository Here:

https://github.com/justinhchae/app_helper

The App:

[EDIT – Due to app limitations (limit of three per user), I have replaced the app on streamlit, but the full code is still available in GitHub to run on your own!]

The Python Package (Edited 7 April 2021):

An updated version of this code is now available as a convenient Python package at https://pypi.org/project/pd-helper/ or through pip install pd-helper.


I am very interested to know if this app is helpful to you. Do you like it? Does it work? What features should it have? Open an issue at Git or leave a comment here. Thanks!!


Related Articles