
Deploy Code – Crash App – Learn Lessons
What happens when you deploy a data app without coding for memory optimization? In my case, at least, the app crashed and I spent days painfully refactoring code. If you are luckier or smarter (or both), then you have nothing to worry about. Otherwise, consider lessons from my mistakes and some helpful resources to avoid your own special headaches.
How to avoid my optimization mistakes to deploy your app for the win.
Coding for the Web Vs. Coding for Me
I have always acknowledged the importance of writing optimized code but I did not fully appreciate what it meant until deploying a Web app. On my laptop, even the most poorly written code will likely run, albeit slowly. However, the consequences on the Web are far more severe – memory leaks and inefficient code can cripple the experience. If you don’t like waiting on an app to load, neither does the user.
In a previous article (linked above), I share a few issues that I encountered when deploying my app. Although I successfully got the app up and running, it started crashing after adding a larger dataset. As a result, in this article, I share some post-deployment problems and solutions.
Two major issues after deploying Streamlit app:
- App unavailable because it has gone over its resource limits
- App crashes a few hours after deployment and continues crashing after repeated reboots

Summary of Detailed Problems and Key Resources:
- DataFrame size too large at over 270 megabytes (MB)! By comparison, 100 MB is probably already way too large. Although there is not a perfect size, for a table of 1 million rows, I eventually got the file down to about 50 MB – still hefty but manageable.
Lesson 1: Exhaust all possible options to reduce the DataFrame’s size.
Make working with large DataFrames easier, at least for your memory
- App holds DataFrame in memory. There is a tradeoff here – hold data in memory for fast access, but hold too much and everything breaks. Without thinking about it, coded the app to hold the entire DataFrame (the one that was way too large to begin with) in memory – this turned out to be a disaster.
Lesson 2: Be intentional about what is stored in memory.
- Function Caching. After I started searching for advice on a crashing Web app, I discovered that Streamlit has a handy @st.cache function decorator. Then, after implementing caching in correctly, I learned about TTL. To save you some time think about caching and TTL from the start. Consider, for any function that you touch multiple times and that is expensive to run, caching might help by holding the result of the function in memory. Caching can be useful for generating Plotly graphs once and retrieving them from the cache later.
Lesson 3: Cache functions with @st.cache and don’t forget TTL parameters.
- DataFrame data hogs. In a Pandas DataFrame, each column’s size can be reduced by half or more when transforming from an object dtype to categorical dtype or integer dtype. Early on, I transformed most but missed a few columns – this oversight ended up eating my lunch.
Lesson 4: Apply least precision as necessary in DataFrame; avoid ‘object’ dtypes like the plague.
- Junk Imports. This is a catch-all issue with program design. As one example, at the top of each Python file, I had necessary import statements for libraries and packages for data cleaning but not required for the Web app. Although these junk imports did not take up a great amount of memory, they still consumed bits of time and memory that I could not afford to give.
Lesson 5: Import only what you need, get rid of the junk.
A Few Detailed Problems and Solutions
Let’s start with a rough snapshot of the dumpster fire that is the original architecture in my program. In Figure 2, Streamlit calls app.py which, in turn reads in data as DataFrames and calls all the program’s other functions. At first, this works just fine with a small dataset on my laptop. Unfortunately, things go off the rails after deploying the app with a full dataset. By the end of the day, my tiny app exceeded its resource limits and shut down.

Now, consider a better architecture in Figure 3. Instead of calling in a massive table and bloating the app, everything is slimmed down. The app only calls a small file that is needed for context menus and calls other data via functions when needed. To be sure, this is not perfect, but it is a vast improvement from the original.

Caching
There are two important scenarios with caching, (1) caching with ttl and (2) caching with hashed dictionaries.
First, consider a typical Streamlit call to plot a graph from some other code (_stat_charts.py) that makes the actual plot with some function (graph1_()). In this case, the program generates the graph anew each time. In some cases, this can cause your app to crash or use up allotted resources.
# Streamlit with Plotly, no caching
import streamlit as st
from figs.stat_charts.py import Achart
st.plotly_chart(Achart().graph1())
According to the docs, you might try to use the @st.cache decorator with a function wrapper. However, after a few hours of repeated calls, memory problems will accumulate and crash the app as described in this blog post.
# with basic caching, eventually crash app
@st.cache()
def make_fig():
some_fig = Achart().graph1()
return some_fig
def show_objects():
st.plotly_chart(make_fig())
Instead, make sure to include parameters for _max_entries and ttl_ to manage the cache size. For my part, I am going forward with these parameters as a default unless there is reason not to.
# with basic cache controls
@st.cache(max_entries=10, ttl=3600)
def make_fig():
some_fig = Achart().graph1()
return some_fig
def show_objects():
st.plotly_chart(make_fig())

Second, depending on your functions, you may run into a CachedObjectMutationWarning which basically means something inside the function is changing every time. According to the documentation, mutating a cached function is generally undesirable but there is a workaround from this blog post.
CachedObjectMutationWarning: Return value of overview_data() was mutated between runs.
By default, Streamlit's cache should be treated as immutable, or it may behave in unexpected ways. You received this warning because Streamlit detected that an object returned by overview_data() was mutated outside of overview_data().
A solution from the Streamlit blog, which is quite clever, returns the plot into the cache as a hashed dictionary key-value pair. Afterwards, subsequent calls are to the dictionary. This is super cool!
# cache plots in hashed dictionary
@st.cache(hash_funcs={dict: lambda _: None})
def make_fig():
some_fig = Achart().graph1()
cached_dict = {'f1': some_fig}
return cached_dict
def show_objects():
charts = make_fig()
st.plotly_chart(charts['f1'])
Optimize Pandas
For Pandas optimization, there are a two layers to consider. First, the data types (dtypes) of each column. Second, the overall size and format of the source file and DataFrame. Since both optimization topics are covered well by other TDS articles (see resources), I will provide one example within the context of this Streamlit deployment.
In an example of a column that should be boolean (True/False), I mistakenly turned it into text at some point. As a result, the column takes up huge amounts of space with two, non-boolean values (‘nan’ and 1.). In this case, the ‘nan’ is a literal string nan instead of the numpy.nan type.
Don’t let ‘nan’ strings take up valuable memory.
# correcting a mistake for a boolean column
# a DataFrame with about 1 million rows
print(df['flag_col'].unique())
# >>> [nan 1.] should be boolean!
print(df['flag_col'].memory_usage())
# >>> 15,274,640 ouch, way too much memory!
To patch this specific case, map a dictionary with new values and apply a new dtype. Notice the memory savings when a boolean column is property typed as boolean.
# correct a Pandas Column that should be boolean
# make column into bool
df['flag_col'] = df['flag_col'].astype('boolean')
print(df['flag_col'].memory_usage())
# >>> 8,591,985 huge savings in memory!
A checklist of other optimization considerations (apply as appropriate):
- Use least precision as required for analysis for floating point numbers and integers, i.e. float16 instead of float64
- Use category dtype instead of object
- Intentionally read specific columns, i.e. df = df[[col1, col2, col3]]
- Compress large files but pickle smaller files (think about a tradeoff in terms of space and decompression time for large files and speed for smaller files) as described in Resources.
Updates on 11 and 13 January 2021:
- For Plotly charts in Streamlit, create smaller slices of uncompressed pickle files instead of reading from a large, compressed file. Each file should contain only the columns and data you need for a chart. While working this article, despite optimization and compression, I found that displaying charts will still run too slowly if reading from a single, large table.
- Changed example for bools transformation from ‘bool’ to ‘boolean’ – this is a major issue! For more, read about it here:
Resources
Sometimes resource sections are an afterthought; however, I wanted to highlight and share some informative articles about a subject I never thought to research before.
Data File Management for Persisting Pandas DataFrames
Managing Memory in Execution for Pandas DataFrames
Diagnosing and Fixing Pandas DataFrame Memory
Make working with large DataFrames easier, at least for your memory
Using Python’s Garbage Collector with Pandas DataFrame for Higher Efficiency and Performance
Conclusion
In this article, I share some post-deployment optimization pitfalls for a simple data dashboard app with Streamlit and Python. Previously, I shared some gotchas in pre-deployment but afterwards, discovered a number of additional issues with memory and optimization. The issues were not clear at first, but after the app kept crashing, I realized I had serious design flaws. Thanks to a ton of great advice from the Web, I resolved and patched most issues.
The lessons boil down to leveraging caching on the Web app, reducing the data file size as much as possible, leveraging least precision in Pandas DataFrame dtypes, and simplifying the program’s architecture. As for next steps, I’m considering how to leverage libraries such as Modin and and Spark for when I run out of tricks and need more performance.
My project is available here and via GitHub and implements all the concepts described in this article. Thanks for reading, I wish you the best in your next project!