How to Stay Within Free Tier Cloud Database Row Limits Without Sacrificing Data

Save money and improve user experience by only hosting small roll-up tables and keeping granular historical tables on your local hard drive

Meghan Maloy
Towards Data Science

--

Disclaimer: These suggestions are intended for low-stakes projects involving a couple GB of data. You’re on your own if you are working with “big data.”

If you’re working on a project that requires a hosted database, you may be concerned that you’ll quickly exceed the row limits for the free/low-cost plan. Even though it might only cost $9/month to be able to store 100 times more rows, that may not be worth it to you, especially if your project won’t be generating revenue. You might think it’s time to start deleting the records that bring you the least joy…

But wait! You’ve probably got another place to store data for free, right at your fingertips — your local hard drive.

At this point, you might be concerned that I have no idea what I’m talking about. You’re not about to try to use a local database for a live website. If your cat rubs its head on your laptop and accidentally shuts it closed or the internet cuts out, your site would be rendered useless.

I’m not suggesting that you totally replace your cloud database — just use a local database to supplement it. Keep the detailed historical data on your computer and only host smaller roll-up tables on the cloud. This way your users will still be able to log in, see aggregated data, subscribe to emails (all that good database-y stuff), and you’ll have all of the old/high-granularity data laying around if you need it later.

In my case, I made a web scraper to collect info on performances from a comedy venue’s site and use the data to show some stats on my own website. I’ve been collecting an average of 300 rows a day (one row per comedian per show listed on the venue’s site), which would exceed Heroku’s 10K “hobby-dev” tier row limit in a little more than a month if I kept appending the results to my table.

I only need to show users aggregated historical data (How many shows has this comedian done? When was the last day they performed?, etc.) and detailed future data (Which specific shows will the comedian be at soon? Which other comedians will be at that show?). However, I don’t just want to throw away the detailed historical data. If there’s a bug in my code that’s creating inaccurate aggregations, I want to have the old data around so I can regenerate the aggregations. And the more detailed data could be handy for doing predictive modeling in the future.

My table structure looks like this (with row counts as of 6/28/20, after about a year and half of collecting data):

The purple tables live on Heroku, however, I also keep local copies as backup. The full history is stored in the fact_shows table, which is used to generate the smaller tables. Fact_shows is clearly way over the row limit, but the Heroku hosted tables are now only a couple hundred rows each (note: dim_upcoming_shows is empty right now because the venue has been closed due to COVID restrictions). All of these tables in aggregate take up 47MB of space on my hard drive.

My scraper runs locally every night and appends the data to fact_shows like this:

After fact_shows is updated, I generate the dim tables. I defined this generalized update_dim_table function to avoid repeating code for each dim table I wanted to create. It requires the table name and query in order to be able to roll up the data from fact_shows and then write the rolled up table to the local database and (optionally) to Heroku.

Here’s an example of how I apply that function to create the dim_comedian_stats table, which gets written to Heroku because I need it for visualizations on my site:

For more context, the full scraper/fact_shows file is here, and the dim table script is here.

Depending on your particular use case, you might not be able to aggregate the data. If your users need insight into granular details, consider other approaches to trimming down your hosted data, like only showing the last 30 days (while keeping longer history on your laptop). The general principle is to keep your ideal table structure with full history stored locally and then figure out the bare minimum that would be required to satisfy users’ needs and only push that small table to your cloud service. On top of saving you money, this can improve users’ experience because website queries are running against smaller tables, so everybody wins!

--

--