Data Stacks For Fun & Nonprofit — Part I

Choosing a data warehouse for your side project

Andrew Stewart
Towards Data Science

--

Photo by Sara Kurfeß on Unsplash

The engine that powers nearly any data science team is some form of data warehouse, which serves as a central collection point and repository for all of an organizations various disparate data sources. These sources may range from the organization’s primary transactional databases to logs, e-mails, SaaS APIs, CRMs, ERPs, and all forms of random flat files floating around. The purpose of the data warehouse is to collect all these data into a single source of truth with a single point of access for convenient access by human consumers. The terminal endpoint in this workflow might be a manual SQL query, some BI platform, or a data scientist’s automated model training workflow.

Image from dbt, which is an amazing open source data warehouse management framework.

Data warehouses are a useful resource when they are provided with the financial backing of an organization, but how does an individual data practitioner emulate the experience for the purpose of personal projects or as training resource? How can one create their own personal data warehouse on a personal budget? This is a question that I have long sought a satisfactory solution for, and thus is this is the first in a series of blog posts where I will explore some options available to us from the modern ecosystem of analytics platform offerings. This first part will focus primarily on imagining what the criteria and some possible solution stacks might look like, and then later on I will investigate and evaluate different options.

The obvious initial answer is to set up a local PostgreSQL database and write some Python/R/SQL code, etc., but this doesn’t quite emulate the exact experience that we are after. It’s useful for learning some SQL and the well earned pain of hand-crafted ETL workflows, but what it lacks is working with remote data stores, flexible scalability, configuring CI/CD workflows, and the ability to utilize an internet accessible data store as a hub for various BI solutions. Students in the data science course I teach often express an interest in having access to large scale data warehousing platforms as part of the course curriculum, which is obviously prohibitively expensive for that purpose, but I can understand when students feel that creating toy SQLite databases doesn’t quite cut it.

What would the ideal personal data warehouse look like? Well, ideally it would be inexpensive, as most of us would probably like to avoid paying the large monthly fees associated with the common crop of SaaS platforms. Serverless is going to be preferable to continuous servers, wherever possible. Most of the time we want our data at rest (incurring minimal charges), and only incurring compute charges as loading and transformation operations require. We would also probably want to minimize configuration and management burdens as much as possible, as we do not likely have the benefit of system administrators for support. Anything open source components are probably also desirable.

So before we explore which options meet these criteria, let’s first briefly review the current ecosystem of platforms we seek to utilize (cheaply) or replace. To store our data, we have several warehouse platforms to choose from including Amazon Redshift, Google BigQuery, and Snowflake.

ETL (Extract/Transform/Load) has traditionally been among the most laborious chores involved in developing and maintaining a data warehouse, typically involving significant software engineering with SQL, Python, etc. Today we have many commercial platforms available that abstract away much of the repetitive logic inherent in ETL systems. Some of these include Fivetran, Stitch, and Matillion.

On the data visualization side of the stack, we have BI platforms including Tableau, Looker, PowerBI, and many many more. Some of them have free / public variants, but we also have sufficient ability to craft our own dashboards via Shiny, Jupyter, and more technologies than I care to enumerate. For the purpose of this series, we’ll focus primarily on the data store and ETL portions of the stack.

One component that I am confident I want to include is the amazing and open source dbt.

dbt is a development environment that speaks the preferred language of data analysts everywhere — SQL. With dbt, analysts take ownership of the entire analytics engineering workflow, from writing data transformation code to deployment and documentation.

With dbt, you write data transformation code following one of the oldest software engineering best practices: modularity. Writing modular SQL means your queries are easier to update and troubleshoot, and will execute dramatically faster.

In a nutshell, dbt is a powerful and convenient framework for organizing all of your transformations and data models within your warehouse. ETL platforms like Fivetran and Stitch are great for loading data into your warehouse, but the resulting data might not be organized the way you want it to be, and you also likely want to integrate some of these data sources. This is where dbt comes in. The tool itself is free, and their cloud deployment platform also has a free tier perfect for personal projects, and there’s also an excellent user community active on Slack and Discourse.

(Left) image by Snowflake | (Center) image by Google — BigQuery | (Right) image by Amazon Web Services — Athena

What about data storage? According to our search criteria, traditional server based platforms like Redshift are not options. As much as I love Snowflake, it doesn’t have any sort of (permanent) free tier that would suit our needs. It is a pay-as-you-go platform, so it’s possible that costs could be minimized but I’m not quite sure exactly how low one maintain that cost point at. The fact that there appears to be a $25 monthly minimum is a non-starter, but I would not necessarily rule out Snowflake entirely.

AWS Athena is an interesting option. It is a serverless deployment of Presto that sits on top of your own S3 objects. You only pay for running queries and the negligibly low S3 storage costs. There is event dbt support for Athena, but it is currently early in development. Athena certainly looks like it could be a good option.

Another option that I hadn’t considered before is Google’s BigQuery. I had always just lumped BigQuery in as the GCP analog of Redshift or Snowflake, but it turns out that Google actually offers a complete free tier of BigQuery so long as your operate within certain limitations. So long as you stay within 10GB of loading storage and 1TB of query processing, you can use BigQuery entirely for free!

What about ETL platforms? Services like Fivetran and Stitch are probably too pricey for individual use. Fivetran did recently switch to a consumption based pricing system, so it’s possible utilization could be capped at an inexpensive level. Much of the value from these paid platforms, however, is more directly related to the challenges of faithfully mirroring complex internal databases to the warehouse than, say, ingesting flat files with more or less the same formats. I think therefore we can abstract the extraction of data from APIs, e-mails and similar sources away into other future projects. The basic endpoint for those future extraction services would be to deposit flat files into object store (S3 or otherwise).

With an Athena based warehouse, one could just develop workflows to extract data into S3 for staging. AWS also offers Glue and some other orchestration services, which could probably be kept under cost. BigQuery has similar sibling services including Dataflow and Datastore, as well as the ability to load data directly from Google Storage and Google Drive. There is also the option to load data into BigQuery from S3. I said BI selection is outside the scope of this search, but it’s also worth mentioning that BigQuery has direct integration into DataStudio.

At this point, I think we’ve got three reasonable candidate stacks:

  1. BigQuery data warehouse, using data from Google Drive and S3 data transfers, dbt for data modeling, and some dashboards with DataStudio.
  2. Athena data warehouse, using S3 staged data, dbt + dbt-athena plugin for data modeling.
  3. Snowflake data warehouse, using external storage integrations and SnowPipe for ingest, dbt for data modeling, and Snowsight for dashboards.

I’m going to start playing around with each of these options by configuring accounts on each platform and setting up some simple data models using dbt. In Part II, I will share my thoughts on a final architecture seems to be the best fit for low-cost personal projects, and begin to delve deeper into warehouse design and workflow orchestration specific to the selected platforms.

What do you think? What would your favorite cloud warehouse stack look like? Feel free to leave your thoughts in the comments!

--

--

Data scientist with interests in biology, economics, political science, and other complex systems.