Data Stacks For Fun & Nonprofit — Part II

Designing a complete data stack with Meltano, dbt and Superset

Andrew Stewart
Towards Data Science

--

Photo by Iva Rajović on Unsplash

In my last article, I explored some inexpensive options to assemble your own data warehouse for training exercises and side projects. Since then, I came up with a somewhat more clever title for this series: Data Stacks For Fun & Nonprofit. I’m expanding the scope of this series to cover the entire data stack, not just the warehousing component. Last time, I explained that data stacks tend to consist of three components(ETL, DW and BI components), and that for the ETL and BI components your basic choices are either DIY or commercial platforms that are priced towards enterprise customers.

Well I, of course, spoke too soon, because it turns out that there have been some very interesting open source developments on both the ETL and BI fronts! I’ll dig into those more in this article, but as a result, I believe we can now assemble a true end-to-end analytics stack from freemium and open-source components. Therefore, this series will be geared towards those looking to set up their own “DS4FNP” stack, whether for personal side-projects, as a teaching resource, or for nonprofit organizations who can’t operate on an enterprise budget.

Our complete data stack! (Left) image by Meltano | (Center) image by dbt | (Right) image by Apache Superset.

In the rest of this article, I’ll report some conclusions from the DW options I explored last time, particularly through the lens of dbt. I’ll then introduce some of the open source ETL and BI options that I think complete the stack. Finally, I’ll wrap up by describing how to setup a completely local development environment, as well as presenting some options for a hosted production environment.

I previously identified three candidate data warehouse platforms that makeup the heart of our stack: BigQuery, Snowflake, and Athena. My conclusions on all three platforms for this purpose are as follows…

Snowflake, I had somewhat discounted since, while it is perhaps my favorite of the three, it doesn’t offer any kind of freemium pricing model. Despite being “pay as you go,” you still have to pay something to get going. However, it turns out that you practically keep your costs down to the minimum $25/month. Now, while that is a fairly low monthly cost, I still don’t know if it’s low enough to justify for personal projects. But if the project was of enough personal importance to me, and I felt that I could maximize the utility of my paid resource usage, I would probably bite the bullet and pay. Snowflake really is that great. With that said, and after an exciting IPO, it would be really great to see Snowflake release some kind of limited freemium model specifically for small projects, training, etc.

BigQuery I certainly had the most traction with. Setting up an account and a BigQuery instance is a very straight forward process. Configuration in dbt is equally simple, and even the authentication setup (a process I’ve often thought Google IaaS traditionally makes needlessly complicated) was a breeze. I am very likely going to focus on BigQuery as the target warehouse for most of this series, while trying to remain as warehouse-agnostic as possible. With the ease of setup, its excellent pricing model (easily free if desired), and the number of integrations available, you can’t go wrong with BigQuery.

Athena is a little different from the other two. It’s main advantages are that 1) it’s entirely under your control and within your own AWS account, and 2) it is the cheapest option if AWS is your preferred ecosystem. The main problem with Athena is that you will have to do all of the setup yourself. I think this could be made into a more turnkey solution with the help of Terraform or Pulumi, and that is certainly something I’d like to explore later on. There is also a dbt-athena plugin, though I found the documentation still fairly raw and I’m still working through configuration hurdles. With time, I think that maturation of the dbt plugin and an Athena provisioning script will make Athena a very decent warehousing option for DS4FNP. The native interoperability with S3 together with the potential abstraction of Presto makes Athena a really desirable warehouse target.

But we’re actually getting a little bit ahead of ourselves in choosing cloud warehouse platforms. In developing your analytics project, you can actually get started with a local PostgreSQL or SQLite database, and of course dbt. In fact, with dbt you can setup separate connection profiles for any number of the warehouses we have discussed, as well as our local environment. This means that all we need to get started are our laptop, Python, dbt, and our local ‘warehouse’ (postgres or sqlite). Later in the series (probably Part III), we’ll start to actually put together a full tutorial on setting up our project from scratch.

Alright, so what about these open source ETL and BI platforms I mentioned earlier? I had good reason to think that these components were only really available as either $$$-expensive enterprise services or time-expensive DIY projects. ETL is generally rather compute-expensive, so there’s not really a strong incentive to offer freemium pricing. Your commercial options there include Fivetran, Stitch, Matillion, and several others. Rolling your own ETL from Python scripts might seem straightforward in the abstract, but most real world ETL tasks end up falling into one of two types of complexity:

  • Structural complexity, meaning loading data from complex database schemas and API specifications, especially when those structures change over time.
  • Volume complexity, meaning loading high volume data sources that requires either stream or batch processing. This mostly pertains to log data. The complexity here stems not so much from the parsing or transformation logic, but from the infrastructure logic involved in moving large data around.

In either case, I think people tend to underestimate the long term effort involved in DIY ETL. It’s an easy trap to fall into because from 10,000 feet out it can seem like such a simple ‘script-able’ task: just fetch data from one source and send it somewhere. There’s some truth to that pattern, though, and you often find yourself repeating the same basic process with just enough variation to warrant some level of abstraction, though not quite enough to justify abstracting it yourself; it’s the type of problem that just begs for a standardized framework, but for a long time those frameworks have been locked away behind commercial platforms.

Image by Meltano

Fortunately, just such a framework exists that is available to developers Singer, developed by Stitch, is a composable framework for developing “taps” and “targets”. Taps extract data from sources, including databases, APIs, logs, emails, etc. Targets are the destinations for the data extracted by taps, typically your common data warehouses (Snowflake, BigQuery, etc). Every tap extracts its data into a common intermediate and each target loads from that common intermediate, allowing one to compose any combination of tap and target to form an ETL task.

Now Singer isn’t necessarily new, but it also doesn’t necessarily present a complete solution for our needs. Singer provides a great starting point for standardizing the components of our ETL workflows, but it still requires significant configuration management and orchestration. However, this obvious gap has been filled by a few new entrants to the field, among them being Meltano (from GitLab) and Pipelinewise (from Transferwise). Both of these tools build upon Singer by adding a configuration management and orchestration layer. They also allow you to manage your extraction/loader tasks as code much like how dbt allows you to manage your transformation models as code. I’ve tried them both, I like them both, and I kinda hope the two projects somehow merge or interoperate, because they each have a different set of strengths that are actually quite complimentary. Transferwise seems to have focused more on developing a solid set of Singer taps/targets to a specific standard, while Meltano has focused more on filling in all the blanks to put Singer into practical action as a legitimate replacement for platforms like Fivetran and Stitch. Because Meltano is a bit more relevant to the mission of our DS4FNP project, I’m going to mostly focus on it.

The basic elements of a Meltano managed project are extractors (Singer taps), loaders (Singer targets), transformers (dbt!), and orchestration (Airflow by default). You can pretty much manage your entire analytics project with Meltano alone. So far I’ve worked through its documented tutorials and setting up projects. It feels a lot like dbt, and while rolling one’s own production deployment seems feasible enough, I’d be very interested to see a hosted Meltano-as-a-service enter the scene. Meltano is well Dockerized, and I could easily see running it via AWS Batch or similar infrastructure.

On the BI side of the stack, your commercial platforms include Tableau, Looker, and PowerBI, among others. Snowflake now has Snowsight built-in, BigQuery pairs well with Data Studio, and Athena’s analog partner would be QuickSite. The big standalone platforms really just lack any practical DS4FNP-friendly priced options, and the DW specific options just feel a bit too vendor-locked for the spirit of this project. This is by far the most subjective component in the entire stack, so by all means one should insert their own preference here. The DIY side of the BI coin also presents some notable options, including Shiny for R and Dash for python. I am much less weary of crafting one’s own BI interfaces than I am of ETL workflows.

Image by Apache Superset.

However, in the interest of fashioning a complete stack, it turns out that we do have an ideal option for the BI layer. Superset also isn’t necessarily new, but it has certainly been picking up steam, and it’s now an Apache project. Developed by Airflow creator Max Beauchemin, Superset is quickly becoming the open-source alternative to Tableau (imho). Aesthetically it feels like Tableau and Airflow had an open-source love child, but the built-in SQL Lab also feels a bit like Snowsight or Redash. Readily deployable via docker-compose, Superset can be ran locally as part of our DS4FNP development environment, as well as deployed as a long running web service.

So there we have it, our DS4FNP stack consists of the following components:

  • Meltano provides our ETL (ELT) layer via Singer, dbt, and Airflow.
  • Our DW layer will use PostgreSQL or SQLite locally, and BigQuery in the cloud.
  • Apache Superset as our BI layer.

Let’s talk about environments briefly. What’s great about this stack is that we can easily run every component of it locally, but we’ll also need to eventually figure out how we want to deploy in the cloud.

I think we can organize all of our code in a single repository, thanks mostly to Meltano which manages our dbt code and Airflow dags for us. Between Meltano and Superset’s Docker deployments, and any warehouse infrastructure provisioning code, we may want to circle back on a general infrastructure repository (or not). For the sake of tutorials, I may use SQLite so that every tangible component of the project is version controlled.

I use a Mac, so some of my local setup may be Mac specific, but I’ll try to be platform agnostic. I’m a big fan of VS Code (which now has a vscode-dbt plugin!), though Atom or any other IDE works just as well. Meltano, dbt, Singer, Airflow and Superset are all mostly Python, so if running each of these natively (non-containerized) we’ll want Python and Anaconda or similar setup. We’ll also want to make sure we have git and docker installed. As for local databases, we’ll want to make sure we have SQLite and PostgreSQL installed, and for the later on Mac I highly recommend Postgres.app along with either Postico or TablePlus as GUI clients. Much of the work will be done using the various CLI tools, so make sure to have a comfortable Terminal setup, or use VS Code’s built-in terminal. I personally use VS Code’s pass-through to iTerm2 with Oh-My-Zsh.

Cloud deployment is going to depend on a few factors. I highly recommend managing your project’s code in a git repository, either with Github or GitLab. You can then setup CI/CD workflows (via Github Actions, Travis CI, etc.), which helps automate all of the deployment management tasks. I tend to use Github, and I find Github Actions very practical to use rather than a standalone CI/CD service, but other opinions may differ. Now, when we think about cloud deployment of our project, there’s a sequence of events that involves a few different moving pieces:

  • We want to run our extraction and loading tasks on a regular basis, probably determined by a defined schedule, and doing so likely requires spawning up some scalable execution hosts via AWS EC2, GCP, etc.
  • We also want to run our transformation code on a regular basis, though because transform code is mostly just remotely executed SQL commands, our execution host is more easily generalized to standard sized host machine.
  • Our data warehouse obviously needs to live somewhere, but the beauty of the current next generation systems is that they separate compute from storage, and manage the orchestration of both. Practically speaking, dbt is our main interface to the warehouse’s actual contents.
  • While our extraction, loading, transformation, and warehousing functions are all likely to be ephemeral batch processes, unfortunately we probably can’t avoid having our BI layer as a long-running web service. That means paying for a 24/7 service host. It is possible that we simply serve pre-generated static content as our BI layer (which is totally possible), if you’re not interested in paying for a hosting service.

Github or other CI/CD service can host our deployment processes, but for both batch processes and long-running services we will need hosting infrastructure. Because pretty much everything comes containerized, we can deploy any of these processes on our own self-rolled AWS or GCP infrastructure. It is also worth noting that dbt offers dbt-cloud, Astronomer.io offers an excellent Airflow hosting service, and Preset.io is an upcoming Superset hosting service. The only thing missing is Meltano, and it seems conceivable that an analogous hosted service could arise for that. It’s entirely possible that we could eventually be able to host every component of our stack on a freemium pricing model.

So that’s it! We have our data stack architecture. For the next part of this series, I want to walk through a more formal tutorial of how to setup a project with all of these components and culminate in a populated data warehouse along with some live data visualizations. Until then, I definitely recommend reading up on Meltano, dbt, and Superset. I’m also going to need to come up with some datasets for the DS4FNP tutorials, so please feel free to make any suggestions or requests in the comments. In the meantime you may also enjoy this tutorial put together by the Preset.io team. Until next time!

--

--

Data scientist with interests in biology, economics, political science, and other complex systems.