If you could only store one type of Analytics data for your business, it should be user events – what pages they view, what buttons they click, what forms they submit, what they purchase, etc. This basic event stream is the bedrock that most data analysis is built upon.
Companies usually start out with an all-in-one platform like Google Analytics, but eventually need something more custom and flexible. It used to require vast data and engineering resources to build a custom data pipeline, but a recent explosion in managed services and open source tooling has greatly lowered the barrier. Now, a custom data pipeline is within reach of most companies.
A user event Data Pipeline consists of a few parts. There’s the data warehouse, the event tracking system, the data modeling layer, and the reporting/analysis tools. The sections below will go over each of these and link to some popular commercial and open source options.
The Data Warehouse
Typical application databases (MySQL, Postgres, MSSQL) are optimized for a mix of read and write queries that insert/select a small number of rows at a time and can handle up to 1TB of data reasonably well. Non-relational databases like MongoDB can be more scalable, but the lack of SQL support is a deal breaker for most analytics use cases. Data warehouses, on the other hand, get you both massive scalability and full SQL support. They are optimized for read-heavy workloads that scan a small number of columns across a very large number of rows and can easily scale to petabytes of data.
The main players in this space are Amazon Redshift, Google BigQuery, and Snowflake. Redshift is best when you have data engineers who want control over infrastructure costs and tuning. BigQuery is best when you have very spiky workloads. Snowflake is best when you have a more continuous usage pattern. Elvin Li provides a great in-depth comparison of these you should check out if you want more details.
If you prefer an open source solution, there is ClickHouse. It was built by Yandex, the Russian search engine, and is quickly gaining in popularity.
The Event Tracking System
You can theoretically insert directly into your data warehouse in real-time as users do things on your app or site, but this approach does not scale very well and is not fault tolerant. Instead, you really need to queue up events and periodically batch insert them. This can be pretty complicated to do yourself, but luckily there are a number of managed services. Segment is the most popular option, but has a few downsides. It’s very expensive, susceptible to ad blockers, only syncs data once per hour or two, and is missing a few key fields in the schema it generates (specifically, session and page ids). Freshpaint is a newer commercial alternative that aims to solve some of these issues.
In addition to Segment and Freshpaint, there are a number of open source options (each with a managed offering if you don’t feel like hosting it yourself). Snowplow is the oldest and most popular, but it can take a while to setup and configure. Rudderstack is a full-featured Segment alternative. Jitsu is a pared down event tracking library that is laser focused on just getting events into your warehouse as quickly as possible.
The Data Modeling Layer
Raw event stream data can be hard to work with sometimes. Data modeling is all about transforming the raw data into something more user friendly. This layer is not required when you are first starting out, but is something you are definitely going to want to add down the road so it’s nice to know it exists. There are really only 2 main players in this space, both open source tools with managed offerings.
Airflow is similar to the Unix cron utility – you write scripts in Python and schedule them to run every X minutes. Airflow can be used for any sort of scheduling task, but is often used for data modeling. Astronomer offers a managed Airflow service.
dbt, on the other hand, is purpose built for data modeling. You write transformations in SQL instead of Python and there is built-in support for CI/CD pipelines to test your models and stage them before committing to production.
The Reporting/Analysis Tools
This is where the big payoff is when moving away from an all-in-one platform. There are hundreds of amazing tools to interact with your data once it’s in a warehouse, and you can use as many or as few as you like with no worry of vendor lock-in.
There are a number of SQL report builders, where you write SQL and create charts and graphs of the data that can then be shared. Looker, Mode, PopSQL, and GoodData are popular commercial options and Redash is a good open source alternative.
There are also some self-serve drag & drop tools that don’t require SQL knowledge. There are the commercial options Tableau, PowerBI, and Indicative as well as open source Apache Superset and Metabase.
And if you have a Data Science team, they will likely interact with the warehouse directly using Jupyter Notebooks. The flexibility to dive into the raw data when needed is absolutely crucial for building a strong data culture at your company.
Then there are a whole host of specialized tools. DataFold monitors your warehouse and alerts you if there are any anomalies (e.g. if checkout conversion rate drops suddenly right after a deploy). Hightouch lets you sync data from your warehouse to your marketing and sales platforms. Whale is an open source tool to document and catalog your data. Retool lets you integrate warehouse data into your internal admin tools.
Lastly, it’s time for a shameless plug. One of the most exciting ways to use user event data is for A/B testing. I’ve been working on an open source platform called Growth Book that plugs into your data warehouse and handles all of the complicated querying and statistics required for robust A/B test analysis. Check it out and give a star on GitHub if you think it’s something you might want to use.
Conclusion
It’s never been a more exciting time to build a user event data pipeline. The number of managed services and open source tools have exploded in recent years making it easy for any company to get something running in a single day. There are no longer any excuses for not taking control of your data and using it to drive business decisions.