Building a Data Platform in 2021

How to build a modern, scalable data platform to power your analytics and data science projects.

Dave Melillo
Towards Data Science

--

Table of Contents:

The Platform

Integration

Data Warehouse

Transformation

Presentation

Transportation

Closing

You know how the saying goes — “There’s more than one way to skin a cat”

This is a tough metaphor for me to use as a proud cat parent, but the sentiment has never been more accurate when it comes to data in the 21st century.

While it’s true that you can solve most of your data problems with a spreadsheet, a python script or a terminal command, problems emerge quickly as you start to consider scale, speed and consistency. Furthermore, the range of tools and processes in the data landscape has stifled collaboration and driven specialization in tool sets, rather than fostering a deep understanding of core data science concepts such as statistics, data modeling and effective visualization.

Lucky for us, a consistent framework has begun to emerge. The nouveau approach to building a data platform is part do-it-yourself and part do-it-for-me. It involves quilting together managed services and engineering enough flexibility in your platform to anticipate the unknown. If done correctly, this modern infrastructure allows data professionals to focus on solving complex problems with math and science, rather than facilitating archaic processes that revolve around administration and documentation.

The Platform

One of the key concepts in this approach to building a modern data platform is modularization. There is no one vendor or technology that currently has domain over the entire data landscape, despite what clever marketing and sales campaigns present. Therefore, understanding each component is key to piecing together the right solutions for your specific project … The components are as follows:

  • Source
  • Integration
  • Data Warehouse
  • Transformation
  • Presentation
  • Transportation

Integration

Let’s assume that the source component is obvious. Sources of data come in many shapes and sizes, and the integration layer should be pliable enough to account for all of them.

On the DYI spectrum of this component are popular tools such as Airflow, which many companies use to build robust, end to end data pipelines. Other Apache offerings such as Kafka offer a more event-based approach to data integration and can be used in combination with Airflow to extend custom data pipelines even further.

Managed services have come a long way in the integration space. Aside from enterprise grade versions of the aforementioned Apache projects such as Astronomer (Airflow) and Confluent (Kafka), there are several leaders in this space that offer flexibility, but are opinionated enough to help accelerate development in a meaningful way. From an event based perspective, Segment is the undeniable leader, while solutions such as Fivetran have emerged as the de facto solution for more traditional ETL/ELT based data integrations.

Data Warehouse

Possibly the most ambiguous and most critical component of modern data platforms is the data warehouse. This is in part because legacy database technologies such as SQL Server, Postgres and MySQL are still extremely effective. However, the dominance of newcomers like Snowflake have highlighted clear path for the future. Cloud based data warehouses such as Snowflake, RedShift and BigQuery offer a multitude of benefits over their predecessors in the way they store, access and manage data.

Regardless of which cloud based data warehouse you choose for your situation, the concept of partitioning this warehouse into different functional layers is still an evolving concept. Best practice is starting to emerge that suggests at least two distinct “zones” of your data warehouse; one that stores raw/unstructured data and another that stores normalized/transformed data. There is much room for debate on this topic, but the general benefit to having these two distinct zones is the ability to effectively manage ever-changing rules that transform raw data into digestible information.

Transformation

If the data warehouse component is the most critical piece of the modern data stack, then the transformation component is the most overlooked. Most projects tend to disperse transformations across business tools, visualization platforms and manual artifacts like spreadsheets, but centrally managing data transformations is a clearly identifiable attribute of mature data organizations.

The idea of efficiently managing transformations first started to manifest in the mainstream as the battle between ETL and ELT. While it may seem pedantic, the simple reorganization of letters in a common acronym ushered in a brand new era that allowed non-data people to participate in building data products. This paradigm shift also gave new life to concepts like data governance and MDM, which rely heavily on input from business stakeholders.

From a DIY perspective, Python reigns supreme, as it can easily manage simple SQL/task based transformations with modules like SQLAlchemy and Airflow, and is tailor made for more complex machine learning transformations fueled by Tensorflow, Scikit-learn and many more.

From a managed service perspective it’s hard to find a better product than dbt. While it's true that all of the major cloud providers (AWS, Microsoft, Google) have their own set of tools to manage transformations on their platform, dbt appears to be ahead of the pack from a platform agnostic standpoint.

Presentation

Up until now most of the components we have discussed are pure infrastructure. While it’s true most data analysts, engineers and scientists will be consuming content from the data warehouse and transformation components, the bulk of end users won’t see anything until it hits a dashboard in the presentation layer.

The presentation component is a candidly vast category. Who‘s to say that a Jupyter Notebook, which also includes elements of transformation, can’t also be used as a presentation tool? After all, Databricks has been very successful using this strategy as they seem poised to be one of the next big tech IPOs of the roaring 20s.

From a historical perspective, visualization tools have dominated both the transformation and presentation categories, with tools like Looker, Power BI, Qlik, Sisense and Tableau proving that managing transformations and building beautiful visualizations are not mutually exclusive concepts.

As the data stack continues to evolve I believe the champions in the presentation space will be those that double down on visualization capabilities and rely less on transformative ability. As organizations integrate more sources of data and data volumes exponentially increase, managing transformations at the presentation level will not only present a challenge but will create ill-defined information and inaccurate analysis.

Transportation

The consideration of a transportation component is what makes this approach uniquely modern. In the past it was acceptable that end users would consume information through dashboards and external analytics tools, but it is becoming increasingly apparent that unless data professionals can get their insights back into systems of record, their work may be all for naught.

Sometimes referred to as “embedded analytics”, the concept of data transportation is simple as it bridges the gap between data tools and systems of record (i.e customer relationship management, marketing automation, and customer success platforms) However, there are few managed services that have emerged to solve this problem effectively and even the ones that have emerged are still actively developing. Companies like Hightouch, Census and Syncari seem to be the first ones through the wall and are likely the only option for most projects unless they have copious amounts of developer resources and experience in automating information exchange.

Closing

Even as I write this the data landscape is changing. Concepts around data platform observability and security are quickly becoming en vogue and companies are materializing overnight to solve these problems. With that in mind, understand that flexibility and agnosticism are the main takeaways from this message. Although it will happen, I am willing to bet that it will take several years before one vendor distills the entire data stack into one unified platform. So take this framework into the future with the understanding that you will have to change your thinking and accept new ideas on a daily basis.

--

--

The Full Data Stack! Data Engineer, Data Architect, Data Scientist ++ practical application of data science 🛠