Four Software Engineering Best Practices to Improve Your Data Pipelines

From agile to abstraction, thinking about data the way we think about software can save us lots of grief.

Olivia Iannone
Towards Data Science

--

A laptop, notebook, and smartphone sit on a conference room table, while their owner gestures with his hands, deep in conversation with colleagues.
Photo by Headway on Unsplash

Let’s start by getting something important out of the way: there are some major differences between data engineering and software engineering.

At the same time, they’re similar enough that many of the best practices that originated for software engineering are extremely helpful for data engineering, as long as you frame them correctly.

In this article, I’ll walk through several software engineering best practices, and how they can help you create and maintain better data pipelines. I’ll be focusing on pipelines specifically because that’s what we focus on at Estuary, but these principles apply just as well to your data stack at large.

The discussion will be high-level. I’m not a software engineer myself, and I don’t believe you have to be to gain strategic and leadership value from these principles.

Software engineering vs data engineering: similarities and differences

Data and software products are different, and their stakeholders are different.

Generally speaking, building a software product involves collaboration between highly technical teams. The product can be delivered to a huge variety of user groups, often commercially. For example, a bank might create a mobile application for its clients.

Data products, by contrast, tend to live within the confines of an enterprise. The stakeholders and players involved can range from highly technical engineers to non-technical professionals who need data to do their jobs. For example, that same bank might create financial and demographic data products about its clients to aid in security, sales, and strategy.

If you’re reading this, you hang out in the data space which means I probably don’t need to belabor these distinctions. You’ve witnessed firsthand how data can be treated as if it’s very different from software — especially from the business perspective.

But the essential practices of data engineering and software engineering are basically the same. You’re writing, maintaining, and deploying code to solve a repeatable problem. Because of this, there are some valuable software engineering best practices that can be converted to data engineering best practices. Lots of the latest data trends — like data mesh and DataOps — apply software engineering practices in a new way, with excellent results.

The history of software engineering vs data engineering

To understand why these best practices come from software and are only recently being applied to data, we need to look at history.

The discipline of software engineering was first recognized in the 1960s. At the time, the idea that the act of creating software was a form of engineering was a provocative notion. In fact, the term “software engineering” was intentionally chosen to give people pause, and to encourage practitioners to apply scientific principles to their work. In the following decades, software engineers tested and refined principles from applied sciences and mechanical engineering.

(Check out this Princeton article for more details on all these bold-sounding claims.)

Then, in the 1990s, the industry fell behind a growing demand for software, leading to what was known as the “application development crisis.” The crisis encouraged software engineers to adopt agile development and related practices. This meant prioritizing a quick lifecycle, iterating, and placing value on the human systems behind the software.

On the other hand, data engineering as we know it is a relatively young field. Sure, data has existed for most of human history and relational databases were created in the 1970s. But until the 2000s, databases were solely under the purview of a small group of managers, typically in IT. Data infrastructure as an enterprise-wide resource with many components is a relatively new development (not to mention one that is still changing rapidly). And the job title “data engineer” originated in the 2010s.

In short, software engineers have had about 60 years of doing work that at least broadly resembles what they still do today. During that time, they’ve worked out a lot of the kinks. The data engineering world can use that to its advantage.

Without further ado, here are some software engineering best practices you can (and should) apply to data pipelines.

1 — Set a (short) lifecycle

The lifecycle of a product — software or data — is the cyclical process that encompasses planning, building, documenting, testing, deployment, and maintenance.

Agile software development puts a twist on this by shortening the development lifecycle, in order to meet demand while continuing to iterate and improve the product.

Likewise, you can — and should — implement a quick lifecycle for your data pipelines.

The need for new data products across your organization will arise quickly and often. Make sure you’re prepared by dialing in your lifecycle workflow.

  • Plan with stakeholders to ensure your pipeline will deliver the required product.
  • Build the pipeline — Depending on the platform and interface, you could be writing a specification or creating a DAG.
  • Document the pipeline — This could include a schema, metadata, or written documentation (dbt docs are an interesting example, though in a different part of the data stack).
  • Test the pipeline before deploying — The pipeline tool may have built-in testing, or you can write your own.
  • Deploy the pipeline.
  • Monitor it — Watch for error alerts and make updates.
  • Iterate quickly as use cases change — Continue to build on previous pipelines and recycle components.

The concept of integrating agile development methods into data is a huge component of the DataOps framework. Check out my full article on the subject.

2- Pick the right level of abstraction

To keep your data lifecycle tight, it’s important not to get lost in the technical implementation details. This calls for abstraction.

Software engineers are quite comfortable with the concept of abstraction. Abstraction is the simplification of information into more general objects or systems. It can also be thought of as generalization or modeling.

In software engineering, the relevant levels of abstraction typically exist within the code itself. A function, for example, or an object-oriented programming language are useful tools, but they don’t reveal the fine details of how they are executed.

In data, you’ll need to work with a level of abstraction that’s higher than code. There are two main reasons for this:

  • The immediate connection between data product and the business use-cases they serve mean you’ll want to talk about data in more “real-world” terms. Getting clear on this level of abstraction means establishing a universal semantic layer — and helps avoid the common problem of multiple, conflicting semantic layers popping up in different BI tools and user groups.
  • The wider variety of technical levels you’ll find in data stakeholders means that talking in terms of something highly technical, like code, isn’t very useful.

For a data pipeline, two pertinent abstractions are the acts of ingesting data from one system and pushing it to another (at Estuary we use the terms capture and materialization, but the semantics will vary).

When we talk about pipelines using terms like “captures” and “materializations,” both engineers and business users are able to unite around the semantic value of the pipeline (it gets data from system X to system Y so that we can do Z).

3 — Create declarative data products

Ok, you caught me, this is really just a continuation of the discussion of abstraction, but it will give that discussion more substance.

Let’s consider the idea of data as a product. This is a central tenet of the popular data mesh framework.

Data-as-a-product is owned by different domains within the company: groups of people with different skills, but who share an operational use-case for data. Data-as-a product can be quickly transformed into deliverables that can take many forms, but are always use-case driven. In other words: they are about the what rather than the how.

The software engineering parallel to this is declarative programming. Declarative programming focuses on what the program can do. This is in contrast to imperative programming, which states exactly how tasks should be executed.

Declarative programming is an abstraction on top of imperative programming: at runtime, when the program is compiled, it will have to settle on a how. But declarative programming allows more flexibility at runtime, potentially saving resources. Plus, it’s easier to keep a grip on mentally, making it more approachable.

By making your pipelines declarative — built based on their functionality first rather than their mechanism — you’ll be able to better support a data-as-a-product culture.

You’ll start with the product the pipeline is intended to deliver; say, a particular materialized view, and design the pipeline based on that. A declarative approach to pipelining makes it harder to get lost in the technical details and forget the business value of your data.

4 — Safeguard against failure

Failure is inevitable, both in software development and data pipelines. It’s a lesson many of us have learned the hard way: scrambling to fix a catastrophically broken system, losing progress or data to an outage, or simply allowing a silly mistake to make it to production.

You can — and should — apply very similar preventative and backup measures in both software and data contexts.

Here are a few important considerations. Many of these functions can be fulfilled with a data orchestration tool or through tools provided by your pipeline vendor.

Testing

This should be part of your pipeline’s lifecycle, just as it is in software.

In addition to comprehensive manual testing before deployment, you should write automated unit tests to keep an eye on the pipeline in production.

How you write these depends on your platform and how you must interface with it. If you use Airflow for your pipelines, for example, you would create Python scripts to test them. Alternatively, you might prefer or require a more robust monitoring setup to catch all potential problems.

As a rule of thumb, the more transformations a data pipeline applies, the more testing is required.

Version control

Software engineers use version control, usually Git, to collaborate on their work and retain the ability to roll back software to previous versions.

If you use a product from a vendor, it might provide a GitOps workflow, meaning engineers can use Git to collaborate on pipelines in their preferred development environment. Not all do, however.

Even if you can’t use Git for your data infrastructure, your vendor will have enabled some option to backup your pipelines, so be sure to take full advantage of that capability.

Distributed storage and backfilling ability

The advent of cloud hosting and storage has lessened the danger of outages and data loss, but it hasn’t gone away.

Your data infrastructure should be distributed; that is, different components should be spread across different servers, making it fault tolerant. The degree of control over this that you have depends on your cloud provider and vendors of choice.

Always iterate

One final lesson from software engineering best practices: when something doesn’t work, iterate.

The status quo and best practices are always in flux. This applies to software engineering and it definitely applies to data engineering.

The best approach is always one that’s well thought-out, introduces change safely, and includes buy-in from all stakeholders.

Start with principles like these, work with them to fit your data team’s systems and culture. Note the positive effects and areas that need improvement, and go from there.

This article was adapted from the Estuary Blog. You can find our team on LinkedIn and our code on GitHub.

--

--

Writing about real-time Data Ops & more @ Estuary Technologies. Content for the full breadth of data stakeholders.