The world’s leading publication for data science, AI, and ML professionals.

Data Engineering at Scale

How to speed up building your Big Data ETL pipelines and getting them into production

We have been hearing the slogan "Data is the new Gold" since a couple of years for now and many companies are heavily investing to follow down this route. Initially most companies believed that is was enough to hire a bunch of expensive data scientists to become a leader in the world of data driven companies. For many reasons it turned out that becoming a data centric organization is much more difficult than only hiring top notch data scientists. This article will focus on one important aspect required for taking off as a data company.

Surprisingly it wasn’t well understood until a couple of years ago that getting and preparing the data for analytical questions actually is much harder than expected. This first step requires a lot of technical knowledge and skills about file formats, database systems, APIs – all important topics, but mostly outside the focus of a data scientist, who’s main expertise is in applying statistical or machine learning methods to data. At this point, the new role "data engineer" has been invented. Actually similar role already existed long before in the world of Business Intelligence (BI) and Data Warehouses (DWH) but it was given a different name.

The Importance of Data Engineering

Since data scientists are depending on having access to relevant data to perform their work, Data Engineering is a central piece of the puzzle for becoming a data driven company. Since agility is at the heart of the trial-and-error approach of modern data science, having an efficient team of strong data engineers will make a huge difference. Being able to quickly build and deploy new data pipelines or to easily adopt existing ones to new requirements becomes an important factor for succeeding with a companies data strategy.


Scaling up development speed of data engineers provides a significant value to any company doing data science. But what do you actually need to do?

1. Chose the Right Approach

Obviously the choice of the tools and/or frameworks used to build data pipelines has a huge impact on overall development speed. There are basically two extreme routes you can chose between and many variants in between.

Either you can place your bet on commercial of the shelve software like Talend or Informatica which offer graphical development environments and fully integrated workflows for building ETL pipelines. At first this approach seems to be very promising, but often this turns out to become difficult once some important feature is missing and you have no idea how you can integrate it. Moreover as nice as the graphical data flows look like, this presentation has some significant drawbacks for example you can’t easily perform a diff to trace changes in the source code repository.

The other option is to bet on creating your own software by employing powerful frameworks like Apache Spark. While this approach seems to imply a much higher up-front effort, it often turns out to be more efficient since the complexity of the solution can grow together with your experience and with your requirements. Plus you can easily build the solution that you actually need instead of tailoring your infrastructure and workflow to the solution provided by a closed solution. Actually the term "Data Engineer" most often refers to highly skilled experts in this camp of developers relying on Apache Spark and its friends.

Myself, I am a strong proponent of open source software would almost always prefer the second approach, since it gives me more control albeit being more low-level than a fully integrated but complex ETL software package. And if implemented wisely, "having more control" often directly translates into "faster to adopt".

2. Get Access to Source and Target Systems for Data Engineers

Although it sounds completely obvious that you need a local development environment, actually having a complete one is much more difficult as soon as external systems are involved like databases or other data sources. Data engineers not only need some IDE, but they also need access to the source systems containing the data that they are required to build data pipelines for. Unfortunately in most cases there is no silver bullet, so you should consider several different approaches.

One option is to grant access to source and target systems like databases or file systems in a development or staging area for data engineers. The advantage is that developers do not need to setup possibly complex software on their own machines, but the downside is that the internet connection required for accessing these systems may become a bottleneck, especially in modern times of home office and underdeveloped countries like Germany (yes, Germany is completely underdeveloped concerning fast internet lines for private homes).

A different option is that you replicate relevant parts of the overall system architecture on your local machines, i.e. you run databases or S3 servers. This is much more complex, but if you use clever tooling like Vagrant a successful local setup can easily be replicated to your colleagues. But this approach needs ongoing efforts to keep the local software stack in sync with the production environment. Plus you still need to populate the local data services with data.

Eventually from my experience there is no way around of granting access to external systems. And since many problems in data pipelines are not caused by the application but by the data itself, you also need to think about granting (read only) access to production systems for trouble shooting. The ability to be able to run a data pipeline inside a debugger on a local machine while connected to external systems can be a life saver in some situations and will tremendously speed up trouble shooting.

3. Write reusable Boiler Plate Code

Most projects do not need a single data pipeline, but a whole bunch of independent pipelines. But most of these applications require very similar boiler plate code for proper integration into the application landscape. Implement that boiler plate code properly once and reuse it for all your pipelines. This way all pipelines will benefit from any improvement of the shared code.

That doesn’t mean that you should put all pipelines into a single code repository, instead you should create a common repository for shared code and separate repositories for each data pipeline. Over time your boiler plate code might grow to a complete framework which also includes many non-functional requirements like logging, monitoring and which simplifies debugging and testing.

This way you also achieve a clear separation of boiler plate code and business logic, which will help you with the next advise.

4. Focus on the Pipeline Logic

Separating boiler plate code from business logic also helps you to focus on the later one. A good implementation of shared code doesn’t change much after some time, since it already addresses all (non-functional) requirements. In this case most changes will happen in the code containing the business logic, and this is the place where the actual business value is generated. It requires agility and simplicity to adopt to changes and to implement new features.

The pipeline logic should not contain any boiler plate code any more, ideally it should not contain much control flow other than what is needed for directing the data itself. It should be clear and concise without any distractions about technical details. This is where data engineers will spend most of their valuable time at.

5. Create meaningful Test Cases

Unit tests and integration tests are the standard mechanisms for quality assurance in software development. This is not much different for data engineering. Frameworks like Apache Spark make it easy to write unit test for individual transformations.

But you should not stop with unit tests, you should also include automated higher level tests for the whole data pipeline (at least for all transformations carried out) by providing input data and expected output data. To do so, the shared boiler plate code should support replacing the connections to external systems by simple CSV or JSON files containing the test records for the input. Then the data processing pipeline should perform its magic on this well controlled data, such that you eventually can compare the pipeline’s result with the provided expected results.

Having this kind of tests will increase the data engineers confidence in their code and at the same time it allows the business experts to exactly specify their expectations by providing example inputs and outputs.

6. Support Data Inspection

It is really important that a data engineer is able to run any of its applications on its local machine, so he can easily attach a debugger to the application for investigating any issues.

The ability to peek inside intermediate results of a complex data pipeline is really important to catch any errors in the transformation logic. You will gain more speed by providing tools or functions for dumping the contents of temporary tables or intermediate results, such that you can follow all transformations step by step.


Conclusion

I went through all these steps in several projects and found the initial issues at gaining speed were almost always the same: Getting access to production data, getting proper test cases, writing similar boiler plate code, and so one. Once all these issues have been addressed, the turnaround times where much lower.

Specifically I highly recommend to separate boiler plate and framework code from business logic. My personal solution for that is Flowman, which is a data engineering application on top of Apache Spark, where the business logic is specified via declarative YAML files. Flowman then takes care of the execution and also provides CLI tools for inspecting intermediate results.


Related Articles