Notes from Industry

The importance of layered thinking in data engineering

How to use real-world data in machine learning (ML) use cases

Joel Schwarzmann
Towards Data Science
11 min readJul 8, 2021

--

Looks a bit like a data lake right? (Tangled wires by Cory Doctorow on Flickr (CC BY-SA 2.0) )

Who is this for?

Are you a data scientist or data engineer keen to build sustainable and robust data pipelines? Then this article is for you! We’ll walk through a real-world example and by the end of this article you’ll understand why you need a layered data engineering convention to avoid the mistakes we made in the past 🙈. We are QuantumBlack and we’ll talk about our open-source Python framework: Kedro.

Why do we need a convention? Remember — data often exists by accident

Experienced data scientists, analysts and engineers know, only too well, that not all data is designed for analytics. Often it exists ‘by accident’, a random byproduct of some other business process.

When this happens, it’s common for the data quality to be poor and the infrastructure to be unreliable.

Here are some examples of common situations found in the world of enterprise organisations which illustrate how one invariably ends up in this situation:

  • The shift-manager of a production line still maintains schedules in Excel instead of using the fancy piece of enterprise software they have available because it’s a system that works, can be emailed around and migrating would interrupt business-critical timelines.
  • The inventory system of a multinational pharmaceutical company may quite literally still be a mainframe computer from the 1980s (hey, it does the job!). It has some basic reporting and forecasting functionality, but it was purposely designed to manage stock, not for analytics.

Many organisations, especially large ones, are not ‘internet natives’ and are now retroactively building machine learning into their operations. In this context, we have to take a flexible and iterative approach to building out ML use cases.

One cannot expect data to be neatly structured and ready to go. For example, you see digital banks like Monzo think very carefully how they seperate PII data from analytics at source. This is something that more traditional institutions have to unpick across disparate systems when trying to do the same sort of analytics.

In the real world, reverse engineering data designed for one purpose into something useful for analytics is a big part of building out ML pipelines.

Acknowledging this situation and using a standardised project template is an effective mechanism for simplifying ones codebase and working mental model.

One group to turn to for an opinionated set of best practices in this matter is Cookiecutter Data Science. Their mission is to facilitate correctness and reproducibility in data science and, as it happens, they also employ a layered approach to data engineering…

We ❤️ Cookiecutter Data Science

Cookiecutter and the associated Cookiecutter Data Science project are leaders in the field with their rock-solid opinions. If you haven’t had a chance to read their methodology in detail, check them out. We’ll wait it’s fine 😀 ⏳.

mmm… (“healthy chocolate chip cookies” by hlkljgk is licensed under CC BY-SA 2.0)

In summary, their thinking is underpinned by the following 6 rules:

1. Data is immutable
2. Notebooks are for exploration and communication
3. Analysis is a DAG
4. Build from the environment up
5. Keep secrets and configuration out of version control
6. Be conservative in changing the default folder structure

You can see from the standard Cookiecutter directory structure that a clear and concise form of data engineering convention is enforced:

...
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Transformed intermediate data.
│ ├── processed <- The final data sets for modeling.
│ └── raw <- The original, immutable data dump.

...

Whilst this is a great framework to work with, as our projects grew in size and complexity we felt more nuance was needed in our approach.

Kedro: What it is and how it helps

There was a time where every project QuantumBlack delivered looked different. People started from scratch each time, the same pitfalls were experienced independently, reproducibility was time consuming and only members of the original project team really understood each codebase.

Enter Kedro, an open-source Python framework for creating reproducible, maintainable and modular data science code. If you’ve never heard of the Kedro framework before, you can learn more here.

We built Kedro from scar tissue.

We needed to enforce consistency and software engineering best practices across our own work. Kedro gave us the super-power to move people from project to project and it was game-changing.

Live footage of engineers joining long-running Kedro projects midway (“180327-N-VN584–3279” by Commander, U.S. 7th Fleet is licensed under CC BY-SA 2.0)

After working with Kedro once, you can land in another project and know how the codebase is structured, where everything is and most importantly how you can help.

Kedro is a framework focused on the development and experimentation phase of ML product development. It is not centred upon executing the ‘finished article’ — that’s called ‘orchestration’ and is something we view as downstream to a deployed Kedro project. If you’re interested in how to use orchestrators, please read about our deployment guide.

The Cookiecutter project’s core opinions are a huge influence on us and something we try to embody in Kedro. The initial premise for Kedro’s project structure extends the Cookiecutter directory structure. In addition, it powers the kedro new command(s) and the Kedro Starters functionality today.

As mentioned earlier, we found that our development process required a slightly more nuanced set of ‘layers’:

...
├── data
│ ├── 01_raw <-- Raw immutable data
│ ├── 02_intermediate <-- Typed data
│ ├── 03_primary <-- Domain model data
│ ├── 04_feature <-- Model features
│ ├── 05_model_input <-- Often called 'master tables'
│ ├── 06_models <-- Serialised models
│ ├── 07_model_output <-- Data generated by model runs
│ ├── 08_reporting <-- Ad hoc descriptive cuts
...

The Kedro Data Layers

The complicated diagram below represents what this thinking looked like before Kedro came to exist. It was (and still is) a playbook for working with data before we had standardised tooling to build out our pipelines.

There is a well-defined sets of rules to ensure a clear understanding of which tasks need to be performed at each layer.

Not all of this is relevant to Kedro today, but demonstrates our wider thinking

Today, this has been simplified and translated into Kedro’s working pattern. A table describing how these work at a high level has been included below, but we’ll also take you through an end to end example shortly.

In the Kedro project template we generate a file structure that implements this convention. This is very much intended to nudge users towards this way of thinking — however, in practice we expect users to store their data in the cloud or data lake/warehouse. If you’re looking for an example, this is a good place to start!

One of the other key benefits of using this approach is the ability to visualise the layers in kedro-viz, our documentation on this can be found here.

shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
layer: raw

The layer key can be applied to the first level of any catalog entry and reflects how the dataset will be visualised in kedro-viz.

Key concept — Source versus Domain data models

In Kedro world we call the Domain level data the primary layer… but more on that later.

Let’s take the following example question and discuss the difference between source and domain data models.

Which machine in a factory is going to break down next?

We start with two raw data sources:

  • Inventory - Tracks the equipment available
  • Maintenance schedule - Which mechanics work which shifts

These data sources were not designed for analytics, but a line between the two systems allows us to create a Machine shutdowns dataset relevant to the problem at hand.

This was built using kedro-viz which can visualise any Kedro pipeline

Whereas the two original datasets were received in whatever shape they were originally designed for, the Machine shutdowns reflects the problem being solved. With this derived dataset we can start to evaluate our hypotheses regarding what causes shutdowns.

Comparing the Kedro layers to Cookiecutter Data Science

The most important difference is how we have split the ‘interim’ section into distinct subsections with clear responsibilities. For reference, here is the full Cookiecutter directory structure.

*In Kedro there is no distinction between internal and external data

Applying the Kedro data engineering convention to a realistic example

In this section we will bring it all together. Let’s take our predictive maintenance example from above (in concept, the data is different) and ground it in a realistic version of a machine learning use-case.

Layered thinking in practice (“TACC brain” by Ioannis N. Athanasiadis is licensed under CC BY 2.0)

Two key points to mention before we start:

🧢 This has been written with a data engineering hat on and as such the data science workflow is somewhat simplified. The modelling approach applied is also indicative rather than a robust piece of work.
🤷‍♀️ Ultimately these are all suggestions not rules — this article aims to contextualise our rationale, but ultimately you should feel free to follow this way of thinking, come up with your own layers, or completely disregard it.

The data necessary to build the pipeline and overall ML use-case is currently sat across multiple systems and parts of the business which rarely speak to each other.

If we could control how this data arrived it would be well documented, typed and accessible. In practice, it’s typical for things to arrive in err…how do we say this delicately… less than ideal formats 💩.

🔒 01. The Raw layer

️We never mutate the data here, only work on copies

In this example theraw layer is populated with data that comes from a large, distributed organisation. The following data sources are present:

  1. An Excel based maintenance log, which details when machines were serviced etc.
  2. A list of machine operators from a ERP system like SAP, describing which operators use different machine at different times.
  3. A static cut from an unknown equipment inventory SQL database that provides other metadata about the various machines in scope. The export has been provided in multiple parts.

Now we’ve set the scene — familiarise yourself with the pipeline below before we walk through how the data flows through the layers.

Play with the tags at the bottom of the left sidebar slide-out to focus on specific layers.

🆕 02. The Intermediate layer

In practice the intermediate layer only needs to be a typed mirror of the raw layer still within the ‘source’ data model

  • Once the intermediate layer exists, you never have to touch the raw layer and we eliminate the risks associated with mutating the original data.
  • We permit minor transformations of the data — in this example we have combined the multi-part equipment extract into a single dataset, but have not changed the structure of the data.
  • Cleaning column names, parsing dates and dropping completely null columns are other ‘transformations’ commonly performed at this stage.
  • We use a modern, typed data format like Apache Parquet.
  • If your data is already typed and structured it is okay to start at this point — but treat it as immutable.
  • There is often a performance gain found running your pipelines from here instead of the raw layer. Typing and parsing large CSV or Excel files can be non-trivial activities in terms of computation.
  • Profiling, EDA and any data quality assessments should be performed at this point.

⚙ 03. The Primary layer

The primary layer contains datasets which have been structured in respect to the problem being solved.

  • Two domain level datasets have been constructed from the intermediate layer which describe both equipment shutdowns and operator actions.
  • Both of these primary datasets have been built in a way that each row describes an action/event at a fixed point in time allowing us to ask questions of data in an intuitive way.
  • The concepts of migrating from the source to your domain model are critical here. This is where data is engineered into a structure fit for the analytical purpose.
  • Additionally redundant source-level datapoints will be discarded as we flow through the layers, simplifying our working mental model.
  • From this we have a platform which we can use to build out our feature layer.

🧩 04. The Feature layer

The feature layer is constructed from inputs which sit in the primary layer.

  • It’s seen as good practice to exclusively build feature tables from the preceding primary layer (and to not jump from the intermediate one). However, as with everything in Kedro this is a suggestion, not a hard rule.
  • In a mature situation, these will be saved in feature store which gives users a versioned and centralised location ready for low-latency serving.
  • Feature are typically engineered at a consistent level of aggregation (often known as the ‘unit of analysis’ or table ‘grain’). In this example, one could potentially transform the data so that each row corresponds to one unique piece of equipment.
  • Target variable(s) reside within this layer and are treated as generic features.
  • In this example, the 3 features created represent some variables which could be predictors or signals of equipment shutdowns:

a) Days between last shutdown and last maintenance
b) Maintenance hours over the last 6 months
c) Days since last shutdown

⚡️ 05. The Model Input layer

We feel the term ‘master table’ isn’t precise enough and have opted to use this nomenclature instead

  • This is where we join all the features together to create inputs to our models
  • In practice it’s typical to experiment with multiple models and therefore multiple ‘model input’ tables are required.
  • The first example here is time-series based table, whereas the other table is equipment centric without a temporal element.
  • In this example we use a simple ‘Spine’ joining table in order to to anchor each input table to the correct ‘grain’ / ‘unit of analysis’.

🧠 06. The Model layer

This is where trained models are serialised with reproducibility in mind

  • In this example, we have two models which we save as pickles for safekeeping.
  • As with the rest of the layers, the ‘Model’ layer is conceptual box to help organise your team’s (or your own) thinking when building out pipelines.
  • In a modern production environment it is common to see model registries used at this point of the process.

🎁 07. The Model Output layer

The results of the various model runs live here

  • In this example, the two distinct modeling approaches output recommendations and scored results in different formats which are consumed downstream.

📣 08. The Reporting Layer

In this example, the feature engineering work performed has also made it possible to provide the business with a descriptive helicopter view of the maintenance activities not previously accessible.

Extra credit —In this example we used an advanced modular pipeline pattern in order to to re-use the same Data Science pipeline across both models (hence the mirrored structure). By doing this we can re-use the same code by simply overriding the relevant inputs and outputs for each pipeline — see Kedro the code here.

TL;DR

The real world is one where data often hasn’t been designed with analytics in mind. It helps to have a framework for getting your data into a format suitable for analytics and, it just so happens, we’ve developed one which helps us make sense of the complexity and avoid common mistakes.

This article gives an idea of how we developed our thinking and provides a worked example of how Kedro’s data convention is set out.

What do you, the readers, use to guide your data engineering? Let us know in the comments!

📦 GitHub
💬 Discord
🐍 PyPi
🤓 Read The Docs

Are you a software engineer, product manager, data scientist or designer? Are you looking to work as part of a multidisciplinary team on innovative products and technologies? Then check out our QuantumBlack Labs page for more information.

All hail Kedroid
Get involved in the Kedro community, become a Kedroid

--

--