DMOT: A Design Pattern for ETL — Data Model, Orchestrator, Transformer

An OOP framework developed for production ETL projects (MLOps) , inspired by the responsibility segregation principals of MVC-like frameworks.

Corbin Hudson
Towards Data Science

--

Figure 1: DMOT Framework Interactions. Image by Author (draw.io)

TLDR; Proposed design pattern for writing ETL data pipeline code (MLOps) . Use this framework to reduce debugging time, increase testability and in multi environment productions.

Data Model communicates with data sources and stores. Transformer manipulates the data. The Orchestrator directs the ETL jobs, connecting the data model & transformer classes, and is likely linked to an applicable orchestration service.

This is new at the time of writing (from authors knowledge) and open to refinement by anyone.

Motivation

Writing pipelines for production machine learning projects doesn’t have the same wealth of experience as the software applications they might be integrated into. We can however, borrow some of the lessons learned in their development.

Segregating your code based on a macro level responsibility worked well with MVC in iOS development. Side note for iOS devs: If you thought someone could write a massive view controller just wait until you see how big some jupyter notebooks are. Data projects are here to stay so code bases need to be built with testing & validation, developer transferability and ease of maintainability in mind.

DMOT Overview

DMOT divides the code base into three distinct categories based on the responsibilities of an ETL job. This framework argues an ETL core responsibilities are to Extract & Load the data (that’s sometimes two responsibilities) , Transform the data , and Orchestration of the runtime file/ flow of data.

An ETL job or pipeline can consist of several tasks/jobs in a logical sequence. A good pipeline analogy to software would be onboarding flow, each view is a task. Each task will have its own orchestrator, transformer and data model file. Inheritance of utility classes (ex. an SQL db connector) will be common for task files. Below is a code sample of how this would work.

Orchestrator

The orchestrator is the actual run time of the task. This is a fictitious example of getting the run time environment, database secrets and manipulating data. Most orchestrator should follow the flow of:

  1. Initialization Requirements
  2. Read in data using the data model
  3. Maniuplate the data using the transformer
  4. Save the maniuplated data using the data model

Data Model

The data model is a class that uses a custom SQL class. This SQLclass is representative of any custom database class. As you can see this class has two responsibilities to both extract and load data. The data model should avoid having data manipulation libraries such as pandas or numpy.

One could make an argument to seperate reading vs writing in certain scenarios. However, if the data source and destination are the same data store (ex: tables in a data warehouse or lakehouse), the author has found it practically more convenient to have one class.

Transformer

This transformer example is simple. It converts an array into a pandas df, and then takes advantage of the built in .dropna() function to drop nulls. Any time data is changed at all, it should be performed in the transformer.

Machine learning training or inference jobs would take place in the transformer. It would be common to see a custom sklearn or keras utilty class imported into these transformers.

Data Model

Responsibilities

  • all extracting and loading of all data
  • direct communication with data sources/stores

Examples

  • Database queries
  • DDL statements
  • raw file copy
  • **Model Loading & Saving

Validation Methods

  • Schema conformity
  • Data Validation Rules
  • Data Source/Store Test connections

Transformer

Responsibilities

  • Data manipulation & generation
  • Accepts and returns data from/to the orchestrator

Examples

  • Cleansing, formatting, datetime conversion
  • Feature Engineering
  • Model Training & Inference

Validation Methods

  • Unit Testing

Orchestrator

Responsibilities

  • Passing data between the data model and transformer in a sequence
  • Direct communication with data model, transformer and orchestration service

Examples

  • Instantiate data model and transformer classes
  • Call public functions in data model or transformer class in correct sequence
  • Get environment variables

Validation Methods

  • The pipeline does what it should
  • Connection with orchestration servicer

Why DMOT Works in Practice

  • OOP allows for highly reusable code between projects.
  • The separation of data model vs transformation allows for proper testings to occur. Testing connection to DB is different from a unit test.
  • Handles multi environment projects since Orchestrator separates out the environment layer. Data model & transformer can just use it as input when required.
  • Highly transferrable code as devs will know where to look for issues. Pipeline not running correctly, check orchestration. Numbers look wrong, try the transformer. No data at all, check the data model.

--

--