The world’s leading publication for data science, AI, and ML professionals.

Designing ML Orchestration Systems for Startups

A case study in building a lightweight production-grade ML orchestration system.

Office Hours

Photo by Emile Guillemot on Unsplash
Photo by Emile Guillemot on Unsplash

I recently had the chance to build out a machine learning platform at a healthcare startup.

This article covers the journey of architecture design, technical tradeoffs, implementation details, and lessons learned as a case study on designing a machine learning orchestration platform for startups.

As the machine learning tooling ecosystem continues to mature and expand, there is no shortage of options for building out a machine learning orchestration layer for production data science pipelines. The more relevant challenge is being able to determine which tools fit the need cases of your organization.

The valuable lesson was structuring the source control process using tools already familiar to data scientists.

The final toolset we ended up with was:

  • Version controlled SQL scripts underpinning source dataset extraction
  • Preprocessing code abstracted by a Python runner script executed within a Docker container, accessing checkpointed data at rest within the data lake.
  • Model training code abstracted within a Python model class that self-contained functions for loading data, artifact serialization/deserialization, training code, and prediction logic.
  • Boilerplate Flask API endpoint wrappers for performing health checks and returning inference requests.
An end-to-end build & transform model pipeline. [Source: Image By Author]
An end-to-end build & transform model pipeline. [Source: Image By Author]

The below sections cover the thought process and implementation details needed to land on the final product. The specific system architecture may not extrapolate to all organizations, but this case study can provide a useful thought exercise in how to think through the selection of the right tools and comparing architecture tradeoffs.

The design process for building out this platform started with establishing the right guiding questions that primed the architecture discussion. This helped with framing which tradeoffs to favor during the technical build.

Guiding Questions

  • Who are the intended users of the service? What technical skills do they currently use and what skills are they willing to learn?
  • What data access patterns do users require and what types of inferences will be produced? How does data ETL already happen within the organization?
  • What are the freshness requirements of the data flow?
  • What components require monitoring? What metrics are useful?
  • How often will model versions be upgraded?

After putting out an Architecture Decision Record (ADR) proposal that laid out the problem scope and an initial architecture, the team and I landed on a rough plan for how to approach the POC.


Key Decisions

  • We need a source control process for both code + data to formalize model deployment patterns.
  • Docker images should serve as the common artifact of choice for portability, reproducibility, and scaling model solutions.
  • AWS Sagemaker will be the orchestration backbone for virtualized resource management in deploying APIs, managing batch transformation jobs, authentication, pipeline observability, and load balancing inference traffic.
  • Airflow will serve as the task orchestration system while Sagemaker takes care of resource allocation – scheduling and execution are decoupled.
  • We need a shared-state controller application that coordinates and manages Sagemaker operations.
  • SQLAlchemy ORMs can structure the data model, incorporating a finite state machine for resource state transitions, cleanly separating steps of the machine learning model lifecycle, and restricting duplicative Sagemaker operations.
  • The configuration and execution of pipelines should be as close to 1-click deployment as possible for users.

With the higher-level questions on tools and key features determined, I spent a few months refining a minimum viable product on this framework. This was my first time building a machine learning platform from scratch, so plenty of unanticipated implementation challenges arose that required some decision-making throughout the coding process.

The architecture design for the Machine Learning Orchestration proof of concept system. [Source: Image By Author]
The architecture design for the Machine Learning Orchestration proof of concept system. [Source: Image By Author]

What does "done" look like?

  • Keep the first model simple and get the infrastructure right.
  • One good option is to aim for a "neutral first launch" where machine learning gains are explicitly de-prioritized.

The first step is to determine how we know when we’re done. For laying down the initial infrastructure and integration testing, we set a goal of deploying a neutral first launch that disregarded machine learning model performance. The idea is to keep the first model simple and get the infrastructure right. This idea was inspired by Google’s rules formalizing effective machine learning frameworks.

The first few models deployed will likely be high-value, low-hanging fruit that don’t need complex and fancy features for training and deployment.

The focus should be on:

  • How is data storage and computation orchestrated?
  • A determination on success and failure states for the system.
  • Integration of models onto your framework – how do offline batch computations get triggered and how do live API endpoints get deployed and taken down?

The goal is to start with simple features:

  • To confirm that training data is correctly ingested, transformed, and fit to the model.
  • The model weight distribution looks reasonable.
  • Feature data can correctly access offline and online models for inference.

Testing

  • Test the infrastructure independently from the machine learning.
  • It’s difficult and wasted effort to unit test machine learning models in CI/CD – write integration tests for docker images using test data fixtures.
  • Detect problems before exporting models.

Make sure that the infrastructure has unit tests and integration tests, and that the learning parts of the system are encapsulated to allow fixtures and stubs to mock runtime conditions.

Specifically:

  1. Test getting data into the algorithm. Check for missing feature columns, ideally by defining acceptability thresholds around data quality.
  2. Manual inspection of inputs for training algorithms can be a good sanity check if privacy policies allow.
  3. Statistics around distributions in your training data vs. serving data can be a valuable monitor against concept drift.
  4. Test the generation of model artifacts from algorithms. Trained models should be reproducible and have performance parity with serving models on the same data.

Configuration

  • Configuration should be centralized as far upstream as possible. Options for sourcing configuration can be through database table configurations or through source-controlled YAML files.
  • Configuration flexibility should be determined based on the anticipated frequency of user changes balanced with pipeline stability needs. Can configuration change outside of source control? Can a ML service tolerate an outage from a bad config?

Enabling and managing machine learning initiatives is a probabilistic endeavor. It is very difficult to predict which problems may be easy or difficult to solve and which approaches and model configurations will produce the optimal model. Developing a good model will often necessitate many sub-optimal iterations along the way and the pace of development is non-linear. There may be cultural gaps between machine learning and engineering teams due to different backgrounds, development styles, and values.

A key metric for success on these projects is to focus less on the number of models deployed to production and more on the number of iterations of test models that were executed. Enabling fast and lightweight experimentation is more valuable in a machine learning product space than building out complex, feature-heavy R&D models.

Model portability and easy configuration of simple, lightweight ML models are consequently higher value initial goals than investing in complex, high-effort R&D algorithm support like Tensorflow, Pytorch, or Spark frameworks.


Task orchestration

  • Dag workflows should instantiate runtime parameters once at the beginning of the workflow and push parameters to the tasks.
  • Storage is cheap – it’s safer to checkpoint and persist data computation whenever possible across data extraction, preprocessing, and training.
  • The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.

API Endpoints

  • The shared-state controller application correctly versions Sagemaker Sagemaker models across deployments to enable blue-green deployments and canary models.
  • A registry of different Sagemaker API payload dictionaries defines default parameters for operation calls, with custom configuration injected at runtime invocation.
  • S3 was our chosen artifact storage framework. We needed a set of utility functions for standardizing the construction of S3 URI namespaces across inference, training, serialization, and tombstone files.

Key steps in deploying a model as an API web service for real-time inference:

  • Invocation Model: Write a lightweight API for the model to handle inference requests and to return predictions, input data, model metadata attributes, and attach unique transaction ids for building audit trails.
  • Virtualization: Dockerize that API and deploy to a cluster with appropriate network overlays, authentication protocols, and application routes.
  • Horizontal Scalability: Configure autoscaling, load balancing, logging, authentication protocols, and any other infrastructure needed.

Beyond the work needed to simply stand up an ML API endpoint, there are additional considerations around how to accelerate developer productivity when building out new models.

  • Are API inference access patterns fairly uniform across deployed models? Perhaps most of these lightweight endpoints can be abstracted out by boilerplate code.
  • What types of inference requests are most common? If throughput is more valuable than latency, building out batch prediction capabilities with input files may be a higher priority than enabling real-time API point inferences.
  • Is data ETL for training and serving mainly sourced from a few distinct datastores or is there a highly diverse set of pipeline inputs? How often do these inputs change? Self-service data ETL can be quickly driven by database table configurations or stabilized with less flexibility by configuring YAML files in source control repositories.

Monitoring

  • What is the degradation rate of your model’s performance? What percentage loss in accuracy metrics occurs over a day? A month? A quarter? Calculating this degradation rate allows you to prioritize re-training freshness requirements for your models.
  • Invest in testing before persisting artifacts and deploying to production. This may seem self-evident given CI/CD culture and best practices of catching errors through testing rather than monitoring, but machine learning models need regression tests on model weight distributions and inference accuracy in addition to integration testing on data flow across the inference pipeline.

There are three typical issues that degrade the performance of deployed ML models.

  • Concept drift: as time progresses, the accuracy of production models decreases due to the increasing drift of data distributions between the training data and the real world serving data. Existing customer preferences may change.
  • Locality: Models trained with a given context (i.e. geography, demographics, industry) perform worse when extrapolated to new groups. As new users onboard onto the model, new customers may introduce new preferences to datasets.
  • Data quality: ML models are particularly sensitive to silent data failures – i.e. a stale reference table, an outage for a particular column within your event stream, new vocabularies or schemas for upstream datasets, or malformed data entering your dataset.

Lessons Learned Along the Way

  • Add process only when appropriate – the success of a machine learning orchestration framework is not how many models you successfully deploy into production, but the number of experiments enabled.
  • Be framework-agnostic. The decision-making process around tools should be driven by the ML service needs, the size and expertise of the Data Science product users, organizational factors like cloud vendors and industry privacy policies, and the maturity of internal data platforms.
  • Build toward abstraction. The more parts of the data science pipeline that can be abstracted out, the easier it is to test each component and restart failed pipelines from checkpoints.
  • Aim for a neutral first launch. This removes the pressure to showcase immediate value from deployed ML models and allows space to just focus on getting the infrastructure, permissions, and glue to get the architecture operational.

During my exploration of ML resources and frameworks, I came across these amazing blogs, articles, and presentations that helped guide my journey. I’ve listed them here to facilitate any others looking to dive deeper into writing and modeling Machine Learning pipelines:

Feel free to connect with me on LinkedIn, Twitter, Github, or Medium!


Related Articles