The world’s leading publication for data science, AI, and ML professionals.

MLOps Practices for Data Scientists

You may have heard this a lot already, but only a small portion of machine learning models go into production. Deploying and operating a…

Architecting Production Machine Learning Systems

You may have heard this a lot already, but only a small portion of machine learning models go into production. Deploying and operating a machine learning model has been challenging for most industries that have started to apply ML to their use cases. In this article, I’ll be sharing some of the MLOps best practices and tips that will allow you to take your ML model and operate it properly in production. Before we start, let’s talk about the typical ML project lifecycle that we may all know.

ML Project Lifecycle

A typical ML lifecycle can be summarized with the following diagram mainly composed of 3 phases.

ML Project Lifecycle - inspired by Uber Machine Learning, Source: https://www.linkedin.com/pulse/doing-machine-learning-uber-way-five-lessons-from-first-rodriguez/
ML Project Lifecycle – inspired by Uber Machine Learning, Source: https://www.linkedin.com/pulse/doing-machine-learning-uber-way-five-lessons-from-first-rodriguez/

In the first phase, and before we get elbow-deep in our data, it is important to set ourselves up for success. Therefore, alongside the business experts, we need to carefully define our problem and the business objectives! We need to answer some important questions that allow us to take training and serving decisions regarding the design of the model and the production pipeline. For example:

  • What is the ideal outcome?
  • What is our evaluation metric? How can we define an ROI?
  • What are the success and failure criteria?
  • What are the latency requirements? And can we get each feature for serving within the latency requirements? …

In the second phase, we prototype our first ML model, or in other words, we perform an ML feasibility study.

Therefore, we prove the ML business value using the metric defined in the first phase. Remember, the best practice for ML engineering rule number 1 is « keep the first model simple and get the infrastructure right ». The first model provides the biggest boost to our product, so it doesn’t need to be the fanciest model at the beginning.

In the third phase, we move to production. This is the main topic of this article so we will see this more in detail in the coming sections. Once our production pipeline is ready and well designed, we can gather insights and iterate new ideas much faster and more efficiently.

What are Data Scientists mostly Doing Today?

Today most ML journeys to get a machine learning model into production look something like this. As a data scientist, it starts with an ML use case and a business objective. With the use case at hand, we start gathering and exploring the data that seems relevant from different data sources to understand and assess their quality.

What are Data Scientists mostly Doing Today? - inspired by Google Cloud Tech, Source: https://cloud.withgoogle.com/next/sf/sessions?session=AI212
What are Data Scientists mostly Doing Today? – inspired by Google Cloud Tech, Source: https://cloud.withgoogle.com/next/sf/sessions?session=AI212

Once we get a sense of our data, we start crafting and engineering some features we deem interesting for our problem. We then get into the modeling stage and begin tackling some experiments. At this phase, we are manually executing the different experimental steps regularly. For each experiment, we would be doing some data preparation, some feature engineering, and testing. Then we do some model training and hyperparameter tuning on any models or model architectures that we consider particularly promising.

Last but not least, we would be evaluating all of the generated models, testing them against a holdout dataset, evaluating the different metrics, looking at performance, and comparing those models with one another to see which one works best or which one yields the highest evaluation metric. The whole process is iterative and is manually executed over and over again until we get the fanciest model with the best possible performance.

Once we obtain the best performing model, we usually place it in some storage and throw it over the wall to the IT and operations team whose job is then to deploy the model to production as a prediction service. And we, unfortunately, consider our job is over.

ML Operations Pitfalls – What’s wrong with this approach?

Here’s what’s wrong with the above approach.

Manual: The steps are highly manual and are written from scratch each time. Every time the data scientist needs to conduct a new experiment, he needs to go through his notebooks, update them and execute them manually. If the model needs to be refreshed with new training data, the data scientist needs to execute his code again manually.

Time-consuming: This manual process is time-consuming and not efficient.

Not reusable: The custom code written in the notebooks can only be understood by the author himself and can’t be reused or leveraged by other data scientists or across other use cases. Even the authors themselves might find it difficult sometimes to understand their work after a certain period of time.

Irreproducibility: Reproducibility is the ability to be recreated or copied. In machine learning, being able to reproduce the exact model is important. With a manual process as here, we’re likely not to be able to reproduce an older version of a model as the underlying data might have changed, the code itself might have been overwritten, or dependencies and their exact versions may have not been recorded. Therefore, in case of a problem, any attempt to roll back to an older version of a model might be impossible.

Error-prone: This process can lead to many errors like training serving skew, model performance decay, model bias, infrastructure crash across time…

  • Training Serving Skew: When we deploy our model, we will sometimes notice that the online performance of our model is completely below the performance we expected and measured on the holdout dataset. This phenomenon is very frequent for operated machine learning models. Discrepancies between training and serving pipelines can introduce training serving skew. Training-serving skew can be very hard to detect and can render a model’s predictions completely useless. In order to avoid this problem, we need to ensure the exact processing functions are done on both training and serving data, monitor the distribution of training and serving data, and monitor the real-time performance of the model and compare it to the offline performance.
  • Model Decay: In most use cases, data profiles are dynamic and change with time. When the underlying data is changing, the model performance decays, as the existing patterns are no longer up-to-date. Static models rarely continue to serve a value. We need to make sure the models are updated regularly with new data and monitor the real-time performance of the served models to trigger model decays. The following figure shows how a deployed model decays with time, and the constant need to update the model with a fresh one.
Image by author
Image by author
  • Model Bias: Applications of AI systems can have a critical nature, such as medical diagnosis or prognosis, pairing people’s skills with jobs, or testing the eligibility of a person for a loan. As practical as these applications may seem, the impact of any bias in such systems can be widely harmful. Hence, an important property of future AI systems is fairness and inclusiveness for all. Therefore, for any Machine learning model, it’s important to measure fairness across sensitive features (gender, race …). The sensitive features depend on the context. Even for insensitive features, it’s important to assess the performance of the AI systems over the different subgroups to make sure we are aware of any underperforming subgroup before a model is deployed.
Image by author - Image generated by Microsoft's Fairlearn tool
Image by author – Image generated by Microsoft’s Fairlearn tool
  • Scalability: Scalability matters in machine learning because training a model can take a long time and hence optimizing a model that takes multiple weeks of training, is not workable. A model can be so big that it can’t fit into the working memory of the training device. Even if we decide to scale vertically, it is going to be more expensive than scaling horizontally. There might be some cases where the data quantity is not big and hence scalability might not be needed at the beginning, but we should think if, with continuous training, the amount of training data that we expect to receive will increase with time and might introduce a memory problem for the infrastructure that we set in place.

Principal Components of an ML System

In this section, we’ll describe the principal components of an ML system and the best practices around them that will allow us to avoid the above pitfalls.

The process of delivering an integrated ML system and continuously operate it in production involves the following steps:

Principal Components of an ML system - Image by author
Principal Components of an ML system – Image by author

Let’s talk a little in detail about each of the pipeline components.

Data Ingestion:

This component is usually external and outside the ML pipeline of our use case. In mature data processes, data engineers should be optimizing continuous data ingestions and transformations to continuously deliver up-to-date data to the different data analytics entities within the organization who are looking forward to uncovering data-driven insights and better-informed decisions.

Data validation:

Image by author
Image by author

In this component, our focus is on validating the input data fed to our pipeline. One cannot underestimate this problem’s significance in ML systems. Regardless of the ML algorithms employed, errors in data can severely impact the quality of the generated model. As one popular Data Science concept says "garbage in, garbage out". Therefore, it is crucial to spot data errors early.

Another role error-free data can play is in terms of model output analysis. This component allows us to properly understand and debug the output of our ML model. As a result, data must be considered a first-class citizen in ML systems, just like algorithms and infrastructure. It must be continuously monitored and validated at every execution of the ML pipeline.

This step is also used before model training to decide whether we should retrain the model (in case of data drift) or stop the execution of the pipeline (in case of data anomalies).

Here’s what a typical behavior of the data validation component should look like:

  • It computes and displays descriptive statistics about the data, it can also display both the descriptive statistics of consecutive spans of data (i.e., between current pipeline execution N and last pipeline execution N-1) to see how data distribution has changed.
Source: https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-understanding-validation-and-monitoring-at-scale-d38e3952c2f0
Source: https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-understanding-validation-and-monitoring-at-scale-d38e3952c2f0
  • It infers the data schema representing that data in use.
Source: https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-understanding-validation-and-monitoring-at-scale-d38e3952c2f0
Source: https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-understanding-validation-and-monitoring-at-scale-d38e3952c2f0
  • It detects data anomalies. It should check if the dataset matches the predefined validated schema. It should detect data drift between consecutive spans of data (i.e., between current pipeline execution N and last pipeline execution N-1), such as between different days of training data. It should also detect training serving skew by comparing the training data with the online serving data.
Source: https://github.com/tensorflow/tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb
Source: https://github.com/tensorflow/tfx/blob/master/docs/tutorials/data_validation/tfdv_basic.ipynb

In production, with continuous training, here’s how a schematic view generating statistics about newly arriving data, validating it, and generating anomalies reports look like:

Source: https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-understanding-validation-and-monitoring-at-scale-d38e3952c2f0
Source: https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-understanding-validation-and-monitoring-at-scale-d38e3952c2f0

Data Transform

Image by author
Image by author

In this step, the data is prepared for the ML task. This involves data cleaning, filtering, data transformations, and feature wrangling. It should do things like generate features to integer mappings. Besides, this component prepares features metadata that might be needed in the trainer component (This includes for example the meta parameters needed in the training step for features normalization, the dictionaries needed for categorical variables encoding, etc…). These are called transformation artifacts; they help with constructing the model inputs.

Critically, whatever mappings that are generated must be saved and reused at serving time (when the trained model is used to make predictions). Failure to do this consistently results in the Training Serving Skew problem we talked about earlier.

Image by author
Image by author

Model Training

Image by author
Image by author

The model training component is responsible for training our model. In most use cases, models can train for hours, days, and even weeks. Optimizing a model that takes multiple weeks of training, is not workable. In other cases, the data used to train the model doesn’t even fit in memory.

In that scenario, the model training component should be able to support data and model parallelism; and scale to large numbers of workers. It should also be capable of handling out-of-memory data.

Ideally, all the components of our ML system should be scalable and running on infrastructure that supports scalability.

This model training component should also be able to automatically monitor and log everything while training. We cannot train a Machine Learning model over a large period of time without seeing how it’s doing and making sure it’s correctly configured to minimize the loss function with the number of iterations. Finally, the training component should also support hyperparameter tuning.

Model Analysis

Image by author
Image by author

In the model analysis component, we conduct a deep analysis of the training results and ensure that our exported models are performant enough to be pushed to production.

This step helps us guarantee that the model is promoted for serving only if it satisfies the quality criteria we preset during the framing phase. The criteria must include improved performance compared to previously deployed models and fair performance on the various data subsets/slices. In the following figure, we’re displaying the performance of our trained model on the feature slice _trip_starthour.

Source: https://blog.tensorflow.org/2018/03/introducing-tensorflow-model-analysis.html
Source: https://blog.tensorflow.org/2018/03/introducing-tensorflow-model-analysis.html

The output of this step is a set of performance metrics and a decision on whether to promote the model to production.

Model Serving

In contrast to the training component where we usually care about scaling with data and model complexity. At the serving component, we are interested in responding to variable user demand by minimizing response latency and maximizing throughput.

Image by author
Image by author

Therefore, the serving component should have low latency to respond quickly to users, highly efficient so that many instances can be run simultaneously if needed, scale horizontally, reliable and robust to failures.

We also need our serving component to easily be able to update to new versions of the model. When we get new data or trigger a new pipeline run, or test new model architecture ideas, we’ll want to push a new version of the model and we want the system to seamlessly transition to this new version.

Monitoring

As we mentioned earlier, the performance of our deployed ML models can decay with time due to the constantly due to evolving data profiles and we need to make sure that our system is monitoring and responding to this degradation.

Therefore, we need to track summary statistics of our data and monitor the online performance of our model in order to send notifications, roll back when values deviate from our expectations, or potentially invoke a new iteration in the ML process.

Source: https://ml-ops.org/content/mlops-principles
Source: https://ml-ops.org/content/mlops-principles

Therefore, online monitoring is key to detect performance degradation and model staleness. It acts as a cue to a new experimentation iteration and retraining of the model on new data.

Pipeline Orchestration Component

The automation level of the steps we just described defines the maturity of our ML system, it also reflects the velocity of training new models triggered by model decay or given new data.

A manual process happens to be very common in many use cases currently. It might be sufficient when models are rarely changed due to static data distribution. But in practice, this is rarely the case. Data is often dynamic and models frequently break when they are deployed in the real world. Static models will surely fail to adapt to changes in the data that describes the environment.

A manual process can also be dangerous as it creates a disconnection between ML training and ML serving. It separates the data scientists who create the model and the engineers who operate the model as a prediction service. And this process can lead to the training serving skew problem.

The goal of the orchestration component is to connect the different components of the system. It runs the pipeline in a sequence and automatically moves from one step to another based on the defined conditions. This is the first step to automation as we can now automatically train new models in production using fresh data based on live pipeline triggers.

Image by author
Image by author

We need to pay attention to that in production, we’re not deploying a trained model as a prediction service. We’re actually deploying a whole training pipeline, which automatically and recurrently runs to serve the trained model as the prediction service.

Pipeline Metadata storage

The role of the pipeline metadata storage is to record all details about our ML pipeline executions. This is very important in order to keep the lineage between components and reproduce deployed models anytime needed. It also helps us debug any encountered errors.

Each time we execute the pipeline, the store records all the details about our pipeline execution such as:

Image by author
Image by author
  • The versions of our pipeline and components source codes that were executed.
  • The input arguments that were passed to our pipeline.
  • The artifacts/outputs produced by each executed component of our pipeline, such as the path to the raw data, transformed datasets, validation statistics and anomalies, trained model…
  • The model evaluation metrics, and the model validation decision regarding model deployment, which is produced during the model analysis and validation component…

CI/CD Pipeline Automation

So far we were only talking about how can we automate the continuous execution of the ML pipeline to retrain new models based on triggers such as the availability of new data or model decay to capture new emerging patterns.

But what if we wanted to test a new feature, a new model architecture, or a new hyperparameter? That’s what an automated CI/CD pipeline is about. A CI/CD pipeline allows us to rapidly explore new ideas and experimentations. It lets us automatically build, test, and deploy the new pipeline and its components to the intended environment.

Here’s how the CI/CD pipeline automation complements the continuous ML pipeline automation:

  • If given new implementation/code (new model architecture, feature engineering, and hyperparameters …), a successful CI/CD pipeline deploys a new continuous ML pipeline.
  • If given new data (or a model decay trigger), a successful automated continuous pipeline deploys a new prediction service. To train a new ML model with new data, the previously deployed ML pipeline is executed on the fresh data.
Image by author
Image by author

A complete end-to-end automated pipeline should look like this:

Image by author - inspired by Google Cloud Tech, Source: https://cloud.withgoogle.com/next/sf/sessions?session=AI212
Image by author – inspired by Google Cloud Tech, Source: https://cloud.withgoogle.com/next/sf/sessions?session=AI212
  • We iteratively try out new ML ideas where some of our pipeline components are updated (introducing a new feature for example will see us update the data transform component…). The output of this stage is the source code of the new ML pipeline components that are then pushed to a source repository of the targeted environment.
  • The presence of a new source code will trigger the CI/CD pipeline which will in return build the new components and pipeline, run the corresponding unit and integration tests to make sure everything is correctly coded and configured, and finally deploy the new pipeline to target environment if all tests have passed. The unit and integration tests for ML systems deserve an independent article themselves.
  • The new deployed pipeline is automatically executed in production based on a schedule, or presence of new training data, or in response to a trigger. The output of this stage is a trained model that is pushed to the model registry and continuously monitored.

Why Tensorflow?

In this final section, I would like to mention why TensorFlow is my preferred framework to develop an integrated ML system. Off-course, TensorFlow might not be suitable and for all use cases, and sometimes might even be an overkill, especially when deep learning is not needed. However, I tend to use Tensorflow whenever it’s possible for the following reasons:

  • Tensorflow comes with Tensorflow Extended (TFX). TFX allows us to focus on optimizing our ML pipeline while giving less attention to the boilerplate code that repeats itself every time. Components like Data Validation and Model Analysis can be done easily without having to develop custom codes that will read our data and detect anomalies between 2 pipeline executions. With TFX, this can be done with very few lines of code, thus saving us an enormous amount of time developing our pipeline components. The screenshots in the data validation and model analysis components were taken from TFX. I’ll try to have an article dedicated to TFX in the future.
  • We can design custom models that we would build from layers using the TF layers API, TF losses APIs, …. If we’re building something fairly standard, TensorFlow has a set of pre-made estimators that we can try out. Tensorflow 2 works well with Keras models.
  • As data and training time grows, our needs will increase. Checkpoints allow us to pause and resume training when needed, continue on training if the preset number of epochs happens to be insufficient.
  • Tensorflow is designed with a dataset API that handles out-of-memory data sets very well.
  • Model training can take hours, sometimes days to train. We can’t train our model for a long period of time without checking if it’s operating as intended. Tensorboard is TensorFlow’s visualization toolkit. TensorBoard provides the visualization and tooling needed for machine learning experimentation. It allows us to surface TensorFlow key metrics that are generated in real-time during training and visualize them on both the training and validation sets in order to see if our model is correctly configured to converge. This will allow us to stop training if it’s not the case.
  • We can distribute TensorFlow on a cluster to make it faster. Going from one machine to many might sound complicated, but with Tensorflow, we get distribution out of the box. TF abstracts away the details of distributed execution for training and evaluation, while also supporting consistent behavior across local/non-distributed and distributed configurations.

References

TensorFlow Data Validation | TFX

TensorFlow Model Analysis | TFX

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, Martin Zinkevich, data validation for machine learning (2019), SysML Conference https://systemsandml.org

Architecture for MLOps using TFX, Kubeflow Pipelines, and Cloud Build


Related Articles