The Machine Learning Lifecycle in 2021

How do you actually complete a machine learning project and what are some tools that can help each step of the way?

Eric Hofesmann
Towards Data Science

--

Photo by Tolga Ulkan on Unsplash

Everyone and their mother is getting into machine learning (ML) in this day and age. It seems that every company that is collecting data is trying to figure out some way to use AI and ML to analyze their business and provide automated solutions.

The machine learning market cap is expected to reach $117 billion by 2027 — Fortune Business Insights

This influx of popularity in ML is leading to a lot of newcomers without a formal background getting into the space. While it’s great that more people are getting excited and learning about this field, it needs to be clear that incorporating an ML project in a production setting is not an easy task.

Image from the 2020 State of Enterprise ML by Algorithmia based on 750 businesses

55% of businesses working on ML models have yet to get them into production — Algorithmia

Many people seem to be under the assumption that an ML project is fairly straightforward if you have the data and computing resources necessary to train a model. They could not be more wrong. This assumption seems to lead to significant time and monetary costs without ever deploying a model.

Naive assumption of the ML lifecycle (Image by author)

In this article, we’ll discuss what the lifecycle of an ML project actually looks like and some tools to help tackle it.

The Machine Learning Lifecycle

In reality, machine learning projects are not straightforward, they are a cycle iterating between improving the data, model, and evaluation that is never really finished. This cycle is crucial in developing an ML model because it focuses on using model results and evaluation to refine your dataset. A high-quality dataset is the most surefire way to train a high-quality model. The speed that this cycle is iterated through is what determines your costs, luckily there are tools that can help speed up the cycle without sacrificing quality.

A realistic example of ML lifecycle (Image by author)

Much like any system, even a deployed ML model requires monitoring, maintenance, and updates. You can’t just deploy an ML model and forget about it, expecting it to work as well as it did on your test set in the real world for the rest of time. ML models deployed in production environments are going to need updates as you find biases in the model, add new sources of data, require additional functionality, etc. This brings you right back into the data, model, and evaluation cycle.

As of 2021, deep learning has been prominent for over a decade now and helped bring ML front and center in the market. The ML industry has undergone a boom with countless products being developed to aid in the creation of ML models. Every step of the ML lifecycle has some tool that you can use to expedite the process and not end up as one of the companies with an ML project that never sees the light of day.

The next sections will deep dive into each phase of the ML lifecycle and highlight popular tools.

Phase 1: Data

Data in the ML lifecycle (Image by author)

While the end goal is a high-quality model, the lifeblood of training a good model is in the amount and more importantly the quality of the data being passed into it.

The primary data related steps in the ML lifecycle are:

Data Collection — Collect as much raw data as possible regardless of quality In the end, only a small subset of it will be annotated anyway which is where most of the cost comes from. It is useful to have a lot of data available to add as needed when problems arise with model performance.

Define your annotation schema — This is one of the most important parts of the data phase of the lifecycle, and it often gets overlooked. A poorly constructed annotation schema will result in ambiguous classes and edge cases that make it much more difficult to train a model.

For example, the performance of object detection models depends heavily on attributes like size, localization, orientation, and truncation. So including attributes like object size, density, and occlusion during annotation can provide critical metadata needed to create high-quality training datasets that models can learn from.

  • Matplotlib, Plotly — Plot properties of your data
  • Tableu — Analytics platform to better understand your data

Data Annotation—Annotation is a tedious process of performing the same task on and on for hours at a time, which is why annotation services are a booming business. The result is that annotators will likely make numerous mistakes. While most annotation firms guarantee a maximum error percentage (ex. 2% max error), a larger problem is a poorly defined annotation schema resulting in annotators deciding to label samples differently. This is harder to spot by the QA team of an annotation firm and is something that you need to check yourself.

Improve dataset and annotations — You will likely spend the majority of your time here when trying to improve model performance. If your model is learning but not performing well, the culprit is almost always a training dataset containing biases and mistakes that are creating a performance ceiling for your model. Improving your model generally involves things like hard sample mining (adding new training data similar to other samples the model failed on), rebalancing your dataset based on biases your model has learned, and updating your annotations and schema to add new labels and refine existing ones.

  • DAGsHub — Dataset versioning
  • FiftyOne — Visualize datasets and find mistakes

Phase 2: Model

Models in the ML lifecycle (Image by author)

Even though the output of this process is a model, you will ideally spend the least amount of time in this loop.

In industry, more time is spent on datasets than models. Credit to Andrej Karpathy (source, original talk )

Explore existing pretrained models — The goal here is to reuse as many available resources as possible to give yourself the best head start to model production. Transfer learning is a core tenant of deep learning in this day and age. You will likely not be creating a model from scratch, but instead fine-tuning an existing model that was pretrained on a related task. For example, if you want to create a mask detection model, you will likely download a pretrained face detection model from GitHub since that is a more popular topic with more prior work.

Construct training loop — Your data will likely differ in some way from what was used to pretrain the model. For image datasets, things like input resolution and object sizes need to be taken into account when setting up the training pipeline for your model. You will also need to modify the output structure of the model to match the classes and structure of your labels. PyTorch lightning provides an easy way to scale up model training with limited code.

Experiment Tracking — This entire cycle will likely require multiple iterations. You will end up training a lot of different models so being meticulous in your tracking of different versions of a model and the hyperparameters and data it was trained on will help a great deal to keep things organized.

Side Note: Even if you think your task is completely unique, here are some pre-training techniques to consider. I would recommend looking into ways to pretrain your model in unsupervised or semi-supervised ways, still only using a small subset of your total raw data for finetuning. Depending on your task, you could also look into synthetic data to pretrain your model. The goal is just to get a model that has learned a good representation of your data so that your fine-tuning dataset only needs to be used to train a few layers worth of model parameters.

Phase 3: Evaluation

Evaluation in the ML lifecycle (Image by author)

Once you managed to get a model that has learned your training data, it’s time to dig in and see how well it can perform on new data.

The key steps for evaluating an ML model:

Visualize model outputs — As soon as you have a trained model, you need to immediately run it on a few samples and look at the output. This is the best way to find if there are any bugs in your training/evaluation pipeline before running evaluation on your entire test set. It will also show if there are any glaring errors, like if two of your classes have been mislabeled.

Choose the right metric — Coming up with one or a few metrics can help in comparing the overall performance of models. In order to make sure you pick the best models for your task, you should develop metrics in line with your end goal. You should also update metrics as you find other important qualities you want to track. For example, if you want to start tracking how well your object detection model performs on small objects, use mAP on objects with a bounding box < 0.05 as one of your metrics.

While these gross dataset metrics can be useful in comparing the performance of multiple models, they rarely help in understanding how to improve the performance of a model.

Look at failure cases —Everything your model does is based on the data that it was trained on. So assuming that it is able to learn something, if it is performing more poorly than you would expect, you need to take a look at the data. It can be useful to look at cases where your model is doing well, but it is vital to look at false positives and false negatives, where your model predicted something incorrectly. After looking through enough of these samples, you will start to see patterns of failure in your model.

For example, the image below shows a sample from the Open Images dataset, one false positive is shown as the back wheel. This false positive turns out to have been a missing annotation. Verifying all wheel annotations in the dataset and fixing other similar mistakes can help improve your model’s performance on wheels.

Image credit to Tyler Ganter (source)

Formulate solutions — Identifying failure cases is the first step in figuring out ways to fix improve your model performance. In most cases, it goes back to adding training data similar to where your model failed but it can also include things like changing pre- or post-processing steps in your pipeline or fixing annotations. No matter what the solution is, you can only fix the problems with your model by understanding where it fails.

Phase 4: Production

Deploying a model (Image by author)

Finally! You’ve got a model that performs well on your evaluation metrics with no major errors on various edge cases.

Now you’ll need to:

Monitor model — Test your deployment to ensure that your model is still performing as expected on test data with respect to your evaluation metrics and things like inference speed.

Evaluate new data — Using a model in production means you will frequently pass brand new data through the model that it has never been tested on. It’s important to perform evaluation and dig into specific samples to see how your model performs on any new data it encounters.

Continue understanding model — Some errors and biases in your model can be deep-seated and take a long time to uncover. You need to continuously test and probe your model for various edge cases and trends that could cause problems if they were to be discovered by clients instead.

Expand capabilities — Even if everything is working perfectly, it’s possible that the model isn’t increasing profits as much as you hoped. From adding new classes, developing new data streams, and making the model more efficient there are countless ways to expand the capabilities of your current model to make it even better. Any time you want to improve your system, you will need to restart the ML lifecycle to update your data, model, and evaluate it all to make sure your new features work as expected.

FiftyOne

The above is pretty general and unbiased, but I want to tell you a little bit more about the tool I’ve been working on.

Lots of tools exist for various portions of the ML lifecycle. However, there is a pretty glaring lack of tools that help some of the key points I’ve stressed in this post. Things like visualizing complex data (like image or video) and labels or writing queries to find specific cases where your model performs poorly are generally done through manual scripting.

I have been working at Voxel51 developing FiftyOne, an open-source data visualization tool designed to help debug datasets and models and fill this void. FiftyOne lets you visualize your image and video datasets and model predictions in a GUI either locally or remotely. It also provides powerful capabilities to evaluate models and write advanced queries for any aspects of your dataset or model output.

FiftyOne can run in notebooks so try it out in your browser with this Colab Notebook. Alternatively, you can easily install it with pip.

pip install fiftyone
Sample from object detection model and dataset in FiftyOne (Image by author)

Summary

Only a fraction of all companies that try to incorporate machine learning (ML) into their business manage to actually deploy a model to production. The lifecycle of an ML model is not straight forward but requires continuous iterations between data and annotation improvements, model and training pipeline construction, and sample-level evaluation. If you know what you’re in for, this cycle can eventually lead to a production-ready model, but it will also need to be maintained and updated over time. Luckily, there are countless tools developed to aid in every step of this process.

--

--