Building Better ML Systems — Chapter 3: Modeling. Let the Fun Begin

About baselines, experiment tracking, proper test sets, and metrics. About making the algorithm work.

Published in

Towards Data Science

15 min readAug 25, 2023

Hello back. I am glad to see you here again. I really appreciate your desire to become a better professional, do better work, and build better ML systems. You are cool, keep it up!

In this series, I do my best to help you master the art, science, and (sometimes) magic of designing and building Machine Learning systems. Here we talk about business value and requirements, data collection and labeling, model development, experiment tracking, online and offline evaluation, deployment, monitoring, retraining, and much much more.

This is the third chapter and it is devoted to model development. An ML algorithm is only a small part of the ML system. A perfectly accurate algorithm without a well-designed system won’t serve your customers and won’t earn money for your company. In this post, instead of giving you an overview of ML algorithms, I will show you a different perspective: how to select, develop and evaluate algorithms keeping in mind that the algorithm’s primary goal is to bring value to the business. And the end of the day, it does not really matter whether you solved the business problem with linear regression or with the most advanced neural network.

*An ML algorithm (black box in the middle) is only a small part of the ML system.* *Image source*

Before we move on, let’s have a quick recap of what we’ve already learned.

The first chapter was about planning. We learned that every project must start with a plan because ML systems are too complex to implement in an ad-hoc manner. We reviewed the ML project lifecycle, discussed why and how to estimate project business value, how to collect the requirements, and then reevaluate with a cold mind whether ML is truly needed. We learned how to start small and fail fast using concepts like “PoC” and “MVP”. And finally, we talked about the importance of design documents during the planning stage.

The second chapter was about the data. We discussed a new trend in the industry — data-centric AI, an approach to building ML systems that considers clean data to be much more important than advanced ML algorithms. We touched on data pipelines, which are designed to organize the flow of chaotic and unstructured data so the data can be used for analytics. We learned that training data should be relevant, uniform, representative, and comprehensive, as models build their understanding of the world based on this data. We reviewed two types of labels — human and natural — and navigated through the complex, slow, and expensive process of obtaining human labels, and discussed best practices to make this process less painful. Lastly, we talked about an alternative to real data and human labeling: synthetic data.

If by some chance you’ve missed the previous posts, I encourage you to read them before we proceed. I’ll wait for you right here.

And now, let the fun begin.

How to select an ML algorithm

There is no algorithm that fits every problem. You need to try several approaches and learn your data and domain really-really well until you come up with something that works.

Think, brainstorm, talk to your colleagues, ask ChatGPT, and then write down three approaches you are going to try: 1) something very simple; 2) something very popular; 3) something new and creative.

Something very simple. Every complexity introduced in the algorithm must be justified. Start with a simple approach (maybe even non-ML), evaluate it, and use it as a baseline to compare all your other models to it.
Something very popular. If you see, hear, and read that many-many people are solving the same business task with a specific algorithm — make sure you add it to your experiment list. Utilize collective intelligence! My expectations are always high about popular approaches and in most cases, they work pretty well.
Something new and creative. Just give it a try. Your boss and company will be happy if you build a competitive advantage by beating typical popular approaches.

Word of caution: Do not reinvent the wheel. There are hundreds of open-source libraries and repositories that have implementations for most of the algorithms, data sampling strategies, or training loops you may think of. Do not write your custom K-means clustering — take the one from scikit-learn. Do not write ResNet50 from scratch — take the one from PyTorch. Before implementing the recent paper, check PapersWithCode, I bet someone already did it.

Doing research and inventing something new is exciting. Implementing algorithms from scratch, where you understand every single line, is tempting. However, the research fits well only within universities and Big Tech companies. For startups every dollar matters, so they simply cannot afford to invest in something that has a low chance to succeed (and research is literally about 100 trials and 1 success).

Be careful with “state-of-the-art”. Imagine you are doing object detection using YOLOv7 and then you hear that YOLOv8 is released, which is expected to be even better. Does it mean that you need to upgrade all your production pipelines to support YOLOv8? Not necessarily.

In most cases, this “better” means a 1–2% improvement on a static benchmark dataset, such as COCO. The model accuracy on your data may be better, insignificantly better, or even worse, simply because your data and your business problem are different in all ways. Also, from Chapter 2 of this series, you should remember: Improving data leads to a more significant increase in model accuracy than improving the algorithm. Come up with ways to clean the training data — and you’ll see a 5–10% accuracy increase.

How to develop an ML algorithm

First, go get a baseline. A baseline is a model that you are going to compete with. There are two logical choices for the baseline:

An existing model from the production (if you have one). We want to improve the existing model, that is why we need to do a comparison with it.
A very simple model that is easy to deploy. If the business task can be solved in an easy way, why bother training complex models? Spend a couple of days searching for an easy solution and implementing it.

And now goes experimentation. You construct all your experiments in order to improve upon the baseline. Found a promising algorithm? Great, evaluate and compare it to the baseline. Your model is better? Congratulations, now it is your new baseline, think of experiments to improve it even more.

*Algorithm development is an iterative process. Image by Author*

Algorithm development is an iterative process. You finish either when you find an algorithm that is good enough for production OR when you run out of time. Both scenarios are possible.

Naturally, most of the ideas you try will fail. So do not be upset about it, and do not take it personally. We all work like that: find a good idea, try it, see that the idea is actually bad, and come up with a new, hopefully, good idea this time, try it, see that it also does not work, find a new idea,…

My advice here: time-frame your efforts spent on a single idea. If you cannot make the idea work within N days (choose your N in advance), wrap it up and move to another one. If you really want to succeed, you need to go through many-many different ideas, because, as I said earlier, most of the ideas you try will fail.

Learn your data really-really well. Visualize samples and labels, plot feature distributions, make sure you understand feature meanings, explore samples from each class, understand data collection strategy, read data labeling instructions given to annotators, … Train yourself to predict what model is expected to predict. If you want to create a good algorithm, start thinking like an algorithm (I am not joking here). All this will help you to find problems with data, debug the model, and come up with experiment ideas.

Split the data into training, validation, and test parts. Train on the training set, choose hyperparameters on the validation set, and evaluate on the test set. Make sure there is no overlap or data leakage among the splits. More on that is in the post: Train, Validation, Test Split for Machine Learning by Jacob Solawetz.

Way to go: take an open-source model, run it with the default parameters, and do hyperparameter tuning. Use algorithms either from ML libraries, such as scikit-learn, PyTorch, OpenCV, or from a GitHub repository that has a lot of stars, a good readme, and a license that allows using it for commercial purposes. Train with the default hyperparameters on your data, and evaluate. Default hyperparameters of the algorithm are selected to maximize accuracy on a benchmark dataset (ImageNet, COCO), so in most cases, they do not fit well your data and your task. Thoroughly learn what each hyperparameter means and how it affects training/inference, so you can do hyperparameter optimization. Typical approaches for hyperparameter optimizations are Grad Student Descent, random/grid/Bayesian searches, and evolutionary algorithms. Never say that the algorithm does not work before you did a hyperparameter optimization. To learn more, check out this post by Pier Paolo Ippolito: Hyperparameters Optimization.

Work with your data even more: do feature engineering and data augmentations. Feature engineering refers to transforming existing features and creating new ones. Feature engineering is a crucial skill, so I am referring you to two great posts where you can acquire it:
- Fundamental Techniques of Feature Engineering for Machine Learning by Emre Rençberoğlu
- 4 Tips for Advanced Feature Engineering and Preprocessing by Maarten Grootendorst

Data augmentation is a technique to create new training samples from the data you have, so during training the model “sees” more samples. Increasing the training set is the easiest way to increase the model accuracy, so you should do data augmentations always when you can. For instance, in the Computer Vision domain, literally, no one is training models without basic image augmentations — rotations, scaling, cropping, flips, etc. For more details check out my post: Complete Guide to Data Augmentation for Computer Vision.

If you are curious, about how augmentations are done for Natural Language Processing, read Data Augmentation in NLP: Best Practices From a Kaggle Master by Shahul ES.

Transfer Learning is your friend. Zero-shot learning is your best friend. Transfer Learning is a popular technique to boost model accuracy. Practically, it means that you take a model pre-trained on some dataset and continue training it using your data (“transferring knowledge”). Even weights from COCO or ImageNet datasets can improve your model, even though your data may look far different from COCO/ImageNet images.

Zero-shot learning is an algorithm that works on your data without training. How? Usually, it is a model pre-trained on a huge billion-sample dataset. Your data may look like something this model was already trained on; and the model has “seen” so many samples, that it can generalize well to new data. Zero-shot learning may sound like a dream, however, there are some super-models out of there: Segment Anything, most of the Word Embeddings models, ChatGPT.

Model Development Checklist for your convenience. Image by Author

There is much more left to say about model development, but we need to wrap up to reserve some time for Experiment Tracking and Evaluation topics. In case, you still feel hungry for knowledge, check out this great post by Andrej Karpathy: A Recipe for Training Neural Networks.

Experiment Tracking

Experiment tracking is the process of saving information about the experiment to some dashboard or file, so you can review it in the future. It is like logging in the Software Development. Links to training and test datasets, hyperparameters, git hash, metrics on the test data — are examples of what you can possibly track.

You must track all the experiments you run. If for some reason your team does not do it, set up a team call right now to discuss the importance of that. You will thank me later :)

So, why do we want to do the experiment tracking?

To compare different experiments to each other. When you develop a model, you train and evaluate a lot of different algorithms, try different data preprocessing techniques, use different hyperparameters and come up with various creative tricks. At the end of the day, you want to see what you tried, what worked, and what gave the best accuracy. Maybe later you’ll want to come back to some experiment and review its results with a fresh mind. Model development may last for weeks or even months, so without proper experiment tracking you’ll simply forget what you did, and have to redo the experiments.
To reproduce the experiments. If you cannot reproduce it, it does not count. Check yourself: can you come back to your most successful experiment, rerun it and get the same accuracy? If the answer is “NO”, it is possible because you don’t version control the code and the data, don’t save all hyperparameters, or don’t set a random seed.
The importance of setting a random seed is well explained in a post by Cecelia Shao: Properly Setting the Random Seed in ML Experiments. Not as Simple as You Might Imagine.
To debug the experiment. Sometimes experiment does not work: an algorithm does not converge, predictions look strange, accuracy is close to random. It is literally impossible to understand why it failed if no information about the experiment is saved. A saved list of hyperparameters, visualization of samples and augmentations, loss plots, etc may give you some clue where the problem lies.

As now you are convinced that experiment tracking is important, let’s talk about how to do it practically.

There are dozens of free and paid experiment tracking tools out there, choose something that fits your requirements and budget. Probably, the most popular one is Weights&Biases; I’ve worked with it a lot and it’s nice. For a review of some other tools, check out 15 Best Tools for ML Experiment Tracking and Management by Patrycja Jenkner.

Machine Learning experiment consists of data, code, and hyperparameters. Make sure you use version control tools for the code, such as Github or Gitlab, and commit all your changes during development. It is important to be able to revert to older code versions to rerun your older experiments. Version control your data. The simplest and most popular way is to create a new folder or a new file on the disk (ideally on cloud storage, such as Amazon S3 or Google Cloud Storage) for each new version of the dataset. Some people use a tool called Data Version Control (DVC).

ML experiment consists of data, code, and hyperparameters. Image by Author

What exactly should you track for the experiment? Well, it is not a bad idea to track everything you can :) Most of the time you won’t use all that information unless an experiment failed and failed really hard.

Here is a list of the things you may want to consider tracking:

Git hash of the commit
Link to training, validation, and test datasets
Hyperparameters and their change over time (model architecture, learning rate, batch size, data augmentations,…)
Loss plots on training and validation sets
Metric plots on training and validation sets
Metrics on the test set
Visualization of training samples with labels (with and without augmentations applied)
Visualization of errors on the test set
Environment (OS, CUDA version, package versions, environment variables)
Training speed, memory usage, CPU/GPU utilization

Set up an experiment tracking once, and enjoy its benefits forever.

Model Evaluation

Before the model is deployed to production, it must be thoroughly evaluated. This evaluation is referred to as “offline”. “Online” evaluation, in contrast, is about checking the model that is already running in production. Online evaluation will be discussed in the next chapter of this series, and today we are focusing solely on offline evaluation.

To perform an offline evaluation we need a metric and a dataset.

The model is evaluated on the test dataset, the one that you’ve put aside while training and tuning hyperparameters. It is assumed that 1) the test set is large enough and extremely clean; 2) the model has never seen the test data; 3) the test data represents production data. If one of the assumptions is violated, evaluation is performed incorrectly, and there is a high risk to get an overly optimistic metric and deploy a model that is bad.

Evaluation on a small test set could give you a good metric simply by chance. Evaluation on dirty data won’t show a true model performance. While having errors in the training set is more forgiving (you can train on clean labels, dirty labels, or even no labels), having errors in the test set can be detrimental. Important note: a labeled test set is needed for unsupervised models as well. Otherwise, how would you know that your model is good enough?

Make sure your model hasn’t “seen” test data. Always filter out duplicates, so the same sample won’t end up in both training and test sets. Do not split data randomly, use time-based or user-based splitting instead. Time-based splitting means putting older data into the training set and newer — into the test set. User-based splitting means having all data from the same user within the same split. And be very careful with data leakages, more details on that is in Data Leakage in Machine Learning: How it can be detected and minimize the risk by Prerna Singh.

A metric is a number that is assumed to correlate with the model’s true performance: the higher the number — the better the model is. You can choose one or a couple of metrics. For instance, typical metrics for a classification task are accuracy, precision, recall, and F1 score. Choose something simple and, ideally, explainable, so non-technical managers and clients can understand.

Below are great posts by Shervin Minaee about metrics for various tasks and domains:
- 20 Popular Machine Learning Metrics. Part 1: Classification & Regression Evaluation Metrics
- 20 Popular Machine Learning Metrics. Part 2: Ranking, & Statistical Metrics

Use slice-based metrics and evaluate your model for each data segment you can think of (unless you want to get into a scandal like “Zoom’s Virtual Background Feature Isn’t Built for Black Faces”). For instance, face detection systems must be evaluated separately for people of various races, genders, and ages. E-commerce models worth evaluating for desktop vs mobile, various countries, and browsers. Double-check whether each segment is well represented in the test set. Slice-based metrics also help with class imbalance: seeing precisions and recalls for each class separately helps much more than a total precision/recall.

One more way to avoid scandal (this time it’s “Bank ABC’s new credit scoring system discriminates unmarried women”), is to use behavioral tests. A great paper, Beyond Accuracy: Behavioral Testing of NLP Models with CheckList, suggests using Minimum Functionality, Invariance, and Directional Expectation tests in addition to numerical metrics. Even though the paper focuses on Natural Language Processing, these types of tests can be easily applied to tabular data and images.

In the example of “Bank ABC’s new credit scoring system discriminates unmarried women,” an invariance behavioral test could help a lot. Keep all features the same but change marital status and gender and check whether model predictions changed. If you see a significant difference in the prediction (when it should be “invariant”), probably your model absorbed bias in the training data; this needs to be fixed, for instance, by totally removing sensitive (discrimination-prone) features from the model inputs.

And finally, visualize the errors. Find samples in the test set for which the model made an error; visualize them and analyze why this happened. It is because the test set is still dirty? Are there enough similar samples in the train set? Is there any pattern for model errors? This analysis helps find possible labeling errors in the test set and bugs during the training as well as come up with ideas on how to improve the model performance even further.

Model Evaluation Checklist for your convenience. Image by Author

Conclusion

In this chapter, we have learned how to develop models keeping in mind, that the ML algorithm is only a PART of the ML system. Model development starts with creating a simple baseline model and continues with iterative improvements over it. We come up with the most efficient way to go: take an open-source model and build experiments around it, instead of reinventing the wheel or falling into the research rabbit hole. We discussed the pitfalls of the “state-of-the-art” algorithms and the benefits of data augmentations and transfer learning. We agreed on the importance of experiment tracking and learned how to set it up. And finally, we talked about offline evaluation — metric selection, proper test sets, slice-based evaluation, and behavioral tests.

We are almost there, one more chapter left. In the next (last) post, you will learn about deployment, monitoring, online evaluation, and retraining — the final piece of knowledge that will help you build better Machine Learning systems.

The finale will be available soon. Subscribe to stay tuned.