Strategic Management of Machine Learning Projects

12 Key guidelines to keep in mind before you superintend your next machine learning project

Published in

Towards Data Science

18 min readAug 24, 2022

In this story, we’ll set out on a journey to explore 12 guidelines that effectively summarize Andrew Ng’s course on machine learning strategy which is known as “Structuring Machine Learning Projects”. This course covers a lot of information that often gets overlooked when studying the topic from any other source but that is quite crucial for getting your machine learning systems to work as efficiently and quickly as possible. In this story we map such information to a set of guidelines that will help you make right decisions while managing machine learning projects and that will in advance answer a lot questions that would cross your mind otherwise in the process.

Whether you decide to consider this story as a précis of what’s to be learnt from the course or just lecture notes on it, it will probably work out. Just get your coffee, notebook and pencils ready and let’s get to action.

∘ 1. Seek Orthogonalized Controls
∘ 2. Set a Single Number Evaluation Metric
∘ 3. Modify Your Metrics if Needed
∘ 4. Get the Ball Rolling Quickly
∘ 5. Ditch the 70/30 and 60/20/20 Rules for Dataset Splitting.
∘ 6. Assert that Your Dev & Test Sets Come from the Same Distribution Which Comes from the Real World.
∘ 7. Manage Data Mismatch When Training Set has a Different Distribution
∘ 8. Take the Right Steps Towards improving the Model’s Performance
∘ 9. Embrace Manual Error Analysis
∘ 10. Consider Transfer Learning First
∘ 11. Keep an Eye Out for Multi-task learning
∘ 12. Decide Between Feature Engineering and End-to-end Learning

1. Seek Orthogonalized Controls

Imagine you’re driving a car that features besides the standard steering wheel and gas/brake pedals two knobs that as well control the wheel’s angle and speed such that every unit angle change in each of the knobs corresponds to “a units” change of the car’s steering and “b units” of the car’s speed for some “a” and “b” that are fixed for each of the two knobs. Wouldn’t this be a terrible feature to tack onto such car? The answer is yes. Even if algebraically every possible steering and speed could be attained with the two knobs.

But what does this have to do with machine learning? In a ML project, instead of steering and speed we want to control the model’s ability to fit each of the training set, dev set, test set (in this order) on the cost function. The problem is that there are many ways to control “fitting the training set” (for instance) and some of these ways may have an effect on say “fitting the dev set” as well. Such ways are what you want avoid or at least take care of while considering.

Early Stopping
A popular form of regularization — a preventive measure against overfitting is to keep watching the model’s performance iteration by iteration on both the train and dev set and then stop training the model whenever performance starts to recede for the dev set (although it’s still improving for the training set). The problem with this control is that it acts on both the ability of the model to not overfit (do good on the dev set) and its ability to properly minimize the cost function (do good on the train set). This is why it doesn’t represent an orthogonalized form of control and which is why you might think twice before choosing it as your first choice for regularization.

Graph by Afshine Amidi and Shervine Amidi on Standford.edu

Orthogonalized Forms of Control
Now let’s take a look at the stages your model will go through in a machine learning project and examples of what would typify an orthogonalized form of control (knob) for each

Sometimes also you just need a significant change in the architecture or choice of hyperparameters — Diagram by author

2. Set a Single Number Evaluation Metric

One of the early steps through your project should be to set a single number that would act as the evaluation metric for your model. It would help you quickly figure out whether the idea you just tried is worse or better than previous idea. In simple cases such number could simply be the model’s accuracy but sometimes it might something else or a combination of things. If you need a refresher, here is a list of various metrics that could end up making it to that single number.

The question is what to do when he have more than one metric that we want the model to do good at. There are a couple of options to consider:

1- See if another metric exists that appropriately combines both metrics. E.g, the F1 score combines both precision and recall. Otherwise if a model was having better precision than another but lower recall and both matter for you you may or may not make the right decision by choosing between them.

2- If they are the same metric but yielding different values along some other dimension then an average is a decent choice. For instance, if a cat app corresponds to different accuracies for each country it’s being used in then you can let your SNEM be their average. If accuracy matters more in some of the countries compared to the others then you could as well consider a weighted average.

3- If it’s not easy to mathematically combine the metrics (e.g, time and accuracy or any two different quantities) then pick one of them as the optimizing metric and let each of the rest be a satisficing metric.

Optimizing & Satisficing Metrics
Suppose it’s important for your model to have high accuracy and low runtime then because these are different quantities just adding them in the same formula doesn’t make sense. For instance if you choose your SNEM to be accuracy(%)-0.5*runtime(msec) then this is a bad idea because it’s as if you’re saying that an increase of accuracy by 1% is equivalent to a reduction in runtime of 2 ms so a model with 70% accuracy is as good as a model of 100% accuracy but that’s slower by 60ms.

What you want to do instead is to let your evaluation criteria look something like “Maximum accuracy such that runtime is less than 100ms” and this is equivalent to choosing accuracy as your optimizing metric and runtime as your satisficing metric. In general, given an irreducible set of metrics for which the model needs to do good you choose one of the elements of the set to be an optimizing metric and each of the rest to be a satisficing metric (described by some inequality or expression).

3. Modify Your Metrics if Needed

Suppose that you’re training a model that will make it to an app that gathers images of cats from the internet and then shows them to the user. In the process it classifies each of the images it gathered as cat or non-cat. For the sake of this argument, suppose that after training you find out that one of two models you trained misclassifies a lot of images that should never make it to the user’s screen as cats but seems to have a much higher accuracy. Then what should be done in this case? An option Andrew suggests in the lecture is to go back and modify the metric. For instance, if it’s the average error, then consider changing it into a weighted average that more heavily penalizes misclassifying images of the unwanted type. This way you know which of two models is doing better through an evaluation metric that captures both having low error and very low tolerance to images of a specific type.

Recall as well that when we discussed orthogonalization, we mentioned that in case your model is performing good on the dev/test set but not in the real world then a natural next step depending on the case is to change the dev/test set or the metric. (E.g., a mobile app that’s trained on cat images from the web but where the real world examples come from mobile phones and thus have far less quality.)

Keep in mind as well that defining the metric and doing well on it should be orthogonal to one another. Andrew uses the metaphor that the former corresponds to placing a target and the latter corresponds to figuring out how to shoot at it. You shouldn’t worry much about how you’ll shoot at it before you actually put it in place.

4. Get the Ball Rolling Quickly

Andrew’s advice here is to not spend a lot of time trying to find the perfect metric and/or dev set and that setting something quickly that’s rough-and-ready to drive the team’s progress and to give insight about what could be improved next is the right way to go. More formally, you should

1- Set up dev/test set and metric

2- Build an initial system quickly

3- Study the model’s bias, variance, and it’s sources of error to prioritize next steps.

It’s often the case that even if you change anything later down the line then adapting to the new setting given the experience so far will be better than starting off everything from scratch.

Use this advice wisely. It doesn’t mean that you have to start even project with a rough-and-ready metric and dev set. It rather urgues you to not run for too long with no metric or dev set. This also applies less strictly if you have significant experience in the application area or if there’s a significant body of literature to draw on for the problem at hand. In such cases, your chances of success after one or few iterations are much higher.

5. Ditch the 70/30 and 60/20/20 Rules for Dataset Splitting.

Your dev and test set should be just as large as they need to be. If you have 1 Million examples then following the 60/20/20 rule and letting each of the dev and test data include 200K examples might be a big waste of data that could’ve rather made it to the training set to better learn the function.

An answer to how big should the dev and test set be lies in their purpose. The dev set’s purpose is to choose between models/set hyperparamters so it should be big enough to allow this and the test set’s purpose is evaluate the final performance of the model, so set it big enough to allow that as well. Given a dataset with beyond 1M examples, you might find that a 98:1:1 split satisfies both purposes. On the other hand, if the dataset is much smaller than that then using the 60:20:20 rule should work just fine but just stop using that blindly.

Note as well that if you don’t need to ship your model with a number that signifies the model’s final performance then you might drop the test set altogether. The dev set is often then called the “test set” in such scenarios.

6. Assert that Your Dev & Test Sets Come from the Same Distribution Which Comes from the Real World.

Wouldn’t it be a bummer to spend hours improving your model until it does well on the dev set only to realize later that it does terribly on your test set or if the data that will come from the real world after deployment is very different from that involved in the dev and test set.

To avoid this, there are two scenarios to watch out for:

1- The data in the real world comes from a mixture distribution. E.g, it’s a cat classifier that will be operated in different countries (and thus different breeds of cats). In this case, you can randomly shuffle data from all such countries then split into dev and test so that they also come from that mixture distribution.

2- You have a small fraction of data that tallies with the examples that will be expected in the real world and another large fraction that isn’t as original but might help. In this case, keep that large fraction away from the dev and test set because that doesn’t match the distribution of data expected form the real world and split the small fraction on them (so the large fraction should make it to the training set).

7. Manage Data Mismatch When Training Set has a Different Distribution

We argued earlier that the dev/test distribution should have the same distribution and that it should match the examples expected from the real world. Ideally, the same should apply for the training set but it’s much less severe if it doesn’t. In fact, Andrew mentions that deep learning models are generally robust to the training set’s distribution but that it might still cause what’s known as a data mismatch.

Suppose you know that your training set follows a different distribution compared to your dev and test sets, then after measuring you train, dev and test errors you might end up with something like this

you might be quick to conclude that there is a variance problem because the fact that the training set has a different distribution implies that some of the error might be just due to data mismatch. You’re training the model to do good under a specific distribution and then testing it on another.

What you should’ve rather done to figure out whether or not this is the case is to also include a train-dev set that’s just as big as the dev set. This set is formed by randomly sampling data from the training set and it will give you a way to tell whether the model is suffering from a data mismatch problem or no. For instance, if you perceive

then this is a variance problem (it’s now the difference between the train-dev and train error). Meanwhile, if it turns out that

then it’s a data mismatch problem (the difference between the dev and train dev error is large) and you have to deal with it. One way for that is to simply collect more data from the real world for the training set so if that’s feasible so it also matches the right distribution. Otherwise:

1- Manual error analysis to understand the differences between the train and dev sets. You might learn from this, for instance, that the images in the dev set are more foggy.

2- Try to make the training set be more similar to the dev set. One way to go around this by artificial data synthesis. So you would try to go over the images in the training data and add fog into them (not manually).

Note of cation
When doing artificial data synthesis, take care that you don’t make the model overfit to the way you’re synthesizing the new data. For instance, it would be a bad idea to add each image to one of 3 examples of fog to make the new training set. You should rather try to generate as much examples of fog as possible. Otherwise, you end up having your synthesized images only forming a small subset of the space formed by the real world data.

8. Take the Right Steps Towards improving the Model’s Performance

Let’s start by laying down two important definitions

1- Bayes Optimal Error
This signifies the theoretical optimal performance. There is no mapping from X to Y that would yield an error below this figure. This isn’t zero because it’s common to have some inputs that can’t be certainly predicted by any mapping from X to Y given by a model or human. If this is for instance a cat classifier then you can think of cat images that are too dim or too blurry that there is no way really to guarantee that the image is a cat. Because such inputs represent a small percentage of all of the possible ones, Bayes Optimal Error will as well often be too small, but most likely not zero. Most of the time, there is also no way to arrive at an exact value for this.

2- Human-level Error
This is considered a proxy or an approximation of Bayes optimal error. It will never surpass it but a model is operating at human-level performance is probably very close to performing at Bayes optimal performance. Suppose we could find the best domain experts out there relative to our machine learning task, then if their error on the data is a% we call this the human-level error.

Some data scientists like to think about human-level error as the average human’s error on the data. That is, you can estimate this just by running the data through some average humans. In this case, human-level error is no longer a proxy/approximate of Bayes optimal error and this is why Andrew prefers the first definition.

Furthermore, for some computer vision applications. Even the average human-level performance seems to sufficiently approximate Bayes optimal performance.

Now note that the model’s performance

1- Will start to grow more slowly once it surpasses the human-level; at this point, humans may no longer as effectively further help the model by for instance providing labeled data (the model now does better in this respect) or performing error analysis to gain insight into the problem.

2- Should be sufficiently lower than Bayes optimal performance to conclude a bias problem. Andrew calls the difference between them “Avoidable Bias” and oftentimes you’ll be only able to estimate it using its approximate “human-level error”.

3- May surpass human-level performance when there’s a lot of data and/or it’s not a natural perception problem (E.g, product recommendations, online ADs, loan approvals).

4- Indicates that there is a variance problem if the difference between the training and dev set error is large enough and should be prioritized over the bias problem if it’s sufficiently more extreme and vice versa.

At this point we can as well recap on the different type of problems that our model might face.

We have discussed earlier how to deal with each.

9. Embrace Manual Error Analysis

Misclassifications in the dev set by your model might be mostly due to a specific type of input. Going over the dev set manually to find out whether or not that’s the case can be worth your time.

For instance, if you’re working on a cat classifier you might randomly sample a significant part of the dev set before preparing a table like the following:

which would give you a sense of the best option to pursue next in light of improving the error on the dev set.

For instance, in this example, you might consider designating a team to work on the blurry images and another team to work on dealing with great cats. It’s also obvious that there are some examples that are incorrectly labeled but these would only help further reduce the error by 6% of its value. In general, deep learning models are quite robust to random errors in the training set. As long as its not the case that a specific type of cat images is being incorrectly labeled every time (systematic error) you shouldn’t expect the incorrect labels to contribute much to the error. Not to mention that some of the examples the model thinks it correctly classified will as well be mislabeled and you’d be introducing bias if you decide to only correct the ones that are misclassified in the dev set.

Note of caution
The process you applied to the dev set to improve the error should also be applied to the test set so that they remain having the same distribution.

10. Consider Transfer Learning First

Assume you want to train model A to do task A and that there exists model B that does task B where

1- Model B takes the same type of input (audio, image,..)

2- Model B has been trained before and works well for task B

3- Deep learning frameworks allow you to import Model B with its pre-trained weight or you can do that manually

then a good start for model A is model B itself after modifying the output layer so that the output is that of task A. Dropping the last layer means that the new output layer will be associated with a new weight matrix. Keep all the layers except that one frozen (no backprop for them) and retrain the model so that layer converges to new weights. This will make the whole model tailored to your task A.

Now you might be thinking:

1- Do we have to freeze all the layers except the top/last one that was replaced?
No. In general, you only keep the first L-T layers frozen where T depends on the amount of data you have and T≥1. If you have a lot of data then you can set T=L and this would correspond to retraining the whole network (starting with the pre-trained weights as the initialization). This is known as “fine-tuning”. On the other hand, if you have a small amount of data then you might keep T=1 (only the output layer) or set it to a larger value if you think the amount of data at hand isn’t that small.

2- Does this always work?
The reason it will work more often than not is that early layers tends to learn low level features (such as edges for images) and because the input is of the same type, it would probably need to learn such features as well. With transfer learning the model simply makes use of the fact that these have been learnt already and focuses on learning the high level features which are bound to the later layers which are usually unfrozen.

11. Keep an Eye Out for Multi-task learning

Assume you want to train models A, B, C to do tasks A, B, C and that

1- The three models take the same input

2- The amount of data you have for each is quite similar

3- Three tasks could benefit from having shared low-level features (i.e., they are somewhat related)

then it can save time and resource to rather build model X that’s bigger than any of them and that has an output node for each classification task then perceive the problem as multi-label classification (generally). Such model would be trained on the data combining those of A, B, C.

12. Decide Between Feature Engineering and End-to-end Learning

Before CNNs were around, tasks like image classification required a lot of image processing to extract features that summarize the image effectively so that when such features would be fed into a FFNN the model would be able to learn how to classify between the two classes.

But now that we have CNNs they do the whole task of extracting features and classification end-to-end. This also applies to other applications such as NLP and speech recognition and has been the trend in deep learning for the latest few years.

The problem with end-to-end learning is that requires a lot of data (in order to gain sufficient insight from it) but in return doesn’t necessitate domain knowledge or hand-engineering of features. So in summary:

There’s also a middle ground. You can sometimes break an end-to-end model into two and introduce a hand-designed component in the middle that extracts some features or does some processing to make the whole system much better. For instance, you might find that a model where there is a hand-designed component that crops to the person’s face before starting on the facial recognition task when a human is found to exist in an image makes a better face recognition system compared to one that’s completely end-to-end.

Hooray! You’ve made it to the end of the story. To wrap everything up, here is a diagram that combines all the topics we went over in this story and that you will probably need to think about next time you work on a machine learning project. Take a few seconds to confirm that you now know more about all of these!

This brings our story to an end. If you enjoyed it and would like to see more stories like this then don’t forget to give it a few claps :). Till next time, au revoir.