Since I was in high school, I’ve had this weird obsession of squeeze the key concepts of everything that I learn in one page. Looking back, that was probably my lazy mind’s way to get away with the least amount of required work to pass an exam…but interestingly that abstraction effort also helped a lot to learn those concepts in a deeper level and to remember them longer. Nowadays when I teach Machine Learning, I try to teach it in two parallel tracks: a) main concepts and b) methods and theoretical details, and make sure my students can look at each new method through the lens of the same concepts. Recently I got a chance to read "Machine Learning Yearning" by Andrew Ng, which seemed to be his version of abstracting some of the practical ML concepts without getting into any formula or implementation details. While they can see so simple and obvious, as an ML engineer I can attest that losing sight of those simple tips are among the most common causes for an ML research to fail in production, and being mindful of them is what distinguishes a good data science work from a mediocre one. Here I wanted to summarize my takeaways from Andrew’s book on one page, and into 5 important tips, so without further ado lets get to it:
1- How you split your available data matters…a lot!

Even if you are not a data scientist, you probably already know that to measure your generalization power of your algorithm, you should split your available data into train, dev (validation) and test sets and you are not supposed to use the test set for any model optimization/tuning. As obvious as it sounds, I have seen a lot of practitioners who used their test set for manually optimizing the hyper-parameters and then report the best results on the same set as the generalization performance…you guessed it right…that’s cheating! and they still get surprised when the model performance drops after Deployment. You should make sure not to use the test set for any kind of optimization, including manual hyper-parameter optimization, that’s what your dev/validation set is for.
Also be aware of any data leakage from your train set to dev or test set. A silly cause of data leakage can be existence of duplicate instances in your data, which can show up both in train and test sets. If you are doing joins to compile your final dataset, be particularly wary of this.
In the classical ML and when datasets were small, we often used 60%-20%-20% for train-dev-test split, either in a hold-out or cross-validation setting. Andrew Ng argues that if the data set is large, you can use much smaller sets for dev and test. However your dev set should be big enough to detect meaningful changes in the performance of your model, for example having 100 examples in your dev set only allows you to detect 1% or larger changes of error of your model.
Your test set should have a similar distribution to the actual population your model will encounter in the deployment time, and should be big enough to give you confidence of your model performance estimate.
Last but not least, make sure to take advantage of stratified sampling, which can be done easily using built in functions in scikit-learn or other libraries. Assuming that your available data is unbiased sample of population, stratified sampling makes sure that you distribute that unbiased-ness across your train/dev/test splits.
So to recap, for splitting your data: a) make sure your data splits are fully isolated, b) your test (and dev) set have similar distribution as your target population c) use train set to optimize model parameters, dev set to optimize hyper parameter, and test set to measure final model performance.
2 – _Estimate each of the Bias, Variance and Data Mismatch error **** components, in order to debug your model effectivel_y
Generalization Error = Bias + Variance + Data Mismatch
You have probably already heard of the first two components: Bias error is the lack of a classifier’s ability to learn the underlying function that has generated the data, for example a linear regression model would have a high Bias error when trying to learn a relationship which was generated by a polynomial function. Variance error, analogous to sampling error in statistics, is caused by a limited sample size, and happens when the classifier learns a specific training sample too well and as a result loses its generalizability to unseen data. There is precise statistical definitions for each of those errors, but roughly speaking Bias error is the error you get on your training set and Variance error is the increase of error from your training set to your dev (validation) set.

The 3rd error or the data mismatch error comes from the fact that the distribution of the test/inference data is different than training and validation . For instance you train a cat breed classifier using the data collected from the internet and use it to deploy a cat classifier mobile app, the difference in classification performance between the internet cat images you used to validate your model prior to deployment and actual mobile app images which you see in the deployment is the data mismatch error, i.e. the increase of error from dev set to your inference/test data. Data mismatch is also represented under different names in different contexts, for example covariate shift or model drift; while there is a slight difference in defining each of those terms, the general idea is the same, and if your test set and target task have similar distribution your test set error can give you an accurate estimate estimate of the data mismatch error before the actual model deployment.
3- Narrow down your debug options with respect to the source of your error

If your training data (bias) error is high, you can try increasing your model size. For neural nets it can translate to adding to the number of layers or neurons. If the difference between dev set error and train set error is high (variance error), you can try adding more data to the train set. To do this effectively, you can use error analysis to understand what areas of feature space cause the most failures and target your data collections accordingly. While collecting more data is the best strategy to reduce variance error since it does not impact Bias error, it is not always possible. The next best way to reduce variance error is regularization (e.g. adding dropout or early stopping in neural nets). Similarly, dimensionality reduction or feature selection can lead to a smaller model which in turn reduces the variance error. Keep in mind that both regularization and dimensionality reduction can also lead to an increase in Bias error.
Adding more relevant variables or features based on insight from error analysis can improve both Bias and Variance error. The same thing goes for improving the architecture of the model.
4- The key factor to decide whether to insert hand-engineered features/components into your ML system is the amount of available data

Having more hand-engineered features or components generally allows a ML system to learn with less amount of data. The domain knowledge supplements the knowledge our algorithm acquires from data. When we don’t have much data, this hand-engineered knowledge becomes more essential. On other hand, if there is no data shortage, an end-to-end design can be applied which bypasses the need for manual feature engineering.
5- Whether to use supplementary/augmented data in your training depends on your model capacity

Supplementary data is any data consistent to your target tasks but not exactly from a same distribution. Suppose for the cat classifier mobile app you have 10k user uploaded images and 20k images downloaded from the internet. Here the 20k internet images would be your supplementary data and if your model is small, including those might take up the capacity of your model and limit the learnability of user uploaded images. However if the model is big enough, it can learn both distributions and take advantage of the commonalities between the two sets to solidify the learning and generalize better (this is the main idea behind meta learning). While this was not mentioned in the Andrew Ng’s book, I assume a same reasoning holds for including any data augmentation techniques which can produce similar text, image or tabular data (e.g. adversarial data augmentation). Including those augmented data can’t do miracle if your classifier’s capacity is small, however it can act as a regularizer when we are using high capacity deep learning.
References
Machine Learning Yearning by Andrew Ng (http://www.mlyearning.org/)