Overfitting and Underfitting Principles

Understand basic principles of underfitting and overfitting and why you should use particular techniques to deal with them

Dmytro Nikolaiev (Dimid)
Towards Data Science

--

Underfitting and overfitting principles. Image by Author

A lot of articles have been written about overfitting, but almost all of them are simply a list of tools. “How to handle overfitting — top 10 tools” or “Best techniques to prevent overfitting”. It’s like being shown nails without explaining how to hammer them. It can be very confusing for people who are trying to figure out how overfitting works. Also, these articles often do not consider underfitting, as if it does not exist at all.

In this article, I would like to list the basic principles (exactly principles) for improving the quality of your model and, accordingly, preventing underfitting and overfitting on a particular example. This is a very general issue that can apply to all algorithms and models, so it is very difficult to fully describe it. But I want to try to give you an understanding of why underfitting and overfitting occur and why one or another particular technique should be used.

This article explains the basics of underfitting and overfitting in the context of classical machine learning. However, for large neural networks, and especially for very huge ones, these rules apply only partially.

If you are intrested, please explore the idea of double descent with MLU Explain posts after checking out the article!

Underfitting and Overfitting and Bias/Variance Trade-off

Although I’m not describing all the concepts you need to know here (for example, quality metrics or cross-validation), I think it’s important to explain to you (or just remind you) what underfitting/overfitting is.

To figure this out, let’s create some dataset, split it into train and test sets, and then train three models on it — simple, good, and complex (I will not use a validation set in this example to simplify it, but I will tell about it later). All code is available in this GitHub repo.

Generated dataset. Image by Author

Underfitting is a situation when your model is too simple for your data. More formally, your hypothesis about data distribution is wrong and too simple — for example, your data is quadratic and your model is linear. This situation is also called high bias. This means that your algorithm can make accurate predictions, but the initial assumption about the data is incorrect.

Underfitting. The linear model trained on cubic data. Image by Author

Opposite, overfitting is a situation when your model is too complex for your data. More formally, your hypothesis about data distribution is wrong and too complex — for example, your data is linear and your model is a high-degree polynomial. This situation is also called high variance. This means that your algorithm can’t make accurate predictions — changing the input data only a little, the model output changes very much.

Overfitting. The 13-degree polynomial model trained on cubic data. Image by Author

These are two extremes of the same problem and the optimal solution always lies somewhere in the middle.

Good model. The cubic model trained on cubic data. Image by Author

I will not talk much about bias/variance trade-off (you can read the basics in this article), but let me briefly mention possible options:

  • low bias, low variance — is a good result, just right.
  • low bias, high varianceoverfitting — the algorithm outputs very different predictions for similar data.
  • high bias, low variance — underfitting — the algorithm outputs similar predictions for similar data, but predictions are wrong (algorithm “miss”).
  • high bias, high variance — very bad algorithm. You will most likely never see this.
Bias and Variance options on four plots. Image by Author

All these cases can be placed on the same plot. It is a bit less clear than the previous one but more compact.

Bias and Variance options on one plot. Image by Author

How to Detect Underfitting and Overfitting

Before we move on to the tools, let’s understand how to “diagnose” underfitting and overfitting.

Train/test error and underfitting/overfitting. Image by Author

Underfitting means that your model makes accurate, but initially incorrect predictions. In this case, train error is large and val/test error is large too.

Overfitting means that your model makes not accurate predictions. In this case, train error is very small and val/test error is large.

When you find a good model, train error is small (but larger than in the case of overfitting), and val/test error is small too.

In the case above, the test error and validation error are approximately the same. This happens when everything is fine, and your train, validation, and test data have the same distributions. If validation and test error are very different, then you need to get more data similar to test data and make sure that you split the data correctly.

How to detect underfitting and overfitting. Image by Author

Tools and Techniques

Now let’s look at techniques to prevent underfitting and overfitting, considering exactly why we should use them.

General Intuition You Should Remember

As we remember:

  • underfitting occurs when your model is too simple for your data.
  • overfitting occurs when your model is too complex for your data.

Based on this, simple intuition you should keep in mind is:

  • to fix underfitting, you should complicate the model.
  • to fix overfitting, you should simplify the model.

In fact, everything that will be listed below is only the consequence of this simple rule. I will try to show why certain actions will complicate or simplify the model.

More Simple / Complex Model

The easiest way that comes to mind based on the intuition above is to try a more simple or more complex algorithm (model).

To complicate the model, you need to add more parameters (degrees of freedom). Sometimes this means directly trying a more powerful model — one that is a priori capable of restoring more complex dependencies (SVM with different kernels instead of logistic regression). If the algorithm is already quite complex (neural network or some ensemble model), you need to add more parameters to it, for example, increase the number of models in boosting. In the context of neural networks, this means adding more layers / more neurons in each layer / more connections between layers / more filters for CNN, and so on.

To simplify the model, you need contrariwise to reduce the number of parameters. Either completely change the algorithm (try random forest instead of deep neural network), or reduce the number of degrees of freedom. Fewer layers, fewer neurons, and so on.

More Regularization / Less Regularization

This point is very closely related to the previous one. In fact, regularization is an indirect and forced simplification of the model. The regularization term requires the model to keep parameter values as small as possible, so requires the model to be as simple as possible. Complex models with strong regularization often perform better than initially simple models, so this is a very powerful tool.

Good model and complex model with regularization. Image by Author

More regularization (simplifying the model) means increasing the impact of the regularization term. This process is strictly individual — depending on the algorithm, the regularization parameters are different (for example, to reduce the regularization, the alpha for Ridge regression should be decreased, and C for SVM — increased). So you should study the parameters of the algorithm and pay attention to whether they should be increased or decreased in a particular situation. There are a lot of such parameters — L1/L2 coefficients for linear regression, C and gamma for SVM, maximum tree depth for decision trees, and so on. In the context of neural networks, the main regularization methods are:

  • Early stopping,
  • Dropout,
  • L1 and L2 Regularization.

You can read about them in this article.

Opposite, in the case when the model needs to be complicated, you should reduce the influence of regularization terms or abandon the regularization at all and see what happens.

More Features / Fewer Features

This may not be so obvious, but adding new features also complicates the model. Think about it in the context of a polynomial regression — adding quadratic features to a dataset allows a linear model to recover quadratic data.

Adding new “natural” features (if you can call it that) — obtaining new features for existing data is used infrequently, mainly due to the fact that it is very expensive and long. But you can keep in mind that sometimes this can help.

But artificial obtaining of additional features from existing ones (the so-called feature engineering) is used quite often for classical machine learning models. There are as many examples of such transformations as you can imagine, but here are the main ones:

  • polynomial features — from x, x₂ to x, x, xx, x², x², … (sklearn.preprocessing.PolynomialFeatures class)
  • log(x), for data with not-normal distribution
  • ln(|x| + 1) for data with heavy right tail
  • transformation of categorical features
  • other non-linear data transformation (from length and width to area (length*width)) and so on.

If you need to simplify the model, then you should use a smaller quantity of features. First of all, remove all the additional features that you added earlier if you did so. But it may turn out that in the original dataset there are features that do not carry useful information, and sometimes cause problems. Linear models often work worse if some features are dependent — highly correlated. In this case, you need to use feature selection approaches to select only those features that carry the maximum amount of useful information.

It is worthwhile to say that in the context of neural networks, feature engineering and feature selection make almost no sense because the network finds dependencies in the data itself. This is actually why deep neural networks can restore such complex dependencies.

Why Getting More Data Sometimes Can’t Help

One of the techniques to combat overfitting is to get more data. However, surprisingly, this may not always help. Let’s generate a similar dataset 10 times larger and train the same models on it.

Why getting more data sometimes can’t help. Image by Author

A very simple model (degree 1) has remained simple, almost nothing has changed. So getting more data will not help in case of underfitting.

But the complex model (degree 13) has changed for the better. It is still worse than the initially good model (degree 3), but much better than the original one. Why did this happen?

Last time (for the initial dataset), the model was trained on 14 data points (20 (initial dataset size) * 0.7 (train ratio) = 14). A 13-degree polynomial can perfectly match these data (by analogy, we can draw an ideal straight line (degree=1) through 2 points, an ideal parabola (degree=2) through 3 points, and so on). By getting 10 times more data, the size of our train set is now 140 data points. To perfectly match these data, we need a 139-degree polynomial!

Note, that if we had initially trained a VERY complex model (for example, a 150-degree polynomial), such an increase in data would not have helped. So getting more data is a good way to improve the quality of the model, but it may not help if the model is very very complex.

So, the conclusion is — getting more data can help only with overfitting (not underfitting) and if your model is not TOO complex.

In the context of computer vision, getting more data can also mean data augmentation.

Summary

Let’s summarize everything in one table.

Techniques to fight underfitting and overfitting. Image by Author

Well, better in two.

Techniques to fight underfitting and overfitting (extended). Image by Author

Some tools and techniques have not been covered in this article. For example, I consider data cleaning and cross-validation or hold-out validation to be common practices in any machine learning project, but they can also be considered as tools to combat overfitting.

You may notice that to eliminate underfitting or overfitting, you need to apply diametrically opposite actions. So if you initially “misdiagnosed” your model, you can spend a lot of time and money on empty work (for example, getting new data when in fact you need to complicate the model). That’s why it is so important — hours of analysis can save you days and weeks of work.

In addition to the usual analysis of the model quality (train/test errors), there are many techniques for understanding exactly what needs to be done to improve the model (error analysis, ceiling analysis, etc.). Unfortunately, these topics are beyond the scope of this article.

However, all these procedures have the purpose of understanding where to move and what to pay attention to. I hope this article helps you to understand the basic principles of underfitting and overfitting and motivates you to learn more about them.

As I said earlier, all the code used in this tutorial is available on GitHub.

Thank you for reading!

  • I hope these materials were useful to you. Follow me on Medium to get more articles like this.
  • If you have any questions or comments, I will be glad to get any feedback. Ask me in the comments, or connect via LinkedIn or Twitter.
  • To support me as a writer and to get access to thousands of other Medium articles, get Medium membership using my referral link (no extra charge for you).

--

--