The world’s leading publication for data science, AI, and ML professionals.

Dealing With High Bias and Variance

Regularization Explained Through Equations

Contents

In this post, we’ll be going through:

(i) The methods to evaluate a machine learning model’s performance

(ii) The problem of underfitting and overfitting

(iii) The Bias-Variance Trade-off

(iv) Addressing High Bias and High Variance

In the previous post, we looked at logistic regression, data pre-processing and also went hands-on on titanic dataset on Kaggle on which we obtained decent results. In both the posts so far, you must have noticed that I threw the term overfitting here and there and also mentioned that it can lead to poor performance of the Machine Learning model. Now, we’ll be having a detailed look into the problems machine learning models can go through and what are their possible solutions. We won’t be going hands-on in this post but in the upcoming posts, we’ll apply the concepts that we’ll learn in this post.

Evaluating model’s performance

Before directly going into the problems that occur in machine learning models, how do we know that there is an issue with our model? For this, we need an evaluation metric for our model. We’ve already seen a couple of them. We used the coefficient of determination (r² score) to evaluate the linear regression model and accuracy to evaluate the logistic regression model. Both these metrics are calculated by the underlying error we get through their respective cost functions. To address the problems of machine learning models, we will be using this error to make decisions.

Errors are calculated in different ways in linear and logistic regression. To make things easy to understand and develop intuition, we’ll be looking at things from linear regression perspective but the terms that we’ll define and the roles they play will be exactly the same as for any other machine learning model.

An error is a ‘mistake’ and the extent of this mistake can be quantified as the absolute difference between the true value and the estimated value. For ‘n’ quantities, it can be represented as:

Computing error for the model’s performance on the training and test set helps us to identify the problems faced by the model. Consider the following scenarios:

(i) Low training set error, low test set error

(ii) Low training set error, high test set error

(iii) High training set error, high test set error

(iv) High training set error, low test set error

Let’s look at these scenarios one by one.

If both the training set error and the test set error are low, it means the model learnt the input-output mappings well on the training set and was able to generalize it well to the test set as well. This is the desired output of a good machine learning model. This was the case with the logistic regression model we trained in the previous post.

When the training set error is low but the test set error is high, we say that the model is over-fitting the training set. This means that after training the machine learning model on the training set, it learnt the input-output mappings exceptionally well for the training set but couldn’t generalize these mappings to the test set. Let’s try to understand why this happens. Data collected for any task can never be error free, no matter how careful the process be. A good machine learning model should always sample out noise and only generate those input-output mappings that exclude the noisy data points i.e. a good machine model is robust to noise. Consider the case of a music concert where there is both, the melody of the artists and the noise of the crowd, but we pay attention only to the artists’ melody (input) as it makes us cheerful (output), ignoring all the noise of the crowd. If we pay heed to all the voices (artists + noise), we probably won’t be as much happy. Overfitting is the case where the machine learning model pays heed to each and every voice (artist + noise) when in reality we just need to focus on the melody. A machine learning model that overfits on the training data is said to suffer from high variance. Later in the post we’ll see how to deal with overfitting.

If both, the training and test set error are high, then it symbolizes that the machine learning model has not properly learnt the input-output mapping on the training set and is also unable to generalize on the test set. In other words we can say that our machine learning model is in a raw state. Such a model is said to underfit on both, the training and test dataset and suffers from high bias.

A machine learning model that has high training set error and low test set error is a rare occurrence but it happens when the training and test data is not properly sampled (eg. almost equal no. of samples for each class in a classification problem) which causes a substantial difference in the statistical properties of the test set. Consider a binary classification problem where predictions for class A give 10% training error and predictions for class B give 40% training error. Average classification accuracy gives us 25% error on the overall dataset. But this is not a good measure. It may very well be the case that the number of occurrences of class A outnumber class B in a real life test set. This means that actual prediction error in real life datasets will be significantly less than 25%. Another case in which this scenario occurs is when the test set is significantly smaller than the training set and although being similar to the training set, we get less error on test set as it does not have that much noise as compared to the training set. Nothing much can be done to avoid these type of scenarios other than studying and sampling the data before applying any machine learning algorithm on it.

Visual representation of high bias, a perfect fit and high variance (Source)
Visual representation of high bias, a perfect fit and high variance (Source)

Definitions of Bias and Variance

The terms bias and variance must not sound new to the readers who are familiar with statistics. Standard deviation measures how close or far enough are data points from a central position and mathematically, variance is just squared standard deviation. So, variance measures how far a set of data is spread out. Data used for machine learning tasks doesn’t have a specific input-output mapping and the task of these models is to find a good enough mapping which generalize the results. A machine learning model, which (over)fits to all the data points, including the noisy ones, or in other words, fits to all the data points no matter how wide they are spread, is said to suffer from high variance.

In statistics, the bias (or bias function) of an estimator (here, the machine learning model) is the difference between the estimator’s expected value and the true value for a given input. An estimator or a decision rule with zero bias is called unbiased. High bias of a machine learning model is a condition where the output of the machine learning model is quite far off from the actual output. This is due to the simplicity of the model. We saw earlier that a model with high bias has both, high error on the training set and the test set.

The Bias-Variance Tradeoff

The Bias-Variance tradeoff is a property that lies at the heart of supervised machine learning algorithms. Ideally, we want a machine learning model which takes into account all the patterns as well as the outliers in the training data and generalize them to the test (unseen real world) data in order to achieve very small error and very high accuracy. We saw earlier that high variance models are complex and represent all the features of the training set very well leading to minimal error on the training set but fail to generalize to the unseen data. In contrast, high bias models represent extremely simple mappings and can generalize some features to the unseen data, but the simplicity of these models leads to underfitting on the training set and generates predictions with lower variance (high bias) when applied to data outside of the training set. The ideal amount of bias and variance that a particular machine learning model should have depends on the minimization of the error (which includes bias error, variance error and noise).

Bias-Variance Tradeoff (Source)
Bias-Variance Tradeoff (Source)

Addressing Issues In Machine Learning

Building a machine learning model is an iterative process. After having a look at the dataset, we should always start with simple models and then keep on increasing their complexity until we get desired results on the unseen data. An extremely simple machine learning model suffers from high bias and an extremely complex machine learning model suffers from high variance. Since we move from simple to complex models step-by-step, we get rid of the problem of high bias but getting rid of high variance isn’t that easy since there are scenarios when with the given set of parameters and methods, we don’t get an optimum model and the model with high variance needs to be treated to get an optimum model. So let’s discuss a few ways to solve the problem of high variance first.

Addressing High Variance

Consider the example of a logistic regression classifier. If we say that the classifier overfits on the training data, this means that the output of the equation y = sigmoid(Wx + b) is very close to the actual training data values. So, what is the root cause of overfitting? Clearly, it is the parameter values that we trained while building the classifier responsible for the high variance (overfitting) of the machine learning model. Regulating these parameter values helps get rid of overfitting and this process is called Regularization. In formal terms, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. Regularization makes the parameter values small and this prevents overfitting. Later in the post, we’ll see why does this work.

Train and Test Error for a High Variance Model
Train and Test Error for a High Variance Model

Regularization

Since we are modifying the parameter values, we need to update our cost function in order to see the effects of regularization. L2 regularization is most commonly used with the cost function and yield considerably good results. The cost function after L2 Regularization is:

where y(i) represents the actual output for the training example i, Φ(z(i)) is the predicted value for the training example i through logistic regression, λ is the regularization parameter and ||w||² is the L2 norm of the vector of weights W. Vectorization of weights is nothing but all the Wi stacked against each other enclosed by a vector. Python’s machine libraries use the vectorized parametric equations to speed up the calculations.

Suppose the vector W has 3 values W1, W2, W3, then L2 norm of the vector W is calculated as:

||W||² = sqrt(W1² + W2² + W3²)

Notice here that we didn’t regularize the parameter b as in a general logistic regression classifier, each feature has a corresponding W value, so there are multiple W values and just a single bias b value and regularizing the bias parameter does not make any difference practically. Just for the sake of completeness, we regularize b by adding (*λ/2)||b||² at the end of the cost function and use separate λ** values for both.

Why and how does regularization work?

Now, while performing gradient descent to update the parameters, the updates will look as follows:

W = W – alpha * dJ/dW

b = b – alpha * dJ/db

The dJ/dW term consists of an additional term due to regularization and is positive in value i.e. the partial derivative value for the cost function wrt W is greater when the cost function is regularized than when it was not. So, on updating the parameters, we subtract a bigger chunk from the previous state of W and we do this for a number of iterations and the final value of W is smaller than the one we would have obtained without regularization. Similarly, we can compute dJ/db with regularization.

Now we know that regularization leads to small parameter values which lead to convergence on the global minima of the cost function. We’ll use logistic regression with ‘n’ examples to see why regularization works. We know that y = sigmoid(W1x1 + W2x2 + W3x3 + ………. + Wnxn + b). Since regularization results in smaller parameter values than the non-regularized ones, let’s consider that the parameters values Wi are very small and very close to 0. So the output of the logistic regression classifier will be simply y = sigmoid(b), which is a very simple and a very poor estimation of the output. In other words, this output is so extremely simple that it has an extremely high bias. From the bias-Variance trade-off, we know that high bias and high variance are 2 opposite ends of a machine learning model and ideally we want to land somewhere in between with our model. To achieve this, we have to choose the regularization parameter λ very wisely. λ is set using a development set, where we try a variety of λ values and choose the one which yields the best performance for our machine learning model and then use this model to make predictions on the test set.

Like the L2 regularization, there is yet another type of regularization term we can add to the cost function. The cost function after addition of this L1 Regularization term is as follows:

where ||w|| is the L1 norm of the vector W. L1 norm of the vector W with 3 elements is computed as:

||W|| = |W1| + |W2| + |W3| where,|Wi| represents the absolute value of Wi.

Adding More Training Data

Another way of dealing with overfitting of machine learning model is the addition of more data to the training set. The reason behind this is simple. If a machine learning model is overfitting on the training set, it is learning the noisy inputs in the data as well. Adding more data will result in more noise in the data and it becomes difficult for the machine learning model to account for so much noise that it ends up leaving the noisy inputs and focuses more on the general pattern of the input-output pairs. This leads to models which don’t overfit the training data and are free from the problem of high variance. The amount of data to be added to the training set depends on the extent of overfitting of the machine learning model.

Addressing High Bias

Bias in the machine learning model is not big of an issue as we saw earlier and is easy to remove. Since high bias leads to an extremely simple machine learning model which does not capture all the necessary features to make more accurate predictions, we can do the following things to remove high bias:

(i) Use a more complicated machine learning model (by introducing polynomial features instead of the linear ones like y = Wx + b) than the existing one as it might well capture all the important features and patterns in the training data.

(ii) We saw that regularization shrinks the values of the parameters drastically, which can also lead to high bias. So, decreasing the regularization parameter helps in getting rid of high bias.

Train and Test Error for a High Bias Model
Train and Test Error for a High Bias Model

In this post, we first developed intuition about high bias and high variance, then understood the problems a machine learning model can run into when either or both of them creep in. Later we looked at ways to get rid of high bias and high variance from a machine learning model and also developed insights into working of regularization.

In the next post, we’ll have a look at another supervised machine learning method called Support Vector Machines and will also solve a dataset from Kaggle using it.


Related Articles