Bias-Variance tradeoff in Machine Learning models: A practical example

Understanding model error and how to improve it.

Carolina Bento
Towards Data Science

--

In supervised machine learning, the goal is to build a high-performing model that is good at predicting the targets of the problem at hand and does so with a low bias and low variance.

But, if you reduce bias you can end up increasing variance and vice-versa. That’s where the bias-variance tradeoff comes into play.

In this article, we’re going to look into what bias and variance mean in the context of machine learning models, and what you can do to minimize them.

To build a supervised machine learning model you take a dataset that looks somewhat like this.

Structure of a dataset used in supervised learning.

It’s a series of data records, each one with several features and a target, the thing you want to learn to predict. But before getting started and building the model, you split the dataset into two distinct subsets:

  • Training set
  • Testing set

You typically select 20% percent of the data records at random and set it aside as the testing set, leaving the remaining 80% of the dataset to train the model. This is commonly referred to as the 80/20 split, but it’s only a rule-of-thumb.

The need for training and testing sets

Training and testing sets have different purposes.

The training set teaches the model how to predict the target values. As for the testing set, the name gives it away, it’s used to test the quality of the learning if the model is good at predicting beyond the data is used in the learning process.

With the testing set you’ll see if the model generalizes its predictions beyond the training data.

We can measure the quality of the two phases in this process, learning and prediction, with the:

  • Training error,
  • Test error, also known as, generalization error.

A good model has low training error. But you have to be careful not to drive the training error so low, that the model overfits the training data. When the model overfits the data, it captures the patterns of the training set so perfectly that it becomes an expert at predicting only the results of the training set.

At first glance, this sounds great, but it has a downside. If the model is an expert at predicting the targets in the training set it will not be so good at predicting other data.

Bias-variance tradeoff

To understand this tradeoff first we need to look at the error of a model. In practice, we can split model error into three different components.

Model Error = Irreducible error + Bias + Variance

The irreducible error is independent of Bias and Variance. But the last two are inversely correlated, whenever you lower bias, variance will increase. Just like the training error and testing error.

Irreducible error

This error is out of the control of the machine learning engineer who is building the model. It’s an error caused by noise in the data, random variations that don’t represent a real pattern in the data, or the influence of variables that are not yet captured as features.

One way to reduce this type of error is to identify the variables that have impact on the problem we’re modeling and turn them into features.

Bias

Bias is about the ability to capture the true patterns in the dataset.

Bias of a simplistic (left) vs a complex model (right).

It is mathematically expressed as the squared difference between the expected predicted target values and the true target values.

So, when you have an unbiased model, you know the difference between the average predictions, as in the expected predicted target values, and the true values is zero. And it is squared to penalize more heavily the predictions that are farther from the true of the target.

A model with high bias will underfit the data. In other words, it will take a simplistic approach to model the true patterns in the data.

But a model with low bias is more complex than it should. It will overfit to the data it utilized to learn, because it will capture as much detail as it can. As a consequence, it will do a poor job generalizing beyond the training data.

You can spot bias by looking at the training error. When the model has a high training error, it’s a sign of high bias.

To control bias you can add more features and build a more complex model, always finding a balance between underfitting and overfitting the data.

Variance

Variance captures the range of predictions for each data record.

Range of predictions in a model with high (left) and low variance (right).

It’s a measure of how far off each prediction is from the average of all predictions for that testing set record. And it is also squared to penalize predictions that are farther from the average prediction of the target.

Even though each model you build outputs a slightly different prediction value, you don’t want those predictions to span a big range of values.

You can also spot variance in your model by looking at the test error. When the model has high test error, it’s a sign of high variance.

One way to reduce variance is to build the model with more training data. The model will have more examples to learn from and improve its ability generalize its predictions.

If building a model with more training data is not possible you can , for instance, build a model that incorporates bootstrap aggregating, usually called bagging.

Other methods to lower variance include reducing the number of features, using feature selection techniques, and reducing the dimensionality of the dataset with techniques like Principal Component Analysis.

Now let’s see this in action

I’ve created a random dataset that follows a fourth degree polynomial with coefficients -5, -3, 10 and 2.5, from highest to lowest order.

Since we're going to fit models to this data, I split it into training and test sets. The training data looks like this.

Training set for 4th degree polynomial generated from random data.
dataset_size = 5000# Generate a random dataset and that follows a quadratic distribution
random_x = np.random.randn(dataset_size)
random_y = ((-5 * random_x ** 4) + (-3 * random_x ** 3) + 10 * random_x ** 2 + 2.5 ** random_x + 0.5).reshape(dataset_size, 1)
# Hold out 20% of the dataset for training
test_size = int(np.round(dataset_size * 0.2, 0))
# Split dataset into training and testing sets
x_train = random_x[:-test_size]
y_train = random_y[:-test_size]
x_test = random_x[-test_size:]
y_test = random_y[-test_size:]
# Plot the training set data
fig, ax = plt.subplots(figsize=(12, 7))
# removing to and right border
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# adding major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
ax.scatter(x_train, y_train, color='#021E73’)
plt.show()

We can start by checking how the complexity of the model impacts bias.

We’ll be fitting gradually more complex models to this data, starting with a simple linear regression.

Simple linear regression model fit to the training data.
# Fit model
# A first degree polynomial is the same as a simple regression line
linear_regression_model = np.polyfit(x_train, y_train, deg=1)# Predicting values for the test set
linear_model_predictions = np.polyval(linear_regression_model, x_test)
# Plot linear regression line
fig, ax = plt.subplots(figsize=(12, 7))
# removing to and right border
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# adding major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
ax.scatter(random_x, random_y, color='#021E73')
plt.plot(x_test, linear_model_predictions, color='#F2B950', linewidth=3)plt.show()

This type of model is definitely too simple, it doesn’t follow the patterns of the data at all.

We’ll quantify the fit of this model with training and test errors, calculated using mean squared error, and looking at bias and variance.

Metrics for the simple linear regression model.
# A few auxiliary methods
def get_bias(predicted_values, true_values):
""" Calculates model bias
:param predicted_values: values predicted by the model
:param true_values: true target values for the data
:return: integer representing the bias of the model
"""return np.round(np.mean((predicted_values - true_values) ** 2), 0)def get_variance(values):
""" Calculates variance of an array of values
:param values: array of values
:return: integer representing the variance of the values
"""return np.round(np.var(values), 0)def get_metrics(target_train, target_test, model_train_predictions, model_test_predictions):"""
Calculates
1. Training set MSE
2. Test set MSE
3. Bias
4. Variance
:param target_train: target values of the training set
:param target_test: target values of the test set
:param model_train_predictions: predictions from running training set through the model
:param model_test_predictions: predictions from running test set through the model
:return: array with Training set MSE, Test set MSE, Bias and Variance
"""training_mse = mean_squared_error(target_train, model_train_predictions)test_mse = mean_squared_error(target_test, model_test_predictions)bias = get_bias(model_test_predictions, target_test)variance = get_variance(model_test_predictions)
return [training_mse, test_mse, bias, variance]# Fit simple linear regression model
# A first degree polynomial is the same as a simple regression line
linear_regression_model = np.polyfit(x_train, y_train, deg=1)# Predicting values for the test set
linear_model_predictions = np.polyval(linear_regression_model, x_test)
# Predicting values for the training set
training_linear_model_predictions = np.polyval(linear_regression_model, x_train)
# Calculate for simple linear model
# 1. Training set MSE
# 2. Test set MSE
# 3. Bias
# 4. Variance
linear_training_mse, linear_test_mse, linear_bias, linear_variance = get_metrics(y_train, y_test, training_linear_model_predictions, linear_model_predictions)print('Simple linear model')
print('Training MSE %0.f' % linear_training_mse)
print('Test MSE %0.f' % linear_test_mse)
print('Bias %0.f' % linear_bias)
print('Variance %0.f' % linear_variance)

Let’s see if using a model complex model really helps drive down bias. We’re going to fit a second order polynomial to this data.

2nd degree polynomial model fit to the training data.

It makes sense that a second degree polynomial reduces bias, because it is getting closer to the true patterns of the data.

And we also see the inverse relationship between bias and variance.

Metrics for the 2nd degree polynomial model.
#############################
# Fit 2nd degree polynomial #
#############################
# Fit model
polynomial_2nd_model = np.polyfit(x_train, y_train, deg=2)
# Used to plot the predictions of the polynomial model and inspect coefficients
p_2nd = np.poly1d(polynomial_2nd_model.reshape(1, 3)[0])
print('Coefficients %s\n' % p_2nd)
# Predicting values for the test set
polynomial_2nd_predictions = np.polyval(polynomial_2nd_model, x_test)
# Predicting values for the training set
training_polynomial_2nd_predictions = np.polyval(polynomial_2nd_model, x_train)
# Calculate for 2nd degree polynomial model
# 1. Training set MSE
# 2. Test set MSE
# 3. Bias
# 4. Variance
polynomial_2nd_training_mse, polynomial_2nd_test_mse, polynomial_2nd_bias, polynomial_2nd_variance = get_metrics(y_train, y_test, training_polynomial_2nd_predictions, polynomial_2nd_predictions)print('2nd degree polynomial')
print('Training MSE %0.f' % polynomial_2nd_training_mse)
print('Test MSE %0.f' % polynomial_2nd_test_mse)
print('Bias %0.f' % polynomial_2nd_bias)
print('Variance %0.f' % polynomial_2nd_variance)
# Plot 2nd degree polynomial
fig, ax = plt.subplots(figsize=(12, 7))
# removing to and right border
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Adding major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
x_linspace = np.linspace(min(random_x), max(random_x), num=len(polynomial_2nd_predictions))plt.scatter(random_x, random_y, color='#021E73')
plt.plot(x_linspace, p_2nd(x_linspace), '-', color='#F2B950', linewidth=3)
plt.show()

As we increase the complexity of the model again, to a third degree polynomial, we see a slight improve in bias. But variance increased again.

3rd degree polynomial model fit to the training data.

The plot didn’t change much, but it’s clear when we look at the metrics.

Metrics for the 3rd degree polynomial model.
#############################
# Fit 3rd degree polynomial #
#############################
print('3rd degree polynomial')# Fit model
polynomial_3rd_model = np.polyfit(x_train, y_train, deg=3)
# Used to plot the predictions of the polynomial model and inspect coefficientsp_3rd = np.poly1d(polynomial_3rd_model.reshape(1, 4)[0])print('Coefficients %s' % p_3rd)# Predict values for the test set
polynomial_3rd_predictions = np.polyval(polynomial_3rd_model, x_test)
# Predict values for the training set
training_polynomial_3rd_predictions = np.polyval(polynomial_3rd_model, x_train)
# Calculate for 3rd degree polynomial model
# 1. Training set MSE
# 2. Test set MSE
# 3. Bias
# 4. Variance
polynomial_3rd_training_mse, polynomial_3rd_test_mse, polynomial_3rd_bias, polynomial_3rd_variance = get_metrics(y_train, y_test, training_polynomial_3rd_predictions, polynomial_3rd_predictions)
print('\nTraining MSE %0.f' % polynomial_3rd_training_mse)
print('Test MSE %0.f' % polynomial_3rd_test_mse)
print('Bias %0.f' % polynomial_3rd_bias)
print('Variance %0.f' % polynomial_3rd_variance)

# Plot 3rd degree polynomial
fig, ax = plt.subplots(figsize=(12, 7))
# removing to and right border
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Adding major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
x_linspace = np.linspace(min(random_x), max(random_x), num=len(polynomial_3rd_predictions))
plt.scatter(random_x, random_y, color='#021E73')
plt.plot(x_linspace, p_3rd(x_linspace), '-', color='#F2B950', linewidth=3)
plt.show()

To summarize this experiment, we could really see the bias-variance tradeoff in action. As we increased the complexity of a model bias kept decreasing, while variance increased.

Summary of the experimental results.

Hope you got a better idea of the role bias and variance play when building a machine learning model.

Thanks for reading!

--

--