
As far as data scientist interviews go, discussing bias-variance tradeoff is one of the most common topics I have encountered, either as the person being interviewed in the past and more recently as the person interviewing the candidates or joining such interviews. Later in the post, we will discuss what bias-variance tradeoff is and why it works differently in deep learning exercises, but let me explain why I think this topic keeps coming up in determining the breadth of machine learning knowledge of data scientist candidates of both entry and experienced levels.
As machine learning scientists, we spend a great amount of time, energy, care and computational resources to train great machine learning models but we always know that our models will have a level of error as they generalize, which is also known as test error. Less experienced data scientists tend to focus on learning new modeling methodologies and algorithms, which I do believe is a healthy exercise. However, the more experienced data scientists are the ones who have learned over time how to better understand and handle the test error that inevitably exists in those trained models.
Bias-variance tradeoff is a fundamental knowledge that will guide us towards improving our trained models, by learning from the errors. This is why I borrowed the commonly-used saying that "it is not how many times you get knocked down that count, it’s how many times you get back up". In this analogy, the inevitable test errors are the number of times we get knocked down and how we use that knowledge to improve the model is how we get back up. And this is exactly why experienced data scientists would like to talk to the interviewees about this classical bias-variance tradeoff. They do prefer to work with the scientists who have a plan, after they inevitably come across test errors. This answers the question of why we come across this question for entry-level scientists. But what do we keep seeing this topic in the interviews for experienced scientist roles? That gets us to the double descent phenomenon, which is the main topic of this post.
We will first start by briefly going over the classical bias-variance tradeoff to ensure we use the same terminology and then will move to the discussion of double descent phenomenon.
Let’s get started!
1. Establishing Terminology
Before getting to bias-variance tradeoff and double descent, there are a few concepts that we need to define first to be consistent for the remainder of the post (if you are familiar with these topics, feel free to skip to section 2). These are fundamental topics that we expect all our scientists to be familiar with. I will be defining them briefly here since most scientists are familiar with these concepts but if you are looking for a more detailed introduction, please refer to the post below:
Machine Learning Basics I Look for in Data Scientist Interviews
- Machine Learning Lifecycle: Each supervised machine learning life cycle includes a training process where the model learns the underlying patterns of the training data. The error that is measured during training time is called "train error". We expect the train error to decrease, the longer we train the model, which indicates the model is learning the training data. Then the train model is used to make predictions on a test set, which the model has not seen before. The predictions are not going to be perfect and therefore there is always expected to exist an error when testing a trained model, which is called "test error". We measure the "test error" and then use various tools to improve the model and therefore lowering the "test error". In other words, the lower the "test error", the better the trained model is. We can almost say that the goal of a machine learning exercise is to lower the "test error".
- Underfit and Overfit Models: Models that we train are not going to be perfect. When a model is too simple to capture the underlying details of training data, it is called an "underfit" model. Usually what we see here is that the model has a high train error, since it did not manage to learn the training data well and understandably the test error will also be high. "Overfit" models are the other side of underfit models. When a model learns too much of the details of the training data, such as learning the existing noise in the training data, it is called an "overfitted" model. In these cases, we usually see that train error is low, since the model learned the training data very well, but the test error is surprisingly high, which implies that the model did not generalize well and is not useful for making actual predictions. Remember that the goal of the machine learning exercise was to lower the test error at the end. In "overfit" scenarios, even though the train error is low, the test error is high and therefore that is not a satisfactory outcome. Now that we understand test error is the important piece of the puzzle and are familiar with overfit and underfit scenarios, let’s dive deeper into components of test error.
- Bias vs. Variance: Test error can be broken down into two components, which are bias and variance. Bias is the error due to over simplistic models so we can imagine a high bias is indicative of underfitting. Variance on the other hand is the error due to overly-complex models. In other words, high variance indicates overfitting. So, here’s the mappping you want to remember – bias is related to underfit and variance is related to overfit.
Here is a tabular summary of what we discussed above:

Now that we are familiar with the terminology, let’s define bias-variance tradeoff and then we can move on to the double descent side of it.
2. Bias-Variance Tradeoff
As we discussed in the previous section, the goal of a machine learning algorithm is to learn from the training data and then generalize its learning at test time. In other words, using a finite amount of training data, the model learns to make predictions for the data that the model has not encountered before (i.e. unseen data). If we define "error" as the distance between what the model is expected to predict and what the model actually predicts, then we can think of the training process as follows. During the training process, the model’s objective is to minimize the "training" error (i.e. the model learns to make accurate predictions for the training data) and the main goal of the machine learning model is to make the most accurate predictions possible when we use it or minimize the "test" error (i.e. what the model predicts during test time is as close as possible to actual outcomes). As we discussed in the previous section, test error includes two components called bias and variance and the trade-off between the two is the topic of discussion in this section.
In classical machine learning, it has been empirically observed that with an increase in model complexity, the "training" error decreases, since the model fits the training data better and better with training. On the other hand, as the complexity increases, the "test" error initially decreases but eventually "test" error starts increasing, due to overfitting.
Best way to demonstrate this is to look at a plot to visualize the components of "test" error. Lets create some synthetic data in the shape of bias-variance tradeoff and then we can discuss further. For the purposes of this post, you do not need to understand the code below, since that is not the point here but I decided to include it as a fun exercise to see.
# import libraries
import numpy as np
import matplotlib.pyplot as plt
# generate values for model complexity
model_complexity = np.linspace(0, 10, 100)
# define functions for bias, variance, and test error
bias_squared = (10 - model_complexity) ** 2 / 20
variance = model_complexity ** 2 / 30
# test error is bias and variance together
test_error = bias_squared + variance
# plot
plt.figure(figsize=(10, 6))
plt.plot(model_complexity, bias_squared, label=r'Bias', color='blue')
plt.plot(model_complexity, variance, label='Variance', color='red')
plt.plot(model_complexity, test_error, label='Test Error (= Bias + Variance)', color='black')
# labels, title and legend
plt.xlabel('Model Complexity', fontsize=14)
plt.ylabel('Error', fontsize=14)
plt.title('Bias-Variance Tradeoff', fontsize=16)
plt.axvline(x=5, color='gray', linestyle='--', label='Optimal Trade-Off')
plt.legend()
plt.grid(True)
plt.show()
Results:

The plot above is what you would usually see in ML courses teaching the bias-variance tradeoff. We can see in the plot that as we explained earlier, the bias (blue line) starts quite high at the top left corner for a simplistic model and as the model complexity in the x-axis increases, the bias (blue line) decreases and the variance (red line) increases. The black line is the test error, which is the summation of bias and variance. Remember that what we really care about is the "test error" or the black line. The term "tradeoff" refers to the fact that as we can see, the overall test error (black line) starts at high in the top left corner, then it improves as model complexity increases up to an optimal point and then it starts to go up again. Since the goal of a machine learning exercise is to find the lowest point of test error, then we want to find that sweet spot, which implies a tradeoff between bias and variance.
In reality, you would not see such clean plots of bias vs. variance to pick the optimal point. Therefore, I decided to include a more realistic example of what the plot would look like in practice. In the code block below, we use polynomial linear regression models for training and increase complexity by increasing the polynomial degree. Then we measure the errors and plot them. I have added comments to make it easier to follow the code.
# import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# generate synthetic data
np.random.seed(1234)
X = np.linspace(-3, 3, 100)
y = X**3 - 2*X**2 + X + np.random.normal(0, 3, X.shape[0])
# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
# variables to store error
train_errors = []
test_errors = []
degrees = range(1, 15)
# loop over degrees to fit polynomial models of increasing complexity
for degree in degrees:
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)
model = LinearRegression()
model.fit(X_train_poly, y_train)
# calculate errors
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)
train_errors.append(mean_squared_error(y_train, y_train_pred))
test_errors.append(mean_squared_error(y_test, y_test_pred))
# optimal degree where test error is minimized
optimal_degree = degrees[np.argmin(test_errors)]
# plot
plt.figure(figsize=(12, 8))
plt.plot(degrees, train_errors, label="Train Error", marker='o', linestyle='-', color='b')
plt.plot(degrees, test_errors, label="Test Error", marker='o', linestyle='-', color='r')
plt.scatter(optimal_degree, min(test_errors), color="purple", s=100, label=f"Optimal Degree ({optimal_degree})", zorder=5)
plt.axvspan(degrees[0], optimal_degree - 1, color="blue", alpha=0.1, label="High Bias (Underfitting)")
plt.axvspan(optimal_degree + 1, degrees[-1], color="red", alpha=0.1, label="High Variance (Overfitting)")
plt.xlabel("Model Complexity (Polynomial Degree)")
plt.ylabel("Mean Squared Error")
plt.title("Bias-Variance Tradeoff")
plt.legend()
plt.grid(True)
plt.show()
Results:

As we can see in the plot, the x-axis depicts model complexity, as measured by the polynomial degree of the linear regression model used for training and the y-axis depicts the error. Note that we cannot directly measure each of bias and variance and instead we depict train and test error to see the bias-variance tradeoff. As the model complexity increases, we see that both train and test set decrease in the area where we have identified as the "high bias" in light blue shade. Note this is the definition that we had earlier for underfitting and high bias area where both test and train errors are high. Then we see an optimal area between model complexity of 2 to 4 and then starting with model complexity of 4, we see the train and test errors start to diverge. Train error continues decreasing as the model gets more complex but the test error starts increasing slowly. This is the area that is shaded in light red and depicts the definition of overfitting or high variance where train error is low but test error is high and therefore model does not generalize well. The takeaway from such a exercise is that we would pick an optimal level of complexity, such as 3rd degree polynomial regression, where both train and test errors are low.
Let’s also look at the fit of some of these models to the actual train and test data to visually see how model fits get better as the model increases in complexity, up to a certain point. Note that even before going through this exercise, we know that 3rd degree polynomial would be the best fit, since we used a third degree polynomial formula to generate our synthetic data but still the exercise is valuable in demonstrating how the fit improves initially up to a sweet spot and then degrades as the model overfits to the training data beyond the third degree polynomial.
I have added comments to the code below to make it easy to follow but in general we first create the same train and test sets used in the previous example and then create subplots of varying polynomial degrees to show how the model fits (in red line) to the train data (blue dots) and test data (green dots). Finally, I have added a table of errors, as measured by mean squared error (MSE), similar to the previous example, to quantitatively also observe that the lowest test error occurs at the third polynomial degree.
Let’s implement the code and then look at the results.
# import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd
# generate synthetic data
np.random.seed(1234)
X = np.linspace(-3, 3, 100)
y = X**3 - 2*X**2 + X + np.random.normal(0, 3, X.shape[0])
# split data into training and test sets, using what we had in the previous section
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
# lists to store errors (mean squared error or mse)
train_mse_list = []
test_mse_list = []
# create figure for subplots
plt.figure(figsize=(20, 12))
# plot polynomial regression fits for degrees 1 to 9
degrees = range(1, 10)
for i, degree in enumerate(degrees, 1):
poly_features = PolynomialFeatures(degree=degree)
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)
model = LinearRegression()
model.fit(X_train_poly, y_train)
# predictions
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)
X_range = np.linspace(-3, 3, 500).reshape(-1, 1)
X_range_poly = poly_features.transform(X_range)
y_range_pred = model.predict(X_range_poly)
# calculate errors and store
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_mse_list.append(train_mse)
test_mse_list.append(test_mse)
# subplot
plt.subplot(3, 3, i)
plt.scatter(X_train, y_train, color="blue", label="Training Data", alpha=0.6)
plt.scatter(X_test, y_test, color="green", label="Test Data", alpha=0.6)
plt.plot(X_range, y_range_pred, color="red", label=f"Degree {degree} Fit")
plt.title(f"Polynomial Regression (Degree {degree})", fontsize=14)
plt.xlabel("x", fontsize=12)
plt.ylabel("y", fontsize=12)
plt.legend(fontsize=10)
plt.grid(True)
# adjust layout and show
plt.tight_layout()
plt.show()
# dataframe for errors
mse_df = pd.DataFrame({
"Degree": [f"{d}" for d in degrees],
"Train MSE": [round(mse, 2) for mse in train_mse_list],
"Test MSE": [round(mse, 2) for mse in test_mse_list]
})
# display errors
print("MSE Values across Polynomial Degrees")
print(mse_df)
Results:


As we can see in the subplots, starting from the third degree polynomial, there is a relatively good fit between the scatterplot of train/test sets and the model in the red line. Looking more closely at the MSE table after the subplots, we can see how the test error decreases up to the third degree polynomial and then as we expected to see, it starts increasing beyond that point, implying that the sweet spot for the bias-variance tradeoff is the third degree polynomial.
At this point, you may be wondering what happened to the whole idea of "getting back up" again, when we face errors in our training and this exercise was a good example of that. What we want to realize is that if we train a model where train error is low but test error is high, it means that the model overfits and therefore, one solution is to use a less complicated model. On the flip side, if we see that both train and test errors are high, we can conclude that using a more complex model may improve the results. Said differently, we want to learn from our error and find a way to get up and improve our model performance, which emphasizes importance of understanding the bias-variance tradeoff.
So far, we only covered the classical bias-variance tradeoff, which may be encountered in the entry level data scientist interviews but what would make this a candidate for experienced data scientists’ interviews? That is the double descent phenomenon that we will cover next.
3. Double Descent Phenomenon
As we explored the bias-variance tradeoff earlier, we observed how model complexity can impact generalization of a trained model, as follows:
- First Descent: As model complexity increases from small to moderate sizes (the area shaded in blue in the previous section’s graph), test error decreases, reaching the sweet spot with the lowest amount of test error (the area between the blue and red shades in the previous section’s graph).
- Overfitting Ascent: As the complexity increases, the test error continues to increase (the area shaded in red in the previous graph) until we reach an overfitting peak.
But what will happen if we just ignore this increase in test error and continue increasing the complexity even further? That will get us into the "second descent". It was empirically observed that for highly complex models used, the test error can actually start to decrease again. As a result of this secondary lowering of test error, this phenomenon is called the double descent, which is mainly observed in highly-complex deep learning models. But what causes this phenomenon? Why did we not observe this in traditional ML modeling? And why is this important? What is the reason behind this phenomenon? We will explore these important questions in the remainder of the post, followed by implementation of an example to visualize it.
Tip: Given the impact and importance of deep learning in creating large language models (and similarly large and complex models), we expect our experienced scientists to be familiar with discussions around deep learning. One such topic is the double descent phenomenon, which is the main focus of this section.
3.1. Why Did We Not Foresee the Second Descent?
In classical ML modeling, the assumption had been that once a model starts overfitting, it will continue doing so if we continue to increase its size and since overfitting would result in increased test errors, the path was rarely pursued. In classical small-scale scenarios, such as linear regression with limited number of parameters, or even in smaller neural networks, it is uncommon to observe a second dip. As a result, this assumption was not challenged in classical ML until much more complex models started to be used in deep learning. In large-scale deep learning models, we deal with extremely high-dimensional parameter spaces and the relationship between model complexity, optimization and generalization behaves differently, as observed empirically in the second descent. But why would we care about this phenomenon? Let’s explore that next.
3.2. Why Is Double Descent Important?
The importance of double descent phenomenon can be broken down into three main categories:
- Emergence of New Paradigms. As we discussed in the previous question, in classical ML, we did not even expect to be able to improve our models with increasing complexity and as a result, this was a path less explored by practitioners. As soon as the double descent phenomenon was observed, these artificial barriers were broken and more and more researches started exploring the highly-parameterized systems, which resulted in more attention and findings in the space.
- Neural Network Architecture. Modern neural network models often come with very large number of parameters (e.g. Meta’s Llama 3.1 has up to 405B parameters) and achieve high performance levels thanks to such complexity. If we did not know about the double descent, we might have stopped at much smaller neural networks or have stopped the training process much earlier, to avoid overfitting. Now we know that we can increase the model size and training, which can result in better performance.
- Deep Learning Implications. Unlike what we thought during the classical ML modeling, we are now learning that in deep learning, bigger seems to be better. Larger models, such as various GPTs can generalize very well, even beyond the area that we used to consider overfitted in the classical sense. This is somehow similar to Richard Sutton’s "Bitter Lessons", where he argues that the most significant advancements in Artificial Intelligence (which is primarily due to deep learning architecture) have come from focusing on general methods leveraging high computation, rather than focusing on smaller hand-crafted features in deep learning.
Next let’s make an attempt at developing an intuition for the double descent.
3.3. Why Does Double Descent Occurs?
This is an open research question but here are the primary reasons behind double descent:
- Over-parameterization: As the name suggests, it is referring to models that have large number of parameters and are hence called "over-parameterization". These over-parameterized models can potentially have more parameters than training data, which results in multiple solutions fitting the training data. Not all of these solutions generalize as well as the others and at this point, optimization algorithms, such as gradient descent, guide us towards the solutions that are more generalizable. Therefore, although the model is technically "overfitting" the training data, but it also generalizes well.
- Data Quality and Size: These highly-parameterized models are trained on large amounts of data, increaseing the presence of noise or mislabeled data points, which can contribute to the double descent phoenomonon. In classical models, the reason that overfitted models demonstrate high test errors is that the model learns the noise in the data, rather than the underlying patterns that is useful for generalization. On the other hand, in an over-parameterized setting, the model learns to distinguish between actual patterns and noise, which results in a better generalization of the model.
- Feature Learning: This theory argues that models learn features at different scales and paces. Initially, the model may overfit to the fast-learning featurs, which is the classical overfitting portion of the learning. But there are also featurs that take longer for the model to learn with larger amounts of data – these slower-learning features are learned during the second descent, resulting in lowering of the test error.
Now that we understand the double descent phenomenon, let’s try to create it in an example.
4. Double Descent Phenomenon – Implementation
So far we discussed how in classical Machine Learning, we observed that test error initially lower as the complexity of the model increases, reaching a sweet spot and then start increasing, which implies entering the overfitted zone. Then we talked about how in deep learning scenarios with highly-parameterized models, a second lowering of test error has been observed, which is the second descent portion of the double descent phenomenon. In this part, we will go through a training process and create increasingly complex neural networks to observe how test error changes as the complexity increases. I have added comments in the code to make it easier to follow.
# import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras import layers, models
# set random seeds for numpy and tensorflow
np.random.seed(1234)
tf.random.set_seed(1234)
# generate synthetic data
n_samples = 500
X = np.random.uniform(-2, 2, (n_samples, 1))
y = np.sin(5 * X) / (5 * X) + np.random.normal(0, 0.05, (n_samples, 1)) # Non-linear function with noise
# split data into training and test sets
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=1234
)
n_train = X_train.shape[0]
# lists to store errors and model complexities
train_errors = []
test_errors = []
model_complexities = []
# range of model complexities (number of hidden units)
hidden_units_list = list(range(1, 20)) + list(range(20, 100, 5)) + list(range(100, 500, 20))
for hidden_units in hidden_units_list:
# build neural network model
model = models.Sequential()
model.add(layers.Dense(hidden_units, activation='relu', input_shape=(1,)))
model.add(layers.Dense(1))
# compile with adam optimizer and mse for loss
model.compile(optimizer='adam', loss='mean_squared_error')
# train
history = model.fit(
X_train, y_train,
epochs=200,
batch_size=n_train, # batch gradient descent
verbose=0,
validation_data=(X_test, y_test)
)
# evaluate train & test error
train_error = history.history['loss'][-1]
test_error = history.history['val_loss'][-1]
# record errors and model complexity
train_errors.append(train_error)
test_errors.append(test_error)
# total number of parameters
total_params = hidden_units * (1 + 1) + hidden_units + 1 # weights and biases
model_complexities.append(total_params)
# convert lists to arrays
train_errors = np.array(train_errors)
test_errors = np.array(test_errors)
model_complexities = np.array(model_complexities)
# plotting errors
plt.figure(figsize=(12, 6))
# plot
plt.plot(model_complexities, train_errors, label='Training Error', marker='o')
plt.plot(model_complexities, test_errors, label='Test Error', marker='o')
# highlight interpolation threshold
plt.axvline(x=n_train, color='k', linestyle='--', label='Interpolation Threshold (n = {})'.format(n_train))
# log scale for errors
plt.yscale('log')
plt.xlabel('Model Complexity (Number of Parameters)')
plt.ylabel('Mean Squared Error (Log Scale)')
plt.title('Double Descent Phenomenon in Neural Networks')
plt.legend()
plt.grid(True)
# adjust x-axis limits to focus on key regions
plt.xlim([0, max(model_complexities) + 10])
# show plot
plt.show()
Results:

As we can see above, the test error continues decreasing across model complexity up to around 300 parameters. Then an increase is observed, which indicates entering into overfitting region. While in classical machine learning, we might have stopped at this point, now that we expect the second descent, we continue increasing the complexity of the underlying neural network model and observe how the test errror starts to decrease again beyond the interpolation threshold of 350. This trend continues and would at least theoretically continue as we continue feeding more data and computation to the architecture.
Thanks For Reading!
If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!
(All images, unless otherwise noted, are by the author.)