After years of playing with the Python scientific stack (Numpy, Matplotlib, SciPy, Pandas, and Seaborn), it became obvious to me that the next step was scikit-learn, or "sklearn".
This second module focuses on the concept of models scores, including the test score and train score. Those scores are then used to define overfitting and underfitting, as well as the concepts of bias and variance.
We’ll also see how to inspect model’s performance with respect to their complexity and the number of input samples.
All images by author.
If you didn’t catch it, I strongly recommend my first post of this series – it’ll be way easier to follow along:
Score: train score and test score
The first concept I want to talk about are train score and test score. The score is a way to numericaly express the performance of a model. To compute such performance, we use a score function, that aggregates the "distance" or "error" between what the model predicted versus what the ground truth is. For example:
model = LinearRegressor()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
test_score = some_score_function(y_predicted, y_test)
In Sklearn, all models (also called estimators) provide an even quicker way to compute a score using the model:
# the model will computed the predicted y-value from X_test,
# and compare it to y_test with a score function
test_score = model.score(X_test, y_test)
train_score = model.score(X_train, y_train)
The actual score function of the model depends on the model and the kind of problem it is designed to solve. For example a linear regressor is the R² coefficient (numerical regression) while a support-verctor classifier (classication) will use the accuracy which is basicaly the number of good class-prediction.
If the default score of the model doesn’t fit your need, you can also import score function from sklearn’s metrics. Numerous score functions can be used to compute the score of a model, each with their pros and cons. They are available in sklearn.metrics module:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# See my previous post for why we split the input data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# create and train a linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)
# Compute predicted values
y_pred = lr.predict(X_test)
print("Model Score (R-squared):",
lr.score(X_test, y_test)) # use the .score method of the model
print("R-squared Score:",
r2_score(y_test, y_pred)) # use the the same function, but from sklearn.metrics
# use other score functions
print("Mean Absolute Error:",
mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:",
mean_squared_error(y_test, y_pred))
Model Score (R-squared): 0.8072059636181392
R-squared Score: 0.8072059636181392
Mean Absolute Error: 0.5913425779189776
Mean Squared Error: 0.6536995137170021
So remember for what follows: from a dataset, we create a train set and a test set. After training the model, we can compute a score both ot the train set and test set to estimate the performance of the fitted model.
Given a fixed input dataset, these scores depend on the choice of model, the parameters of that model (like the degree for a polynomial fit for example), the way we split that dataset (which sample goes into which set) and the choice of the score function.
It was important to introduce the test and train scores, because those concepts are usefull to inspect the "fitting state" – over or under-fitting- of a model.
Relation between over/under-fitting and train/test-score
Remember from the previous the rationale behind splitting and cross-validation:
- splitting: allows to esimate the generalization performances
- cross-validation: estimate the robustness of the generalization, evens out the luck/no-luck in a single split
Also, remember that in the process of cross-validation, different splits are used, but the rest of the process is the same: once split, train the model on the train set, and we can then compute the scores of that model (train score and test score).
That being said, let’s define what over-fitting and under-fitting mean. As their name suggest, they correspond to opposite state of your model, relative to the dataset at stake.
We say a trained model is over-fitted if it learned to much the dataset it was trained on, and hence lacks generalization abilities. This can be seen when the train score is very good (the model make very little error on the data it was trained on) but the test score is bad, so it is not good at generazlization. This can happen if the model is too complex/flexible (a very high degree polynom for example), if the trainset was too small or is very noisy. In this case, small changes in the train set imply big changes in the test predictions.
On the other hand, a trained model is under-fitted if it only focuses on very general global trend and not enough on details. This can be seen when the train score is not good enough, meaning that the model didn’t have the flexibility to learn the complexity of the data. This happens when the model is not flexible enough (we also say the model is too "constrained"), which can be a consequence of the choice of the model or of its parameters (like setting a 1-degree polynom to fit a 10-degree problem).
Our job is to find the sweet-middle spot, where there’s the best trade-off between over and underfitting, by tuning the model – in the very general way, including model choice, preprocessing choice, and all associated parameters.
So to summarize, the relation between train error, test error and model complexity (with a fixed input dataset):
- Underfit: at very low complexity, the model will underfit the train set (because it does not have the flexibility to bear the actual complexity in the data) and lead to errors on both the trainset and testset (train set and test set should have more or less the same complexity/noise, since they are drawn from the same population/dataset)
- Sweetspot: as the model complexity increases away from heavy underfitting, the train error and test error will decrease.
- Overfit: if the complexity increases too much, the train error will decrease (since we give the model more flexibility to learn the train set), but the test error will increase exponentially (because the model "learned too much" the train set, it performs less on the new data of the test set)
Let’s see a quick example: we will modify the model complexity by changing the degree of a polynomial fit. The truth model is 0.5 * X**2 + X + 2, and we try various degrees: 0, 1, 2, 10 and 25. Since the model has 0-th, 1-st and 2-nd order, we now that 0 degree will probably underfit, and 25 degrees will probably overfit.
%matplotlib qt
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
def truth(X):
return 0.5 * X**2 + X + 2
X = 6 * np.random.rand(100, 1) - 3
y = truth(X) + np.random.randn(100, 1)*2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
def fitted_model(degree):
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X_train, y_train)
return model
fig, axes = plt.subplots(1, 5, sharex=True, sharey=True)
xs = np.linspace(X.min(), X.max())
degrees = [0, 1, 2, 10, 25]
for deg, ax in zip(degrees, axes):
model = fitted_model(deg)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
y_train_score = model.score(X_train, y_train)
y_test_score = model.score(X_test, y_test)
ax.plot(xs, truth(xs), '--', alpha=0.2)
ax.scatter(X, y, alpha=0.5, color="gray")
ax.scatter(X_train, y_train_pred, label=f"train set", alpha=0.5)
ax.scatter(X_test, y_test_pred, label=f"test set", alpha=0.5)
ax.set_xlabel(f"train={y_train_score:.2f}/test={y_test_score:.2f}")
ax.legend()
ax.set_title(f"degree={deg}")
fig.tight_layout()
The result looks like this:
On the extreme lowest-complexity, 0 degree, the model underfits. Both the train and test scores are pretty low (for a linear regression, the optimum score is 1).
Increasing the degree to 1 brings significant improvements of both scores, but we can still see visually that the model is too simple to fit the data trend.
With a degree of 2, the model seems to approach an optimum. The scores are again way better, and we get a visual accordance. Compared to degree 5 for which the scores are a bit better, we can see a misfit bump (actually overfit). It is likely that for another split (like done in cross-validation), that degree 5 would be quite different.
With a degree of 25, we see a clear decrease of the test score and the train score keeps improving: this is a clear sign of over-fitting. At this point, our model is memorizing the train set and cannot generalize on new data.
This dependence of train/test scores with model complexity shows how over-fitting and under-fitting occurs. We will inspect this further with validation curve below.
Note another important term: ‘inductive bias‘: this is the bias introduced by the choice/kind of model. It is builtin the model itself, not the hyperparameters or number of samples (as opposed to bias introduced by the different hyperparameters like the degree of a polynomial regression). Remember that a model’s complexity depend both on the kind of model and its parameters.
Model performance with number of samples
While in most cases, we have to work with fixed size input data, another way to look at the model performance and the overall ML exercice, is to inspect how the scores change with the number of sample data.
Again, we have more or less 3 zones:
- with few samples, both train and test errors are important (there is not enough data for the model to understand what is happening, whatever its flexibility)
- as the number of samples increases, the train error will increase (since the model complexity is fixed), but the test error will decrease (adding more samples allowed the model to learn better)
- if the number of samples increases a lot, the train and test error will almost converge together: the model has reached its potential. The train error stopped increasing because the model itself (what it learns) is not changed by any new data point, and the test error is limited by the model complexity and cannot decrease anymore.
In this last case of very high number of samples, we say the model approaches the Bayes error rate: this is the error of the best model trained on unlimited data, when predictions are just limited by noise in the data.
Visualizing scores as function of complexity and number of samples
Finally, it can be a good practice to visualize those concepts using what are called "Validation curve" and "Learning curve":
- validation curve: plot the test score and train score as a funtion of model complexity (like the degree of polynomial fit):
score=f(complexity)
- learning curve: plot the test score and train score as a function of input size (for a given input data matrix, we can use just a portion of the total available data):
score=f(#samples)
Both curves can be generated pretty simply with sklearn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, validation_curve
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ValidationCurveDisplay
def truth(X):
return 0.5 * X**2 + X + 2
X = 6 * np.random.rand(100, 1) - 3
y = truth(X) + np.random.randn(100, 1) * 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
degrees = np.arange(0, 12, 1)
# Plotting the validation curve
ValidationCurveDisplay.from_estimator(
make_pipeline(PolynomialFeatures(), LinearRegression()),
X, y,
param_name='polynomialfeatures__degree',
param_range=degrees,
)
plt.xlabel('Degree of Polynomial Features')
plt.ylabel('Score')
plt.title('Validation Curve for polynomial fit')
As you can see, both the train score and test increase quickly in the [0–2] range. For higher degrees, the test score starts decreasing, indicating a loss of generalization of. the model. Obviously, the training keeps increasing when the model complexity increases.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import ValidationCurveDisplay, LearningCurveDisplay
def truth(X):
return 0.5 * X**2 + X + 2
N = 1000
X = 6 * np.random.rand(N, 1) - 3
y = truth(X) + np.random.randn(N, 1) * 2
LearningCurveDisplay.from_estimator(
make_pipeline(PolynomialFeatures(2), LinearRegression()),
X, y,
train_sizes=np.logspace(-2, 0, 20),
)
plt.xlabel('# of samples')
plt.ylabel('Score')
plt.title('Learning Curve for polynomial fit')
As you can see, with a fixed complexity of degree=2, the train score and test scores converge toward the same value for very high number of samples. This kind of curve can help you analyse if the input data you’re working with is enough for your model to approach its Bayes error rate.
The bias-variance tradeoff
The concepts of bias, variance and the bias-variance is strongly related to the concepts of over-fitting and underfitting. We already covered overfitting and underfitting, so I’ll make this quick:
- variance refers to the variation a model’s reponse changes with the train set. In other words, a model can exihbits strong variance when the trainset is very small and/or the model complexity is very high. Overfitting is often associated with high variance because the model is sensitive to the specifics of the training data.
- bias refers to how the fitted model will be "biased" compared to the perfect model, and will be pretty much the same whatever the input its fed. This happens especially if the model complexity/flexibility is very low compared to what we want it to learn. In other words, the model is biased towards its assumptions and might not adapt well to the complexities of the data. Underfitting is associated with high bias because the model is not flexible enough to adapt to the complexities of the data.
The bias-variance tradeoff is then a key concept in ML: it suggests that there is a tradeoff between bias and variance. Increasing model complexity tends to decrease bias but increase variance, and vice versa. The goal is to find the right balance that minimizes both bias and variance, leading to a model that generalizes well to new, unseen data. So our job as data-scientist is to tune the model and find this sweet spot.
The example below shows how a simple polynomial fit of a single variable, so y=p(x) can have either high bias or high variance, depending on the degree allowed for the model (the code is available further down, below the figure).
The idea is the following: we create a toy dataset y with a known polynomial function p(x)=0.5X³ + X + 2 + noise. So the true coefficients we would like the polynomial fit to learn are 2 for the constant coefficient, 1 for the X coefficient, and 0.5 fort the X³ coefficient. We test 2 models: a very low degree polynom with degree=1, and a high degree polynom with degree=15.
Once the data is generated, we split it 50–50 into a first train/test set. Both low and high degree models are fitted on this split, and their learned coefs, preditcions and score are computed. Then we just switch the train set and test set, so for the second split, we are training on what was the test set and testing on what was the train set. This is not common practice but rather a pedagogical trick. Again, both low and high degree models are fitted, and again their learned coefs, predictions and scores are computed.
The results are the following:
As you can see for the low degree plots (middle column plots), both splits lead to pretty much the same coefficients solution. The predictions are "biased" from the true data, there is this kind of constant "offset" in the learned coefs.
On the other hand for the high degree plots (right column plots), both splits lead to very different coefficients. There is a high variance in their response.
Here is the code used:
%matplotlib qt
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
degree_low = 2
degree_high = 15
true_coefs = np.zeros(degree_high)
true_coefs[4] = 0.5
true_coefs[1] = 1
true_coefs[0] = 2
N = 100
X = 6 * np.random.rand(N, 1) - 3
y = 0.5 * X**3 + X + 2 + np.random.randn(N, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
def compute_results(X_train, X_test, y_train):
model_underfit = make_pipeline(PolynomialFeatures(degree=degree_low), LinearRegression())
model_underfit.fit(X_train, y_train)
y_pred_train_underfit = model_underfit.predict(X_train)
y_pred_test_underfit = model_underfit.predict(X_test)
model_overfit = make_pipeline(PolynomialFeatures(degree=degree_high), LinearRegression())
model_overfit.fit(X_train, y_train)
y_pred_train_overfit = model_overfit.predict(X_train)
y_pred_test_overfit = model_overfit.predict(X_test)
return model_underfit, model_overfit, y_pred_train_underfit, y_pred_test_underfit, y_pred_train_overfit, y_pred_test_overfit
# Plot the results
fig, axes = plt.subplots(3, 3, figsize=(12, 8))
axes[0,0].set_title("RAW input data")
axes[0,0].set_ylabel('First split')
axes[2,0].set_ylabel('Second split')
axes[1,0].set_ylabel('Poly. coefs.')
axes[0,0].set_title("Input data")
axes[0,1].set_title("Low-degree, biased model (degree=1)")
axes[0,2].set_title("High-degree, high variance (degree=15)")
axes[1,0].bar(np.arange(len(true_coefs)), true_coefs, alpha=0.3, label="True coefs")
model_underfit, model_overfit, y_pred_train_underfit, y_pred_test_underfit, y_pred_train_overfit, y_pred_test_overfit = compute_results(X_train, X_test, y_train)
axes[0,0].scatter(X_train, y_train, label='Training Data', color='red', alpha=0.7)
axes[0,0].scatter(X_test, y_test, label='Test Data', color='blue', alpha=0.7)
axes[0,1].scatter(X, y, color="gray", alpha=0.2)
axes[0,1].scatter(X_train, y_pred_train_underfit, color='red', label='Train', alpha=0.7)
axes[0,1].scatter(X_test, y_pred_test_underfit, color='blue', label='Test', alpha=0.7)
axes[0,2].scatter(X, y, color="gray", alpha=0.2)
axes[0,2].scatter(X_train, y_pred_train_overfit, color='red', label='Train', alpha=0.7)
axes[0,2].scatter(X_test, y_pred_test_overfit, color='blue', label='Test', alpha=0.7)
axes[1,1].bar(np.arange(degree_low+1), model_underfit.named_steps['linearregression'].coef_.flatten(), label="1st split", alpha=0.3)
axes[1,2].bar(np.arange(degree_high+1), model_overfit.named_steps['linearregression'].coef_.flatten(), label="1st split", alpha=0.3)
axes[1,0].set_xticks(np.arange(len(true_coefs))); axes[1,0].set_xlim(-1, 15); axes[1,0].set_ylim(-5, 5)
axes[1,1].set_xticks(np.arange(len(true_coefs))); axes[1,1].set_xlim(-1, 15); axes[1,1].set_ylim(-5, 5)
axes[1,2].set_xticks(np.arange(len(true_coefs))); axes[1,2].set_xlim(-1, 15); axes[1,2].set_ylim(-5, 5)
axes[0,1].set_xlabel(f'Train score={model_underfit.score(X_train, y_train):.2f} / Test score={model_underfit.score(X_test, y_test):.2f}')
axes[0,2].set_xlabel(f'Train score={model_underfit.score(X_train, y_train):.2f} / Test score={model_underfit.score(X_test, y_test):.2f}')
# Switch train and test sets...
X_train, X_test, y_train, y_test = X_test, X_train, y_test, y_train
# ... and start over
model_underfit, model_overfit, y_pred_train_underfit, y_pred_test_underfit, y_pred_train_overfit, y_pred_test_overfit = compute_results(X_train, X_test, y_train)
axes[2,0].scatter(X_train, y_train, label='Training Data', color='red', alpha=0.7)
axes[2,0].scatter(X_test, y_test, label='Test Data', color='blue', alpha=0.7)
axes[2,1].scatter(X, y, color="gray", alpha=0.2)
axes[2,1].scatter(X_train, y_pred_train_underfit, color='red', label='Train', alpha=0.7)
axes[2,1].scatter(X_test, y_pred_test_underfit, color='blue', label='Test', alpha=0.7)
axes[2,2].scatter(X, y, color="gray", alpha=0.2)
axes[2,2].scatter(X_train, y_pred_train_overfit, color='red', label='Train', alpha=0.7)
axes[2,2].scatter(X_test, y_pred_test_overfit, color='blue', label='Test', alpha=0.7)
axes[1,1].bar(np.arange(degree_low+1), model_underfit.named_steps['linearregression'].coef_.flatten(), label="2nd split", alpha=0.3)
axes[1,2].bar(np.arange(degree_high+1), model_overfit.named_steps['linearregression'].coef_.flatten(), label="2nd split", alpha=0.3)
axes[1,0].set_xticks(np.arange(len(true_coefs))); axes[1,0].set_xlim(-1, 15); axes[1,0].set_ylim(-5, 5)
axes[1,1].set_xticks(np.arange(len(true_coefs))); axes[1,1].set_xlim(-1, 15); axes[1,1].set_ylim(-5, 5)
axes[1,2].set_xticks(np.arange(len(true_coefs))); axes[1,2].set_xlim(-1, 15); axes[1,2].set_ylim(-5, 5)
axes[2,1].set_xlabel(f'Train score={model_underfit.score(X_train, y_train):.2f} / Test score={model_underfit.score(X_test, y_test):.2f}')
axes[2,2].set_xlabel(f'Train score={model_underfit.score(X_train, y_train):.2f} / Test score={model_underfit.score(X_test, y_test):.2f}')
fig.suptitle('Comparison for 2 splits of low/high degree polynomial models')
for ax in axes.flatten(): ax.legend()
plt.tight_layout()
Wrapup
To summarize, remember these important concepts:
- Score/Train Score/Test Score: Scores quantify the performance of a model; Train Score reflects its accuracy on the training data, Test Score on unseen data. Balancing high train score with a comparable test score is crucial for a model that generalizes well.
- Underfitting/Overfitting: Underfitting occurs when a model is too simplistic, missing data complexities; Overfitting arises when a model overly tailors itself to training data, hindering generalization. Finding balance in model complexity is essential to avoid both underfitting’s simplicity and overfitting’s memorization.
- Model Performance as a Function of Complexity and Number of Samples: Examining how a model’s scores change with complexity (e.g., polynomial degree) or dataset size provides insights into its behavior. Assessing performance across varied complexities and sample sizes aids in identifying the optimal model characteristics.
- Bias and Variance in Model Evaluation: Bias refers to a model’s tendency to deviate from the true data pattern; low flexibility yields high bias. Variance captures a model’s sensitivity to dataset changes; high complexity leads to high variance
Finally, achieving the bias-variance balance is often the key point in a good ML exercise to tune a model to its best possible performance.
You might like some of my other posts, make sure to check them out: