The world’s leading publication for data science, AI, and ML professionals.

Machine Learning Basics I Look for in Data Scientist Interviews

Let's build our breadth of science together.

Photo by Matt Wang on Unsplash
Photo by Matt Wang on Unsplash

I have been increasingly reviewing resumes and conducting phone screens and interviews to hire our next scientist and as a result, I have been thinking more than ever about the detailed expectations that I have from our data and applied scientists. As a result, I recently wrote a post about the mathematics that I look for in our data scientist interviews, which was very well received by the readers (linked below) and therefore, I decided to continue that discussion by going much deeper into the machine learning knowledge that we expect our data scientists to have in this post.

I recognize there are more than enough Data Science interview tutorials out there but I still decided to write this post because I wasn’t able to refer candidates to one comprehensive source to prepare for their interviews or day-to-day ML tasks as a data scientist. If you are preparing for a data scientist interview or just plan to refresh your memory of the related topics, I believe you will find this post helpful.

At the very least, I strongly encourage you to browse through, take a look at the tables, which I am personally very proud of making, and if you find them helpful, mark the post for your future reference and reading.

Here are a list of topics that we will cover today:

  1. Basic Concepts: We will discuss Machine Learning types (supervised, unsupervised, etc.) and then define train, validation and test sets and discuss their roles.
  2. Model Selection: We will introduce model families and how to choose the right model for a given task.
  3. Generalization: How trained models perform in practice, measuring error, bias-variance tradeoff and how to improve model based on error types.
  4. Evaluation: How to select the right evaluation metric based on the problem being solved and how to interpret the metrics.

Here is the post about the mathematics requirements for a data scientist that I mentioned earlier and after that we will go into the machine learning topics.

Mathematics I Look for in Data Scientist Interviews

Let’s get started!


1. Learning The Concepts

Let’s start by describing what machine learning (ML) is. ML generally involves having some data that we would like to analyze in order to either make predictions or infer knowledge from. For example, can we look at all the historical housing prices data and try to come up with a good estimate for our own house or for a house that plan to buy? Yes and that will be the prediction application. Or can we use the historical housing prices to determine what factor is the most important one for the price of the house (e.g. location or size)? Yes and that is an example of inferring knowledge from the ML analysis. As you can imagine, there are many use cases where businesses try to better understand the existing data and use the data to make better predictions, which hopefully will result in an improved performance of the business and therefore ML has been an essential tool for almost all businesses these days that deal with data. Next, let’s talk about why this process is called "learning".

In the context of machine learning, "learning" refers to the process where a model or an algorithm discovers patterns and relationships in the provided data, which is similar to what human learning is, and hence the choice of the word. The process through which ML models learn from the data is called "training" and therefore the data that model learns from is sometimes called "training data". Now that we know what ML and training are, let’s talk about different types of learning/training in ML.


1.1. Supervised vs. Unsupervised

Machine learning approaches can be broadly categorized into two categories of supervised and unsupervised – let’s define each.

Supervised Learning is where the model learns from "labeled" training data so let’s define what labeled data is. A data is considered labeled when each training example is paired with the correct output. For example, let’s assume we would like to train an ML model to determine whether an email is spam or not. In order for us to teach the model and for the model to learn which emails should be considered spam or not, we provide the model with a labeled training data that includes a set of emails and each email is already "labeled" with whether it is a spam or not. Then the model learns from the training data (emails + labels) and hopefully it will be able to make a prediction about other emails, that did not exist in the training data, based on what it has learned during training. This is essentially what decides which email goes to the spam folder in our inboxes in Gmail and others! As you can imagine, creating labeled data requires time and effort but there are a lot of unlabeled data just available around us. For example, there are many published articles, wiki pages and books. It would be great if we could take advantage of the large amounts of "unlabeled" data. That is where unsupervised learning comes in the picture. Let’s talk about that.

Unsupervised Learning is where the model learns from unlabeled training data, as you might have guessed by now. Since there are no labels involved here, the goal of unsupervised learning is usually to identify patterns, structures or groupings in the data. Going back to our email example above, let’s say that we would like to go one step further. Once an email is determined as a non-spam, we would like to have another model to categorize it. Let’s assume we do not have labels and we will just have the model look at all the existing unlabeled emails and group them together. It is very likely that the model would put them into different unnamed categories and once we review the categories we realize that political emails are grouped together, sports emails are grouped into another cluster and so forth, since the content of each group looked similar to the ML model, without the model knowing what groups they actually belonged to (since there are no labels such as "politics" or "sports" in our example).

Understanding the distinction between supervised and unsupervised learning will suffice for the purposes of this post but for the sake of completeness I want to cover a third category called semi-supervised learning, which combines elements of supervised and unsupervised learning. In such a semi-supervised setting, the model is first trained on a smaller set of labeled data (i.e. supervised) and then the training continues on a larger amount of unlabeled data (i.e. unsupervised). The underlying idea is that the model can learn the basic patterns from the labeled data and then improve its performance with the unlabeled data.

As a summary, I created the table below that specifies data requirements of each approach, along with some example algorithms and use cases for each category.

Table 1 - Learning Types Overview
Table 1 – Learning Types Overview

1.2. Train, Validation and Test Sets

As we recall from the previous section, one of the goals of ML models is to make predictions, which requires the model to be trained on the data. During this process, the model learns the underlying pattern and structure of the data and then we use it to make predictions for data that the model has not seen before. In other words we want the model to be able to forecast well, which is also called a model that generalizes well in the literature.

In the ML literature, data sets are commonly divided into three distinct sets to ensure the ML models generalize well to new data, as follows:

  • Training Set is the portion of the data used to train the model. This is the portion of the data set that the ML model uses to learn the patterns and relationships of the data.
  • Validation Set is used to tune model hyperparameters and evaluate its performance during the training process. The idea here is that the model has already seen the training set so we cannot use that to judge how the model is performing during training and we need a data set that model has not seen, therefore we use the validation set. Also, ML models usually come with various knobs and levers that can be pulled that influence their performance. These knobs and levers are usually called hyperparameters and the process through which we test various hyperparameter combinations is called hyperparameter optimization. We can use the validation set to find the best combination of hyperparameters for a given problem. We will not cover hyperparameter optimization in this post, but I have other posts on this topic and will link one of them below for those who are interested in learning more.

Hyperparameter Optimization – Intro and Implementation of Grid Search, Random Search and Bayesian…

  • Test Set is the third and last portion of the data set that the model has not seen during the training time. The test set is used to evaluate the final performance of the model after training and hyperparameter optimization. The idea here is that the test set will provide a good proxy for how the model performs on data that it has never seen before.

Important Note: I have seen the application of validation and test sets being confused with each other (and admittedly definitions can be confusing). The distinction is that validation set is used during iterative training and hyperparameter optimization of the model, while test set is used for the final evaluation of the model.

Breaking down a data set into train, validation and test set can be done as follows:

  1. Import libraries used for splitting the data
  2. This library can only break down the data into two parts so we first break down the data into test set and a temporary train set. This temporary train set would be a combination of validation and train sets. Note that we are using 20% of the data for the test set – this is something you can change but 10%-20% is what I have seen in practice.
  3. Finally we break down the remaining data into train and validation sets. We usually aim for having 60%-70% of the original data for training and 10%-20% for validation. These numbers can change based on each person’s preferences and project requirements. Note that in the code we are using 25% of the remaining data, which was 80% of the original data, for validation set – that is 25% of 80%, which equals to 20% of the original data and the remainder will be the training data. In other words, our train, validation and test sets for this example are 60%, 20% and 20%, respectively.

I have added comments in the code below to make it easier to follow the above steps.

# 1 - import libraries
from sklearn.model_selection import train_test_split

# X and y are features and labels
# 2 - split original data set into temp_train and test
temp_train_X, test_X, temp_train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=1234)  # 20% for test set

# 3 - split temp_train into train and val
train_X, val_X, train_y, val_y = train_test_split(temp_train_X, temp_train_y, test_size=0.25, random_state=1234)  # 25% of the remaining 80% for validation set

Let’s summarize this section into a table before moving on to the next topic.

Table 2 - Data Sets Definition, Purpose and Typical Size
Table 2 – Data Sets Definition, Purpose and Typical Size

This covers our overview of the basic concepts. Next, we move on to the model selection, based on project requirements and available data.


2. Model Selection

One of the areas I see new data scientists struggle with is to use the model with the highest likelihood of being effective for the problem they are trying to solve so I wanted to make sure we have a section dedicated to model selection in this post. I break down model selection process into two elements – both are essential but this needs to happen in a sequential order. The first stage is to select the family of ML models to use (let’s call this "model family selection") and the second stage is to find the right model within that family of models for the task (let’s call this "final model selection"). Let’s talk about each stage in detail.

2.1 – Model Family Selection

In order to choose the right family of models to use for a given task, we need to understand the problem type and the data that is available to us. Each of these two factors can result in a different selection. Since we broke down the learning process into supervised and unsupervised, we can start with the same breakdown for model selection as well. In other words, we will consider two scenarios, one where labeled data is available, and therefore supervised learning is a possibility and the other scenario where labeled data is not available and unsupervised approaches should be considered. We will explore each scenario further next.

  • Scenario 1 (supervised) – Let’s assume we have access to labeled data. As the next step, we would want to understand what problem we are solving. If the target variable is continuous, then we will use regression models. Some examples of regression models are linear regression, ridge regression and gradient boosting regressors for more complex use cases. On the other hand, if the target variable is categorical, then we will use classification models. Some common classification models include logistic regression (despite having "regression" in the name), support vector machines (SVMs), random forest classifiers and K-nearest neighbor (KNN).

For a more detailed comparison of regression and classification with hands-on examples, please visit this post:

Classification vs. Regression in Machine Learning – Which One Should I Use?

  • Scenario 2 (unsupervised) – In this scenario we assume that we do not have access to labeled data. In such a case, we will rely on unsupervised learning models. Some of the more common unsupervised models are K-means clustering, hierarchical clustering and principal component analysis (PCA).

For a deep dive into principal component analysis, take a look at this post:

Principal Component Analysis – Hands-On Tutorial

Now that we have selected the right family of models, based on our understanding of the problem and availability of data, we can focus on finding the right model within that family of models in the next stage.


2.2 – Final Model Selection

Let’s summarize where we are so far. We took a deeper look at the problem we are trying to solve and also at the data that is available to us and based on those, decided on what family of models we can use for the problem at hand. Let’s walk through an example. Assuming we have labeled data and we are trying to make a forecast about the price of a house, since the target variable (i.e. price of a house) is a continuous one, we will choose regression as the family of models to use. But there are various regression models available, which one should we use? For example, we could use linear regression, ridge regression, polynomial regression, decision trees or random forest regressors, gradient boosting machines (XGBoost), or even neural networks. So how do we make a decision? The best way at this point is to first narrow down the model selection based on requirements of the problem statement and then start implementing all of the remaining approaches to find what works best for a given problem. A few of the usual considerations are:

  • Budget: Some times existing business requirements limit the amount of computational resources assigned to a project. Or even if budget is not an issue, we may not have enough time to run a long training. In such cases, we may choose a lighter model to make sure we stay within the boundaries of our budget.
  • Interpretability: Some high-performing models are too complex for human interpretability and therefore are not suitable in sensitive use cases. For example, in disease diagnosis, we may want to be able to interpret the results and follow the logic of a recommendation being made by an ML model, to ensure the right decision is being made. Therefore, we may choose to limit ourselves to only those models that indicate what symptoms led to the diagnosis and therefore we may choose not to use black box models such as neural networks.
  • Complexity: Complex models, such as neural networks, can perform very well but they tend to require larger amounts of data, compared to simpler models. If we do not have enough training data to support a complex model, we may be better off using simpler models.

Once we go through the above considerations and narrow down the list of models, we are ready to start training using the remaining models and then we will evaluate the performance of each model to find the best model for our use case.

I know we covered a lot of ground in this section for model selection, especially because I personally feel passionate about this area. I have tried to distill down the information into a decision tree table below to summarize and faciliate the decision making progress. Hope you like it as much I did!

Table 3 - Model Selection Decision Tree
Table 3 – Model Selection Decision Tree

2.3. Model Selection Reference

Now that we understand the model selection process, I would like to also include a more comprehensive overview of models that may be useful to data scientists, along with associated complexity levels, data requirements and level of interpretability. As we observed earlier, this information can come in handy to make the right choice in selecting the right model. We will also explore later in the post that during our model selection and based on the observed errors, we may want to increase or decrease the level of complexity of the models we are testing so it will be helpful to have a relatively comprehensive list of what models are available to us.

Since there are quite a number of models available for each of supervised and unsupervised learning approaches, I am going to present them in two separate tables. First table will cover supervised approaches and the second table will present the unsupervised algorithms. It is important to note that data scientists are not required to know all of these models in depth, since we each specialize in different areas of data science but having an overall understanding and appreciation of these models can help increase our breadth of knowledge.

Let’s start with a summary of supervised learning models as follows:

Table 4 - Supervised Models Overview
Table 4 – Supervised Models Overview

Then, let’s look at a tabular summary of unsupervised learning models as follows:

Table 5 - Unsupervised Models Overview
Table 5 – Unsupervised Models Overview

That concludes our model selection section. Now we understand what kind of model we are looking for and what we can experiment with. In the next section, we will look at what errors can happen during the training and testing of these ML models and how we can use the knowledge gained from these errors to further improve our modeling.

One of the models that has been very popular among data scientists is XGBoost, which I have also mentioned in the earlier tables. It is a very strong candidate for both regression and classification tasks and I have seen it used in data scientist interviews. If you are interested in learning more about it, refer to the following post, which includes an introduction, a step-by-step implementation, followed by performance comparison against other popular models, such as random forest, gradient boosting, adaboost, KNN, SVM, etc.

XGBoost: Intro, Step-by-Step Implementation, and Performance Comparison

This concludes our discussion around model selection. Once we have selected the model that fits our problem, we want to look at how well it generalizes and performs after training. Next section takes a closer look at this topic.


3. Generalization – Overfit and Underfit

For almost every section of this post, I initially wrote "this section is the most important one" but I kept changing my mind and reserved it for this section! So let me say that the section we are about to go through is the most important part of this post in my estimation and yet this is where I see majority of our interview candidates struggle with the depth of understanding. Let’s start from what the goal of an ML exercise is and build from there.

Recall that the goal of a (supervised) ML training is to learn from the training data. The training process is an iterative one so the model keeps looking for ways to improve its learning during training time. In order for the model to be able to learn the data best, we provide the model with an objective function, which is also called a loss or a cost function. The idea is that during training, the model measures its own performance by comparing it to actual data by using the loss function and tries to minimize loss. Loss in this case can be thought of as the distance between a model’s predictions and the actuals during training time. The model tries to minimize this loss and once the loss is small enough or once we have spent our resources, such as computational budget or time, the training stops. The final loss that remains in the loss function is called training loss or error. In short, the model tried to minimize its error during training time and whatever was left is called the training loss.

Now we understand what the training loss is but the goal of training an ML model is to create a model that can predict the future well and therefore, once the model has been trained, we want to measure its "generalization" on an unseen data (or test set) to measure the model’s actual performance and minimize. In other words, the goal is to minimize the test error and not necessarily to minimize the training error. The problem is that minimizing the training error does not always lead to the model generalizing well (i.e. minimizing the test error). So let’s try to better understand what test error is and then we can find ways to improve the model further.


3.1. Test Error Categories

We breakdown the test errors into two categories:

  1. Overfit: When the trained model predicts accurately on the training data set (i.e. training error is small) but does not generalize well to unseen data (i.e. test error is large), the model is said to have overfitted the data.
  2. Underfit: The model underfits the data if the training error is relatively large, which typically ends up having a relatively large test error as well.

This can be confusing so let’s put it in a table that we can use going forward as a reference. This table can be used as a decision matrix – for example, if you see a case with small train error but large test error, then that is an overfitted model.

Table 6 - Error Matrix
Table 6 – Error Matrix

We now understand whether the model overfits or underfits but how do we use this information to make our models better? In order to be able to improve the models, we need to first understand what is causing the errors. We will break down the test error into two components next and then talk about each one in more detail.


3.2. Test Error Components – Bias vs. Variance

Test error is usually decomposed into two components of bias and variance. Let’s define each and understand them with examples.

  • Bias is the type of error that happens because the model is too simplistic. Simplistic models fail to capture the underlying complexities that exist in the training data, which results in underfitting. As we learned before, in these cases we expect both the train and test errors to be high. In short, high bias models tend to underfit.
  • Variance, unlike bias, is the type of error that happens when the model is too complex for the training data. The result of using an overly complex model for a simpler training data is that the model learns to react to even the noise in the training data and therefore overfits the training data. As a result, train error ends up being small but the overfitted model does not generalize well and the test error ends up being large, which is the definition that we provided earlier for a case of an overfitted model. In short, high variance models tend to overfit.

Let’s expand our previous table to include bias and variance in there for our future reference.

Table 7 - Expanded Error Matrix
Table 7 – Expanded Error Matrix

Now that we understand these two error types, let’s use them to improve our model’s performance.


3.2.1. Bias-Variance Trade-Off

As we saw in the previous section, the test error can be decomposed into bias and variance and often there is a trade-off between these two components. If a model is too simplistic (e.g. has few parameters) for a given training data, then it may have large bias and small variance and usually suffers from underfitting. When the model is too complex (e.g. has many parameters) for the training data, then it may suffer from high variance and low bias and thus overfit. These can be demonstrated in the below figure.

Note that I am creating this graph in Python to demonstrate what the trade-off looks like. Technically, this can be created for a given training data by using various ML models with varying complexities and measuring the test error. For simplicity, I just create the diagram below. I have also added comments to make the code easy to follow but the code is not important for this section.

# import libraries
import numpy as np
import matplotlib.pyplot as plt

# generate values for model complexity
model_complexity = np.linspace(0, 10, 100)

# define functions for bias, variance, and test error
bias_squared = (10 - model_complexity) ** 2 / 20
variance = model_complexity ** 2 / 30

# test error is bias and variance together
test_error = bias_squared + variance

# plot
plt.figure(figsize=(10, 6))
plt.plot(model_complexity, bias_squared, label=r'Bias', color='blue')
plt.plot(model_complexity, variance, label='Variance', color='red')
plt.plot(model_complexity, test_error, label='Test Error (= Bias + Variance)', color='black')

# labels, title and legend
plt.xlabel('Model Complexity', fontsize=14)
plt.ylabel('Error', fontsize=14)
plt.title('Bias-Variance Tradeoff', fontsize=16)
plt.axvline(x=5, color='gray', linestyle='--', label='Optimal Trade-Off')
plt.legend()
plt.grid(True)
plt.show()

Results:

Figure 1 - Bias-Variance Tradeoff
Figure 1 – Bias-Variance Tradeoff

In the diagram above, bias is in blue, variance is in red and the test error is in solid black. X axis depicts model complexity and Y axis depicts the error. As the model complexity increases, bias lowers and variance increases. The important part here is how the "test error" in solid black changes with model complexity. Note that test error starts high, decreases to a point and then goes up again – this is what we mean when we talk about the bias-variance trade-off. The minimum of test error is where we want to be, which is marked as the "optimal trade-off" in the plot.

Let’s walk through an example to see this bias-variance trade-off in action.


3.2.2. Bias and Variance – Examples

We are going to improve our understanding of bias and variance concepts through an example. We will start with generating some synthetic data. In order to do demonstrate the underfit and overfit concepts, I will generate the synthetic data in the form of a quadratic equation and therefore we expect a quadratic solution would fit the data best. Then we will try to fit linear, quadratic and higher degree ML models to the synthetic data to demonstrate the over and underfitting.

I strongly encourage you to read the discussion that comes after the plots. I will explain in detail how to identify a high bias and/or variance system both visually and quantitatively. This is one of those concepts that once you have thought through a few times, can use it as a strong analytical tool for your day-to-day modeling results analysis.

3.2.2.1. Implementation

Let’s start by creating the synthetic data as follows:

# import libraries
import numpy as np
import matplotlib.pyplot as plt

# generate synthetic data
np.random.seed(1234)
X = np.sort(np.random.rand(100, 1) * 10, axis=0)
y = 2 * X + X ** 2 + 5 + np.random.randn(100, 1) * 4

# plot
plt.figure(figsize=(6, 4))
plt.scatter(X, y, color="blue", alpha=0.5)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Synthetic Data")
plt.show()

Results:

Figure 2 - Synthetic Data Scatterplot
Figure 2 – Synthetic Data Scatterplot

So far, we only created some synthetic data through a quadratic equation and plotted them above. Next step is to fit a few different ML models to the data and look at and measure how well they fit the data. Before we do that, let’s take care of a couple of things. First we will breakdown our data set into train and test sets. Note that since the data is randomly generated to begin with, we are not going to randomize the selection. We will simply pick the first 80% as training and leave the rest as test set.

# split data into training and test sets
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

And then we will define a function that we will use to plot the results of our models that we will train.

# function to plot model
def plot_model(models, titles, poly_features):
    plt.figure(figsize=(15, 6))
    colors = ['red', 'orange']

    for i, (model, title) in enumerate(zip(models, titles)):
        plt.subplot(1, len(models), i + 1)
        plt.scatter(X_train, y_train, label='Training Data', color='blue', alpha=0.5)
        plt.scatter(X_test, y_test, label='Test Data', color='green', alpha=0.5)

        X_range = np.linspace(0, 10, 100).reshape(-1, 1)
        if poly_features[i] is not None:
            X_poly_range = poly_features[i].transform(X_range)
            plt.plot(X_range, model.predict(X_poly_range), color=colors[i], linewidth=2, label=title)
        else:
            plt.plot(X_range, model.predict(X_range), color=colors[i], linewidth=2, label=title)

        plt.xlabel('Feature')
        plt.ylabel('Target')
        plt.legend()
        plt.grid(True)
        plt.title(title)

    plt.tight_layout()
    plt.show()

At this point, we are ready to get to modeling. I will explain the process here at a high level and then will add comments in the code to make it easier to follow.

We will start by importing the libraries that we will be using for this exercise. Then I want to make sure we have a model that underfits (i.e. high bias) and also another model that overfits (i.e. high variance). I want to also compare those to a model that fits the data better. Normally we would need to experiment with various ML models to see when this behaviors happen but we have an advantage here – our advantage is that we generated the data ourselves and therefore know that the data follows a quadratic equation. Therefore, I would expect a linear regression model to be too simplistic to fit well and therefore we can use that as our underfitting model. Additionally, anything beyond quadratic would be over complex for the training data so we can use a 5th degree model as our overfitting choice. And finally, since the data is quadratic, we will use a quadratic model as the model that we expect would fit the data best.

To summarize we will use the training data to train our models (linear and polynomial regression models). Then we use the train models to make predictions on the test set. Finally we measure the error observed in the predicted test sets using mean squared error (i.e. average of the squared errors) and finally plot the results to visually inspect the fit (over, under and a good one).

In this post, we will not be able to go deep into details of liear regression but if you are interested in learning more about linear regression, the following post is for you.

Linear Regression – Occam’s Razor of Predictive Machine Learning Modeling

Let’s implement what we discussed so far below and then we will further discuss the findings:

# import libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# high bias (underfitting) - linear regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# better fit model - polynomial regression (degree 2)
poly_2 = PolynomialFeatures(degree=2)
X_poly_2_train = poly_2.fit_transform(X_train)
poly_model_2 = LinearRegression()
poly_model_2.fit(X_poly_2_train, y_train)

plot_model(
    models=[linear_model, poly_model_2],
    titles=["(A1) High Bias (Underfitting) - Linear Regression", "(A2) Better Fit - Polynomial Regression (Degree 2)"],
    poly_features=[None, poly_2]
)

# high variance (overfitting) - polynomial regression (degree 5)
poly_5 = PolynomialFeatures(degree=5)
X_poly_5_train = poly_5.fit_transform(X_train)
poly_model_5 = LinearRegression()
poly_model_5.fit(X_poly_5_train, y_train)

plot_model(
    models=[poly_model_5, poly_model_2],
    titles=["(B1) High Variance (Overfitting) - Polynomial Regression (Degree 5)", "(B2) Better Fit - Polynomial Regression (Degree 2)"],
    poly_features=[poly_5, poly_2]
)

# calculate mean squared errors (MSE) for each model
def calculate_errors(model, X_train, y_train, X_test, y_test, poly=None):
    if poly:
        X_train = poly.transform(X_train)
        X_test = poly.transform(X_test)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    mse_train = mean_squared_error(y_train, y_train_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    return mse_train, mse_test

# calculate errors
mse_linear_train, mse_linear_test = calculate_errors(linear_model, X_train, y_train, X_test, y_test)
mse_poly_2_train, mse_poly_2_test = calculate_errors(poly_model_2, X_train, y_train, X_test, y_test, poly_2)
mse_poly_5_train, mse_poly_5_test = calculate_errors(poly_model_5, X_train, y_train, X_test, y_test, poly_5)

# Print errors
print("Mean Squared Errors (MSE):n")
print(f"Linear Model (High Bias - Underfitting): Train MSE = {mse_linear_train:.2f}, Test MSE = {mse_linear_test:.2f}")
print(f"Polynomial Model Degree 2 (Better Fit): Train MSE = {mse_poly_2_train:.2f}, Test MSE = {mse_poly_2_test:.2f}")
print(f"Polynomial Model Degree 5 (High Variance - Overfitting): Train MSE = {mse_poly_5_train:.2f}, Test MSE = {mse_poly_5_test:.2f}")

Results:

Figure 3 - Comparison of Overfit and Underfit Models to an Optimal Fit
Figure 3 – Comparison of Overfit and Underfit Models to an Optimal Fit

Next comes the discussion of the above plots that I strongly encourage you to read and then try to drive the same discussion yourself as an exercise.

3.2.2.2. Analysis and Discussion

Let’s discuss the results in more detail. As we discussed earlier, our data set is synthetically generated using a quadratic equation, which can be seen in any of the above plots in the form of the blue and green dots scattered around the scatter plots. We trained three models so let us discuss each individually:

  • Linear regression (plot A1 in the top left corner): We can see that the linear model cannot capture the underlying quadratic pattern of the training data very well, due to its simplistic linear nature, and therefore it demonstrates a high bias or underfitting scenario.
  • 5th degree polynomial regression (plot B1 in the bottom left corner): Note that we moved to plot B1. This is a higher degree polynomial regression model, relative to the underlying data (i.e. 5th vs. 2nd degree). As we can see, the 5th degree model fits very well to the train set and follows the data very closely but it does not generalize well to the test set, as we see that the predicted outcomes, in red line, start to move away from the actuals in green dots. This is due to the fact that the 5th degree model is too complex for the underlying data and therefore results in a high variance system that overfits to the training data, resulting in a poor generalization.
  • 2nd degree polynomial regression (plots A2 and B2 in the right half): And finally, we get to two plots of A2 and B2, which are the same but I included them separately to make visual side-by-side comparison more convenient. We can visually observe in the plots how beautifully the model captures the underlying pattern of the training data and generalizes well to the test set. This of course is not by accident since the underlying train and test set follow a quadratic equation and we also use a polynomial regression (2nd degree), which is expected to have a very good fit.

Conclusion here is that if we were given this data set and we tested these three different models, we would pick the 2nd degree polynomial regression model as our model of choice, given the results above.

But what if we did not have the visualization and/or wanted to rely on quantitative measures of error? Let me tabularize the mean squared results that we have above into a small table and then we can discuss.

Table 8 - Errors Summary
Table 8 – Errors Summary

When we look at rows 1 and 3, we can see that the train errors were much smaller compared to test errors. This indicates that our models did not generalize well. But the interesting part is that the errors are of different types. For row 1, the model suffers from a high bias error, while for row 3, the system suffers from a high variance error. Just by looking at these numbers, we will not be able to tell which one is high bias and/or variance but given our knowledge of the models, we can come to a conclusion. Row 1 is a linear regression model, which is much simpler compared to row 5, which is a 5th degree polynomial model and therefore we can conclude two important points: (1) row 1 suffers from high bias, due to simplicity of the linear regression model, and therefore underfits the data, while row 3 suffers from high variance, due to complexity of the 5th degree polynomial regression model, and therefore overfits the data. (2) Optimum point that we discussed during the bias-variance tradeoff lies somewhere between linear and 5th degree polynomial regression. Knowing these points, we could start testing various degrees of polynomial regression models and finally approach the right solution, which is our row 2 that uses a 2nd degree polynomial regression model, where train and test errors are roughly around the same mangnitude. This is an indication that our model generalized well.

In the example above, we talked about one scenario where we identified cases of underfitting and overfitting models and then found the optimized model that generalized well. We simply used the model with the right level of complexity but there are other ways that can help us with high bias and/or variance scenarios. Let’s discuss those options next.


3.2.3. Bias and Variance – Improvement Paths

If you have been to a data scientist interview, it is very common to ask about the bias-variance tradeoff. It helps the interviewer gauge the interviewee’s depth of knowledge of both types of errors in ML modeling and how the right balance can be achieved. And the next logical question in such scenarios is that once we realize our model overfits or underfits, what can be done about it. In this section, we are going to talk through various tools that are available to us to help us improve our underfit or overfit models.

Improving overfitting (high variance and low bias) can be achieved through the following:

  • Reduce model complexity. Recall that overfitting models are usually too complex for the training data and therefore end up capturing the details of the noise that exists in the data set, resulting in poor generalization results with high variane. Therefore, using simpler models is one of the obvious ways to consider in such scenarios. In the example that we walked through, we saw that 5th degree polynomial was too complex for the data, resulting in a high variance system, which was improved by moving to a lower complexity model of 2nd degree polynomial regression. In general, the more features a model has, the more complex it is considered and therefore, reducing features is also another way of lowering the complexity. If neural networks are being used, which are considered complex models, reducing layers or nodes can help lower the complexity.
  • Use regularization: Regularization is a method specifically designed to improve with overfitting. There are techniques such as L1 (Lasso) and L2 (Ridge) regularization, which essentially penalize the model’s large weights. Going into the depth of regularization is beyond the scope of this post and it will suffice to know here that regularization techniques can be used to help with overfitting.
  • Increase training data. When the model is complex, it generally requires more data compared to simpler models – otherwise the complex model overfits. Therefore, the same logic follows that if we observe an overfitting model and assuming more data is available to us, adding to the training data could help with overfitting.
  • Use cross-validation and early stopping. Overfitting happens because we continue training for too long, since the training loss keeps decreasing. Then we go to the test set and realize that test error is quite high and therefore the model is already overfitted. One solution is to simply early stop the training to not give the model an opportunity to overfit. Another solution is to use cross-validation, which is a technique used to evaluate generalizability of the model by testing it on various subsets of the data (instead of one set to determine training error). There are various methods to implement this but most common one is called k-fold cross validation where training data is divided into "k" subsets and the model is trained "k" times but each time 1 part of the "k" parts is held as the cross-validation set and the remaining "k-1" subsets are used for training. The part that is held for cross-validation changes each round of training and therefore the model has less of an opportunity to overfit to the training data.

Improving underfitting (high bias and low variance) can be done through the following methods, which are mainly the reverse of some of the solutions we used to improve overfitting.

  • Increase model complexity. Recall that underfitting models are usually too simple to capture the underlying data so changing to a more complex model can help overcome underfitting. One example is what we observed earlier where moving from linear to 2nd degree polynomial regression did the trick. In general models with more features are considered more complex so adding features, selecting higher degree polynomial terms or using neural networks are among the solutions here.
  • Decrease regularization. As we discussed earlier, regularization is a method to lower overfitting so if the model that we are using comes with a regularization measure and the system is underfitting already, we can lower the regularization measure to improve model’s performance.

I know this was a lot of information to go over for one topic and it can get confusing. Similar to earlier sections, I have distilled this knowledge into a simple table that you can use and refer to in the future.

Table 9 - Bias and Variance Improvement Paths
Table 9 – Bias and Variance Improvement Paths

This concludes our deep dive into error types. In the next section, we will review the metrics that are usually used for model evaluation, depending on the model type.


4. Evaluation

So far we have learned about various learning types (supervised vs. unsupervised), their corresponding data requirements and complexity, how to choose the model based on the requirements of a given project and also how to improve our machine learning modeling exercise by leveraging the types of errors observed during training and test time. One last but still very important part of any ML modeling exercise is to measure the performance of the model, which is the topic we will cover in this secction of the post.

In order for us to measure the performance of our ML models in the bias-variance tradeoff example, we simply relied on mean squared error (MSE) to measure the errors. But how did we decide that MSE was the right evaluation metric for this exercise? Let’s think through our model selection process again to see how important the evaluation metric is. In the first stage, we want to determine which model family to use and within that family of models, there will be various models to try, which is the final model selection stage that we covered earlier. Let’s say we have a continuous target and we choose regression as the model type. Then we will probably explore linear and polynomial regression models, among others. This means that we should be able to somehow measure how each one performs to pick the best model among them. Let’s assume we select the model that performs best among the models that we tried. Then as an extra step, we will iteratively tune various parameters of that model to make sure we use the best set of parameters to get the best performance out of the model (this was the reason behind having a validation set that we had discussed earlier and the process is called hyperparameter optimization). For this iterative process, we again need an evaluation metric to be able to pick the best set of hyperparameters for our selected model. As you can see in these examples, having the right metric is an essential part of a modeling exercise.

In the ML literature, there are various evaluation metrics that can be used, based on the type of problem that is being solved. Let’s break these down and look into some of the most common ones. In this section, I am going to mainly focus on evaluation metrics for supervised learning. The reason behind the focus on supervised evaluation metrics is that by definition, in order to measure error, we generally measure the distance between actuals and predicted outcomes and actuals are only available in labeled data, which limits us to supervised learning scenarios. This does not mean that unsupervised learning methods cannot be evaluated but those can be more specialized, which I consider out of scope for this post.

Let’s start with regression first.


4.1. Regression Evaluation Metrics

1. Mean Absolute Error (MAE)

Note this is different than the mean squared error (MSE) that we used in our earlier example. MAE calculates the average absolute difference between the predictions and the actuals. It represents the average magnitude of errors without considering their direction (positive or negative), since it measures the absolute value of the difference and can be formulated as follows:

This is a measure of error and therefore lower values of MAE indicate better model accuracy and since it doesn’t square the errors, MAE is not as sensitive to outliers as MSE. MAE is ideal when we want a straightforward average error measure and can treat all errors equally.

2. Mean Squared Error (MSE)

MSE, which is what we used earlier in our example, computes the average of the squared differences between predicted and actual values. By squaring the errors, MSE gives more weight to larger errors, making it sensitive to outliers and is calculated as follows:

Lower MSE values indicate better performance. Since errors are squared, large errors have a disproportionately larger impact on the metric. MSE is useful when larger errors are particularly important to us or when we want to penalize outliers more.

3. Root Mean Squared Error (RMSE)

RMSE is the square root of MSE, which brings the error metric back to the same unit scale as the target variable, making it more interpretable than MSE. Given the definition above, here is how RMSE is calculated:

Lower RMSE values indicate better model performance, with errors represented in the same units as the target. RMSE is commonly used when the scale of errors needs to be in the original units for easier interpretability, as opposed to MSE.

4. R-Squared (R²)

R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables. This one can be a bit unintuitive to understand so let me try to explain it further. Let’s say our independent variable is X and we are trying to come up with a way to measure the performance of the model in predicting the target (dependent) variable, which is Y. So, it is important to understand how much the dependent variable Y is spread around its own mean, which is a loose definition of variance around the mean for Y – if it was just close the mean, predicting the mean is easy but if there is a larger variance, maybe prediction will be harder or would require more work to make sure our independent variable (X) can actually explain the variance around our dependent variable (Y). R-squared tries to quantify that. It compares the model with a baseline (mean) model that predicts the mean value for all observations and is calculated as follows:

R-squared values range from 0 to 1, where 1 indicates a perfect fit, and 0 indicates that the model explains none of the variance. R-Squared is helpful in understanding the explanatory power of a model and is often used for comparing models with similar variables. For example, an R-squared of 0.85 means the model explains 85% of the variability in the data.

5. Adjusted R-Squared

Adjusted R-squared adjusts the R-squared value based on the number of predictors in the model, penalizing models with more predictors to avoid overfitting.

Higher values of adjusted R-squared indicate a better fit, but it doesn’t automatically increase with more predictors like R-squared. This makes it a better metric when comparing models with different numbers of variables. Adjusted R-squared is essential when adding or removing predictors to assess their impact on the model, during hyperparameter optimization or feature engineering. For example, if a model’s adjusted R-squared decreases when adding a new predictor, that predictor may not contribute meaningfully to the model.

This concludes the most common evaluation metrics we use for regression algorithms. Next, we will walk through classification evaluation metrics.


4.2. Classification Evaluation Metrics

I personally think the classification metrics are more intuitive to understand compared to regression metrics but there is some groundwork we need to do before it happens and therefore we need to define some terminology first. In a two-class target variable where the target variable can only be positive (or 1) and negative (or 0), there are four possible outcomes for a prediction:

  • True Positive: A positive event was correctly predicted.
  • False Positive: A negative event was incorrectly predicted as positive (a.k.a. Type I error).
  • True Negative: A negative event was correctly predicted.
  • False Negative: A positive event was incorrectly predicted as negative (a.k.a. Type II error).

Given the above four possible outcomes, we can define more nuanced metrics that we will cover below.

1. Accuracy

Accuracy measures the proportion of correct predictions made by the model out of all predictions. It is calculated as follows:

Accuracy is a straightforward measure of model performance, with higher values indicating better accuracy. However, it is only suitable for balanced datasets where all classes are equally important. In cases of class imbalance, accuracy can be misleading, as it may ignore minority classes. For example, an accuracy of 90% might sound impressive, but if the dataset is 90% one class, the model could achieve this by always predicting the majority class. In short, we want to make sure the classes are balanced when using this metric.

2. Precision

Precision, or positive predictive value, is the ratio of true positives to the sum of true positives and false positives. It focuses on the accuracy of positive predictions and is calculated as follows:

High precision means that the model makes positive predictions carefully, with fewer false positives. Precision is especially important in scenarios where false positives are costly, such as in medical diagnostics where a false positive might lead to unnecessary treatments. For instance, if a cancer diagnostic tool has a precision of 0.8, it means that 80% of the positive predictions made by the model are correct, which sounds concerning even as an example.

3. Recall

Recall (a.k.a. sensitivity or the true positive rate), is the ratio of true positives to the sum of true positives and false negatives. This metric tells us how effectively the model captures all relevant positive cases and is calculated as follows:

High recall means that the model is successfully identifying most positive cases, with fewer false negatives. Note that precision focused on false positive, while recall focused on false negatives. Recall is crucial in applications where missing positive cases can have serious consequences, such as fraud detection or disease diagnosis. For example, a recall of 0.9 means that the model correctly identifies 90% of all actual positive cases, which is critical when catching every positive instance is essential.

Precision and recall are probably the two most commonly-used evaluation metrics in classification but as the definitions showed above, they focus on different aspects (false positives vs false negatives, respectively) so it would be nice to combine these two metrics into one, which is what we will cover next.

4. F1 Score

The F1 Score is the harmonic mean of precision and recall, providing a balanced metric that takes both into account. This score is particularly useful in situations where there is a trade-off between precision and recall. The F1 Score is calculated as follows:

The F1 Score is most effective in imbalanced datasets where focusing on both false positives and false negatives is important. Recall that accuracy worked well in balanced data sets and here we see that F1 can be effective in imbalanced data sets. Since F1 combines precision and recall, it provides a single metric that reflects the model’s performance in capturing positive cases while minimizing false positives. For example, an F1 Score of 0.75 means the model strikes a balance, capturing relevant positives while avoiding excessive false positives.

5. Area Under the Receiver Operating Characteristic Curve (AUC-ROC)

AUC-ROC measures the model’s ability to distinguish between positive and negative classes at various classification thresholds. The ROC curve plots the true positive rate (recall) against the false positive rate, and the AUC (area under the curve) represents the model’s overall performance.

Values for AUC range from 0 to 1, with values closer to 1 indicating better performance. An AUC of 0.5 would suggest a random model with no classification power, while values above 0.7 generally indicate a useful model. AUC-ROC is especially valuable in binary classification tasks (i.e. when there are only two possible outcomes), particularly for imbalanced datasets, as it provides insight into the model’s ability to separate classes across multiple thresholds. For example, an AUC-ROC of 0.9 means the model has a 90% chance of ranking a randomly chosen positive instance higher than a randomly chosen negative one. We will cover an example later in the post.

6. Area Under the Precision-Recall Curve (AUC-PR)

AUC-PR measures the trade-off between precision and recall across different classification thresholds and is particularly useful for imbalanced datasets. The PR curve plots precision against recall, and the area under the curve provides a single metric to assess model performance.

Higher AUC-PR values indicate a stronger ability to capture true positives while avoiding false positives. AUC-PR is preferred over AUC-ROC when the dataset is heavily imbalanced, since it emphasizes the model’s performance in predicting positive classes rather than negative ones. For example, an AUC-PR of 0.85 indicates that the model maintains high precision and recall across various thresholds in a setting where positive cases are rare.

7. Logarithmic Loss (Log Loss)

Log Loss measures the accuracy of probabilistic predictions, penalizing confident but incorrect predictions more heavily. It is calculated as follows:

Here, $y_i$ is the actual label (1 or 0), and $hat{y}_i$ is the predicted probability for the positive class. Log Loss considers both the correctness of the prediction and the confidence in that prediction, rewarding correct, confident predictions and penalizing wrong, confident predictions.

Lower Log Loss values indicate better model performance, as they imply that the model is making accurate, confident predictions. Log Loss is especially useful in probabilistic models like logistic regression, where calibrated probability estimates are needed. For example, a Log Loss of 0.3 indicates that the model makes accurate predictions with high confidence.

I have good examples of how to calculate some of these classification metrics in a separate post about XGBoost so I won’t go through examples here but I will include an example to demonstrate how AUC metrics work, since those are a bit different and less intuitive. Here is the post that you can look at for implementation of some of these metrics:

XGBoost: Intro, Step-by-Step Implementation, and Performance Comparison

Let’s look at an example of implementing AUC-ROC.

# import libraries
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, flip_y=0.3, random_state=1234)

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

# initialize and fit a random forest classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# predict probabilities for the test set
y_proba = clf.predict_proba(X_test)[:, 1]

# calculate fpr, tpr, and thresholds for roc curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# calculate the auc score
roc_auc = auc(fpr, tpr)

# plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid()
plt.show()

Results:

Figure 4 - ROC Curve Example
Figure 4 – ROC Curve Example

A curve that’s closer to the top left corner indicates better model performance, as it achieves a high TPR with a low FPR, so this seems very good. The AUC score is a single value summarizing the ROC curve. An AUC of 1 indicates a perfect model, 0.5 indicates random performance, and values between 0.5 and 1 show varying levels of performance.

Now that we have covered both regression and classification metrics, let’s summarize them in a nice table for our future use.

Table 10 - Evaluation Metrics Summary
Table 10 – Evaluation Metrics Summary

Final Thoughts

Whether you made it this far down the post or not, I hope you enjoyed this comprehensive overview of machine learning basics that I personally think every data scientists needs for their day-to-day model use or as a guide to their next interview preparation.

Remember that we use these data sets since we cannot possibly run actual experiments for every possible scenario and therefore we heavily rely on our training and test sets, which are collected from the population’s distribution. Given this inherent limitation, the learned relationship holds well for the distribution that the data was sampled from, assuming we did a good job and collected a representative sample. The model’s performance on the test set is only an estimate of the performance of the model on the full population distribution and in practice. We use all these tools to increase the likelihood of our model generalizing well, while recognizing these limitations.


Thanks For Reading!

If you found this post helpful, please follow me on Medium and subscribe to receive my latest posts!

(All images, unless otherwise noted, are by the author.)


Related Articles