Machine Learning — How to Evaluate your Model?

Basic evaluation metrics and methods for Machine Learning algorithms

Karlo Leskovar
Towards Data Science

--

Just recently I covered some basic Machine Learning algorithms, namely, K Nearest Neighbours, Linear and Polynomial Regression and Logistic Regression. In all these articles we used to popular car fuel economy dataset from Udacity and conducted some kind of classification of cars, i.e. in vehicle “size” classes or according to driven wheels. Also, every time we calculated the basic model accuracy on training and test data and tried to fit a car of our interest into the model to check it’s capabilities.

This is all fine, but sometime one could face a dataset where the classes are not that well distributed like in this example of 3920 different cars. Could this cause a problem? Can a basic accuracy evaluation be enough on such datasets?

An prime example of such a dataset can be found on Kaggle. It’s the popular credit card fraud dataset. What makes this dataset different? Well, the number of credit card frauds is very small in opposite to valid transactions. More on that latter. 🙂

Photo by bruce mars on Unsplash

In this article, I will use this dataset to explain some of the most popular evaluation metrics for machine learning models.

Structure of the article:

  • Introduction
  • Confusion Matrix
  • Dataset loading and description
  • Model description
  • Model evaluation
  • Conclusion

Enjoy the reading!

Introduction

Class imbalance is a very common phenomena in classified data. What does it actually mean? Often, we get some binary or multiclass data where one class is in minority to other class or classes. Other than the mentioned credit card fraud example, popular examples are health issues, traffic accidents, airline failure or any other events, where we have a very negative outcome, which happens very rarely, in opposite to a large number of positive outcomes, vice-versa also possible.

What happens if measure accuracy of a model which is meant to classify such data?

We get very high values of accuracy, even, if the model is pretty bad. Why is that so?

Let’s take a simple example. We want a model that will predict if we get a flat tyre on our journey. Remember when you had a flat tyre? Probably, don’t or you never had one. This event happens very rarely. So if the model has to predict that we will NOT have a flat tyre on our journey, event without knowing any factors or features, that can cause a flat tyre, the model can predict the majority of outcomes right if we set it to always predict the “most frequent” outcome, i.e. that we will never have a flat tyre.

Such models are called dummy classifiers. They are completely independent of training data features. Scikit-Learn has a DummyClassifier class (which we will later use).

Confusion Matrix

Before we continue with the examples, I wanted to mention an important term in model evaluation process. The Confusion matrix. It greatly helps us visualize the outcome of a model. Let’s take a look at the confusion matrix for the flat tyre problem. Think of the situation when having a flat tyre is the positive outcome (even it is not 🙂). I’m aware it is a bit confusing, but in terms of machine learning algorithms, the zero (0) label represents negative outcome, while one (1) label represents a positive outcome. In our case, the majority of outcomes (0) is when we don’t have a flat tyre on our journey. While very rarely we do have a flat, and that’s a 1 label outcome.

Probably that’s why is it called “Confusion Matrix” 🙂. All kidding aside, the explanation follows.

Simple visualization of a binary Confusion Matrix

The green squares present the case where our model predicts the correct outcome, true positive (TP) the model predicted we will get a flat tyre, and we actually had a flat tyre, and true negative (TN) where the model predicted no flat tyre and we actually had no flat tyre.

The red squares present cases where the model prediction is wrong, either false negative (FN) where the model predicted a non-flat tyre, but actually we had a flat tyre, and false positive (FP) where the model predicted a flat tyre (positive outcome) but we had no flat tyre.

Dataset loading and description

As always, first we import the dependencies.

Next, we load our dataset. This time, we use a imbalanced dataset, from Kaggle. The dataset contains data about credit card transactions, where the majority is legit (284 315), while very few are fraud (492). The columns V1, V2, V3, …, V28 are principal components obtained with PCA. (description from source site) The principal components will be used as features to our linear model.

Bar plot of the credit card transaction dataset (by author)

Also, we will spit the data into training and test samples.

Model description

We will use a logistic regression model. To read more about linear regression check this article.

To create the model, we need to import LogisticRegression class from sklearn.linear_model.

The model predicts both positive and negative outcomes.

array([0, 1], dtype=int64)

Let’s create the Dummy Classifier, by setting the strategy as “most_frequent”. This means that the classifier will predict the outcome based of most occuring outcome in the training data. We use the DummyClassifier class from sklearn.dummy.

array([0], dtype=int64)

As expected, the dummy classifier predicts the outcome according to the most frequent occurrence, in this case its zero, meaning, it’s a VALID transaction.

Model evaluation

Confusion Matrix

We already defined what a Confusion Matrix is, now lets calculate it for our two models. First we need to import the Confusion_matrix class from sklearn-metrics. First we will clalculate the matrix for our Logistic Regression model.

array([[71072,    10],        

[ 40, 80]], dtype=int64)

According to the results the Logistic Regression model predicts:

True Negatives (TN) = 71 072

False Positives (FP) = 10

False Negatives (FN) = 40

True Positives (TP) = 80

Let’s see how does the matrix of the Dummy Classifier look like.

array([[71082,     0],        
[ 120, 0]], dtype=int64)

It’s important to notice here that the diagonal of the matrix shows the successful predictions (“when the model predicts correct”), also the True Negatives and the True Positives. The Dummy Classifier predicts all cases as negative, as zero, that’s why the confusion matrix of the Dummy Classifier shows 71 082 True Negatives and 120 False Negatives, meaning, it predicted all transactions as VALID.

On the other hand the Logistic Regression classifier, predicted 71 072 True Negatives (correctly predicted VALID transactions), 10 False Positives (FRAUD predicted, but actually VALID), 40 False Negatives (VALID predicted, but actually FRAUD) and 80 True Positives (correctly predicted FRAUD).

Now that we have covered the possible outcomes of our model, we can start calculate the metrics.

Accuracy

First in the list of the metrics is the model accuracy, probably the most basic and most used metrics when measuring performance of predictions.

Accuracy represents the ratio between the correctly (TRUE) predicted values and all outcomes, or in other words, ratio between True Negatives and True Positives, and all outcomes. That means, we can sum the diagonal of the matrix, and divided with sum of all four outcomes.

Here we can clearly see, why accuracy can gave false confidence when evaluating an imbalanced dataset. With our Logistic Regression model we get only slightly better accuracy then with a Dummy Classifier. This is a clear indication, that other evaluation metrics are necessary, especially when dealing with data where we have great discrepancies between classes.

Classification Error

Classification Error can be seen as the opposite of Accuracy, it is calculated as the sum of the counts off the diagonal (False Negatives and False Positives), also the number of outcomes where the model predicted WRONG.

Let’s calculate it for our models.

Again, both metrics look fine, even for our Dummy Classifier. But the things should change soon.

Recall

Recall is also called True Positive Rate, Sensitivity or Probability of Detection since it calculates the ration between True Positive outcomes (when model predicted a FRAUD transaction) and the sum of True Positives and False Negatives (false negative outcome is when the model predicted no FRAUD, but the actual transaction was a FRAUD).

This metric is important when it’s very important to have high number of True Positives, and avoid False Negatives, a prime example is an algorithm which detects some medical problems. Or even our credit card example is good, if we let’s say wanted to be sure that our model correctly predicts al the FRAUD transactions, but also avoids classifying FRAUD as a VALID transaction. Basically, Recall gives us an idea what proportion of actual Positives was predicted correctly.

Let’s calculate Recall for our two models.

Precision

Precision calculates the ration between True Positives (when model predicted a FRAUD transaction) and the sum of True Positives and False Positives (the case when model predicts a VALID transaction which actually was FRAUD). Simply, it says how “precise” our model is in predicting the positive class (FRAUD bank transaction, or a flat tyre).

Precision calculates the ration between True Positives (when model predicted a FRAUD transaction) and the sum of True Positives and False Positives (the case when model predicts a VALID transaction which actually was FRAUD). Simply, it says how “precise” our model is in predicting the positive class (FRAUD bank transaction, or a flat tyre).

First, we need to address that in case of the Dummy Classifier we get an “zero_division” error, since for the Dummy model the sum of the True Positives and False Positives (right column of the Confusion Matrix) is both zero.

Secondly, our Logistic Regression model gives us a Precision value of 88.89 %, which is very good, as it means that the of the predicted FRAUD transactions 88.89% of them was really a FRAUD.

False positive rate (FPR)

False Positive Rate or sometimes called Specificity, presents the ratio between False Positive values (the case when model predicts a VALID transaction which actually was FRAUD) and the sum of all True Negatives (the case when model predicted a Valid transaction, and indeed it was a valid transaction) and False Positives. This metric helps us identify the fraction of incorrectly classified negative instances, in other words, how much of the VALID transactions are classified as FRAUD.

Again, our Dummy model due to zero positive predictions, yields a zero False Positive Rate. While our Logistic Regression model has a FPR of 0.00014 , which means that only 0.014% of the VALID transactions the model classified as FRAUD.

F1-score (F-score)

Another very popular metric when evaluating Machine Learning algorithms is the F1-score. It’s a metric calculated from Recall and Precision. It’s calculated as the Harmonic Mean of the two values. For the special case of two numbers the Harmonic Mean is calculated as follows:

Also, there’s a more general form of the equation, which allows the user to modify the weight of Precision or Recall:

The β parameter allows the user to emphasize the influence of Recall or Precision as follows:

Give Precision more weight: β = 0.5 (the model performance is more affected by False Positives — FRAUD transactions predicted as VALID)

Give Recall more weight: β = 2 (the model performance is more affected by False Negatives — VALID transactions predicted as FRAUD)

Let’s calculate the F1-score for our two models.

Again, we can observe that a proper model like the Logistic Regression one, yields a much better result than the Dummy model.

Bonus content: sklearn — classification_report

At the end of the article I wanted to point out that sklearn provides a classification_report function which calculates the Recall, Precision and F1-score metrics at once. Also, it gives us the option to label the outcome cases. Let’s see how we can use it in our case.

The “support” column shows as the number of outcomes of the desired label. In our test data we have 71 082 VALID and 120 FRAUD transactions.

Conclusion

In this article I’ve covered some of the basic evaluation metrics and methods for a Machine Learning algorithm. Also, we saw how the Accuracy metric can be sometimes very misleading when we have an imbalanced dataset. In such cases, I would kindly suggest using other metrics presented here.

As bonus I’ve added the sklearn — classification_report function, which provides a fast and simple way to evaluate our algorithm.

Feel free to check the Jupyter Notebook on my github page.

I hope the article was clear and helpful. For any questions or suggestions regarding this article or my work feel free to contact me via LinkedIn.

Check my other articles on Medium.

Thank you for taking the time to read my work, Cheers! 🙂

--

--

PhD → Hydrology & Deep Learning (ANN hydrological models). PostDoc at University of Zagreb.