The world’s leading publication for data science, AI, and ML professionals.

How to Evaluate a Classification Machine Learning Model

An Introduction of Accuracy, Precision, ROC/AUC and Logistic Loss

Photo by Bru-nO on Pixabay
Photo by Bru-nO on Pixabay

It is known that the evaluation of a machine learning model is critical. It is the process that measures how the model is effective in terms of accuracy, precision, recall, performance, etc.

In one of my previous articles:

Machine Learning in Academic Research v.s. Practical https://towardsdatascience.com/machine-learning-in-academic-research-v-s-practical-5e7b3642fc06

I proposed that the following phases are of typical industrial machine learning projects:

As shown in the above picture, selecting appropriate Evaluation Metrics is the first and crucial step of the Problem Modelling phase.

Since we need to choose appropriate metrics to evaluate a machine learning model, the output value type is the most significant factor that you need to consider when selecting evaluation metrics for the model. That is, whether the output values are

  • Discrete classifications
  • Continuous values, or
  • Ranking

In this article, I will introduce three of the most commonly used evaluation metrics for discrete classifications machine learning models.

Precision and Recall

Photo by geralt on Pixabay
Photo by geralt on Pixabay

Precision and recall are commonly used in boolean output. Before we can define these two concepts well, let’s have a look at the famous "Confusion Matrix":

Image Courtesy: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
Image Courtesy: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62

In the above confusion matrix,

  • TP (True Positive) indicates both the actual and predictive results are True
  • FP (False Positive) indicates the actual result should be False but the predictive result is True, so this is an error
  • TN (True Negative) indicates both the actual and predictive results are False
  • FN (False Negative) indicates the actual result should be True but the predictive result is False

Therefore, it can be easily derived that:

Definition of Precision and Recall

Then, we can define our Precision (P) and Recall (R) as follows:

Ideally, both of these two metrics would be considered as high as possible to have better performance. However, most of the time, these two are the opposite. That is, when you try to improve the precision, it will cause worse recall and vice versa.

Let’s have an extreme example here. Suppose you are using a search engine to search web pages by some keywords. If the search engine returns only one webpage, which is the most relevant ones, we consider the precision as 100%. However, the recall will be very low because there will be a considerable amount of web pages that are relevant (positive) but were ignored (False Negative, FN). Let’s go back to the definition of recall, TP is only 1, but FN is very large, so the denominator is extremely large whereas the nominator is very small.

On the other hand, if the search engine returns all the web pages (imagine that we don’t do "retrieving" here, simply return all webpages on the internet), the recall will be 100%, but the precision will be close to 0%. This is because the False Positive (FP) is extremely large.

Therefore, most of the time, we need to balance these two metrics. In different scenarios, we may try to improve either of them and have some trade-off on the other one.

Average Precision Score

Sometimes we may use "Average Precision Score" (AP) to measure a model.

Image Courtesy: https://stats.stackexchange.com/questions/345204/iso-f1-curve-for-precision-recall-curve
Image Courtesy: https://stats.stackexchange.com/questions/345204/iso-f1-curve-for-precision-recall-curve

As shown in the plot, we use the x-axis for recall and the y-axis for precision. The curve produces an area shown in blue colour. Generally, we can generate such an AP score curve for all the models, and the larger size that the model produces indicate the better performance of the model.

However, there are some drawbacks for AP. For example, this method is not quite convenient to use, as we need to calculate the area under the curve. Therefore, we usually use other metrics that will be introduced below.

Other metrics derived from precision and recall

One of the most commonly used is probably the F-measure. F1 is defined as follows, which considers precision and recall are equally important to the model.

In practice, we may want to add different weights to precision and recall. For example, if we consider recall is 𝛼 times more important than precision, we can use the following equation to calculate the F-measure.

Additionally, we can also measure Accuracy Rate and Error Rate:

In terms of science and engineering, accuracy and precision are usually refer to different concepts. In this article, I will not discuss this in detail here as it is not the focus. Basically, the most significant difference is that precision refers to boolean output (true or false), whereas accuracy can be applied to multiple classifications, such as:

In the above formula,

  • n is the number of classifications in the sample space
  • P is the function to calculate the precision for the specific classification, which can be defined independently

ROC and AUC

Photo by Erik Mclean on Unsplash
Photo by Erik Mclean on Unsplash

In practice, sometimes we may not output boolean values but a probability. For example, the patient has a 99.3% probability of having a particular disease. In this case, if we insist on using precision and recall, we will have to define a threshold. If we define 0.8 as the threshold, then every prediction greater than 0.8 will be considered True. Then, if the actual result is also True, this prediction is a true positive.

This is kind of inconvenience and inappropriate because the threshold will

  • Significantly impact the evaluation of the models
  • Involving another artificial parameter
  • Reduce the model’s performance in more general problems

ROC (Receiver Operating Characteristic)

In this case, the ROC (Receiver Operating Characteristic) curve will be more effective, as it does not need such a threshold.

In a ROC curve, the x-axis is False Positive Rate (FPR), and the y-axis is True Positive Rate (TPR), which are calculated as follows:

And a ROC curve looks like as follows:

Image Courtesy: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Image Courtesy: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

As shown in the figure, the more the ROC curve close to the top left corner, the better the performance that the model has. When the coordinate of the top left corner point is (0,1), which means that TPR=1 and FPR=0, we can derive that both FN and FP are equal to 0 based on their formula. Therefore, the ideal case is that all the testing samples were correctly classified.

AUC (Area Under ROC Curve)

AUC (Area Under ROC Curve) is also shown in the above graph, which is simply used as a single value to be compared between different models for evaluating their performance.

The method of calculating the area under the ROC curve is as its definition:

Therefore, AUC is a number that is utilised to measure the performance of multiple models.

Logistic Loss

Photo by geralt at Pixabay
Photo by geralt at Pixabay

Logistic Loss (logloss) is another evaluation method that is commonly used in classification problems. The basic idea is to try to measure the likelihood of the similarity between the predictive values (probabilities) and the actual values. The original form of logloss is:

In our case, we want to use the logloss function to "maximise" the probability of the distribution of predicted values being the same as the original distribution in the testing dataset. Suppose our model is predicting a boolean output, the logloss will be as follows:

where

  • N is the number of samples
  • y{0,1}, which is true or false of the i(th) value in the original dataset
  • p is the probability that the i(th) sample is predicted with output equals 1

Logloss can also be utilised in multi-category classification problems, as follows:

In this equation, C is the number of categories.

Summary & Tips

Photo by Aaron Burden on Unsplash
Photo by Aaron Burden on Unsplash

So far, we have introduced three different types of evaluation metrics that are particularly for classification Machine Learning models:

  • Precision and Recall (Average Precision Score)
  • ROC and AUC
  • Logistic Loss Function

This article is meant to introduce these evaluation metrics but not compare them, because that may end up with a journal paper. However, if you wish to know how to choose an appropriate metric for your classification model among these three, I can provide a rough guideline as follows.

  • If you’re dealing with a multi-classification problem with extremely imbalanced categories, i.e. some categories have their numbers of samples overwhelmingly more than the others, you may use Average Precision Score.
  • If you believe the rank of your predictions or the scale of the errors are more important than how much the error that the predictions produce, you might want to use ROC and AUC.
  • If you do care about how much the error that your predictions have but not its distribution, you should use Logistic Loss to increase the error sensitivity of your models.

Join Medium with my referral link – Christopher Tao

If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)


Related Articles