This is part 1 of the 2 article series where we discuss different evaluation metrics for Machine Learning (ML) problems. Evaluating an algorithm’s output is as important as modeling the algorithm itself. Evaluating a program helps in determining how impactful is the program and how it could be improved. In this article, we will be reviewing evaluation metrics for Classification. So, let’s begin.
Confusion Matrix
Confusion Matrix is an N x N matrix, where N represents the number of categories in the target variable (For example, 1 and 0 are two classes/categories in the survived column of the titanic dataset).

We first write if the prediction is Positive (P) or Negative (N), and then decide based on actual value if it is True (T) or False (F).
╔═══════════╦════╦════╦════╦════╦════╦════╦════╦════╗
║ Actual ║ 1 ║ 1 ║ 1 ║ 0 ║ 1 ║ 1 ║ 0 ║ 0 ║
╠═══════════╬════╬════╬════╬════╬════╬════╬════╬════╣
║ Predicted ║ 0 ║ 1 ║ 0 ║ 0 ║ 1 ║ 1 ║ 1 ║ 0 ║
╠═══════════╬════╬════╬════╬════╬════╬════╬════╬════╣
║ ║ FN ║ TP ║ FN ║ TN ║ TP ║ TP ║ FP ║ TN ║
╚═══════════╩════╩════╩════╩════╩════╩════╩════╩════╝
Table 1. Sample Data
Consider the table above. For the first case, the predicted value is 0, so we write Negative and since the actual value is contradicting the predicted value, we write False making it False Negative (FN). Similarly, evaluate every case, count them and fill out confusion matrix layout.
# Confusion matrix of sample data (Table 1)
from sklearn.metrics import confusion_matrix
confusion_matrix(Actual, Predicted)
╔═══╦═══╗
║ 3 ║ 2 ║
╠═══╬═══╣
║ 1 ║ 2 ║
╚═══╩═══╝
Cost of Classification
Cost of Classification (CoC) is a measure of computing cost for classification models. The way we calculate cost is by assigning weights to the confusion matrix. In simple terms, we grant reward if the classification is correct and penalize if it is wrong.
╔════════╦═══════════════════╗
║ ║ PREDICTED CLASS ║
╠════════╬════════╦════╦═════╣
║ ║ C(i|j) ║ + ║ - ║
║ ACTUAL ╠════════╬════╬═════╣
║ ║ + ║ -1 ║ 100 ║
║ CLASS ╠════════╬════╬═════╣
║ ║ - ║ 1 ║ 0 ║ C(i|j): Cost of misclassifying class
╚════════╩════════╩════╩═════╝ j example as class i
Table 2
Here, -1 is the reward and 100 is the penalty. Hence, for the above confusion matrix, CoC is:
# Cost of sample data (Table 1)
(3 x (-1)) + (2 x 100) + (1 x 1) + (2 x 0) = 198
Lower the cost, better the model.
Accuracy
The most commonly used Evaluation method is Accuracy. It is the ratio of the correct predicted values over the total predicted values.


# Accuracy of sample data (Table 1)
from sklearn.metrics import accuracy_score
accuracy_score(Actual, Predicted)
0.625
Accuracy v. CoC
Choosing Accuracy or Cost depends on the domain for which we are modeling the ML solution. Consider the two models, M₁ and M₂, and their trade-offs with following cost and accuracy.
╔════════╦══════════════════╗ ╔════════╦══════════════════╗
║ M₁ ║ PREDICTED CLASS ║ ║ M₂ ║ PREDICTED CLASS ║
╠════════╬══════╦═════╦═════╣ ╠════════╬══════╦═════╦═════╣
║ ║ ║ + ║ - ║ ║ ║ ║ + ║ - ║
║ ACTUAL ╠══════╬═════╬═════╣ ║ ACTUAL ╠══════╬═════╬═════╣
║ ║ + ║ 150 ║ 40 ║ ║ ║ + ║ 250 ║ 45 ║
║ CLASS ╠══════╬═════╬═════╣ ║ CLASS ╠══════╬═════╬═════╣
║ ║ - ║ 60 ║ 250 ║ ║ ║ - ║ 5 ║ 200 ║
╚════════╩══════╩═════╩═════╝ ╚════════╩══════╩═════╩═════╝
Table 3 Table 4
Accuracy = 80% Accuracy = 90%
Cost = 3910 Cost = 4255
Choosing between either of the models depends on how flexible a company is towards accuracy or cost. Let us consider a scenario where you are developing a model to classify patients with Liver Disorders. Here, we can make it work by having a few False Positives (type 1 error), but we cannot afford to have False Negatives (type 2 error). As a result, we would want to pick a model with higher accuracy. Alternatively, in a manufacturing company, where we can afford to have a few errors, but the cost is of the essence, a model with lower cost would be our pick.
Cost is proportional to accuracy when:
C(+|-)=C(-|+)= q and C(+|+)= C(-|-)= p
╔════════╦═════════════════╗
║ ║ PREDICTED CLASS ║ N = a + b + c + d
╠════════╬═══════╦════╦════╣
║ ║ ║ + ║ - ║
║ ACTUAL ╠═══════╬════╬════╣ Accuracy = (a + d)/N
║ ║ + ║ a ║ b ║
║ CLASS ╠═══════╬════╬════╣
║ ║ - ║ c ║ d ║
╚════════╩═══════╩════╩════╝
Confusion Matrix
╔════════╦═════════════════╗
║ ║ PREDICTED CLASS ║ Cost = p(a+d) + q(b+c)
╠════════╬═════════╦═══╦═══╣ = p(a+d) + q(N-a-d)
║ ║ C(i|j) ║ + ║ - ║ = qN - (q - p)(a + d)
║ ACTUAL ╠═════════╬═══╬═══╣ = N[q-(q-p)x Accuracy]
║ ║ + ║ p ║ q ║
║ CLASS ╠═════════╬═══╬═══╣
║ ║ - ║ q ║ p ║
╚════════╩═════════╩═══╩═══╝
Cost Matrix
Rate Comparisons
We cannot totally rely on Accuracy for evaluation. Consider the following case where data is unbalanced.
╔════════╦═════════════════╗
║ ║ PREDICTED CLASS ║
╠════════╬═══════╦═══╦═════╣
║ ║ ║ + ║ - ║
║ ACTUAL ╠═══════╬═══╬═════╣
║ ║ + ║ 4 ║ 2 ║
║ CLASS ╠═══════╬═══╬═════╣
║ ║ - ║ 8 ║ 486 ║
╚════════╩═══════╩═══╩═════╝
Table 5
The predictive capability of the above model is absolutely bad, and yet we have the accuracy to be (486 + 4)/(4 + 2+ 8 + 486) = 98%. Accuracy is not sufficient for this type of classification. It is just not robust enough. To overcome the problem of unbalanced data, the following metrics can be used.
True Positive Rate

Its value lies between 0 through 1. The higher the value of TPR, the better the model. It represents by what rate are the true positives in the predictors.
False Negative Rate

Just like TPR, its range is 0 through 1. It represents the rate of false negatives in the predictors. Lower, the better.
Similarly, we have TNR and FPR.
True Negative Rate

False Positive Rate

Precision
Precision is an evaluation metric which tells us out of all positive predictions, how many are actually positive. It is used when we cannot afford to have False Positives (FP).

Recall
Recall tells us out of all actual positives, how many are predicted positives. It is used when we cannot afford to have False Negatives (FN). A low value of recall tells us that we are missing good examples in data.

Precision v. Recall Curve

F1-Score
Sometimes, prioritizing precision or recall is not clear. Hence, both can be combined in order to obtain a good model evaluation method. It is called F1-score. F1 is the harmonic mean of precision and recall.

It reaches the maximum value when precision becomes equal to recall.
# Classification report of sample data (Table 1)
from sklearn.metrics import classification_report
classification_report(actual, predicted)
╔══════════════════╦═══════════╦════════╦══════════╦═════════╗
║ ║ precision ║ recall ║ f1-score ║ support ║
╠══════════════════╬═══════════╬════════╬══════════╬═════════╣
║ 0 ║ 0.50 ║ 0.67 ║ 0.57 ║ 3 ║
╠══════════════════╬═══════════╬════════╬══════════╬═════════╣
║ 1 ║ 0.75 ║ 0.60 ║ 0.67 ║ 5 ║
╠══════════════════╬═══════════╬════════╬══════════╬═════════╣
║ ║ ║ ║ ║ ║
╠══════════════════╬═══════════╬════════╬══════════╬═════════╣
║ accuracy ║ ║ ║ 0.62 ║ 8 ║
╠══════════════════╬═══════════╬════════╬══════════╬═════════╣
║ macro average ║ 0.62 ║ 0.63 ║ 0.62 ║ 8 ║
╠══════════════════╬═══════════╬════════╬══════════╬═════════╣
║ weighted average ║ 0.66 ║ 0.62 ║ 0.63 ║ 8 ║
╚══════════════════╩═══════════╩════════╩══════════╩═════════╝
Classification Report
NOTE
- Precision, Recall and F1-score are cost-sensitive. You do not have to think about both, accuracy and cost.
- Precision is biased towards C(+|+) and C(+|-).
- Recall is biased towards C(+|+) and C(-|+).
- F1-score is biased towards all except C(-|-).
- The performance of a well-curated algorithm also depends on the class distribution of target variable, cost of misclassification, and size of training and test sets.
- F1-score lacks interpretability, and hence it should be used in combination with other evaluation metrics. A combination of two metrics is enough depending on the use case.
Thresholding
When we have a model that is predicting probabilities of classes (prediction being 0/1), and not the classes themselves. Hence, a threshold value can be set to classify predictions as 0 or 1 from probabilities. Let us understand more.
AUC-ROC
AUC-ROC is the abbreviation of Area Under the Curve – Receiver Operating Characteristics. This name comes from signal detecting theory originally used for distinguishing noise from, well, not noise. It is an evaluation metric for binary classification which gives the trade-off between False Positive Rate and True Positive Rate.
Each point on the ROC curve is the representation of the performance of the model’s classification. Changing the threshold of the algorithm, sample distribution or cost matrix changes the location of the point. Consider a 1D data set containing 2 classes – 0 and 1.

Any point which has a value greater than t is classified as class-1. The area under the curve of ROC Curve 2 is the AUC-ROC value. Higher the AUC-ROC, better the model.

In Fig 4, the area of ∆ABC is 0.5 (half of the area of a square with side 1). Hence, the AUC-ROC value is always greater than (or equal to) 0.5.
How to plot an ROC Curve
╔══════════╦════════╦════════════╗
║ Instance ║ P(+|A) ║ True Class ║ Steps:
╠══════════╬════════╬════════════╣ 1. Calculate P(+|A) for each
║ 1 ║ 0.95 ║ + ║ instance and sort them in
╠══════════╬════════╬════════════╣ descending order.
║ 2 ║ 0.93 ║ + ║
╠══════════╬════════╬════════════╣ 2. Take first probability as
║ 3 ║ 0.87 ║ - ║ threshold and calculate TPR
╠══════════╬════════╬════════════╣ and FPR.
║ 4 ║ 0.85 ║ - ║
╠══════════╬════════╬════════════╣ 3. Repeat calculation of TPR and
║ 5 ║ 0.85 ║ - ║ FPR with every value of
╠══════════╬════════╬════════════╣ P(+|A) as threshold.
║ 6 ║ 0.85 ║ + ║
╠══════════╬════════╬════════════╣ 4. Plot FPR vs TPR.
║ 7 ║ 0.76 ║ - ║
╠══════════╬════════╬════════════╣
║ 8 ║ 0.53 ║ + ║
╠══════════╬════════╬════════════╣
║ 9 ║ 0.43 ║ - ║
╠══════════╬════════╬════════════╣
║ 10 ║ 0.25 ║ + ║
╚══════════╩════════╩════════════╝
Now, can we use this method to compare two models? The problem with AUC-ROC is that it only considers the order of probability. Hence, it cannot be used for comparing two models. If we compare two models using ROC, it will consider the threshold value from the single prediction. As a result, TPR and FPR values will be the same for both the models and so will be their curves, while both of the models are not performing equally. What it means is, consider Model 1 which predicts class 1 with a probability of 96%, and Model 2, for the same values, predicts class 1 with a probability of 89%. Hence, the order of probabilities matter.
Log Loss
Log Loss is the negative average of the log of corrected-predicted probabilities for each instance. For Actual classification value as 0, their predicted value is subtracted from 1. (Note that for Actual =1, we do not calculate corrected-predicted probability)
╔════════╦═════════════╦════════════╦═════════╗
║ Actual ║ Predicted ║ Corrected ║ log ║
║ ║ Probability ║ Predicted ║ ║
╠════════╬═════════════╬════════════╬═════════╣
║ 1 ║ 0.94 ║ 0.94 ║ -0.0268 ║
╠════════╬═════════════╬════════════╬═════════╣
║ 0 ║ 0.56 ║ 1 - 0.56 = ║ -0.3565 ║
║ ║ ║ 0.44 ║ ║
╠════════╬═════════════╬════════════╬═════════╣
║ 0 ║ 0.1 ║ 1 - 0.1 = ║ -0.0457 ║
║ ║ ║ 0.9 ║ ║
╚════════╩═════════════╩════════════╩═════════╝
Table 6


Here, yᵢ is the actual class, p(yᵢ) is the probability of actual class and 1-p(yᵢ) is the corrected-predicted probability. It is a robust metric for comparing two models.
Next, we will discuss about evaluation metrics for Regression problems.