Common Classification Model Evaluation metrics.

Oduor George
Towards Data Science
5 min readMay 8, 2019

--

All models are wrong, but some are useful,George E. P. Box.

Introduction.

How accurate is a classification model? Is the model reliable?

These two questions are easily answered by evaluating how well a model performs when subjected to unseen observations. This post illustrates some of the best ways models can be evaluated.

What you will learn from this post:

  1. Jaccard index.
  2. Confusion Matrix
  3. F-1 Score
  4. Log loss

Sample model.

First I will fit a simple model and use it to illustrate these methods are applied in model performance evaluation. The model predicts whether a cancerous cell is malignant or not.

#quick model fit
import numpy as np
import warnings
import pandas
warnings.filterwarnings("ignore")#not recomended but i have included this for my own convenience.
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pandas.DataFrame(data = data.data,columns=data.feature_names)
y = data.target
#train test split
from sklearn import model_selection
np.random.seed(2) #to enable you replicate the same thing i am doing here.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y,test_size=0.30)
# I will use logistic reg
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()
reg.fit(X_train,y_train)
preds = reg.predict(X_test)
predsprob = reg.predict_proba(X_test)

Jaccard Index

Given predicted values as (y hat)and actual values as y, the Jaccard index can be defined as :

so let us say you have the following set of predicted and actual values.

the Jaccard index will be :

The idea behind this index is that higher the similarity of these two groups the higher the index.

Applying this to the model above.

from sklearn.metrics import jaccard_similarity_score
j_index = jaccard_similarity_score(y_true=y_test,y_pred=preds)
round(j_index,2)
0.94

Confusion matrix

The confusion matrix is used to describe the performance of a classification model on a set of test data for which true values are known.

confusion matrix

From the confusion matrix the following information can be extracted :

  1. True positive(TP).: This shows that a model correctly predicted Positive cases as Positive. eg an illness is diagnosed as present and truly is present.
  2. False positive(FP): This shows that a model incorrectly predicted Negative cases as Positive.eg an illness is diagnosed as present and but is absent. (Type I error)
  3. False Negative:(FN) This shows that an incorrectly model predicted Positive cases as Negative.eg an illness is diagnosed as absent and but is present. (Type II error)
  4. True Negative(TN): This shows that a model correctly predicted Negative cases as Positive. eg an illness is diagnosed as absent and truly is absent.

Applying this to the model above.

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,preds,labels=[1,0]))
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(confusion_matrix(y_test,preds),annot=True,lw =2,cbar=False)
plt.ylabel("True Values")
plt.xlabel("Predicted Values")
plt.title("CONFUSSION MATRIX VISUALIZATION")
plt.show()

In this case for the breast cancer data, the model correctly predicts 62 cases as benign and 98 cases as malignant.
In contrast, it mispredicts a total of 11 cases.

F1-Score.

This comes from the confusion matrix. Based on the above confusion matrix above, we can calculate the precision and the recall scores.

Precision score: this is the measure of the accuracy, provided that a class label has been predicted. Simply put, it answers the following question, of all the classes, how many were correctly predicted? The answer to this question should be as high as possible.

It can be calculated as follows:

Recall score(Sensitivity): This is the true positive rate that is if it predicts positive then how often does this take place?

The F1 score is calculated based on the precision and recall of each class. It is the weighted average of the Precision and the recall scores. The F1 score reaches its perfect value at one and worst at 0.It is a very good way to show that a classifies has a good recall and precision values.

We can calculate it using this formula:

Applying to the model above.

from sklearn.metrics import f1_score
f1_score(y_test,preds)
0.9468599033816425

F1 score can be calculated for all classes so that an average of the realized scores can be used as shown in the classification report below.

from sklearn.metrics import classification_report
print(classification_report(y_test,preds))
precision recall f1-score support 0 0.91 0.93 0.92 67
1 0.95 0.94 0.95 104
micro avg 0.94 0.94 0.94 171
macro avg 0.93 0.93 0.93 171
weighted avg 0.94 0.94 0.94 171

Log loss.

We can use the log loss in cases where the outcome of the classifier is a class probability and not a class label like in cases of logistic regression models.

Log loss measures the performance of a model where the predicted outcome is a probability value between 0 and 1.

In real life when predicting a probability of 0.101 when the true label should be 1 would result in a high log loss. Log loss can be calculated for each row in the data set using the Log loss equation.

The equation simply measures how far each predicted probability is from the actual label. An average of the log loss from all the rows gives the ideal value for the log loss.

A good and model should have a smaller log loss value.

Applying in the above model.

from sklearn.metrics import log_loss
log_loss(y_test,predsprob)
0.13710589473837184

And there we have a 0.14 log loss which is pretty good!

Conclusions.

  1. The choice of evaluation metrics should be well understood based on the model applied.
  2. Feature engineering and parameter tuning are recommended for a model in order to get excellent results from the evaluation metrics.

Thanks for reading, any comments and/or additions are welcome.

Cheers!

--

--

George holds a BSc Statistics and Programming degree from Kenyatta University and has an interest in machine learning, data science and AI.