The world’s leading publication for data science, AI, and ML professionals.

Analyzing the Performance of the Classification Models in Machine Learning

Probing into the basics of Confusion Matrix, ROC-AUC curve, and Cost Functions for Classification in Machine Learning.

Photo by Safar Safarov on Unsplash
Photo by Safar Safarov on Unsplash

Confusion Matrix

Confusion matrix (also called Error matrix) is used to analyze how well the Classification Models (like Logistic Regression, Decision Tree Classifier, etc.) performs. Why do we analyze the performance of the models? Analyzing the performance of the models helps us to find and eliminate the bias and variance problem if exist and it also helps us to fine-tune the model so that the model produces more accurate results. Confusion Matrix is usually applied to Binary classification problems but can be extended to Multi-class classification problems as well.

Terminologies of Confusion Matrix

Confusion Matrix for Binary Classification. Image Source
Confusion Matrix for Binary Classification. Image Source

Concepts are comprehended better when illustrated with examples so let us consider an example. Let us assume that a family went to test for COVID19.

True Positive (TP): True Positives are the cases that have been predicted as positive and they indeed have that disease.

False Positive (FP): False Positives are the cases that have been predicted as positive but they do not have that disease.

True Negative (TN): True Negatives are the cases that have been predicted as negative and they indeed do not have that disease.

False Negative (FN): False Negatives are the cases that have been predicted as negative but they have that disease.

Sensitivity: Sensitivity is also called as Recall and True Positive Rate. Sensitivity is the proportion of actual positives that are correctly predicted as positives. In other words, Sensitivity is the ratio of True Positives to the Sum of True Positives and False Negatives.

PC: Author
PC: Author

Specificity: Specificity is also called True Negative Rate. Specificity is the proportion of actual negatives that are correctly predicted as negatives. In other words, Specificity is the ratio of the True Negatives to the Sum of True Negatives and False Positives.

PC: Author
PC: Author

Precision: Precision is the proportion of predicted positives that are correctly predicted as positives. In other words, Precision is the ratio of the True Positives to the Sum of True Positives and False Positives.

PC: Author
PC: Author

F1 Score: F1 Score is defined as the harmonic mean of Precision and Recall. F1 Score scales from 0 to 1 with 0 as the worst score and 1 as the highest score. F1 Score can be used when the data is suffering from class imbalance since it considers both False Positives and False Negatives.

PC: Author
PC: Author

Accuracy: Accuracy of a model is defined as the ratio of Sum of True Positives and True Negatives to the Total Number of Predictions. Accuracy scales from 0 to 100. Accuracy can be used when obtaining True Positives and True Negatives are imperative.

PC: Author
PC: Author

ROC- AUC Curve

ROC-AUC curve (Receiving Operator Characteristics- Area Under Curve) helps to analyze the performance of classification at various threshold settings. High True Positive Rates (TPR/ Sensitivity) of a class describes that the model has performed well in classifying that particular class. ROC-AUC curves can be compared for various models and the model that possesses high AUC (Area Under Curve) is considered to have performed well. In other words, the model has performed significantly well producing high TPR (True Positive Rate) at various threshold settings.

PC: Author
PC: Author

Cost functions for Classification

Cost functions assist in measuring how well a model performs by considering actual values and predicted values.

Cross-Entropy loss

Cross-Entropy loss is also called as Log loss. Log loss can be applied to Binary classification problems where the targets are binary and to Multi-class classification problems as well. Let us consider C to be the number of classes in the target variable.

If C = 2 (binary classification) the log loss or binary cross-entropy loss is calculated as follows,

  • When the actual value y = 0, [(1-y) * log(1- 𝑦̂)] is applied where 𝑦̂ is the prediction of y.
  • When the actual value y = 1, [y * log(𝑦̂)] is applied 𝑦̂ is the prediction of y.
PC: Author
PC: Author

*Graph for -y log(𝑦̂) when y = 1 (y is the actual value)**

PC: Author. Done using Desmos
PC: Author. Done using Desmos

*Graph for -[(1- y) log(1- 𝑦̂)] when y = 0**

PC: Author. Done using Desmos
PC: Author. Done using Desmos

If C > 2 (multi-class classification) the log loss or multi-class cross-entropy loss is calculated as follows,

PC: Author
PC: Author

Multi-Class Cross-Entropy Loss is defined for a single instance of data and Multi-Class Cross-Entropy Error is defined for the entire set of instances of data.

Sparse Multi-class Cross-Entropy Loss

Sparse Multi-class Cross-Entropy Loss is very similar to Multi-class Cross-Entropy loss except for the representation of the true labels that differ from the Multi-class Cross-Entropy Loss.

PC: Author
PC: Author

In Multi-class Cross-Entropy loss, the true labels are one-hot encoded whereas, in Sparse Multi-class Cross-Entropy loss, the true labels are left as such thereby reducing the computation time.

Representation of true labels y in Multi-class Cross-Entropy loss,

PC: Author
PC: Author

Representation of true labels y in Multi-class Sparse Cross-Entropy loss,

PC: Author
PC: Author

Summary

  • F1 Score can be used when the dataset is imbalanced. A dataset is said to be imbalanced when the number of samples of one class is more than the number of samples of another class.
  • ROC-AUC curve is plotted using True Positive Rates and False Positive Rates at different threshold settings. ROC-AUC curve helps to find the optimal threshold for classification.
  • Cross-Entropy loss can be applied to both binary and multi-class classification problems.
  • Sparse Multi-class Cross-Entropy Loss is computationally faster than Multi-class Cross-Entropy Loss.

References

[1] Jason Brownlee, How to Choose Loss Functions When Training Deep Learning Neural Networks.

[2] Scikit-learn, Receiving Operator Characteristics.

[3] ML- cheatsheet, Loss functions – ML glossary documentation.


Connect with me on LinkedIn, Twitter!

Happy Machine Learning!

Thank you!


Related Articles