
Introduction
Machine learning requires creating a robust training dataset since the training performs as the seed for subsequent model evaluation. If the training data is corrupted, the model will perform badly as its accuracy will drop. Image classification is a thriving sector for machine learning where a balanced and correct training dataset is extremely important. AI managers as well as data scientists need to ensure the loading of balanced and clean data in the pipeline.
Classification Metrics
There are several metrics to evaluate a model. For image classification, we can take the test directory to apply the prediction behavior of the trained model. Scikit-Learn provides some simple packages to evaluate a model’s performance in terms of its accuracy, precision, recall and F1 score. A classification report is the combination of all these metrics. I will go through these metrics briefly.
Precision
When we try to guess something between two things, there is a 50% chance that we may go wrong. That is our precision to correctly identify something. The same goes for the machines. A machine is trained to create a model and based on the training data, it may correctly identify all of the test data or some may go wrong. Let’s say, we want the model to identify images between cats and dogs. We have already trained the model with images of cats and dogs. The model may identify all images correctly or some images may be wrongly labeled. The related terminologies are below
- True-positive (TP): a cat identified as a cat
- False-positive (FP): a dog identified as a cat
- True-negative (TN): a dog identified as a dog
- False-negative (FN): a cat identified as a dog
It is evident that when a model is not able to label something correctly, it increases the number of false positives or false negatives. We want the model to have the highest number of true positives and true negatives. Precision and recall come in handy to better evaluate a model in terms of these outcomes. In short, precision measures how precise a model is to detect elements correctly. When a cat and dog detecting model’s precision is high, it indicates that it is able to correctly identify most of the cats as cats and very few dogs as cats. The model may also wrongly identify some cats as dog images but this information is not there in the precision value. We need recall for that.
Precision = TP/(TP+FP)

Recall
Recall is another metric for model evaluation. It measures how good a model is to detect all occurrences.
Recall = TP/(TP+FN)
When the above-mentioned model’s recall value is high, it indicates that it is able to correctly identify most of the cats’ images as cats and very few cats are identified as dogs.
We have two classes in this model namely cats and dogs. For a class, if recall is high but precision is low, it means the model is biased to that particular class. For example, if cat’s recall is high but precision is low, the model is able to correctly identify most of the cat images as cats but it also wrongly identified many dog images as cats. On the other hand, if cat’s recall is low but precision is high, it indicates the model is able to correctly identify most of the cats’ images as cats but at the same time, it also wrongly identified many cat images as dogs.

X-ray image dataset
I have written the blog below on X-ray image classification between normal and pneumonia types. There I have demonstrated the identification model training and evaluation steps.
In this article, I will use the same dataset but with some changes to reflect the real-world issues. I have created three separate datasets for training, validation and testing.
- First dataset: It is a balanced dataset having 500 normal images and 500 pneumonia images in the training
- Second dataset: It is an unbalanced dataset having 500 normal images and 250 pneumonia images in the training
- Third dataset: It is a mixed but balanced dataset. 250 normal images are intentionally labeled as pneumonia and 250 pneumonia images are intentionally labeled as normal.
This Github page provides the code blocks for training and evaluating the model. When the first dataset is used, we get the following loss and accuracy after 50 epochs.


Clearly, we have overfitted the model with training data. We can improve the model with other techniques to reduce overfitting but here our concern is the comparison between balanced and unbalanced data reports.

The loss and accuracy for the second and third datasets are shown below




The mixed dataset, which is really dirty, ends up with very low accuracy. Once the code is executed, the ground truth along with the probability of being identified as normal as well as the probability of being identified as pneumonia is shown on top of each image. We need to convert the predictions to binaries to get a classification report.
Usually, when the average is not mentioned, the scores are given as a weighted average to get a single metric. When the average is None, the scores are given for both classes. The same code blocks are executed for all three datasets mentioned before. The classification report summary for the datasets is below



Discussion
On the first dataset, the precision for class 0 is high where as recall is low indicating a substantial amount of false negatives. Overall accuracy is 58%. In the second dataset, there are more samples of 0s in the training data making the recall value for class 0 to be very high but precision drops. essentially it refers to the fact that the model became biased towards that class. The third dataset is just a mixed data ending up with a very weak model. We, therefore, need both of precision and recall to be high for better model performance. Some prefer to evaluate a model by F1 score which combines both of these metrics.

The confusion matrices show some high values in off-diagonal positions for both balanced and unbalanced datasets but for the mixed dataset, it is quite a mess.

Conclusion
In this article, I have described the model evaluation metrics for Image Classification and its implementation in python. Three different datasets are used to reflect real-world issues like unbalanced datasets and dirty datasets. Data scientists and AI managers need to ensure that the training data should be sufficiently clean and balanced for each classification. Mixed data can end up with a weak model after all the lengthy model training.