The world’s leading publication for data science, AI, and ML professionals.

Evaluation of Machine Learning Classifiers

Explanation of Bias-Variance Analysis, Regularization, Performance Metrics, and an Implementation of Harmonic Classifier

Figure 1: A depiction of Results from a Bias-Variance Analysis (Source: Author)
Figure 1: A depiction of Results from a Bias-Variance Analysis (Source: Author)

In the previous articles, we have discussed various Machine Learning methods for classification tasks. We have also used terms like Regularization, Overfitting and Underfitting repeatedly. In this article, we shall go through these terms in detail and show how you can circumvent such problems. Furthermore, we shall also discuss various metrics for measuring the performance of a classifier.

1. Bias-Variance Analysis

Bias-Variance Analysis is a process for the evaluation of a Machine Learning Classifier. Every Classifier can suffer from either a High Bias or a High Variance problem depending on the conditions of the training. Knowing about these common problems and preventing them can help one build better, more general, and high-performance models.

  • High Bias (Underfitting)

When a classifier is highly biased towards a certain type of prediction (e.g., a certain class) regardless of the change in the input data then we call such model suffering from a High Bias problem. For example, if we train a linear model on a training-set which is not linearly separable, then such model would perform badly even on the training-set. So we shall call such a model an underfit model because it doesn’t fully capture the structure of the dataset. An example of such an underfitting problem is shown in figure 2.

Figure 2: An Example of a model suffering from an Underfitting Problem (Source: Author)
Figure 2: An Example of a model suffering from an Underfitting Problem (Source: Author)

As you may have noticed, such a model works badly on both the training-set and the test-set. This is because the model is not equipped with sufficient parameters to deal with the non-linearity in the data. It could also mean that there isn’t enough data. In cases where sample size is too small for training, model would not converge to the optimum point and resulting model would suffer from underfitting problem.

· High Variance (Overfitting)

Another common problem while training a Machine Learning Classifier happens when model performs well on training-set, however, when it is tested on a different set of examples than the ones it has already been trained on, its performance deteriorates significantly. This problem is called a High Variance or an Overfitting problem. It is named this a way because the model has a high variance in the prediction outputs. In other words, model can’t handle the variance in the input samples for a given class and instead of producing a steady prediction for a given class, it produces a varied predictions for the same class. An example of such Overfitting problem is shown in figure 3.

Figure 3: An Example of a model suffering from an Overfitting Problem (Source: Author)
Figure 3: An Example of a model suffering from an Overfitting Problem (Source: Author)

· Learning Curves (Diagnostics)

Now that you know what the High Bias and the High Variance problems are, you may be wondering how one can find out that the model is suffering from which problem. In two-dimensional data such the one in the examples, we can easily see that from naked eye by plotting the data and the decision boundary of a classifier, however, when data is multidimensional, it is not feasible to do so. Therefore, one needs a general diagnostic method for finding the exact problem. Learning Curves is such a criterion which one can use to find-out the type of the problem. More specifically, we plot the objective function value for a sample set of data instances drawn randomly from the training-set. We then increase the number of samples iteratively and keep plotting the objective function value for new set of data samples. This gives us a curve for how the model is behaving for training-set. In addition to the training-set, we also use a cross-validation set for verification of the performance. We create a separate data split in addition to the train-test-split and call it Cross-Validation set. All the parameter tuning, and evaluations are then performed and compared against this Cross-Validation set while in the training phase. We repeat the aforementioned process of plotting the learning curve for cross-validation set as well. An example result of such learning curves can be shown in figure 4.

Figure 4: An Example of Learning Curves for Underfitting, Overfitting and Best Fitting respectively (Source: Author)
Figure 4: An Example of Learning Curves for Underfitting, Overfitting and Best Fitting respectively (Source: Author)

If we observe Learning Curves, we shall see that when both the curves (i.e., Jₜ and Jcᵥ) don’t converge we have an underfitting problem which means that the model hasn’t learned anything and won’t classify accurately both on the training and the test data. However, when we see that the curve Jₜ on training data shows a clear convergence while the curve for cross-validation set (Jcᵥ) shows divergence then, we have an overfitting problem. The larger is the gap between the final cost values of the training and cross-validation sets, the greater is the probability that the model is suffering from an overfitting problem, and it won’t work for new examples than the ones it has already been trained on. In case of a best fit, both curves would converge which would indicate that the model is balanced and would work well for both the training and the test examples. It is to be noted that in complex datasets, both curves might not converge to absolute zero, and there might be a small gap between the two curves, however, it would not be a significant gap and distance from the absolute convergence point should not be too much.

2. Solutions

In the previous section, we discussed multiple problems that a Machine Learning Classifier can suffer from, here we list some of the mitigation strategies for overcoming these problems.

· Using Better Data

Most of the problems while building a Machine Learning model stem from bad data. If the dataset is too small, has too much noise and/or it contains contradictory/misleading data points then the best course of action is to find better data. A dataset which is unbalanced and has a few instances of a certain class compared to the other classes can also cause overfitting problem. In case of an overfitting problem, increasing the size of the data could solve the problem if the problem originated due to lack of data. Normalizing and stratifying the data can also help if the problem is arising due to unbalanced data.

· Regularization

One way to handle the overfitting problem is to introduce a Regularization term in the objective function which would relax the constraints a bit resulting in accommodation for variability in the data. More specifically, we can add a term in the objective function (e.g., L1/L2 norm of weights) which would result in less penalty for slightly deviant points in the training-set occurring due to noise. An example of such regularization can be seen in figure 5.

Figure 5: An Example of Effect of using Regularization for solving Overfitting problem (Source: Author)
Figure 5: An Example of Effect of using Regularization for solving Overfitting problem (Source: Author)

If we observe the effect of a Regularization on Classification boundary, we will notice that it widens it and thus allows accommodation for noisy points which would have otherwise wrongly classified due to a tight decision boundary. However, this may come at the cost of decreased performance on the training-set as can be observed from figure 5. So, it is a matter of choice and is controlled by a parameter Lambda which determines how much weight is to be given for a Regularization term while constructing a classifier. The optimum value of Lambda can also be learned by plotting the Learning Curves on training and cross-validation sets against variations in Lambda parameter while keeping the data size the same.

· Constructing Better Feature Space

There are cases when the problem arises due to bad/insufficient feature space. In such cases, one must construct a better feature space. For example, in case of an underfitting problem, such as the one we depicted earlier (i.e., a linear classifier on a non-linear dataset), it would be necessary to increase the number and type of feature. We can see the results of adding more features (i.e., a 3rd degree polynomial model) in figure 3. The model becomes good at classifying the training-set and therefore, solves the underfitting problem. However, we also saw that it started to suffer from an overfitting problem and became too restrictive for certain data samples. We have also discussed how we can mitigate the overfitting problem in the previous sections. However, in some cases, simply changing the feature space to a more suitable one can improve the performance drastically and can provide a more general model which works robustly for even unseen data. If we observe the data closely, we should see that it is a repetitive pattern, and we can model it using a harmonic classifier. In-fact, if we model it with a harmonic classifier (e.g., a sine wave) we can obtain a model which would work for any new sample and thus provide us with a general solution. An example of such Harmonic classifier fitted on the data is shown in figure 6. As we can see, it not only learns and classifies the training-set with just one cycle but it is robust for any number of cycles.

Figure 6: Results of a Harmonic Classifier on Non-Linear Data (Source: Author)
Figure 6: Results of a Harmonic Classifier on Non-Linear Data (Source: Author)

3. Performance Metrics

Figure 7: Performance Metrics for Binary Classification Problem (Source: Author)
Figure 7: Performance Metrics for Binary Classification Problem (Source: Author)

Once a classifier has been trained, one needs a measure to find out how good is the classification performance. There can be a multitude of metrics which can be used for such performance evaluation. Accuracy is a commonly used metric for many tasks; however, it is a bad metric when the dataset is unbalanced. If dataset has only a few instances of a certain class, accuracy would be highly biased towards the majority class. In such cases, a different metric must be used. As in figure 7, we construct a set of metrics, e.g., True Positive Rate (TPR) and False Positive Rate (FPR) by counting the frequency of the classified labels. TPR (recall/sensitivity) measures the prediction accuracy of the classifier for the positive class (i.e., class 1) and FPR measures the prediction performance of the classifier performance of the negative class (i.e., the class 0). There is always a tradeoff between high recall/TPR and low FPR. For example, if we take an example of an object detector/classifier then TPR tells us how often the detector finds the object correctly among the ones it was shown and FPR measures how often it wrongly detects the object when there wasn’t an object in the scene.

Furthermore, performance of a classifier depends heavily on the threshold used on the probability to classify the point into a class. For example, in case of a binary classifier, we may use a threshold (e.g., 0.5) on the output probability and classify the point to either in class 0 or in class 1 depending on the value of the output probability. However, 0.5 is an arbitrary value and differs in different classification scenarios. So, in order to find the optimum threshold – at which performance of a classifier is the maximum, one can vary the threshold from 0 to 1 and observe the performance metrics (e.g., TPR, FPR etc.). This process is called Receiver Operating Characteristic Analysis (ROC) and is performed by plotting the ROC curves. A ROC curve is obtained by plotting the FPR against the TPR while varying the threshold of a classifier. The area under the such curve provides the measure of performance (i.e., the greater the area, the greater is the classification performance). A No-Skill classifier (i.e., the one which either returns a random class or a constant class as prediction output) is plotted as a diagonal line. Any output curve close to a No-Skill Classifier is considered as a bad classifier which has no classification ability. A perfect classifier is represented by a square curve in the upper left corner. An example output of such ROC curves can be seen in figure 8.

Figure 8: ROC Curves and Precision-Recall Curves (Source: Author)
Figure 8: ROC Curves and Precision-Recall Curves (Source: Author)

In addition to the ROC curves, there is another metric called "Precision-Recall" curves. It is obtained by plotting the TPR against Precision/PPV (Positive Predictive Value). Precision measures the correctness of the classification. If we take our previous object detection analogy, then precision measures how often the classifier correctly classifies the object among the ones that it detects. Same as TPR vs FPR, there is a tradeoff between TPR and Precision. High TPR may come at the cost of loss of Precision. Precision-Recall curves are particularly useful for unbalanced data. The interpretation of the Precision-Recall is similar to ROC curves, However, the perfect classifier performance lies in the upper right corner and No-Skill Classifier (e.g., a majority classifier or random classifier) appears as a flat line around 0.5.

4. Conclusions

In this article, we discussed the problems which occur while training a Machine Learning Classifier. We explained how to diagnose the type of these problems. Then we explained various mitigation strategies for dealing with such problems. We have also discussed various performance metrics for evaluating a Machine Learning Classifier. Evaluation methods are an important aspect of a Machine Learning modelling paradigm and can be useful for robust training of models and hyperparameter optimization. You can further practice with the methods and concepts with the code in the link below.

Code:

https://www.github.com/azad-academy/MLBasics-Evaluation

Become a Patreon Supporter:

https://www.patreon.com/azadacademy

Find me on Substack:

https://azadwolf.substack.com

Follow Twitter for Updates:

https://twitter.com/azaditech


Related Articles