4 Reasons Your Machine Learning Model Is Underperforming

A schematic approach to building better ML models

SudoPurge
Towards Data Science

--

As a Data Scientist, creating impact with data is the reason you are paid, but grasping the issues of creating an impactful Machine Learning model may seem a little too daunting for a novice. Classifying them into some overarching umbrella schema can help contextualize the bits and pieces of optimizing a model efficiently and understand where the real bottlenecks occur. Divided into four ideas, the schematic approach should provide a clearer picture of the steps required to get to an impactful model.

Quality of Training Data

Most ML engineers are familiar with the quote, “Garbage in, garbage out”. Your model can perform only so much when the data it is trained upon is poorly representative of the actual scenario. What do I mean by ‘representative’? It refers to how well the training data population mimics the target population; the proportions of the different classes, or the point estimates (like mean, or median), and the variability (like variance, standard deviation, or interquartile range) of the training and target populations.

Generally, the larger the data, the more likely it is to be representative of the target population to which you want to generalize. But this may not always be the case especially if the sampling method is flawed. For instance, say you want to generalize to the population of a whole school of students, ranging from the 1st standards to the 10th, but 80% of your training data consists of students from the 2nd standard. If the school’s student distribution does not correspond to 80% of them being in the 2nd standard, and the data that you want to predict is in reality largely affected by natural differences in the characteristics of the populations in the different classes, your model will be biased towards the 2nd standard.

It is crucial to have a good understanding of the distribution of your target population in order to devise the right data collection techniques. Once you have the data, study the data (the exploratory data analysis phase) in order to determine its distribution and representativeness.

Outliers, missing values, and outright wrong or false data are some of the other considerations that you might have. Should you cap outliers at a certain value? Or remove them entirely? How about normalizing the values? Should you include data with some missing values? Or use the mean or median values instead to replace the missing values? Does the data collection method support the integrity of the data? These are some of the questions that you must evaluate before thinking about the model. Data cleaning is probably the most important step after data collection.

Irrelevant Features

The quote “Garbage in, garbage out”, is also applicable when it comes to feature engineering. Some features are gonna have a greater weightage (impact) towards the prediction than others.

Measures like correlation coefficients, variance, dispersion ratios are widely used to rank the importance of each feature. One common mistake that novice Data Scientists make is that they use Principle Component Analysis for reducing dimensions that are not inherently continuous. I mean, technically you can, but ideally, you should not. This usually results in assuming the features with the highest variability are the ones with the highest impact, which of course, is not necessarily true. Artificially encoded features that are originally categorical in nature, generally don’t turn out to be as highly variable as the continuous ones when encoded and so get undervalued in terms of their relevance.

Sometimes, creating new features using other known features can have a greater impact than keeping them separate. Oftentimes, having too many features with low relevance can lead to overfitting, while having too few can lead to underfitting. Finding the best combination of features comes with experience and knowledge of the domain. It could be the difference between an okay model and a near-perfect model, and by extension, an okay ML engineer and a pretty darn good one.

Overfitting and Underfitting

Unlike the previous ones where the data was our focus, this one really comes down to the algorithm used for the model, although, the effects can still be alleviated to some extent by considering the issues discussed above.

Overfitting is when the model fits the training data too closely and cannot generalize to the target population. The more complex a model is, generally, the better it is at detecting the subtle patterns in the training dataset. The data collected may not always be completely representative of the target population and so using more complex algorithms like deep neural nets instead of simpler ones with a lower-order polynomial could be the difference. But, use a model too simple for the problem and the model will not be able to learn and detect the underlying patterns well enough. This is, of course, called underfitting.

One way to compensate for overfitting is by imposing a penalty, depending on the difference between the weightage the model gives to a feature and the value set by us before training (which could just as well be zero, if we want the model to completely disregard the feature). This effectively allows us to control the complexity of the algorithm at a finer scale and help find the sweet spot between overfitting and underfitting. This is what we call regularization of the model and the penalty is a hyperparameter. It is not part of the model but affects the model’s ability to generalize and is set before training. There are other methods of finding the sweet spot like bagging (mainly for Random Forests) and boosting.

But this doesn’t end here. After extensive tuning of the hyperparameters, you may find that your model predicts with an accuracy of 95% to the test dataset. But now you run the risk of overfitting to that set of test data and the model may not be able to generalize to the real-world data when it is deployed. The common solution to this is to carve out another set of data from the training dataset and use it as another layer for validating the model after testing it on the first test dataset with the different hyperparameter tunings. The three rounds of fitting generally render a model that works great, but this really ultimately depends on the size and quality of the data you have and the complexity of the problem at hand.

Lack of Enough Data

Most ML models require a sh*t tone of data. And unless you have a pre-trained model that only requires some fine-tuning, you are gonna have to find a way to provide your model with enough data. Even for simple tasks like recognizing oranges and bananas, there should be at least a few thousand example images for the model to learn. This is a massive bottleneck in the pipeline. More than any other factor, the efficiency of today’s ML models and the efficacy of their applications are greatly choked due to a lack of enough data.

This is why companies like Facebook, Google, and Apple are so keen on collecting as much data as possible from their users (not gonna debate on the ethical concerns of this practice here though). Data augmentation techniques like cropping, padding, and horizontal flipping have been critical in squeezing out as much training potential there is out of the available dataset but these can only do so much. This study from Microsoft illustrates how very diverse ML models performed similarly and had a pretty strong positive correlation with the size of the training data (number of words).

Figure 1 from, Banko, M., & Brill, E. Scaling to Very Very Large Corpora for Natural Language Disambiguation. Retrieved February 16, 2021.

Of course, this is not true for all cases, but companies should reconsider the trade-offs between spending millions on algorithm development for models, and on collecting more and more data.

The ultimate objectives are to be able to:

i) have data that is as perfectly representative of the target population as possible (aka. bigger data size and higher data quality)

ii) use features that actually affect the prediction in reality

iii) use a model with the appropriate level of complexity (aka. the level of detail at which it is able to learn)

iv) fine-tune the model to match or reduce the gap between the actual complexity of the problem and the complexity of the model

P.S. For more short and to the point articles on Data Science, Programming and how a biologist navigates his way through the Data revolution, consider following my blog.

With thousands of videos being uploaded every minute, its important to have them filtered out, so that you consume only the good quality data. Hand-picked by myself, I will email you educational videos of the topics you are interested to learn. Sign-up here.

Thank you for reading!

--

--