Introduction to Data Mismatch, Overfitting and Underfitting in Building Machine Learning Systems

In the article "Primer on Most Important Machine Learning Methods" I give an overview of common Machine Learning approaches that are differentiated and frequently used in the field. Three main dimensions are considered here:
- Human Supervision: Elaborates on how supervised, unsupervised, semisupervised and reinforcement learning works, specifically with regards to predefining the shape of the outcomes.
- Online vs. Batch Learning: Discusses the difference between incremental learning "on-the-fly" and training a model based on a static data set.
- Instance-based vs. Model-based Learning: Highlights the differences in the learning approaches of clarifying the explicit comparison with previous values versus developing a logic that is used by a model to generalize.
When building a Machine Learning system several steps are performed to yield a robust solution that doesn’t only make accurate predictions, but also gives proper answers to the underlying questions. A typical Machine Leaning workflow contains the following steps:
- Problem Definition
- Data Collection
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Train/Test Split
- Model Selection and Training
- Cross Validation
- Model Accuracy Evaluation
- Hyperparameter Tuning/Model Improvement
- Model Deployment
- Model Retraining
This workflow can lead to several challenges. Especially, during the steps from EDA to model improvement some key questions need to be answered:
- Is there enough data?
- How much of the data is needed for the model?
- Does the data represent the real world?
- Which features are suitable?
- Are data records complete?
- Which Machine Learning method is appropriate?
- How complex should the model be?
- Does the model make accurate predictions?
Responding to those questions, the main tasks are the selection and enhancement of the data as well as the selection, training and improvement of a model. These challenges can be put into two groups: Data-related Challenges and Model-related Challenges.
Although there are dependencies between those two groups, this article discusses the main data and model challenges and explains how to identify and address them with the ultimate goal to improve performance of the Machine Learning system.
1) Data Challenges
The saying that data scientists spend 80% of their work with finding, cleaning and organizing data shows how important the input data is for a Machine Leaning model to produce meaningful outcomes¹.
Data Quantity
Compared to human learning procedures, which usually work quite well with few training examples, current Machine Learning algorithms need lots of data to perform well. A general rule is "the more complex a problem and the applied method is, the more data is needed as input for training". When a Machine Learning system lacks quantity of data, overfitting can occur (explained under model challenges), as the algorithm picks up patterns, which are really not more than noise. With a lack of data, the real patterns can stay uncovered.
Representative Data
A key aspect related to training data is that it needs to be representative of the production data that i.e. occurs in real-world applications where the Machine Learning solution is deployed. If the production data is completely different, the results are likely going to be useless. This applies to basically all types of models, whether they are based on supervised or unsupervised learning, batch or online learning as well as instance- or model-based learning.
In practice, the collection of a data set that represents the real world is a challenging. Issues like sampling noise (=randomness in data) can have a big impact, especially on small sample sizes formed by pure chance and in a non-representative way. If the data collection is easy and the training data set is sufficiently large, a sampling bias can still occur. It is caused by the usage of inaccurate sampling methods.
Data Quality
"Garbage in – Garbage out" is a frequently used proverb, but remains true for data quality in the context of Machine Learning solutions. The quality of the training data highly impacts the quality of the model results, when operating it in the context of real-world products and services. Hard work with lots of modifications result in high quality data:
- Error Correction: With sufficient insights into the data acquisition process, errors can be understood and manually fixed.
- Outlier Removal: Strongly deviating values, i.e. as a result of poor data acquisition methods, can be removed from the data set.
- Noise Balancing: More data can be added to account for noise and make it largely irrelevant.
- Missing Data Imputation: Incomplete data records can be imputed by filling in the median of the remaining series of values or, in time series analyses, the average of surrounding values.
If data cannot be fixed, the data records or features should be dropped entirely to protect the accuracy of the model, as missing values typically have an adverse effect. If the model is loaded with low quality training data, it will be almost impossible for it to perform well with regards to detecting the targeted patterns.
Relevant Features
Analyzing relevant features goes beyond the raw shape of data and more into its meaning. This is addressed during the feature engineering phase in the Machine Leaning workflow.
To achieve the best possible predictions for a specific Machine Learning problem, the model has to be trained on the most relevant features. Irrelevant features should be excluded, as they are not helping but distorting the process to make accurate predictions.
As part of feature engineering two major steps help to build the proper collection of features:
- Feature selection: Analysis of features and selection of the most useful ones for making accurate predictions, i.e. by looking at a correlation matrix.
- Feature extraction: Uncovering of interrelations among features and development of new features by combining or enhancing existing ones (i.e. used during dimensionality reduction).
Machine Readability
Besides the above-mentioned challenges, the machine readability of data is also key to maximize the Machine Learning model’s accuracy. With different feature data types including numerical, categorical and ordinal, date & time, as well as text & image data there are specific criteria for machine readability. With the feature preprocessing executed during the feature engineering step in the Machine Learning pipeline, data formats are adjusted to ensure machine readability.
Some feature preprocessing methods are specific to data types, some are independent. The following list gives an overview of important methods to be familiar with:
- Feature scaling or normalization: Rescaling (e.g. Min-Max-Scaling) distributes values on a scale with a minimum of 0 and maximum of 1. Standardization (e.g. Z-score normalization) rearranges a feature’s values around a mean of 0 and variance of 1. Feature scaling optimizes the results mostly for non-tree models.
- Outlier removal: Whereas outlier removal is discussed under data quality as a result of faulty data acquisition, it is also considered in this section as an improvement measure for machine readability. Outliers can drastically bring down Machine Learning model’s capabilities to process the rest of the data accurately.
- Rank transformation/integer encoding: Through rank transformation categorical values within a series of ordinal data points can be transformed into numerical ranks that have the machine-readable integer data type.
- Log transformation: Log transformation is used to improve Machine Learning model performance for highly variable data. It takes each data value’s logarithm to make the distribution more bell shaped and remove skewness. Log transformation can be applied if the distribution is roughly log-normal.
- One-hot-encoding: For each occurring value of categorical features a new variable is created, which can take on the value 0 and 1. One-hot-encoding is used to avoid poor performance or unexpected results when a feature is categorical and an integer, but at the same time values are not ranked.
2) Model Challenges
A Machine Learning model has to be tested and optimized to ensure good quality of the predictions it produces. Immature models that are put into production can lead to insufficient results and unhappy users of a product or service.
One way to improve model quality is splitting the data into different subgroups to build a testing pipeline based on different data used in each step. The common approach is to split the data used for model development into a training set and a test set. A typical ratio is 80/20, with 80% being the training set and 20% the test set. The split depends on the size of the total data set. When the data set is very large it is sufficient to reduce the test set share to well below 20%. With the production data included, the splitting process leads to three data sets:
- Training data
- Test data
- Production data
Once the model is built on the training set, the training error can be measured, which is the accuracy the model achieves on the training set itself – this is also called the bias. When the model already performs poorly on the training data, this is typically caused by underfitting. In that case the complexity of the model is too low to capture the logic in the data set at hand.
When the model is applied to the test data after initial training, the generalization error (=error rate on new cases the model hasn’t seen) can be assessed. This is performed by comparing the model predictions for the test set with the test set’s real values for the predicted labels. When the model performs well on the training data but poorly on the test data this can be described as variance and is typically caused by overfitting.
A third pattern, data mismatch, has already been defined in the data challenges part and is quite often the reason for the error that occurs when the tested model is applied to the production data in the real world.
Overfitting
Overfitting is a frequent cause for a high generalization error. It implies that the model has over-interpreted patterns in the training data and adjusted too well.
As data is frequently collected based on a decent amount of chance and noise included, it is just not appropriate for a model to pick up every little pattern. When doing just that, it couldn’t distinguish which part of the data is part of the real pattern to be detected and which is just noise. Overfitting typically occurs when the model is too complex, the data set too noisy or too small. It can be addressed by following measures:
Reduce the complexity of the Machine Learning model…
- By choosing a simpler model type (=lower-degree polynomial model instead of complex deep neural network) or different model architecture (i.e. other neural network type)
- By reducing the parameters of the model (=lower degree polynomial or linear model instead of high-degree polynomial model)
- By constraining the model by applying L2 or L1 regularization or dropout
- By adding early stops to the model so that learning is stopped at a targeted error rate (i.e. a rate deducted from production targets/KPIs)
Change/fine-tune the data input fed into the model…
- By reducing the number of features or selecting more relevant features for training
- By simply collecting more data as long as it is available and computational power is not an issue
- By reducing noise (=fix/balance errors and get rid of outliers)
Underfitting
Underfitting relates to cases where the model is too simple for the data set it has been trained on and therefore creates poor results. The levers that can be addressed are similar to overfitting but mostly pulled the opposite way.
Increase the complexity of the Machine Learning model…
- By choosing a more complex model type (=polynomial model instead of linear model, more layers/neurons to a neural network) or different model architecture (e.g. other ANN type)
- By reducing constrains that have potentially been previously applied (e.g. L2 and L1 regularization, dropout)
Change/fine-tune the data input fed into the model…
- By choosing features that are more relevant based on error analysis
- By augmenting features to make them more relevant based on feature engineering methods (i.e. features can also be regularized)
Summary
In a Machine Learning workflow several steps are taken to train a model that is fueled by a learning algorithm to eventually make predictions about the future. This includes steps from problem formulation and data collection to, ultimately, model building and improvement. The model, as the final product, should be capable of generalizing to new cases and make beneficial predictions in the productive environment of a real-world application.
The results of the Machine Leaning model are affected by lots of factors. Two main groups are data and model challenges. In context of data challenges the key things to look at are:
- Data Quantity
- Representative Data
- Data Quality
- Feature Relevance
- Machine Readability
Two of the key challenges related to the model building process are:
- Overfitting
- Underfitting
While these factors are frequently cited as important this list is not exhaustive and further avenues should be considered in addition to this article. A good source is the book Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, which provides more depth to the content discussed as well as practical exercises.