The world’s leading publication for data science, AI, and ML professionals.

The Danger of Overfitting a Model

An Explanation for Splitting Data into Training and Testing Sets

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

On my first job out of college, I was tasked with streamlining how a company made purchases. While a big project that encompassed many factors, such as lead times and order quantities, the most challenging part was determining how many units would sell over time.

With lead times in excess of 2 or 3 months, I needed to accurately predict if a certain product would run out before a new shipment came. It was a balancing act, because if I underpredicted how much product would sell, the company would run out and lose sales. On the other hand, if I overpredicted, the company would have too many of their assets tied to inventory.

Given free range to approach the problem however I saw fit, I queried some data from the last couple of years and began to crunch some numbers.

In just a few days of work, I had a model with an absurd amount of accuracy for the variation I saw in the data. Overconfidently, I applied the model to the next 3 months. When evaluating it in real time, however, I saw the huge, glaring inaccuracies quickly accumulate.

Inexperienced and confused, I tried to understand what what went wrong. Having gained some knowledge since then, I now understood that I overfitted the model.

Underfitting vs Overfitting

Suppose an analyst was put in my position and needed to make predictions on how much product would sell on a monthly basis. Reviewing the data, they found a strong seasonal pattern. For the sake of example, they found the company sold about 2,000 units of product per month during the first half of the year and about 4,000 units of product the last half of the year.

Figuring they could simply take an average, they make a formal prediction of 3,000 units of product sold per month throughout the entire year.

The analyst underfits the model, because their prediction didn’t adequately capture the form or structure of the data. If purchasing decisions were made based on their prediction, the company would consistently overbuy product during the first half of the year and underbuy during the second half.

The former case ties up company assets and incurs unnecessary storage costs. T latter case misses out on sales because they don’t have enough product.

Correcting their actions, the analyst reviews the data again and creates a simple model that takes the sales of the month from the previous year to predict the sales of the current month. For example, if the company sold 2,000 units of product in January 2020, the analyst would predict 2,000 units of product sold in January 2021.

The analyst overfits the model, because instead of taking the general form or structure of the data, they simply model the noise. While this can sometimes work if the data is remarkably consistent, it causes problems when there are fluctuations that can’t be adequately predicted or explained.

For example, if in May 2020 there was an unusual dip in sales so that only 900 units of product were sold, the analyst would underbuy for May 2021 when sales resume their normal volume.

The Temptation of Overfitting

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash
Photo by Stephen Phillips – Hostreviews.co.uk on Unsplash

Overfitting a model presents a big temptation, because an inexperienced analyst can easily commit this simple mistake in pursuit of better accuracy. Going back to the narrative of myself on my first job, I fell for the same thing. I created some model (a regression with higher polynomial components, if I recall) which modeled the previous year’s data fairly well.

I tweaked it and continued to optimize it until the model very closely followed the data, noise and all, because I saw the error rate continually fall. By the time I was finished, I was proud of my work and wanted to see it excel.

The problem, of course, was once the model was applied to newer data, the accuracy plunged. My model was too noisy and resulted in some very poor predictions. Luckily, no purchasing decisions were made based on my model, yet, but I knew I had to resolve the issue.

Training and Testing Data

The best way to avoid the problem of overfitting a model is to split the dataset into training and testing data.

Training data is a subsample of the dataset used to fit the model. Testing data (also sometimes called validation or holdout data), however, is a subset of the dataset used to evaluate the finished model.

In theory, both the training and testing data subsamples should be representative of the entire dataset. The training data provides the model with all the initial information needed to create meaningful predictions and the testing data supplies an unbiased set of observations to evaluate the accuracy of the model.

In cross-sectional studies, the data is usually split between the training and testing data by random selection; however, other methods may be used depending on the characteristics of the data. Typically training data represents between 60–80% of the data and the testing set comprises of the remainder.

In time series data, the oldest data is usually chosen as the training data and the most recent observations is designated as the testing data. For example, an analyst may use 18 months of sales data to train their model and evaluate it on the last 3 month’s worth of sales.

Conclusions

The ultimate mistake I made was to forget about splitting my data into a training and testing sets. Between data cleaning and thinking about the business logic, this simple step is easily overlooked, but can wreak havoc on any meaningful data Analysis. Overfitting in particular, poses a danger, because an analyst may see their accuracy continually increase without realizing they’re simply mimicking random noise.

To bring things back to my first job, once I realized my mistake on my first job, I eventually did split the sales data into training and testing sets. After a few more days of trying a few models, I found a good fit I could present to my manager and eventually used it to make purchasing decisions.


Related Articles