The world’s leading publication for data science, AI, and ML professionals.

Sweatpants Unleashed – the Data Science of an Ignominious Garment

A Time Serie Analysis of How Covid-19 Unraveled a Loose Attire

We will run a SARIMA model and its diagnostics, with KPSS, ADF, OCSB, CH, and normality tests. Step by step, with Python, at the frayed seams of civilization.

"Sweatpants are a sign of defeat. You lost control of your life so you bought some sweatpants." Karl Lagerfeld

Mike Von, unsplash.com
Mike Von, unsplash.com

We are going to analyze the monthly Google Trends data for the keyword "sweatpants" from 2010 through 2021 September; and will then investigate which new insights on the course of history and human civilization we can extract.

(PSA: Doom is nigh!)

sweatpants – Explore – Google Trends

0. Dependencies

1. Data Processing

1.1 Reading and Wrangling

After downloading the pants.csv file from Google Trends, we read it into the Jupyter notebook.

The "Month" column has been imported as an object/string, therefore we convert it to datetime and then define an index.

We limit our analysis to the ups and downs of the sweatpants vogue since 2010.

Now we are ready to sit down and start some serious data wrangling to learn what the sweatpants fashion cycle is made of.

Mike Von, unsplash.com
Mike Von, unsplash.com

1.2 Visual Analysis

1.2a Chart the Raw Data

We observe a rising long-term trend. Civilization has undeniably been drifting towards a cliff edge for the past decade.

We see stable seasonal fluctuations – stable until recently. The pattern indicates a single peak and trough in every year.

We also perceive a funnel-shaped pattern of fluctuations. The peaks and dips exhibit higher amplitudes towards the end of the timeline. This suggests heteroskedasticity: the variance is not time-invariant.

1.2b Outliers, Heatmap, Trend, and Seasonality

We will create a pivot table, from which we will derive a couple of charts that will enable us to focus on different aspects of the time series.

The pivot table provides us with a nifty way to aggregate, sort and filter the source data.

The pivot table indicates that the popularity of sweatpants tends to soar near year-ends. Let’s visualize the hot zones for loose pants in a heatmap.

We observe seasonal peaks in every December, with search activity beginning to increase in October and November.

  • Either people tend to mentally prepare for the cold season by googling for snugly warm, but not too tight-fitting comfort garments, 3 months before the cold mid-winter months of January and February;
  • or piles of gift-wrapped sweatpants – sad to say – really do clutter the floor beneath the x-mas trees.

In the 2020 row of the heatmap, we can discern the moment when sweatpants went off the rails: in April-May 2020 and November/December 2020. These are the spikes which will become the focus of our analysis.

Let’s visualize the long-term trend. We take our pivot table, apply the aggregation function ‘mean’, and choose years as the ‘index’ in the pivot table; and then plot the year-over-year trend of the annual interest.

Karl Lagerfeld, the Parisian fashion tsar, might as well have exclaimed "Doom is nigh!" For 10 years or more, our civilization appears to have been on a trajectory to collectively lose control over our lives if Karl was right.

We reached peak pants bagginess, at least temporarily, in 2020, after an inexorable growth path since 2010.

Next, let’s investigate more closely the seasonality in our source data. We select month as the index in our pivot table and have the table compute the average value of each month across all years in our time series; and then plot the resulting 12-month curve.

We observe a seasonal pants bagginess spike late in each fourth quarter. The chart indicates that every year, normally, tends to have a single trough in June/July, followed by a single peak in December.

To summarize, let’s decompose the time series into trend, seasonality, and residuals, and throw in the correlograms (ACF and PACF) as well, all in one function.

The pmdarima package offers the tsdisplay method to visualize a time series. In addition to the charts we have created above, tsdisplay also shows a histogram of the observations (lower-right corner). Clearly, the observations are not normally distributed; they are left-skewed. The non-normality does not invalidate the SARIMA forecast model we want to develop. But normality would provide us with more reliable forecasts and confidence intervals.

Even more concerning is the funnel shape of the observations curve at the top. The widening funnel indicates that the variance is not time-invariant, which is mandatory for time series forecasts.

2. Diagnostics and Transformations

2.1 Homoskedasticity and Normality

The histogram shows observations that do not appear to be normally distributed. Let’s run the normaltest of scipy.stats.

The test confirms, with its vanishingly small p-value, that the original time series is not normally distributed.

So we have to sit down, in our fashionable black sweatpants, and ponder our next steps.

Mike Von, unsplash.com
Mike Von, unsplash.com

We decide to try out two transformations to solve both our issues, if possible: non-normality and – more urgently – heteroskedasticity (inconstant variance).

2.1a Log-Transformation

First, we log-transform the observations.

The purpose of the transformation is to reshape the curve of observations in a way that narrows its expanding funnel (reduces its heteroskedasticity); and ideally also bends the curve so that it aligns more closely with normally distributed data, by adjusting the skewness and kurtosis of their distribution.

After fitting the forecast model, we will be able to inverse-transform the prediction values to make them comparable with the original, untransformed time series data.

The pmdarima package offers the LogEndogTransformer method.

The log-transformed line chart has lost most of its concerning funnel-shape. The variance seems to be time-invariant, too. The histogram resembles a normal distribution, without a significant skew to the left or right. The high p-value of the normality test confirms this.

2.1b Box-Cox-Transformation

As an alternative to log-transformations, we run the Box-Cox transformer, which can often be more effective than log-transformations.

The charts look similar to the log-transformed series. The normality test returns an even more favorable p-value.

We will continue with the Box-Cox-transformed series.

2.2 Autocorrelation Structure

Next, we investigate the correlogram of the Box-Cox-transformed series.

We see a seasonal pattern in the ACF. Thus, the time series is not stationary – yet.

2.3 Stationarity

A time series is stationary if its mean, variance, and autocorrelation structure do not change over time. If they are not time-invariant, the properties we use today to prepare a forecast would be different from the properties we would observe tomorrow. A process that is not stationary would elude our methods for using past observations to predict the future development. The time series itself does not need to remain a flat, constant line in past and future periods to be deemed stationary – but the patterns that determine its changes over time need to be stationary to make its future behavior predictable.

The time series needs to exhibit:

  • time-invariant mean
  • time-invariant variance
  • time-invariant autocorrelations

Inconstant Mean

A series that shows a robust upward or downward trend does not have a constant mean. But if its data points tend to revert to the trendline after disturbances, the time series is trend-stationary. By differencing the time series – taking the difference between an observation y(t) and an earlier observation y(t-n) – we could obtain a stationary (mean-reverting) series of the changes.

A time series with seasonality will exhibit patterns that repeat after a constant number of periods: temperatures in January differ from those in July, but January temperatures will be at a similar level between years. Seasonal differencing takes the difference between an observation and its predecessor that is S lags removed, with S being the number of periods in a full season, like 12 months in a year or 7 days in a week.

If both the trend and the seasonal pattern are relatively time-invariant, the differenced time series (first-differenced with respect to the trend; and seasonally-differenced with respect to the seasonality) will have an approximately constant mean.

The required order of differencing is a parameter that should be determined in advance, before fitting a forecast model to the data. "It is important to note that these information criteria tend not to be good guides to selecting the appropriate order of differencing (d) of a model, but only for selecting the values of p and q. This is because the differencing changes the data on which the likelihood is computed, making the AIC values between models with different orders of differencing not comparable. So we need to use some other approach to choose d, and then we can use the AICc to select p and q." (Hyndman, 8.6 Estimation and order selection | Forecasting: Principles and Practice (2nd ed) (otexts.com)).

To find out if differencing is required, we can run four tests:

  • Augmented Dickey-Fuller ADF for first-differencing
  • Kwiatkowski-Phillips-Schmidt-Shin KPSS for first-differencing
  • Osborn-Chui-Smith-Birchenhall OCSB for seasonal differencing
  • Canova-Hansen CH for seasonal differencing

2.3a First-Differencing

We use pmdarima’s ADF and KPSS tests to obtain the recommended order of first-differencing.

We get a contradiction. KPSS asks for 1 order of differencing, whereas ADF asks for 0.

  • If the ADF test does not find a unit root, but the KPSS test does, the series is difference-stationary: it still requires differencing.
  • Conversely, if the KPSS test had not found a unit root, but the ADF test had, the series would have been deemed trend-stationary: it would require differencing (or other transformations such as de-trending).
  • In general: If the tests disagree, we need to take the higher of the two test results (variable _ndiff) as the appropriate order of differencing. Only if both tests agree that the series is stationary, we can refrain from differencing.

2.3b Seasonal Differencing

We run the OCSB and CH tests for the Box-Cox-transformed series. Both tests agree that no seasonal differencing is required.

2.3c Check After Transformations

We combine the test results for stationarity and compute a differenced time series.

  • Column "Pants" in our dataframe below contains the original Google Trends data on sweatpants popularity.
  • Column "y_bc" contains the Box-Cox-transformed values.
  • Column "y_bc_diff" shows the differenced and transformed values.

Final round of diagnostics, now for the differenced data:

  • The differenced data still pass the test for normality with a p-value higher than 0.05
  • The stationarity tests do not insist on additional rounds of differencing

2.4 Split of Training and Test Dataset

We split the Google Trends observations into a training dataset and a test dataset. In the constant TESTP, we had reserved the final 24 months for testing.

3. SARIMA

3.1 Training: Auto-ARIMA Search for Hyperparameters

The AutoARIMA method of the pmdarima package runs a hyperparameter search for the autoregressive AR terms and moving-average MA terms in a SARIMA model.

We choose stepwise search, which uses an algorithm developed by Hyndman and Khandakar in 2008 (Automatic Time Series Forecasting: the forecast Package for R (r-project.org)). This algorithm is much faster than a full-grid search because it avoids to process the nonsensical parameter tuples in the search space.

We formulate the input for the AutoARIMA method as a so-called pipeline. The pipeline consists of our Box-Cox transformation (but could comprise additional transformations if we wanted to include them). The pipeline syntax will enable us to inverse-transform the model results when we compute predictions.

The model summary shows excellent diagnostic results:

  • The Ljung-Box test returns a very high p-value of 0.91. So the residuals represent white noise. They do not contain a signal that the SARIMA model failed to find and could have used to improve its forecast accuracy.
  • The Jarque-Bera test for normality returns a very high p-value of 0.72. The skewness is close 0 and the kurtosis is close to 3, as they should be to resemble a standard normal distribution. Therefore, we conclude that the residuals are normally distributed. Forecast values and confidence intervals will be more reliable than in a case of residuals with dubious normality.
  • The Heteroskedasticity test returns a very high p-value of 0.78. We conclude that the residuals have constant variance.
  • Most of the SARIMA parameters have coefficients far from zero, so their influence is not negligible. The exception is the second seasonal AR term ar.S.L24, whose coefficient is close to 0, with a wide confidence interval that straddles zero. Its SAR term also shows a very high p-value. We conclude that this second SAR term should probably be skipped. It does not contribute much to the model.

3.2 Prediction Accuracy Metrics

We define some prediction accuracy metrics, which will enable us to compare the SARIMA model for the training dataset we have just prepared with the upcoming out-of-sample predictions.

We fill a dictionary with the formulas for:

  • mean absolute error (MAE),
  • the mean absolute percentage error (MAPE),
  • the root mean square error (RMSE) and
  • the correlation between the predicted values and observations

    3.3 Training: In-Sample Predictions

    pmdarima’s _predict_insample method applies the fitted SARIMA(0,1,1)x(3,0,0)(12) model to the months in our training period.

The parameter _inversetransform tells the method to provide the predicted values and their confidence intervals in a similar structure as the original time series: the predictions will not be reported as Box-Cox-transformed values.

The prediction accuracy for the training dataset is not too bad.

The auto-fitted SARIMA model has captured ca. 91.5% in the ups and downs of historical sweatpants popularity: the mean absolute percentage error MAPE amounts to 8.5% of discrepancy between predictions and actual values. The 8.5 % consist of random fluctuations for which the model could not extract a stable pattern.

The root mean squared error RMSE is the standard deviation of the residuals. RMSE tells how closely the predictions hug the curve of actual observations.

The (linear) correlation between predicted and actual values is 97%.

3.4 Testing: Predictions for Test Dataset

Next, we run the same SARIMA model for the 24 months in the test dataset.

We get a much worse prediction accuracy. MAPE now amounts to 23%. The correlation between predictions and actual observations has fallen from 97% to 78%.

So what is afflicting our SARIMA model in 2020–2021?

Large disparities in prediction accuracy between training and test dataset usually imply overfitting.

  • A model that is underfit will exhibit low accuracy both in the training dataset to which it was fitted, and the test dataset.
  • A model that is overfit will present high accuracy in the training dataset, but deals poorly with new data such as in the test dataset.

In our case of Sweatpants Unleashed, the patterns the time series had followed over the preceding 10 years were no longer applicable once the Work-from-Home months started in April 2020. This implies an overfit model, but the overfit was inevitable because the 2020 disruptions could not be isolated as a predictable pattern from the historical 2010–2019 data in the training dataset.

3.5 The Pattern of the Past No Longer Hold in 2020

Let’s revisit the heatmap we had created in our source data investigation. We observe that 2020 was an exceptional year,

  • with an unusual, "unseasonal" spike in Googling for sweatpants in April- May 2020;
  • and another peak in November/December 2020, which was more similar to the seasonal patterns in previous years, but surged to a much higher amplitude as a hot fashion item than in previous winter seasons.

We can’t distill the reasons from the Google Trends data alone. But this pattern in 2020 suggests that interest in sweatpants shot up in April when many people had to huddle around the warming glow of their notebook screens at home rather than continuing with their morning and evening commutes to their workplaces. Evidently, most of us chose not to wear formal office attire while typing away on our keyboards at the kitchen table.

In November – December 2020, we observe a higher than usual amplitude of sweatpants acclaim supposedly because Covid-19 cases were rising fast in the months before vaccines became available in early 2021. In the fall of 2020, People might have expected to have to withdraw into their home caves for a long, dire winter.

Out of our data science toolbox, we could pull the test kit for cointegration & granger causality (Granger Causality Tyrion Cersei | Towards Data Science). If we compared the intensity of work-from-home and the sweatpants popularity in 2020, I’d expect to see some considerable Granger causality.

What’s our next step?

Let’s investigate how badly the sudden rise of work-from-home has derailed the time series pattern that had been valid throughout the preceding decade.

4. Predictable Trend versus Trend-Break in 2020

4.1 Training SARIMA Model Up To The Break Point

From our Google Trends raw data 2010–2021, we slice off a training dataset that ends in March 2020, just before lockdowns and work-from-home routines established a new reality.

We fit a new SARIMA model to this training dataset that represents the old normal.

Then we run predictions for the months since April 2020.

The new training model returns the predictions for sweatpants popularity until March 2020.

4.2 Testing the SARIMA Model After the Break Point of April 2020

Let’s see how sweatpants would have fared in an alternative universe that was not afflicted by Covid-19 after March 2020.

As expected, the prediction accuracy is quite bad, illustrating that the Covid disruption put the baggy pants trend on a very different, unforeseen path from April 2020.

4.3 The Unforeseen "Work-From-Home" Impact Since April 2020

To discern the magnitude of the Covid-19 impact on the sweatpants fashion corner, let’s compare the predictions for the months after April 2020 in that alternative universe with what happened in our universe.

  • We see how closely the SARIMA predictions (orange) are hugging the actual observations (blue) between 2011 and the first quarter of 2020.
  • Early in the second quarter of 2020, sweatpants show a sudden additional blue spike that had never before occurred in any seasonal pattern. The unanticipated April/May spike causes most of the inaccuracy that our prediction metrics report.
  • The SARIMA model does anticipate the peak in November + December 2020.
  • Yet the model did not foresee that sweatpants would experience a sudden decline in popularity in 2021, immediately after the turn of the year; and even before many work-from-home arrangements were reversed.
  • We can surmise that households had stocked up their strategic sweatpants reserves in the course of 2020. By January 2021, they were well provisioned with loose legwear for the next 2 or 3 years. So the search for sweatpants lost its urgency in 2021.
  • On the boom of the sweatpants industry in 2020 followed a bust in early 2021.

A next step could be an intervention analysis that isolates the work-from-home impact, for instance by defining it as an exogenous variable

But we won’t do that today. Rather, we will put our loosely-clothed legs up and take a nap.

Mike Von, unsplash.com
Mike Von, unsplash.com

Related Articles