Prophet vs DeepAR: Forecasting Food Demand

Published in

Towards Data Science

8 min readOct 20, 2019

In collaboration with Christian Aalby Svalesen

The global food industry faces significant sustainability challenges, and the UN estimates that “approximately one-third of the global food production is wasted or lost annually”. About 66% of the losses are in food groups where freshness is an important criterion for consumption. In addition, spoilage often occurs due to over-ordering and overstocking, which in most cases is a result of forecasting difficulties. The problem is amplified as one moves upstream in the supply chain and away from consumer purchasing behavior, a phenomenon known as the bullwhip effect, which results in higher inaccuracies and exacerbates the waste problem.

Until now, most if not all, forecasting solutions in the market have been based on traditional decades-old methodologies such as ARIMA and Exponential Smoothing. While these methodologies generate stable forecasts, they have difficulties forecasting quick changes in time-series, especially when these changes are due to multiple seasonalities or moving holidays.

However, with the advance of machine learning, several techniques were developed in the last few years that made addressing these challenges easier.

Crisp is offering a comprehensive and fully automatic forecasting solution to the food industry by combining cutting-edge machine learning techniques along with a novel methodology to model economic losses and freshness of inventory.

In this post, we are examining two forecasting models: Facebook Prophet and Amazon forecast’s DeepAR. We found these to be promising as Amazon research claimed DeepAR resulted in forecasting improvements of around 15% compared to the current state-of-the-art methods, while Prophet allows for a quick and tremendously flexible modeling experience.

Facebook Prophet

The Facebook Prophet model is a combination of multiple functions such as trend, multiple seasonalities, holidays, and user injected regressors estimated with LOESS. It is therefore conceptually similar to a Generalized Additive Model.

Here is the general formula for Prophet showing all different components:

Trend:

The trend component can be a linear or logistic function of time. To model non-periodic changes in the trend, the term a(t)^T, namely changepoints, was introduced into the equation to allow for changes in the growth factor. The logistic function introduces the term C(t) to prevent infinite growth or negative growth in the trend.

Seasonality:

Multiple seasonalities can be modeled with Fourier series with a smoothing prior and a low-pass filter. In Prophet, this is represented by the formula below. The higher the number of terms (N) is, the higher the risk of overfitting.

Holidays and Events:

Holidays are modeled as independent dummy external regressors allowing the inclusion of a window of days around the holidays to model their effects.

The priors for all model components’ parameters are drawn from a normal distribution with hyper-parameterized standard deviation to allow for its flexibility.

Pros

Ability to easily model any number of seasonalities
Ability to work with missing dates in time-series
Easily integrates holidays in the model
Built-in uncertainty modeling with full Bayesian sampling allows for model transparency
Allowing flexible step-wise linear or logistic trend modeling with user-specified change points

Cons

Can be volatile with a low number of observations
Long-horizon forecasting can be volatile with automatic changepoint selection

Amazon’s DeepAR

The DeepAR Algorithm (or the evolved version, DeepAR+, referred to as DeepAR in this article for simplicity) developed by Amazon Research tailors a similar LSTM-based recurrent neural network architecture to the probabilistic forecasting problem. Long Short-Term Memory (LSTM) is known for its ability to learn dependencies between items in a sequence, which makes it suitable for a time-series forecasting problem. It implements a series of gates in which information is either passed on or forgotten. Instead of just passing on a hidden state h, it also passes on a long term state c. This allows it to learn both short and long term time-series patterns.

The DeepAR model implements such LSTM cells in an architecture that allows for simultaneous training of many related time-series and implements an encoder-decoder setup common in sequence-to-sequence models. This entails first training an encoder network on the whole conditioning data range, then outputting an initial state h. This state is then used to transfer information about the conditioning range to the prediction range through a decoder network. In the DeepAR model, the encoder and decoder networks have the same architecture.

The DeepAR model also takes a set of covariates X, which are time series related to the target time series. In addition, a single categorical variable is supplied to the model which is used as information on the groupings of related time series. The models train an embedding vector that learns the common properties of all the time series in the group. Another feature of the model is that DeepAR automatically creates additional feature time series depending on the granularity of the target time series. These can, for instance, be time-series created for “day of the month” or “day of the year”, and allow the model to learn time-dependent patterns.

The DeepAR interface expects a fair number of hyperparameters to be set, including the context length for training, number of epochs, dropout rate, and learning rate among others.

Pros and Cons

DeepAR has the advantage of training several hundred or thousands of time-series simultaneously, potentially offering significant model scalability. It also has the following technical benefits:

Minimal Feature Engineering: The model requires minimal feature engineering, as it learns seasonal behaviour on given covariates across time series.
Monte Carlo Sampling: It is also possible to compute consistent quantile estimates for the sub-ranges of the function, as DeepAR implements Monte Carlo sampling. This could, for instance, be useful when deciding on safety stock.
Built-in item supersession: It can predict on items with little history items by learning from similar items
Variety of likelihood functions: DeepAR does not assume Gaussian noise, and likelihood functions can be adapted to the statistical properties of the data allowing for data flexibility.

Model Evaluation

Common practice in data science is to use statistical measures to facilitate model selection and tuning towards attaining desirable model prediction accuracies. However, these measures fail to capture the actual business impacts of forecasting errors. We developed a custom measure that quantified these errors, differentiating between errors due to over-forecasting and under-forecasting. Our approach to developing the evaluation will be explained in a future article, but the measure allows us to evaluate the two models on whether they caught daily trends when this was important (e.g. fresh products) or performed more towards average accuracies the suitable products for this (e.g. dried food).

For model comparison, we used the time-series for 1000 food retail products that ranged anywhere between 3 months and 3 years of data. In the case of DeepAR, the data included the Holidays data as well as product grouping by retail product category. We set the number of epochs to 500, a context length and a prediction length of 90 days while leaving the default values for all other hyperparameters. In the case of Prophet, a model was trained for each time-series on its own including Holidays. The hyperparameters used tuned on the most popular food item and included multiplicative weekly seasonality, additive annual seasonality, and a high variance for the prior distribution of both seasonalities.

Using an expanded window cross-validation methodology with a 7-days horizon and a moving origin of 12 origins and 7-days step size, the MAPE of almost the majority of products was significantly higher for DeepAR than Prophet. However, this is not a stable metric especially that several of the products had zero or near-zero entries, so we visually investigated the time-plots to compare both forecasts. Our custom metric also revealed that the forecasting errors from DeepAR would cause somewhat higher business costs than those of Prophet.

Total losses with a Prophet forecast: 890,013.29 units
Total losses with a DeepAR forecast: 1,502,708.56 units
Prophet losses are 41% better than DeepAR.

Below are two time-plots comparing the actual and forecast signals for the product DeepAR predicted at its best performance:

On visual analysis of the results, it is clear that DeepAR had problems picking up the intensity of a few highs and lows while missing some of the drops in the signal. Prophet was generally more flexible, had a better performance, and was able to match the magnitude of the variation in the actual demand. A possible explanation for the performance of DeepAR might be the need for better groupings. The category groupings included products with very different demand characteristics, and the model might, therefore, benefit from groupings where product demand is analytically inferred rather than set by retail category (e.g. dairy). Although beyond the scope of our current investigation, future experiments should examine whether the performance of DeepAR improves with grouping products by clustering methods or spectral analysis.

We conclude that Prophet wins in this forecasting exercise for the following reasons:

Better forecasting performance with the food demand patterns
The flexibility of the model and ease of setup
No need for a large number of products to reach a good performance
No need for product grouping
Allows deployment to any cloud platform

Prophet vs DeepAR: Forecasting Food Demand

Model Evaluation

Written by Sam Mourad