Need for speed: optimizing Facebook-Prophet fit method to run 20X faster

Simplifying the Bayesian model

Oren Matar
Towards Data Science

--

This is the second part of a series about optimizing the internal mechanism of Prophet. Part one is not a prerequisite to understand this part, but it is recommended.

Since I released part one, the number of downloads of the Prophet package reached 30 million. It is one of the most common time series forecasting packages out there. Despite not being the most accurate algorithm, it works relatively well out of the box, but still allows to fine-tune its hyperparameters, and to add extra regressors (i.e., whatever data that may be correlated with your target, e.g., price of oil, weather), which makes it a good baseline algorithm for many use cases.

But its speed remains an issue. In part one we saw that predicting can take a full second but managed to reduce this number to ~15ms. Still, fitting an item takes 50–100ms, depending on the number of change-points and whether seasonality is used. These runtimes limit the scalability of Prophet to relatively small datasets, as each item needs to be fitted separately. But it need not be so; just like predicting — the fit method of Prophet can be optimized dramatically with a couple of tricks.

But before we continue, a disclaimer: at the end of the part one I provided code that can replace Prophet’s uncertainty prediction; in this post I will show how the fitting can be optimized, but writing fully functional code, including all the arguments Prophet can take, takes a bit more work. But along the way we will explore new ideas in time series modeling, get a better understanding of Prophet under-the-hood, and even learn a bit about Bayesian statistics.

A primer on Prophet

The Prophet model is an additive regression model, with three main components:

Trend, Seasonality, Holidays (and we can add other regressors as their functioning is identical)

The Prophet model components (Image by author)

We will start by focusing on the trend factor, and as we optimize it, we will see that adding the other terms is not a challenge. We will limit ourselves to the case where the trend is linear.

Prophet fitting the linear trend with change-points (Image by author)

As seen above, Prophet fits a linear slope to the data, but creates changepoints for the slope. When forecasting into the future, only the final slope is used, but the number and size of the changepoints influence the uncertainty interval (see part one).

The number of changepoints is set by the user (the default is 25) and those are distributed uniformly over the data, not including the last 20%. Prophet does not “find” changepoints; it creates many changepoints and lets the fitting process use the right ones.

Placement of potential changepoints by Prophet (Image by author)

In the figure above, the vertical lines represent the changepoints created by Prophet, but as you can see, most of them do not cause a meaningful change in the slope. Prophet uses a Bayesian model to find the best parameters (intercept, the current initial slope, and the deltas between slopes), for the data. Most deltas between slopes are 0 due to the functioning of the Bayesian model.

Bayesian Priors and regularization

Without delving too deep into the beauty of Bayesian modeling, here’s the basics: when estimating parameters using a Bayesian method you must supply a “prior”, which represents where you believe your parameters will lie, before seeing the data. The Bayesian model takes this prior and the data and returns the “a posteriori” — the updated belief of where the values of these parameters most likely lie.

In the case of the slope-deltas in Prophet, the prior used is Laplace, which looks like this:

Laplace distribution with mean 0 and different standard deviations (Image by author)

In other words, Prophet tells the model that it should assume the delta between two consecutive slopes is probably around zero — that there is no delta unless the data highly suggests a slope change. Prophet creates lots of potential change-points, but when fitting their parameters it makes sure most changepoints will be ignored by using this prior, centered around zero. The idea is like adding a regularization term to the loss function — it pushes the optimizer to prefer small values.

Modifying the scale of the Laplace changes its effect on the results: smaller standard deviations mean we are certain most deltas must be close to zero, while a large standard deviation means that non-zero values are likely too, so the model will not push the deltas towards zero as much. Again, like adjusting the strength of the regularization in linear regression. In Prophet you can adjust the Laplace prior standard deviation with the changepoint_prior_scale argument. As seen below, it can dramatically effect the number of changepoints actually used by the model.

The impact of the Laplace prior on the changepoints fit (Image by author)

Prophet uses pystan, a library for Bayesian modeling, to find the optimal deltas. But Bayesian modeling takes a very long time to fit (tens of milli-seconds!). If only we could substitute it with a simple linear regression, we could reduce the runtime without sacrificing performance, adding to the scalability of the algorithm.

Modeling slopes

To replace pystan with a linear regression model, we need to find a way to represent slopes and slope-changes using features.

To model a single slope we can create a linear feature with the values of 1,2,3,4,5,…,n. corresponding to the time series at hand, where n is the length of the time series. If we use this feature in a linear regression it is easy to imagine what it can accomplish.

a simple linear trend as a feature for linear regression (Images by author)
Time series fitted only using a linear trend (Images by author)

Linear regression finds the best coefficient — representing how steep this slope should be; the regression can also create a downwards slope with a negative coefficient. We can easily add this feature to other features that represent seasonality and extra regressors.

But this is a single slope. Can we create a feature to represent a slope-change? How about this:

Linear trend changepoint feature (Image by author)

The new feature values are 0,0,0,0,…,0,1,2,3,4… which may remind you of the Relu activation function, but here it is used as a feature, not as a function.

Its coefficient represents the delta between slopes before and after the changepoint. Sum together with the initial slope feature and you can create this shape:

A linear regression finding the coefficients for a linear trend and changepoint features. Each feature multiplied by its coef (left). Their sum (right). Image by author.

We can create any number of these changepoint features, allowing the regression to adopt a new slope at each changepoint.

“The comb” features — changepoint features in fixed intervals and a fit based on the features (Image by author)

We’ll call this “the comb” features. We can confirm that these features represent the same deltas as Prophet by training Prophet and using its trained deltas as coefficients for these features.

Giving us:

Using Prophet’s delta coefficients with the comb features, getting the same result as Prophet (Image by author)

The comb features are multiplied by the deltas found by Prophet, and summed together to get the exact same prediction. We’ve shown that we can translate Prophet into a linear model using the “comb”.

But that’s not what we want- we want to train those deltas-coefficients ourselves, without pystan. Can we emulate the Laplace prior using linear regression?

Training Prophet without Prophet

Recall that the whole point of the prior was to push the coefficients towards 0. This may remind you of L1 regularization — penalizing the absolute values of the coefficients — and not without a reason: it can be proved that the maximum a posteriori (MAP) estimate of a Bayesian model with a Laplace prior is equivalent to a linear regression with L1 regularization (aka Lasso regression). The Laplace prior is the Bayesian interpretation of the same concept. Instead of thinking in terms of cost function, Bayesians think in terms of priors.

Therefore, we can train a regression on the comb features and get the exact same result as Prophet. As I mentioned — the standard deviation determines how heavily the model will “push” coefficients towards zero, just like the L1 alpha, but inverse: a larger standard deviation means less regularization, whereas larger alpha values mean more regularization (more “cost” or “loss” for higher coefficients). All we need is the formula to convert the standard deviation of the Laplace prior (given in changepoint_prior_scale) to the equivalent L1 alpha.

conversion formula from b (the std of Laplace) to the L1 regularization parameter (Image by author)

Where b is the standard deviation of the Laplace, and sigma is the standard deviation of errors of our model’s prediction.

Wait, what? How can we use the standard deviation of errors before we trained a model? We can’t, but we can estimate it. Let’s assume it is 0.3 And run a Lasso with this conversion:

Comparison of Prophet and comb-Lasso training with estimated sigma (Image by author)

Not bad, but not quite there; our alpha didn’t match the Laplace prior because our sigma estimate was off. But now that we have a prediction, we can better estimate the errors and re-compute the correct alpha. We can repeat this iteratively: create a model, estimate the standard deviation of errors and re-compute alpha.

Playing with the changepoint_prior_scale we will get these comparisons:

Comparison of Prophet and Lasso results, given changes in the prior_scale (Images by author)

As you can see, we can create an almost identical fit to Prophet, using the comb features and this conversion from the Laplace std. You can experiment with different values for the prior std and see that both Prophet and Lasso will change their regularization accordingly and remain similar. But isn’t training a few Lassos slower than training a single Prophet? Nope.

%%timeit
m = Prophet(n_changepoints=15, changepoint_prior_scale=changepoint_prior_scale, growth='linear', uncertainty_samples=None, yearly_seasonality=False)
m.fit(ds)

output:

60.2 ms ± 658 µs per loop

Prophet takes 60ms to fit the time series. What about our iterative Lasso?

%%timeit
lr = lasso_by_laplace_prior(feat, ds.y / scale, changepoint_prior_scale)

output:

11.2 ms ± 239 µs per loop

Even 5–10 Lassos are significantly faster than pystan’s optimizer. In fact, if you profile the Lasso fit, you’ll find that most of the time spend in each fit is spent on preprocessing the X and y (e.g. normalizing them). At the moment, the preprocessing occurs in every iteration, but it can be placed outside of the loop and occur only once, which will cut the Lasso training time by at least 50%.

If we don’t particularly care about emulating Prophet, and simply want to model a time series — we can simply use the comb features, add seasonality features and other regressors and set a single alpha. After all, both the alpha and the standard deviation of the prior are arbitrary values (which may be optimized using cross validation), why would we care about finding an exact match? No reason that I can think of, except that it gives us a learning opportunity about Prophet, regularization, and time series modeling. So, let’s say we want to emulate Prophet precisely, but faster, what else is missing?

Feature-specific regularization

If you look closely at the examples above, you’ll notice that the initial slope is less steep for Lasso compared with Prophet. The reason is that Prophet only places the Laplace prior (or L1 regularization) on the deltas, the change between slopes, and not on the initial slope. Our Lasso model regularizes all features, including the initial slope, so it does not emulate it perfectly.

Before we continue, I’d like to point out that Prophet’s modeling is a little strange in a sense. Take the following example:

Completely random data fitted with a linear trend (Image by author)

The values of this time series were produced completely at random and yet you can find some slope purely due to chance. But if we would like to predict the future of this random series, using the slope would cause overfit, and the best prediction would simply be the mean of these values. It makes sense to place a prior belief on having no slope at all, and to be convinced otherwise only if there’s a very clear slope, that is unlikely to occur randomly.

Prophet does in fact place a prior on the initial slope (you technically have to in any Bayesian model) — the prior is a Gaussian distribution, instead of Laplace.

Laplace and Gaussian distributions with std=1 (Image by author)

The Gaussian curve is flatter than Laplace, so it is less restrictive towards zero, and the standard deviation Prophet uses for the Gaussian prior is 5. Since the data is scaled to have a maximum value of 1, the initial slope usually has values below 5 anyway. In other words, Prophet technically places a prior on the initial slope, but it is an “uninformative prior” — it hardly limits the value of its coefficient. As we’ve just seen — not regularizing the initial slope might be problematic.

But let’s continue emulating Prophet: if L1 regularization is the equivalent to a Laplace prior, what is the equivalent to the Gaussian prior? You guessed it — you can prove that L2 (punishing squared coefficients) will yield the same MAP as a Gaussian prior. Therefore, we need to use different regularization parameters for different features to emulate Prophet . Sadly, sklearn does not allow this — the same regularization values must apply to all features. We’re going to solve this, but before we do, let’s talk about seasonality and the other components in Prophet.

Seasonality and extra regressors

Seasonality in Prophet is modeled using sine wave features. Prophet automatically creates a bunch of sine waves with different frequencies to represent weekly, monthly and yearly seasonality. Pystan finds the best coefficients for the sine waves so that their sum would best fit the data. As a Fourier series, these sine waves are very flexible to fit most shapes of seasonality.

Training a time series using sine wave features (Image by author)

What about their prior/regularization? Prophet uses a Gaussian prior for the seasonality features, and you can set its scale using the seasonality_prior_scale argument in Prophet. The default scale is 10, which again means — hardly regularizing at all. Holidays and extra regressor are treated the same — a wide Gaussian prior as the default.

Putting it all together

We’re basically done, all we need is to write a linear regression with a custom loss function that can take different regularization parameters for different features — L2 for seasonality and the initial slope, L1 for comb features. The regularization loss is:

Regularization part of the cost function (Image by author)

Where:

  • σ is the standard deviation of errors
  • b is the changepoint_prior_scale (std of Laplace prior for changepoints)
  • s is the seasonality_prior_scale (std of Gaussian prior for seasonality)
  • β_fs is the coefficient of the first slope (linearly going up feature)
  • β_cp are the coefficients for the comb features
  • β_s are the coefficient of seasonalities

The regularization loss is added to the squared error loss to get the cost function.

We then use a gradient descent optimizer such as Adam to minimize the cost. Instead of iteratively running this model like we did with Lasso, we can incorporate the sigma estimation into process: re-estimate sigma every k iterations of the optimizer.

Note that both the comb and the seasonality sine features depend only on the length of the time series. Which means that unless extra regressors are added, all the features in this model are not target-specific. We can create X once for an entire dataset.

I created a PyTorch implementation which can take L1 and L2 regularization parameters or Laplace and Gaussian priors’ standard deviation per feature. Since optimizing only 20–30 parameters is not PyTorch’s forte, this implementation takes an entire dataset, with multiple items, and estimates the coefficients for each item simultaneously but separately, which is much faster per item. The code and an example script used to compare the results of this model with the original Prophet can be found here.

And here are a few examples of the fit:

Comparison of Prophet prediction and PyTorch-implemented feature-specific regularization for Laplace emulation (Images by author)

The mean Pearson correlation between this implementation and the original Prophet is 0.995. It’s still a bit off due to tiny differences in the implementation that I won’t get into (but they can be reduced with a little more work), and the random nature of gradient descent optimizers. Anyway, there is no reason to believe that the new implementation is less accurate.

Our model emulated the Bayesian model using a simple linear regression, with per-feature regularization. But the run time for Prophet is ~60ms per item, and for the PyTorch implementation its ~3ms per item. More can be done to reduce the runtime, with early stopping when the loss function converges and learning rate schedulers, but that is beyond our scope.

Conclusion and takeaways

In the first part of this series, optimizing the code for prediction, I provided code which can easily substitute Prophet’s original prediction code. But for the fit it is more complicated than that: a full implementation requires the ability to add extra regressors and holidays, and auto-detecting whether to use weekly, monthly, and yearly seasonalities, which can be copied from Prophet. There’s more work needed before productization, but it’s all technical. If someone out there uses Prophet very frequently — taking on this project can be worth your while.

So, what was the point of this exercise?

Firstly, “the comb” features as a creative way to model slope changepoints can be borrowed for other time series models.

Secondly, I hope that this form of problem solving, while studying the specific mechanisms of Prophet, Bayesian regressions and the Bayesian interpretation to regularization is inspiring and the lessons learned can assist in other problems.

Finally, it serves to prove my motto: any model that can be linear regression should be linear regression. Linear regressions are elegant, incredibly fast, and as explainable as you can get. Many data scientists often rush to use more sophisticated models, neglecting the power of linear regressions, which can often do the trick just as well. In this case all it took was a bit of creative feature engineering to simplify Prophet to a linear regression model and significantly improve its run-time.

Where do we go from here?

One area for further research is the auto-detection of slope-changepoints. Instead of creating lots of changepoints and trusting regularization to avoid overfit, a good algorithm for slope-change detection could improve both accuracy and run time.

In the first post of this series, we found that the fit process takes 60–100ms and the predict can take a full second; we then reduced the predict to ~10ms. There was not much point in further optimizing the prediction method since most of time was spent on the fit.

Now that the fit is reduced to ~3ms, it is again the predict method that consumes most of the runtime. In part 3 of this series, we will return to the vectorized predict method we developed in part 1 and see how we can achieve similar results in microseconds, completing an overall 1000X runtime improvement for fit & predict.

--

--

I am a Data-scientist, with experience in auto-ML systems, NLP, Bayesian statistics, and Time Series Forecasting.