Using Open Source Prophet Package to Make Future Predictions in R

Harel Rechavia
Towards Data Science
6 min readMay 17, 2017

--

Almost every company wishes to answer where they will be one week/month/year from now.
The answers to those questions can be valuable when planning the company’s infrastructure, KPIs (key performance indicators) and worker goals.
Hence, using data forecasting tools are one of the common tasks data professionals are being asked to take on.

One tool which was recently released as an open source is Facebook’s time series forecasting package Prophet. Available both for R and Python, this is a relatively easy to implement model with some much needed customization options.
In this post I’ll review Prophet and follow it by a simple R code example. This code flow is heavily inspired from the official package users guide.

We will use an open data set extracted from wikishark holding daily data entrances to LeBron James Wikipedia article page. Next, we will build daily predictions based on historical data.
* wikishark was closed after the release of the article. you can use another useful site to get the data.

Phase 1 — Install and import prophet

install.packages(‘prophet’)
library(prophet)
library(dplyr)

Phase 2 — loading the data set

stats <- read.csv(“lebron.csv”, header=FALSE, sep=”,”)
colnames(stats) <- c(“ds”, “y”)
head(stats)
#daily data points starting from 2014–01–0
stats$y <- log10(stats$y)

For our code example we will transform the data and use the log of entrances. This will help us make sense of the prediction visualizations.

Phase 3 — Exploring the data

View(summary(stats))
plot(y ~ ds, stats, type = "l")

We can see the data is from 2014–01–01 and up to 2016–12–31 with some yearly seasonality peaks from April thorough June.

Phase 4— Basic Predictions

m <- prophet(stats)
future <- make_future_dataframe(m, periods = 365)
forecast <- predict(m, future)

Like machine learning models the first command fits a model on the dataframe and next will deploy the model using the predict command in order to receive predictions for the length of days required.

plot(m, forecast)

The out of the box visualizations of the prophet package are quite nice with predefined tick marks, data points and uncertainty intervals. This is one of the advantages of this open source package, no need for extra customization and the first result is fast and good enough for most needs.

predicting one year of future data points

Using this graph we spot the yearly trend and seasonality much clearer and how these are used for making predictions.

tail(forecast[c(‘ds’, ‘yhat’, ‘yhat_lower’, ‘yhat_upper’)])

The forecast object holds the raw data with the predicted value by day and uncertainty intervals. It is also possible to access the prediction trends and seasonality components with:

tail(forecast)

Phase 5 — Inspecting Model Components

prophet_plot_components(m, forecast)

keeping up with the simplicity, its easy to look into the components of the model. Showing the overall trend, weekly and yearly seasonality.

Model components

Phase 6 — Customizing holidays and events

The last components graph has shown the raising interest with LeBron James during the NBA playoffs and during the NBA finals. The model at this point recognizes the yearly seasonality which returns every year. On a side note, LeBron James currently holds the record for 6 consecutive years of playing the NBA playoffs starting with Miami and continuing with the Cavaliers. So we should expect the same seasonality year after year.

Adding holidays and events is a major advantage of the package. Firstly by making the predictions more accurate and allowing the user to take into considerations known future events. The developers had made this customization much easier than prior time series packages in which events and holidays should have been manually changed or ignored in order to make predictions. Think of an e-commerce website that can add all reoccurring campaigns and promotions and set revenue goals based on known future campaign dates.

Adding events and holidays is done by creating a new dataframe in which we pass the dates of begging or ending of the events and the length of the days. In this example we will add the NBA playoffs and NBA finals as events

playoff_brackets <- data_frame(
holiday = ‘playoffs’,
ds = as.Date(c(‘2017–04–16’ ,’2016–04–17', ‘2015–04–19’, ‘2014–04–19’)),
lower_window = 0,
upper_window = 45
)
playoff_finals <- data_frame(
holiday = ‘playoff_finals’,
ds = as.Date(c(‘2016–06–02’, ‘2015–06–04’, ‘2014–06–05’)),
lower_window = 0,
upper_window = 20
)

Using the lower and upper window parameters the user can set the length of the holiday. Those mappings will be row binded to a single object and passed in holidays parameter.

holidays <- bind_rows(playoff_brackets, playoff_finals)
m <- prophet(stats, holidays = playoff_brackets)
forecast <- predict(m, future)
plot(m, forecast)

Notice the model better predicts the values during the peaks. Printing out he components again will show the added row of holidays effect on prediction. Theoretically you can map many events which are critical for the business and get better predictions.

Phase 7 — Removing Outliers

When building predictions it is important to remove outliers from the historical data. The data points are used by the model which adds their effect to the predictions although they are single time events or just false event logs. Unlike other packages that will breakdown when passed an NA value with the historical data, Prophet will ignore those dates.

In this example we will remove a series of single time event in which the NBA player announced he is leaving Miami in favor of Cleveland which probably draw attention to his Wikipedia page.

outliers <- (as.Date(stats$ds) > as.Date(‘2014–07–09’)
& as.Date(stats$ds) < as.Date(‘2014–07–12’))
stats$y[outliers] = NA

Phase 8 — More Functionality

At this point we can see the simplicity and robust the developers had in mind when creating this package. Some extra functionality I didn’t show but should be used:

  • Changing seasonality and holidays effect scale
  • Mapping critical trend change points
  • Editing uncertainty Intervals

Using Prophet for Anomaly Detection

The next step I hope the developers will take is using this package and leverage it for anomaly detection for time series data. Prior packages offer such functionality but depend heavily on the data structure and strict seasonality.

Lets use the existing model in order to map anomalies in the data. We will compare the original values (y) with the predicted model values (yhat) and create a new column called diff_values.

combined_data <- cbind(head(forecast, nrow(stats)), stats[order(stats$ds),])
combined_data$diff_values <- (combined_data$y - combined_data$yhat)
summary(combined_data$diff_values)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.05269 0.07326 0.10780 0.13490 0.16520 0.29480

To better generalize the detection of anomalies we will also add the normalized diff values representing the percent difference from actual values.

combined_data$diff_values_normalized <-
(combined_data$y - combined_data$yhat) / combined_data$y

Lets go ahead and visualize the normalized diff values over time.

plot(diff_values_normalized ~ ds, combined_data, type = “l”)
Normalized difference prediction to actual values

Most predictions are quite close to the actual values as the graph tends to move around the 0 value. We can also ask what percent of data points are anomalies by filtering the absolute of the column diff_values_normalized. Setting a threshold from which a data point is considered a anomaly is one way to look at it. In this example its 10%

nrow(combined_data[abs(combined_data$diff_values_normalized) > 0.1
& !is.na(combined_data$y),]) / nrow(combined_data)

We receive 0.02 which indicates that 2% of data points are anomalies based on the given threshold.

Closing Words

Making predictions is an important skill for data professionals. Thanks for open sourced projects like Prophet this does not need to be too difficult. This package balances between simplicity, computation speed and the right amount of customization so both beginners and advanced users can use it.

Thanks for reading,
Harel Rechavia,
Data Analyst at Viber

--

--

I'm a data analyst at Amazon Alexa. I like to write on analytics and how human behavior speaks to us when looking at data. linkedin.com/in/harelrechavia