Using Bayesian Statistics to Predict Cafe’s Popularity with Geodata

Make predictions with pure statistics

Danil Vityazev
Towards Data Science

--

Photo by Nika Benedictova on Unsplash

This article is in some sense a sequel to my previous article on Bayesian statistics. Long story short, the company I work for needed a model that not only predicts a target variable as a number but also estimates its probability density, i.e. estimates that the predicted variable will likely be in a certain range from the prediction. As a result, the model based on the naive Bayes approach has been created that not only suited the initial task but also managed to outperform a random forest model.

Intro

The model I mentioned is a slightly modified naive Bayes classifier. Instead of classifying points into classes, the model returned posterior distribution which then was used to calculate the mean target. The model itself is made to achieve a balance between accuracy and interpretability. End users are often suspicious about complicated models like the random forests if it’s unclear to them how exactly the decision is made.

But the approach had several disadvantages. First, the final distribution does not necessarily belong to a certain family. This means the user often gets curves with two or even more local maxima with a mean somewhere in between them. That may indicate that there are two likely predictions, but the final prediction corresponds to an improbable result.

The example of distribution, with the mean located at an unlikely spot. (Image by author)

The second major flaw is a lack of accuracy. There are some cases in which the Bayesian model performs better than more traditional solutions but usually, it’s not the most accurate. Sure, It isn’t designed to be extremely accurate, but rather to solve a particular specific task. Nevertheless, improving the accuracy would still be appreciated.

And finally, the model is relatively slow. It’s not usually an issue for the end user, but the stages of tuning the hyperparameters, and feature selection may take painfully long.

Bayesian inference, conjugate priors

Instead of figuring out the distribution of the target itself, sometimes it’s worth trying to estimate the distribution’s parameters.

Let ρ(σ), ρ(μ) be the distributions for the target’s standard deviation and mean. So for the final target’s distribution, we get.

Image by author

This formula makes intuitive sense, keeping in mind that to find the expected value of some function of random variable f(x) with a given distribution one needs to integrate the product of said function and the distribution over the whole function scope. In this formula, we are essentially finding the expected distribution for the target.

Now let’s take into account the features we put into our model. Let’s name them f1…fn. In terms of Bayesian inference, these features are observations that change the likelihoods of certain values of σ and μ.

Image by author

Now we need to put this into the first equation to get the final distribution given the observations. But it’s unclear what distributions to use as priors for σ and μ and what distribution should be used as a likelihood for these variables given the feature?

For the ρ(f|σ) and ρ(f|μ), it’s reasonable to take a distribution from the same family as ρ(x). This can be concluded from the fact that the information about a particular feature is equal to the information that the sample originated from the distribution of samples with the same value of that feature.

Regarding priors of σ and μ, it’s not obvious what they should be. But it’s reasonable to assume that by multiplying the original distribution by it one should get the distribution from the same family. In other words, if you have your target normally distributed after taking into account some features, it would be weird if after applying another one you would get a gamma distribution.

These “paired” distributions are called conjugates, and a prior chosen in the way not to transform the likelihood is called a conjugate prior. The conjugate priors are evaluated for all the most frequently occurring distributions [1]. For example, for Normal distribution, the conjugate prior is Normal-inverse gamma.

Moreover, in case you choose your priors to be conjugate you don’t need to actually multiply the functions. As the multiplying doesn't change the family or your distribution you may just calculate the new hyperparameters for the initial distribution using known formulas, which is much quicker.

In most cases, you don’t even need to integrate the resulting expression. Like conjugate priors, the final results are known for the most frequently occurring distributions and are called Posterior predictive. For example, in the case of normal distributions, the Posterior predictive is Student’s t-distribution.

Predicting a real target

To begin predicting a real target, the first thing we have to do is to determine how the data is distributed to choose distribution families. In my specific case, the target was the number of clients that come into cafes. Let’s look at the histogram (the original data is linearly transformed).

Image by author

The data looks to obey lognormal distribution, the Kolmogorov-Smirnov test gives a p-value of 0.12, which indicates that our hypothesis is not far from the truth. This also allows us to take a logarithm of our target variable and get normally distributed data. In this case, the conjugate prior is normal-gamma distribution and the posterior predictive is Student’s t-distribution.

For each feature of the point to predict, we take a subset of the train set, containing points with the same value of that feature. The resulting subset is then used to update hyperparameters for the posterior distribution. all we need to do is substitute our observations into a certain formula, given in the article [1].

As a result, we get a model’s estimation for the logarithm of the target.

Image by author

Where 2α is the total number of observations, and t_α is a Student’s t-distribution with α degrees of freedom.

Image by author

As you may see, the estimated distribution is now always a log-t distribution, so the situation described at the beginning of the article, when we get multiple humps, will no longer occur.

Accuracy assessment

To assess the model’s accuracy I predicted 400 points from the train set using the leave-one-out technique. For starters let’s check the old model, to set a base to compare the new model with.

Image by author

The best Rsq achieved is 0.34. Our team also tried various random forest models but struggled to achieve anything better than 0.4.

Now let’s see what the new model has to offer

Image by author

Now we’re talking! All the metrics are significantly better. The prediction looks like it’s discrete now, it’s due to some computational optimization we performed to make the model faster.

Let’s look at the histogram of the errors

In this image, blue bars are the old errors, and yellow bars are the new errors [Image by author]

It’s clear that the new model has significantly fewer points with mistakes greater than 1.

In conclusion, it’s safe to say, that even though more advanced math was used, it’s compensated with greater accuracy. And the model is still interpretable which is the main advantage of such statistical methods.

References

[1]Fink, Daniel (1997). “A Compendium of Conjugate Priors” (PDF). CiteSeerX 10.1.1.157.5540

--

--

PhD candidate, Data Scientist. I make mathematical models of business processes to help people make decisions. vityazevdanil@gmail.com