The world’s leading publication for data science, AI, and ML professionals.

Pen & Paper Bayes

How to enjoy simpler calculations thanks to elegant results

This story assumes you have a basic knowledge of the Bayesian framework – as opposed to the classic frequentist one. But that is pretty much it 🙂 We will see how the learning steps of a model do not need to be numerically heavy, as some fortunate result can speed up things greatly. Such improvements open the adoption of this kind of stuff whenever resources are tight, for example on reduced hardware or in an online learning setting.

"Before, when I looked at a piece of blank paper my mind was filled with ideas. Now all I see is a blank piece of paper" Paul Erdős | Photo by Aaron Burden on Unsplash
"Before, when I looked at a piece of blank paper my mind was filled with ideas. Now all I see is a blank piece of paper" Paul Erdős | Photo by Aaron Burden on Unsplash

Off the beaten Bayesian path

The main idea of Bayesian learning is to iteratively apply Bayes theorem to find the distributions describing some data behavior in the best way possible. The description is with respect to some machinery to generate new data points and predict which kind of data points will come next, informing how such machinery looks like on two ingredients. The first one is to start with some more or less informed guesstimate on how your data looks like. And the second is the data we keep observing, this refines our initial hypothesis as time goes.

To begin with, a well-deserved mention to Bayes theorem itself:

Neon Bayes! | Photo by mattbuck on Wikimedia Commons
Neon Bayes! | Photo by mattbuck on Wikimedia Commons

Some very basic notation (and apologies for the formatting, see closing rant):

  • _x_ᵢ is the i-th data point we see, coming from a random variable we are trying to learn about
  • θ̅ is the vector of the model parameters, so the parameters of the distribution D in ~D(θ̅) – our goal is to discover reasonable values for it; if X̅~ N(μ, σ²) then θ̅=(μ, σ²)
  • α̅ is the vector of parameters for the model H ruling θ̅: as we see θ̅ as a random vector it follows a distribution whose parameters are represented by α̅, so θ̅ ~ H(α̅) the hyperparameters

Note: we will use α̅ in our example below but this is by coincidence.

Learn something - refreshing what you have learned so far every time you see something new. | Photo by Kelly Sikkema on Unsplash
Learn something – refreshing what you have learned so far every time you see something new. | Photo by Kelly Sikkema on Unsplash

The machinery relies on the definition of a few items:

  1. prior distribution: this is the initial hypothesis H, defined as a distribution on the parameters of the data distribution, so p(θ̅|α̅); its parameters are the hyperparameters of the model
  2. posterior distribution: this is the update of the initial hypothesis in the light of new data points coming in, so _p(θ̅|x,x,…,x,α̅)_; its hyperparameters are a function of the original hyperparameters we initialised the prior distribution with, and the data points of course
  3. sampling distribution: this is how your actual data distribution D looks like, and it is known as likelihood; it is _p(x|θ̅)_
  4. marginal likelihood: the sampling distribution, marginalised over the hyperparameters, this is known as evidence, _p(x|α̅)_; we are asking ourselves how a specific instance of the prior is compatible with the data
  5. posterior predictive distribution: this is the learning engine, as it defines the probability of an event under any likelihood assumption – not just the current one, which is the latest set of posterior hyperparameters – it is _p(x|x,x,…,xᵢ₋₁,α̅)_; this is main conceptual difference from a frequentist approach based on a simple maximum likelihood.

You can sample some data points from both the current likelihood or the posterior predictive, but there is a fundamental difference: the former accounts only for the specific assumption of the updated parameters as of now, whereas the latter accounts for the fact that a point could still come from any other likelihood – we do not know the truth even if one option emerges as more data points get seen.

Pay attention about the difference between posterior distribution and posterior predictive distribution, as you see above they are not synonyms. In particular:

  • the posterior is the repeated update of the prior as new data points come in; it is of the form _p(θ̅|x)_, it says something on the possible model parameter θ̅ conditioning on seeing a point _x_ᵢ
  • the posterior predictive says something on the actual point _x_ᵢ, considering all of the possible model hypotheses θ̅ **** together (marginalisation)

The posterior distribution is one on the parameters of the data distribution, this is why we say that in a Bayesian framework we do not try to predict exactly what the next point will be – coming up with a single best candidate is the frequentist maximum-likelihood based approach – but rather informing how it should look like, so its distribution.

On our definitions 4. and 5. above: you may see this as the main method tools at our disposal, sharpened as we see more and more points. It is not important to know that definitions by heart but understand what they mean of course. We will see the posterior predictive in action later!

In general updating the posterior is computationally challenging as you basically have to solve integrals. For arbitrary choices of priors and likelihoods you may have to get some help with libraries like PyMC3. But… for a number of known combinations closed forms for the posterior do exist: this is a big advantage, because it opens the adoption to online learning – when it is just sums and products you can hope to keep up with the data stream, you do not get stranded by expensive numerical stuff.

A clarifying example

Such magic pairs are called conjugate priors and understanding a few examples on either table is an excellent way to make sure you understand what’s going on. What is a conjugate prior exactly? The likelihood function (see above) is our hypothesis on the process generating the data: if we are studying a coin flip, the process will be a Bernoulli one – each variable representing a specific coinflip is an indicator, 0 for head 1 for tail. The parameter of a Bernoulli distribution is the probability p of heads. The prior distribution sets an initial, hopefully informed, hypothesis on how this p should look, in the form of a probability distribution. Every time a coin is flipped, so as new data observations come in, this prior is updated, the resulting distribution being called posterior. If the newly updated posterior is from the same family as the prior, then we say that the prior distribution is a conjugate prior for the chosen process likelihood distribution. What a family of distributions actually is might be a purely formal and rather vanishing concept. It is a given set of distributions that can be described all together at once with a finite number of parameters. For example the normal distribution may be seen as a family of distributions, specifically, an exponential one; if you can obtain two distributions from the same analytical form by setting their parameters properly, then they are in the same family.

For example, let us say that we think our data points come from a Pareto distribution, so this will be our likelihood. This fat-tail distribution is defined by a scale xm – telling us what the minimum is – and a shape k – the higher the closer to the axes (note: Wikipedia uses xm – which does not render in Unicode unfortunately – for the scale but α for the shape, for which we use k instead). It is useful to model any event whose samples cannot be negative and for which we expect that large values are not that unlikely (as in: better than what we could do with a half-normal, informally).

Pareto coffee: 80% water 20% caffeine. | Photo by Austin Distel on Unsplash
Pareto coffee: 80% water 20% caffeine. | Photo by Austin Distel on Unsplash

Let us fix the scale; in most applications this is a reasonable thing: say we are interested in modeling the expected costs of a claim for an insurer, the minimum can be taken as the processing and office costs of a rejected claim (the insurance does not pay the client but even saying no has some costs). Usually we study a conjugate only after fixing a subset of parameters; in general it is possible to work out a form to learn all of them, and we will briefly touch upon this. Say we fix the scale to some value xm=xm*. Please note that for all calculations in our examples we will round to two decimals (this introduces some little rounding errors here and there).

So Pareto is our likelihood, scale is fixed; what is our initial guess for the shape? How should the distribution for the shape value look like, so our prior? From the table we see that the Gamma is the one; this distribution has two parameters, α and β, our hyperparameters for the learning model. A good choice for our Pareto is k=2.1, with 2 as the pathological value where the variance is undefined, so 2+0.1 to stay sufficiently close. To center the gamma on 2.1: we know the mean of a gamma is μ=α/(α+β) so we set: α/(α+β)=2.1 to center around it, and fix α=2 (we have no good guess here so we pick a nice α) so we get: β = -1.04. So the result is Gamma(2,-1.04): good, we have decided what our prior looks like! In the table you also see how we can update these two hyperparameters as we read in more xᵢ data points, where n is the number of observations so far. How does our posterior on top the updated α and β look like? By the beauty of conjugation, updating a Gamma after seeing data from the Pareto she inspires still yields us a Gamma, with the updated parameters. Notice how the posterior predictive is missing from the table. That means that we can keep generating data from the likelihood process, but that we do not really have a way (yet!) to say something about the probability of a specific data point to come up next, without restricting to a specific hypothesis on the likelihood.

Chaining parameters

For likelihoods with multiple parameters as the example above, the default setting is to fix all of them except the one to learn, which becomes the model parameter. This is for one-parameter conjugates. Nothing prevents us from trying to get a sense of all the parameters at once. In this sense reasoning about one single parameter is the case about most examples and applications but it is not a requirement or anything. You can think of hypotheses on an hypothesis so to say.

A quite detailed discussion can be found [here](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf). Take for example a Gaussian conjugate, where both the prior and likelihood are normal. We can have a first guess for the mean: θ ~ N(μ₀, τ₀) and plug this in for the actual posterior X̅ ~N(θ,σ²₀) starting thus with three hyperparameters, two to get the mean (a mean and a variance, as we have a normal prior), one to get the point after plugging in this mean (a variance, again under a normal prior). See here for the worked out example (par. 3). Pay attention: the precision is the inverse of the variance, τ=1/σ².

Nothing stops you from making an hypothesis on your hypothesis - logic unchained! | Photo by Karine Avetisyan on Unsplash
Nothing stops you from making an hypothesis on your hypothesis – logic unchained! | Photo by Karine Avetisyan on Unsplash

Show me the data

We will now bring the Pareto example to life, showing both the analytical likelihood updates and the numerical predictive posterior, in absence of a safe closed form for the latter.

And that will make your idea a great idea. | Photo by Franki Chamaki on Unsplash
And that will make your idea a great idea. | Photo by Franki Chamaki on Unsplash

Conjugate case

You may want to keep the example notebook open.

For the Pareto example, on the Wikipedia page we find a very interesting (and unclaimed, at least at the moment of writing) reference: the amount of time a user on Steam will spend playing different games is Pareto distributed. According to the famous principle, this basically means that the average user will spend 80% of the time playing on 20% of the games he or she has – some games get played a lot but most not really. The basically here really has a meaning, since only for a very specific choice of scale and shape we get the classic 80–20 behavior for a Pareto distribution – and we do not know if this is going to be the case for our dataset. We want to study the probability that a game gets played x hours.

It turns out (it is really a coincidence) that on Kaggle we can find the perfect dataset (which has been uploaded, together with its original license, on the repo): the user behavior of hundreds of Steam users.

Do they have Pacman on Steam? If yes, I am pretty sure its game time is the outlier in the set. | Photo by Kirill Sharkovski on Unsplash
Do they have Pacman on Steam? If yes, I am pretty sure its game time is the outlier in the set. | Photo by Kirill Sharkovski on Unsplash

Peeping at the data we see that the minimum amount of playtime found is 0.1 hours (if you are fed up after five minutes already that was a very bad game indeed); this will become our fixed scale, xm=0.1. In real life you must just guess as you do not have all of the data beforehand, so you cannot take the minimum of course! Or you have to rely on the laws or hard limits of what you are modeling. Chasing the shape parameter k will craft our Pareto. What is our initial guess for the Gamma shaping our k? We can start with an educated guess based on the average game time we expect in general; from the metadata of this dataset on the website it is unclear to which period it refers to, or if it was ever updated; we can just choose in absolute terms, so from 2003 (Steam launch) to 2017 (dataset creation). Shall we say 7 hours? As a side: is it really a problem if our guess turns out to be inaccurate? If we have enough data and a number of boring conditions are checked, it is not, as in: any bias coming from an initial bad choice in the prior will be washed away as we read more data and adjust the posterior; this is thanks to this theorem, which tells us that the prior becomes irrelevant, asymptotically. One could even talk about the convergence rate of a Bayesian model, but that is another story. We may not have enough data but the regularity checks are there. So how do we get the desired shape? First we must ask that k > 1 otherwise μ=∞ which is not true in our case: the average game tends to be played for a finite amount of hours, and a small one too, which we chose to be 7. Plugging in xm=0. into the mean for this case we solve for k in the mean for a Pareto which is:

Mean of a Pareto distribution, with our k and xm as above. Can you spot the typo?
Mean of a Pareto distribution, with our k and xm as above. Can you spot the typo?

and this gives us k=1.01. To get started we expect our data process to be a Pareto(0.1, 1.01).

Created with essycode; no rescaling.
Created with essycode; no rescaling.

Each data point we take from the dataset is the registered playtime, for some user and some game; as we are interested in the global behavior we discard the user and game information, keeping the hours record only.

Now the question: what are the good values for our prior? Our Pareto parameters are xm=0.1 and k=1.01. For the Gamma we set α=1 as it is the smallest and nicest shape (for α ≥ 1 the distribution is well defined and has a mode too); observing that our target Pareto shape is k=1.01 we will center the Gamma around k by choosing β=0.99 observing that μ=α/β. So again our Pareto scale xm is fixed and the shape, which is the unknown model parameter, is assumed to behave as a Gamma(1, 0.99) (here expressed in scale θ, which is the reciprocal of our rate β). Sampling from this distribution gives a single instance of our likelihood in the form of the Pareto distribution.

And now, time to get on the data. The set is sorted by user id, and it is just a small sample of the users. We are interested in the play events only. As we want to have the feeling of a random sampling from the unknown and Pareto-assumed distribution we will shuffle the rows a little.

Here is the evolution of the Gamma posterior mean for the first few iterations, so the plot of μ=α/β as the hyperparameters get updated:

Gamma posterior mean plot; you can see it stabilizes after about n=50 iterations.
Gamma posterior mean plot; you can see it stabilizes after about n=50 iterations.

you can see it stabilizes pretty fast around 0.25, realizing the expected value for the shape parameter k of our Pareto likelihood. In general, this can be seen as a good sign. For further confidence, one could play around a little for different range Gamma choices and see that the convergence rate stays somehow constant – or does not blow up at least. On convergence some results as do hold as we have mentioned above but this discussion takes a too advanced flag for now and becomes good material for some other blog experiment 🙂

Here are the final results for our posterior and likelihood:

So what we got eventually is a Pareto distribution with scale xm=0.1 and shape k=0.25. This is not great after all: if the scale is less than 1 both the mean and variance tend to infinity, and these are two important features to get a feeling for any distribution. If this was an insurance risk model we should really back off as this would mean: the expected claim cost is gigantic!

In our example we do see that a legit Pareto emerges, albeit one which is not very informative and nice to look at. It could be that with more data the result shifts to a nicer model, namely one where the shape is greater than 1, so that the main moments are finite. But as we have converged pretty soon on this dataset, we would really like to see more data from the middle of the distribution, and not from the fat tails – if things really stay too widespread we cannot say anything interesting, an obese tail is a flat line, and we could land anywhere.

What shall we conclude from this? The question coming to mind is of course: is our hypothesis a legit one? In general, whenever the resulting distributions do not look nice we should be cautious, theory aside just because a distribution which does not look nice, one missing a few moments for example, is not nice to work with and it is of little informative value too.

If we do think a Pareto is what our process looks like, could it be a data thing then? For example:

  • not enough data from the whole Steam population: this hypothesis may not hold with so few samples – we might need more data
  • data badly sampled: if this data comes from some sampling carrying inherent bias, or in general representing distinct sub-populations (e.g. top players, a single land etc.) we might have to make sure that this is not the case, or seek for a better-fitting hypothesis for the land

To reinforce that our choice is a good one or a not so great one we could also look outside the Bayesian scope and consider more classic goodness of fit tests.

For the purpose of this blog, we keep the faith that a Pareto is still a good description. In general, be wary of carrying on with an hypothesis you judge to be too odd according to numbers and experience 😉

Let’s see how keeping the Pareto hypothesis and trying that out in the numerical case has us come to the very same numbers.

Numerical case

See the example notebook here.

Another close-up of a chain. This time it is a Markov one. | Photo by JJ Ying on Unsplash
Another close-up of a chain. This time it is a Markov one. | Photo by JJ Ying on Unsplash

As we have no analytical form for the predictive posterior, using the conjugate approach we can discover how the likelihood should look like, and we can take samples from specific realizations of such best likelihoods too, but… we cannot take a glance at how some anonymous data point would look like, anonymous in the sense of originating from our newly found specific likelihood or any other one like it – so where the shape is still from the updated Gamma.

To get some help here we will turn to numerical sampling. Detailing how this kind of approaches works falls out of scope for our chat. We will use the famous PyMC3 library.

The definition of our model is pretty simple: https://gist.github.com/ad1a76de0cd45db8a1e580cb6033b1af

We extract different kinds of samples from our Bayesian building blocks, the last is of course what we were missing: https://gist.github.com/8f48b5e26ed1bb6bd1d83e643e0424e1

Plotting the prior we see something we knew already: the shape of the Pareto comes to sit around k=0.25:

Nice to see that we get to the same results!

A number of samples are offered for the posterior predictive: https://gist.github.com/7159b872941b272307ea4f272e59bebb

You may have noticed how the numerical sampling takes more time than the pen & paper recurrence of course.

Recap

We have had a look at an example of Bayesian learning on real data, taking a chance to see an analytical approach next to the numerical one. Are you happy with the model we have come to? If not, why? 🙂

And finally, about the pen & paper thing: carrying out just sums and products to get a result, in this world of supa-dupa Machine Learning, really feels like that – but having the computer do the math is definitely a better idea.

See you the t+1 time!

Arrivederci. | Photo by Jonathan Kemper on Unsplash
Arrivederci. | Photo by Jonathan Kemper on Unsplash

_Find the Markdown version [here](https://github.com/rvvincelli/bayes-penandpaper/blob/main/blogpost.html) and its HTML rendering here. As a closing rant it is really a pity that Medium does not support LaTeX formulas, or allow custom fonts like KaTeXMath. The .md renders properly with an extension like this one.

Let me get down to this LaTeX formula... | Photo by Clay Banks on Unsplash
Let me get down to this LaTeX formula… | Photo by Clay Banks on Unsplash

Author: Riccardo Vincelli

The author wants to thank his employer KPMG in the Netherlands for supporting the development of this blog story, as well as colleagues Ewine Smits, Redmer Bertens and Amber Plaatsman for reviewing its content.


Related Articles