Beta Distribution — Intuition, Examples, and Derivation
When to use Beta distribution
The Beta distribution is a probability distribution on probabilities. For example, we can use it to model the probabilities: the Click-Through Rate of your advertisement, the conversion rate of customers actually purchasing on your website, how likely readers will clap for your blog, how likely it is that Trump will win a second term, the 5-year survival chance for women with breast cancer, and so on.
Because the Beta distribution models a probability, its domain is bounded between 0 and 1.
1. Why does the PDF of Beta distribution look the way it does?

What’s the intuition?
Let’s ignore the coefficient 1/B(α,β) for a moment and only look at the numerator x^(α-1) * (1-x)^(β-1), because 1/B(α,β) is just a normalizing constant to make the function integrate to 1.
Then, the terms in the numerator — x to the power of something multiplied by 1-x to the power of something— look familiar.
Have we seen this before?
👉 Yes. In binomial distribution.
The intuition for the beta distribution comes into play when we look at it from the lens of the binomial distribution.

The difference between the binomial and the beta is that the former models the number of successes (x), while the latter models the probability (p) of success.
In other words, the probability is a parameter in binomial; In the Beta, the probability is a random variable.
Interpretation of α, β
You can think of α-1 as the number of successes and β-1 as the number of failures, just like n & n-x terms in binomial.
You can choose the α and β parameters however you think they are supposed to be. If you think the probability of success is very high, let’s say 90%, set 90 for α and 10 for β. If you think otherwise, 90 for β and 10 for α.
As α becomes larger (more successful events), the bulk of the probability distribution will shift towards the right, whereas an increase in β moves the distribution towards the left (more failures).
Also, the distribution will narrow if both α and β increase, for we are more certain.
2. Example: Probability of Probability
Let’s say how likely someone would agree to go on a date with you follows a Beta distribution with α = 2 and β = 8. What is the probability that your success rate will be greater than 50%?
P(X>0.5) = 1- CDF(0.5) = 0.01953
I’m sorry, it’s very low. 😢

Dr. Bognar at the University of Iowa built the calculator for Beta distribution, which I found useful and beautiful. You can experiment with different values of α and β and visualize how the shape changes.
3. Why do we use the Beta distribution?
If we just want the probability distribution to model the probability, any arbitrary distribution over (0,1) would work. And creating one should be easy. Just take any function that doesn’t blow up anywhere between 0 and 1 and stays positive, then integrate it from 0 to 1, and simply divide the function with that result. You just got a probability distribution that can be used to model the probability. In that case, why do we insist on using the beta distribution over the arbitrary probability distribution?
What is so special about the Beta distribution?
The Beta distribution is the conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions (seems like those are the distributions that involve success & failure) in Bayesian inference.
Computing a posterior using a conjugate prior is very convenient, because you can avoid expensive numerical computation involved in Bayesian Inference.
If you don’t know what the Conjugate Prior or Bayesian Inference is,
read first
then
As a data/ML scientist, your model is never complete. You have to update your model as more data come in (and that’s why we use Bayesian Inference).
The computation in Bayesian Inference can be very heavy or sometimes even intractable. But if we could use the closed-form formula with the conjugate prior, the computation becomes a piece of cake.
In our date acceptance/rejection example, the beta distribution is a conjugate prior to the binomial likelihood. If we choose to use the beta distribution as a prior, during the modeling phase, we already know the posterior will also be a beta distribution. Therefore, after carrying out more experiments (asking more people to go on a date with you), you can compute the posterior simply by adding the number of acceptances and rejections to the existing parameters α, β respectively, instead of multiplying the likelihood with the prior distribution.
4. Beta distribution is very flexible.
The PDF of Beta distribution can be U-shaped with asymptotic ends, bell-shaped, strictly increasing/decreasing or even straight lines. As you change α or β, the shape of the distribution changes.
a. Bell-shape

Notice that the graph of PDF with α = 8 and β = 2 is in blue, not in read. The x-axis is the probability of success.
The PDF of a beta distribution is approximately normal if α +β is large enough and α & β are approximately equal.
b. Straight Lines

The beta PDF can be a straight line too!
c. U-shape

When α <1, β<1, the PDF of the Beta is U-shaped.
The Intuition behind the shapes
Why would Beta(2,2) be bell-shaped?
If you think of α-1 as the number of successes and β-1 as the number of failures, Beta(2,2) means you got 1 success and 1 failure. So it makes sense that the probability of the success is highest at 0.5.
Also, Beta(1,1) would mean you got zero for the head and zero for the tail. Then, your guess about the probability of success should be the same throughout [0,1]. The horizontal straight line confirms it.
What’s the intuition for Beta(0.5, 0.5)?
Why is it U-shaped? What does it mean to have negative (-0.5) heads and tails?
I don’t have an answer for this one yet. I even asked this on Stackexchange but haven’t gotten the response yet. If you have a good idea about the U-shaped Beta, please let me know!
Below is the code to produce the beautiful graphs above.
5. Classical Derivation: Order Statistic
When I learned Beta distribution at school, I derived it from the order statistic. The order statistic isn’t the most widely used application of the Beta distribution, but it helped me think about the distribution deeper and understand it better.
Let X_1, X_2, . . . , X_n be iid random variables with PDF f and CDF F.
We’re re-arranging them in increasing order so that X_k is the k-th smallest X, called the k-th order statistic.
a. What’s the density of the maximum X?
(Not familiar with the term “Density”? Read “PDF is NOT a probability”)

b. What’s the density of the k-th order statistic?

c. How can we derive the Beta distribution using the k-th order statistic?
What happens if we set X_1, X_2, . . . , X_n as iid Uniform(0,1) random variables?
Why Uniform(0,1)? Because the domain of the Beta is [0,1].

Here, we have the Beta!
6. Beta Function as a normalizing constant
I proposed earlier:
Let’s ignore the coefficient 1/B(α,β) … because 1/B(α,β) is just a normalizing constant to make the function integrate to 1.
To make the PDF of Beta integrate to 1, what should be the value of B(α,β)?

B(α,β) is the area under the graph of the Beta PDF from 0 to 1.
7. Simplify the Beta function with the Gamma Function!
This section is for the proof addict like me.
You might have seen the PDF of Beta written in terms of the Gamma function. The Beta function is the ratio of the product of the Gamma function of each parameter divided by the Gamma function of the sum of the parameters.

How can we prove B(α,β) = Γ(α) * Γ(β) / Γ(α+β) ?
Let’s take the special case where α and β are integers and start with what we’ve derived above.

We got a recursive relationship B(α,β) = (α-1) * B(α-1,β+1) / β.
How should we exploit this relationship?
We can try to get to the base case B(1, *).

Beautifully proved!

