Towards rigorous probability
There is a very interesting thing about probability: everything seems to be so obvious, but when we investigate it a bit more into depth, it suddenly turns out that we don’t actually understand it.

This post is a continuation of the previous post on the measure theory, but I think there’s no problem reading this post directly, except for the fact that it will be necessary to refer to the other post sometimes.
The probability distribution which is usually encountered in our early stage of learning probability is the uniform distribution. Uniform means all the event has the same probability of happening. When we throw a six-sided die, the probability of each number showing up is 1/6, and they sum up to one, as expected.
However, there are several counter-intuitive situations in uniform distribution. The first weird thing: when we consider the situation where we throw a tiny ball on the real line with length 6, what’s the probability of the ball hitting any particular point, is it 1/6? The answer is no because, in fact, it’s zero. This means it’s impossible that the ball will hit any particular point. The second weird thing: if the probability of the ball hitting any particular point is zero, all those probabilities don’t sum to one. The third weird thing: consider the line with length 1/6, the probability density of each point is 1 / (1/6) = 6 (by definition of the probability density function of uniform distribution) which is more than one –
probability density is not a probability.
This shows the need for introducing the mathematical difficulties since counter-intuitive phenomena like the ones discussed above can’t be studied merely intuitively. In this article we will be looking at the "mathematical difficulties" – measure theory.
Intuition of measure
The measure is an extension of the concepts, which have already been very familiar to us since primary school, i.e., length, area, and volume in Euclidian space. Those ideas are simple and intuitive, they quantize the magnitude of a shape.
However, they demonstrate a very fundamental property in measure theory, which is the additivity property. This refers to the fact that e.g. in Figure 1.1, the length of the whole section equals the sum of the two subsections (a + b), the area of the whole rectangle equals the sum of the two subrectangles (S₁ + S₂).

Why is it so? Because the subsections or subareas are disjoint, if they overlap, apparently the aforementioned conclusion will not be true. When this additivity is extended to infinitely many such geometrical objects, we call it sigma-additivity. Before defining measure formally, there is one important thing to notice
Measure is a function, but we sometimes also use it to refer to the output of this function.
When the input is a section, it outputs a length, when the input is a shape, it outputs an area, etc.
Axioms of probability
The measure theory extends and formalizes our intuitive knowledge of the area of a region. Integrating measure theory into probability theory axiomatizes the intuitive idea of the degree of uncertainty – it uses the power of measure theory to measure uncertainty.
Before introducing the Kolmogorov probability axioms, some concepts need to be clarified. The first and most fundamental is σ-field (or σ-algebra). It’s an algebraic structure, roughly speaking, a non-empty set with operations and axioms defined on it. Let 𝔉 be a set of subsets from another set Ω, i.e. 𝔉 is a subset of 𝓟(Ω) (the power set of Ω). We call 𝔉 algebra if it is closed under finite union and complementation (by De Morgan’s law, 𝔉 is also closed under finite intersection). If we replace finite union with countable union (this is more strict. therefore all σ-algebras are algebras, but not vice versa), 𝔉 is a σ-algebra.
The smallest σ-algebra that contains 𝓐 is denoted by σ(𝓐) (the σ-algebra generated by 𝓐,), where 𝓐 is a collection of subsets of Ω. We extend 𝓐 to σ(𝓐) by applying finite union, complementation, complementation of finite union, finite union of complementation, and so on.
A trivial σ-algebra is a set that contains only the empty set and the whole sample space {∅, Ω}. Another example of a σ-algebra is {∅, {a, b}, {c, d}, {a, b, c, d}}. 𝓟(Ω) is also a σ-algebra. As a real-world example, we put σ-algebra in the context of coin tossing, the possible outcomes are head (H) and tail (T), thus the sample space is Ω = {H, T}. Let 𝓐 = {{T}}. σ(𝓐) = {∅, {T}ᶜ, {T}, Ω} = {∅, {H}, {T}, {H, T}}, where every element in this set is an eve_nt. I_f we have two coins, Ω is consists of all the possible outcomes, which equals {HH, HT, TH, TT}.
Now we have the sample space Ω, which serves as sample space in the context of probability, and a σ-algebra 𝔉. Now the only building block lack for the definition of measure space is the measure. As mentioned before, a measure μ is a function. It is a non-negative, extended real-valued function defined on 𝔉 such that

The triple (Ω, 𝔉, μ) is called a m_easure space (_note that "metric space" is a completely different concept, though the names look alike. Metrix space is a set with a metric on it). μ is a measure on 𝔉 if μ is countably additive and it is non-negative for all the elements in 𝔉. If μ is a probability measure, which means μ(Ω) = 1, (Ω, 𝔉, μ) is called a probability space. We know that the probability space is in fact a Lebesgue measurable space on [0, 1] (refer to Lebesgue measure). A quick check of knowledge: is the Lebesgue measure a probability measure? It’s not, because the whole space is ℝ, and the Lebesgue measure on ℝ is ∞ instead of one.
Equipped with this knowledge, the Kolmogorov probability axioms become very straightforward. The axioms are defined over the probability space, with the probability measure being the probability function. We slightly modify the notation of probability space and it becomes (Ω, 𝔉, P)
- The probability of an event E, P(E) is never negative. This corresponds to the non-negativity of the measure.
- The probability of the entire probability space P(Ω) = 1. This is specifically defined for the probability measure.
- The additivity of disjoint events. This is described in Equation 2.1.
Let’s go back to the example of coin tossing. What are the events, and what are their probabilities? In our example, {H} and {T} stand for the event of seeing head or tail respectively. We assign probability P({H}) = P({T}) = 1/2 to them, in the case of a non-biased coin. ∅ represents the event of "nothing happens". According to the axioms of probability (properties of measure), P({H, T}) = P({H}) + P({T}) = 1, this is an intuitive result – {H, T} stands for the event of seeing either a head or a tail, of course, one of this will happen.
Probability distribution
The concept distribution is closely related to random variables. A loose definition of the random variable is the numerical outcome of the event. More formally, random variables are measurable functions from the sample space (we know that this is a set of events) to the set of real numbers. Consider the coin-tossing example, the possible outcomes are either head or tail, which form the sample space Ω = {H, T}. The random variable can be defined as follows: X(H) = 1 and X(T) = 0, which means the numerical outcome of seeing a head is 1 and seeing a tail is 0. In this case, the random variable also works as an indicator function.
Why do we learn this? The answer is that if we don’t convert the outcomes to numerical values, we can’t do any arithmetic on it, e.g. to calculate the expectations and variance. We also need this mapping to enable the study of the distribution, since talking about distribution without random variables is a bit odd.
Random variables in probability theory correspond to the measurable functions in Lebesgue integral. In terms of measure theory, we define distribution as the probability measure on (R¹, ℛ¹), where R¹ denotes the real line and ℛ¹ is the σ-algebra generated from the real line (refer to the definition of measure space we have discussed before).

And the distribution function is defined as

where F(x-) gives the left-hand limit (since function F is non-decreasing, e.g. not oscillating, the left-hand limit always exists). In fact, F is a right continuous function with a left limit. An example of such a function is given below, when you approach any number from right, the limit of the number equals the function value.

This gives us the cumulative distribution function, which is usually learned in fundamental probability. A quick capture: (1) probability distribution is a function, in terms of measure theory, it is the measure (2) F is the distribution, which is defined using the measure.
To grasp this definition better, we need to connect it with some concrete distributions, here the Bernoulli and binomial distribution will be used as examples. Let’s look at the coin-tossing example again, with the random variable X defined as X(H) = 1 and X(T) = 0, if we repeat the experiment n times, the results form a binomial distribution. Every single trial subjects to the Bernoulli distribution, which is a special case of the binomial distribution (n=1).
The Bernoulli distribution is very simple, it’s a discrete distribution with only two points. The random variable have only two values to take on, namely 1 and 0. The probability of each of them is μ({1}) = 0.5 and μ({0}) = 0.5. "μ" and "P" will be sometimes used interchangeably, since they are eventually the same thing.

How about the binomial distribution? What is the random variable and how does the graph look like? The random variable X is defined as the number of successes (k) in a sequence of n trials, with the rate of success being p. This means k is the concrete value X takes, n and p are the parameters of the distribution of X. The probability measure at each point is given by


Now we are ready to explain what’s wrong with the continuous distribution. Why is the probability measure of every single point zero? This is quite complicated, but in short, no probability can be defined on the σ-algebra of all the subsets of real numbers. This is proved by partitioning the interval of real numbers (this is a set of real numbers) and constructing a Vitali set, for a good explanation, see here. The conclusion is that it only makes sense to measure the length of the subintervals when the sample space is an interval, i.e. a uniform distribution. Therefore, in this case,
the σ-algebra is Borel σ-algebra – a σ-algebra consisting of all the open intervals (or equivalently by closed sets).
This is also mentioned in this post. Connect this to our intuitive knowledge of measure, we can see that the probability measure on the real line indeed gives us a length, where the points have no length. And we have

Also, we can see why is countable additivity important: if the third axiom can be satisfied by non-additivity, then if we add up the probability measure of all the points on the interval, we will get zero, since μ({x}) = 0 for all the x on the interval. However, since all the points form the whole interval, which is the whole sample space Ω. According to the second axiom, μ(Ω) = 0. This leads us to a contradiction.
Probability mass, density, and cumulative function
With the knowledge equipped by now, some confusing concepts can be easily clarified.
Probability mass: This is given by the probability mass function, which is a probability measure on a probability space where the sample space is discrete.
(Think about this concept a bit deeper, what is "discrete" actually about? It is usually connected with the set of natural numbers, and the discrete random variables are always indexed by natural numbers instead of rational numbers. Yet more precisely, the values we are interested in can be taken from a finite or countably infinite set. Why is it so? This can be answered using some basic knowledge of set theory. Both the set of natural numbers and rational numbers are countably infinite, this means they have the same cadinality, or equivalently, there is a bijection between those two sets.)
This means the probability mass function gives a probability that that random variable takes some value.
Probability density: A seemingly surprising fact – the probability density is not a probability (measure). It is rather relative, which tells us relatively, how likely the random variable will equal some value. In the case of uniform distribution, the probability density is the same everywhere, which means, each value is equally likely to be taken by the random variable. The relative function is a probability function, note that it outputs the density of a continuous random variable instead of the probability.
To relate to what we have discussed before, the probability of each point in a continuous distribution should be equal to zero. However, this is not the case with probability density. Therefore, obviously, the probability density is not a probability.
Cumulative function: The cumulative function is already defined in Eq 3.2. As the name indicates, it "accumulates" the probability of a random variable taking on all the values smaller than a certain value. Why do we need this concept? This is because we sometimes encounter probability distributions that are neither absolute discrete nor absolutely continuous – they can be the combination of both. In this case, we can’t use only the probability density function or probability density function to describe it. However, one cumulative function is enough to handle this situation. Consider the cumulative function shown in the following graph, it’s neither discrete nor continuous.

What’s the relationship between the cumulative function and the distribution? The distribution function is the derivative of the cumulative function, where the cumulative function is differentiable. Let’s look at the graph in Fig 4.1 again, the cumulative function is defined as

which is directly read from the graph. How about the probability density function? The cumulative function is differentiable apart from points 1 and 2. Therefore, the probability density function is defined as f(x) = 1/2 for x in (1,2) and 0 anywhere else.
The probability measure is interesting. It is only non zero for the discrete part, which x=1, we thus have P({1}) = 1/2 (refer to according to Eq 3.2 in the case of jump).

The probability distribution can be illustrated as in Figure 4.3. Let’s check the additivity: P({1}) + P((1,2)) = P({1}) + P((1, 2]) = 1/2 + 1/2 = 1. (the second 1/2 is the area on the interval (1, 2) under the line). Anywhere else, the distribution is continuous, therefore, the probability mass is zero.
The expected value
In the language of measure theory, the Expected Value of a variable X on a measure space (Ω, 𝔉, P) is defined as the integral of X with respect to the measure P

The expected value is nothing more than the weighted average of the output of a random variable. The definition in Eq 5.1 seems to be strange at first sight. Recall that the definition of expectation which is more familiar to us is as follows

It’s not difficult to see that the definition in Eq 5.1 merges the two separate definitions together.
Without using measure, it’s necessary to define the expected value separately, because the sample space of a discrete distribution is countable, meanwhile, the continuous distribution has an uncountable sample space.
Therefore, the probability measures for those two kinds of distributions are defined differently – in the case of discrete distribution, the probability of every single point is given by the probability mass, and summation is used; yet in the case of continuous distribution, the probability of each point is zero and we use probability density instead to calculate the probability on an interval, this means we need to integrate.
Why do we need this merge? This question is equivalent to "What’s good about the generalization?" The answer is simple. The advantage of the definition in Eq 5.1 over the separate definitions is that it brings the summation of simple random variables (the range of the random variable is finite, refer to the concept of simple function and range) and the Rieman integral of continuous random variables together, and this allows us to apply the general theory of integration to study it. Some other important concepts, such as variance, moments etc. are defined using expexted values. If we have a deep understanding of the probability space and distribution, the other values will also be easy to grasp.
Summary:
In this post we fristly have a quick look of the concept of measure, then we used some elementary examples to introduce the probability axioms. Then we try to differentiate some common confusions in probability. At last, we show the general definition of expected values. More about probability will be upcoming.
Resources:
[1] Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, L. E. (2005). A Modern Introduction to Probability and Statistics: Understanding why and how. Springer Science & Business Media.
[2] Billingsley, P. (2008). _Probability and measure_. John Wiley & Sons.
[3] Sigma algebra. (2021, September, 4). In Wikipedia.
[4] Ash, R. B., Robert, B., Doleans-Dade, C. A., & Catherine, A. (2000). _Probability and measure theory_. Academic Press.
a piece of music I always like to listen to while writing 🙂