1. Introduction
In this post, we will walk through the building blocks of probability theory and use these learnings to motivate fundamental ideas in machine learning. In the first section, we will talk about random variables and how they help quantify real world experiments. The following section will segue into probability distribution functions. The final section will talk about how these mathematical concepts are used together to solve machine learning problems.
2. Random Variables
Let’s begin our journey with a fun experiment. Take a pen and paper; go outside to the main street in front of your house. Look at every person that walks passed you and take note their hair color; some approximation of their height in centimeters; and any other detail you find interesting. Do this for about 10 minutes.
Congratulations! You conducted your first experiment! With this experiment, you can now answer some questions: How many people walked passed you? How many people who walked passed you had blue hair? What was the average height of the people who walked passed you? Maybe in this experiment, 10 people walked passed you; 3 of these people had blue hair; and the average of their approximated height may have been 165.32 cm. For each of these questions, we tied to it some number; we tied to it some measurable quantity.
Random Variables are functions that map the outcome of an experiment to a measurable quantity.
We can now represent each of the 3 questions with a random variable. For example, let X₁ be the random variable that represents the number of people who walked passed you. Note from the definition that random variables are functions. And so, we can write the following in functional notation.

This means that the random variable X₁ is a function that maps the "number of people who walked passed you" (an outcome of an experiment) to some non-negative integer 10 (a measurable quantity). Similarly, let X₂ be the random variable that represents the number of people who walked passed you and had blue hair. Then we can write the following.

This means X₂ is a function that maps the "number of people who walked passed you and had blue hair" (an outcome of an experiment) to a non-negative integer quantity 3 (a measurable quantity). In a similar way, let X₃ be the random variable that represents the average height of the people who walked passed you.

This means X₃ is a function that maps the "average height of people who walked passed you" (an outcome of an experiment) to a non-negative real number 165.32 (a measurable quantity). What makes these random variables so useful is their ability to turn a fun experiment of watching human beings into numbers that we can perform mathematics with. In the following section, we will see how these random variables form the core of probability distribution functions.
3. Probability Distribution Functions
Random variables provide us a way to quantify the outcomes of an experiment. But how exactly do these outcomes (and thus random variables) behave? We can understand this with probability distribution functions.
During our experiment, we saw 10 people pass us. "10 people" is an example of an "outcome". However, we could have had 0 people pass us, or 1 person pass us, or 2 people pass us, and so on. Hence, the set of all possible outcomes S of the experiment is the following.

This set of possible outcomes S is called an "event". Let’s now write Equation 4 in formal math notation. Let the outcome of a single experiment be ω₁ and the corresponding random variable be X₁. We can then write the set of all such possible outcomes as the following.

The curly braces {.} represents a set; the colon ":" translates to the english phrase "such that"; and ℤ⁺ represents the set of non-negative integer values that X₁ can take. Given this context, Equation 5 translates to the following english statement.
The event S is a set of outcomes ω₁ such that the random variable X₁ can take on the value of any non-negative integer.
The probability of event S happening is 100%, and is hence less interesting. Typically, we are interested in an event that is the subset of S. For example, we might be interested in answering the question "What is the probability of 2 people passing us?". We can define another event A as being the set of outcomes we are interested in; for our case, we are interested in the outcome of "2 people" only.

We can write this out more generally in the following mathematical notation.

In English, we can translate Equation 7 to the following statement.
The event A is a set of outcomes ω₁ such that the random variable X₁ can take on the value 2.
Now that we have an understanding of an "event", let’s now define a probability distribution function.
Probability distribution functions are functions that map an event to the probability of occurrence of that event
Let P be a probability distribution function. According to the definition, P is a function. Hence P takes an input and returns an output. The input to this probability distribution function is an event, while the output is some probability value. In mathematical notation, we can write P in the following way.

Using the definition of event A from Equation 7 in Equation 8, we get the following notation.

This equation looks confusing. But using random variables, we can rewrite Equation 9 in a more concise and intuitive way.

Using random variables, we can thus make use of concise mathematical notation shown in Equation 10 as opposed to long worded english statements shown in Equations 1-3 and the long winded mathematical notation of Equation 9. Let us now talk about the types of probability distribution functions; hopefully the utility of random variables will become even more clear as we continue our discussion.
3.1 Probability Mass Function
This is the probability distribution function of a discrete random variable. This function takes in the value of a random variable and maps it to a probability value. For example, the probability mass function _p_ₓ can be written in the following mathematical notation.

X is a discrete random variable; x is a sample value it can take on. From this definition, the probability mass function at X=x is the probability that random variable X takes on the value x. You can consider this output as the probability mass of the random variable; this is like the mass in physics.

A random variable is discrete if it can take on a finite number of values or a countably infinite number of values.
In our example experiment, we discussed a random variable X₁ that represents the number of people who walked passed you. This could, in theory, be as low as 0 if you are in a quiet rural area and as high as the population of the world (~8 billion) if you are in the middle of a hypothetical metropolis of the world. Thus X₁ can take one of 8 billion distinct values: 1, 2, 3, …, 8 ⨉10⁹; this is a finite number of values. Hence, X₁ is a discrete random variable. The probability distribution function for the random variable X₁ may look something like the following graph.

The x-axis shows the distinct integer values X₁ can take. The y-axis shows the probability associated with the corresponding value of X₁. From the graph, we can infer the probability that 10 people passed you during the experiment was around 0.03. In other words, __ the probability _mas_s associated with _X₁=1_0 i_s 0.0_3.

Given this definition of probability mass function, let’s discuss two properties they exhibit.
Property 1: For every value the random variable can take, the value of the probability mass function is a probability number greater than or equal to 0. In math language, we can write the following for the random variable X₁.

ℤ⁺ represents the set of non-negative integer values that X₁ can take.
Property 2: If we take every possible value the random variable X₁ can assume and determine the probability mass function at each of these values, the total would be 1.

I hope the concept of discrete random variables and probability mass functions make sense.
3.2. Probability Density Function
This is the probability distribution function of a continuous random variable. This function takes in the value of a continuous random variable and maps it to a probability value. The probability density function _f_ₓ can be written in the following mathematical notation.

From this equation, the probability density function is a fraction of the probability (mass) when the continuous random variable X takes on the values in an infinitesimally small interval divided by the length of the interval (volume) itself . Hence the density seen here is the same as in physics.

For clarity, we require a limit in Equation 16 because X is a continuous random variable and can take on real values. The limit indicates the value to which this density will converge as Δx approaches 0 from the positive direction. In this section, we will derive some properties of probability density functions so it is clear why we need them. But first, let’s begin with a definition of continuous random variables.
A random variable is continuous if it can take on an uncountable number of values.
Consider the random variable X₃ which represents the average height of people who walked passed you during the experiment. This value could be a number like 165 cm, 170 cm. Here is our first attempt at graphing this data.

This graph gives a good overall representation of the data. However, this implies the average height of people measured in the experiment can only be integer values separated by 5. This isn’t necessarily true; after all, we can see cases where the average height might be close to 166 cm for instance. So maybe, we can break this down such that every centimeter is denoted on the x-axis.

But Figure 3 still isn’t accurate since we can have values of average height be 165.5 cm or 165.25 cm or 165.25495824 cm; there are an uncountable number of measurements that X₃ can take. Hence, X₃ is a continuous random variable. To represent the distribution of continuous random variables on a graph, the width of each vertical rectangle bar needs to become infinitesimally small; this leads to a smooth curve.

Property 1: A core property of continuous random variables is their probability mass function is 0 for any value the random variable may take. Mathematically, this can be represented with the following equation.

We will show how this is the case with some calculus. Let us prove this fact using the cumulative distribution function. It is the probability of the random variables X₃ taking on any value that is less than or equal to some value x. If we represent F as the cumilative distribution function, we can write the following notation.

To compute the probability mass associated with a continuous random variable, let’s take the difference between cumulative distribution functions of X₃ when it takes on a value of x and a value slightly smaller. Let’s write this using only the notation for probability distribution function P to see what is going on.

With continuous variables, Δx is an infinitesimally small value that converges to 0. So we can write this using limits in math notation.

Solving this limit, we will see the probability mass function of the continuous random variable X₃ taking on any value x is going to be 0.

Intuitively, this equation makes sense. The probability that the average height of people in an experiment is exactly 165 cm and not 165 + 10⁻¹⁰⁰ or 165−10⁻¹⁰⁰ will converge to the value 0. This means there effectively is no probability "mass" when dealing with continuous random variables at a specific point. So instead of measuring mass, let us use the concept of density. Density is mass per unit volume.

With continuous variables like X₃, the delta x is an infinitesimally small value that converges to 0. Mathematically, we represent this using limits.

The right hand side is the formal definition of a derivative of the cumilative distribution function.

Property 2: Because the cumulative distribution function can take values between 0 and 1, the derivative behaves similarly. Hence another important property of the probability density function is that it is greater than or equal to 0.

We can integrate on both sides to get rid of the derivative.

Property 3: This shows another important property: the probability density function over all values of the continuous random variable integrates to 1.

To learn more tidbits about these concepts in probability, check out the accompanying video on YouTube (links in Section 6).
3.3. Joint Probability Distribution
Let’s now talk about a joint probability distribution in the context of discrete and continuous random variables. To motivate a discussion that will make sense in a machine learning / realistic context, let’s conduct another experiment. Search for a random house on zillow.com and take note of the house price and number of bedrooms it contains; do this for 10 houses. Let’s define some random variables now

We defined 10 discrete random variables X₁ through X₁₀; one for each house. Each of these random variables are discrete as the values they can take are countable. Now that we have random variables to map the outcome of events to numbers, we can do some analysis. For example, the probability mass that the first house has 3 bedrooms can be mathematically represented in the following equation.

Remember, this is the probability mass as we are dealing with a discrete random variable. Similarly, the probability _mass_that the number of bedrooms in the 3rd house is 9 can be represented with this equation

The joint probability distribution will indicate the probability mass that both of these random variables take on some values simultaneously. The joint probability that the first house has 3 bedrooms and the third house has 9 bedrooms can be represented mathematically.

In this experiment, we observed 10 houses; if we write out the joint probability of observing these 10 values, we end up with the following notation that samples a value from each random variable.

This is for the discrete case where the properties of a probability mass function still hold true. That is, probability mass is greater than or equal to 0. In math notation, it would look like the following.

The second property states this joint probability mass of all possible values of each random variable should sum to 1.

Let’s now motivate the example to the continuous random variable case. Remember when we documented the 10 houses, we took note of number of bedrooms and it’s price. Let’s define another set of 10 random variables.

We defined 10 continuous random variables Y₁ through Y₁₀; one for each house. Each of these random variables are continuous as the values they can take are uncountable. Now that we have random variables to map the outcome of events to numbers, we can do some analysis. For example, the total probability density of the first house being under $300,000 can be mathematically represented in the following equation.

Similarly, the total probability density of the third house being under $700,000 can be mathematically represented in the following equation.

The properties of joint probability density functions are similar to that of density functions for single random variables.
Property 1: For instance, the probability mass of a continuous random variable is be 0 at any point after all as we discussed previously.

Propoerty 2: The joint probability density will be a number greater than 0 for any value of the random variable.

Property 3: Also, the total joint probability density across all values of each continuous random variable should sum to 1

Given this context to probability distributions, let us now tie these all together with an application in Machine Learning.
4. Machine Learning Application
One of the most fundamental applications of probability in machine learning is perhaps in the estimation of the parameters of a statistical model. Let’s continue to use the experiment of looking at zillow.com for information on houses that are listed to sell. Here are some example listings you might see.



Given these listings, let’s say we want to build a statistical model to predict the price at which a house sells given information about the house such as size of the house in sqft, age of the house and the number of bedrooms.

During our zillow.com experiment, let us look at 10,000 houses and collect this information for each house. We can eventually build a table that looks like the following.

Now, let’s talk about how the math fits in. In this experiment, we can consider an event as the act of looking at each house and collecting information. And so, we can come up with some random variables as follows.

In other words, for every house we see on Zillow, we can create 4 random variables. Let us say the 5th house we observe is the first house in Figure 5. We can define 4 random variables as follows.

So for 10,000 houses, we can create 40,000 random variables in the same way. And so, we were able to transform the event of looking at a house listing on zillow.com into numbers on which we can perform mathematics.
Now that we have defined our random variables, which of these are discrete random variables; which are continuous random variables? The number of bedrooms is countable and it is hence hence all 10,000 of the Xᵢ random variables are analyzed as a discrete random variable. On the other hand, size of house, age and price are measurements that can each take on an uncountable number of values. Hence, all the 30,000 other random variables Wᵢ, Vᵢ, Yᵢ are analyzed as continuous random variables.
Let’s now use the data we collected and the concept of random variables to estimate parameters of our statistical model. But before doing so, let’s add some formal math to the model blueprint. I want the statistical model to be the simple linear regression.

This is the linear regression hypothesis equation. Let’s write this equation a little more formally.

Note the variables x, y, v, w in this equation are not random variables; they are a specific value that the output of a random variable can take. y is the house price we want to predict; x, v, w are the pieces of housing information we have; the 𝜃 terms are the model parameters which we need to estimate; and epsilon is an irreducible error. The 𝜃 terms in equation X _a_re typically calculated with a technique called Maximum Likelihood Estimation.
4.2 Maximum Likelihood Estimation
Intuitively, we want to determine the values of the parameters in equation X that best fit the 10,000 house prices we have seen; this should be the best model. Mathematically, this is equivalent to finding the value of parameters that maximize the joint probability density of observing the house price of the first house being y1 and the second house price being y2 and so on. In math notation, we represent this with the following equation.

Note the use of the word density since we are dealing with the continuous random variables. The arg max is used to signify "The value of the parameters that maximize this function"; this function is the joint probability density function. An assumption we make in machine learning is that the house prices are independently and identically distributed; we will break this down. "Independently distributed" implies that the fact that house price of house 1 does not affect nor is affected by the price of any other house in our 10,000 house dataset; this is a reasonable assumption. Mathematically, this means the joint probability density is the same as the product of it’s constituent parts.

Since we know that every Yᵢ is a continuous random variable, we can use the notation we learned about probability density functions. The result is a product of probability density functions.

And so, we can replace the right hand side of the equation X with the product of probability density functions.

Very nice! The second part of being independently and identically distributed is "Identically distributed". As we have seen from the section on continuous random variables, each random variable can behave differently; hence they have their distribution functions. However, we are assuming the distribution of potential house price of the first house is the same as the other 10,000 houses. This means the probability density at the same points for any of the Yᵢ random variables is the same.

This means we can rewrite equation X to just use a random variable Y instead of the 10,000 Yᵢ.

Let us compress this notation with the product symbol as follows.

Nice again! We have a compact notation. What we do from here really depends on the type of statistical model we are building. In our example, we are trying to build a linear regression model; so the probability density function is assumed to follow a normal distribution. An interesting realization after doing this math is the optimal values of 𝜃 would be the values that maximizes the residual sum of squares equation in the linear regression case, a fundamental equation in machine learning. If you’re curious about this derivation, the resources are linked down below. Furthermore, if you are interested in an extended mathematical discussion on likelihood and it’s link to probability, checkout my other medium blog post and this video on my YouTube Channel "Code Emporium".
5. Conclusion
In this post, we talked about how random variables allow us to quantify the outcome of an experiment. We then understood the behavior of these random variables using probability distribution functions. Depending on whether the random variable measured is discrete or continuous, we can draw different types of probability distribution functions; probability mass functions for discrete random variables, and probability density functions for studying continuous random variables. We finally tied all of these concepts together using machine learning by understanding how one would estimate the parameters of a statistical model with Maximum Likelihood Estimation.
Thank you for reading until the end! For a extended discussion on mathematical concepts and how they tie into machine learning, please check out my YouTube channel "Code Emporium" and other resources below.
All images without a source credit were created by the author
6. Resources
[1] Code Emporium, Probability Theory for Machine Learning (2022), YouTube.
[2] Imperial College London, Mathematics for Machine Learning, Coursera
[3] Johns Hopkins University, Advanced Statistics for Data Science, Coursera
[4] University of Sydney, Introduction to Calculus, Coursera
[5] Ajay Halthor, Likelihood, Probability and the math you should know (2022), Towards Data Science
[6] Dennis Sun, Introduction to Probability (2020)
[7] Joram Soch, Maximum Likelihood Estimation for simple Linear Regression(2021), The book of statistical proofs