Math for Machine Learning: Probability and Statistics Prerequisite
Going over some problem sets that cover basic Probability and Statistics (ProbStat) for Machine Learning
Hi everyone, welcome to my second post! This is going to be the continuation from my first post. We’re going to be technical here, or should I say somewhat "Mathy". Thus, for you who have been away for some time from hands-on Math, I suggest you review the material around Probability, Statistics, and Calculus. You could do that along the way of reading this story so that you could follow my explanation comfortably.
This story is inspired by the course that I took at Columbia. The syllabus and the high-level detail of the course could be accessed through this link: COMS W4771 Machine Learning Fall 2019 by Daniel Hsu.
As you might realize, you couldn’t access the assignment files (which is intended by the Prof). Thus, the problem sets that I’m going to cover will also be somewhat different, I would say it will be some variations of what I did in the course. It will not be entirely different and I will try my best so that I could give something around the same level.
I’ll give you three different problems in this story that I feel cover all the basic concepts you need to know around ProbStat to start learning Machine Learning (particularly Bayesian ML). Since most of the solutions are not trivial, I would warn you that this story will be lengthy, so bear with me!
Lastly, before we start, I would like to let you know that this post contains LaTeX formulas. To render them correctly, install Math Anywhere add-on and activate it on your browser (I suggest Google Chrome) while reading this post.
Problem 1: Description
First thing first, master your basic probability!
We’re going to use the k-Nearest-Neighbor (kNN) algorithm as our background story here. If you’re not familiar with it or if you need some refreshment, I recommend you take a look at this amazing story from fellow Towards Data Science (TDS) author.
Let’s limit our scope to $k=1$, i.e. we will only find one nearest neighbor. Suppose we have $n$ training examples, then the complexity of this algorithm is $O(n)$ because we simply have to iterate over $n$ examples, each calculating the $L_2$ difference, then finally pick the neighbor with the least $L_2$ distance.
Let say that you work in a company and your boss asks you to speed up the computation of this 1-NN process. He doesn’t care that the result is not inch-perfect, as long as it falls to the least 10% neighbors. For example, if there are 1 million data, your picked neighbor must be within the 100k data with the least $L_2$ distance.
Let’s call this problem "approximate-Nearest-Neighbor" and we should have the following mathematical definition for it:
Given a test point $x in mathbb{R}^d$, we say a 10%-approximate nearest neighbor is $x{i(x)} s.t. ||x−x{i(x)}||_2$ is among the smallest 10% of distances to training points $||x−x1||_2,…,||x−xn||_2$ with $n$ as the number of training examples
Of course, the above problem can simply be solved by taking samples $lceil{0.9n}rceil + 1$ from the training examples. In other words, if you have one million data, just sample 900k + 1 data, without replacement, and you’re good to go. However, this doesn’t speed up the computation by much and that is rather useless.
So, let me propose something rather unexpected, at least that’s what I felt when I worked on a similar problem to this one. What if I say that by sampling exactly 50 examples, call this $T=50$, we could make sure that, mathematically, the Probability that we get the neighbor outside that 10% examples with the least $L_2$ distance is very small, say at most 0.5%. This is regardless of the number of data ($n$). Could you prove that?
Problem 1: Solution
The solution to Problem 1 is actually pretty straightforward even though the problem description feels a little bit overwhelming. See that this is basically just a sampling without replacement problem. The way I approach this problem would be: Since we’re looking for the probability that all the sampled data are not in the top 10%, then it’s the same as calculating the probability that all the sampled data (50 of them) are in the rest 90% of the data.
Thus, we can write the probability, say $p$, as:
$$p = frac{text{number of combinations of picking 50 examples from 0.9n data}}{text{number of combinations of picking 50 samples from n data}}$$
$$p=frac{binom{lceil{0.9n}rceil}{50}}{binom{n}{50}}=frac{frac{lceil{0.9n}rceil!}{(lceil{0.9n}rceil-50)!:50!}}{frac{n!}{(n-50)!:50!}}=frac{lceil{0.9n}rceil:(n-50)!}{n!:(lceil{0.9n}rceil-50)!}=(frac{(n-50)!}{n!})(frac{lceil{0.9n}rceil!}{(lceil{0.9n}rceil-50)!})$$ $$p=frac{lceil{0.9n}rceil(lceil{0.9n}rceil-1)(lceil{0.9n}rceil-2)…(lceil{0.9n}rceil-49)}{n(n-1)(n-2)…(n-49)}=prod_{i=0}^{49}frac{lceil{0.9n}rceil-i}{n-i}$$
We’re interested to see the value of the above probability regardless of the size of the data. Hence, my approach would be we take $n to infty$. The probability value that comes of that would the highest (try to reason yourselves on why this is the case). Thus, we could prove that the chance is at most 0.5%.
$$p{ntoinfty}=lim{ntoinfty}prod_{i=0}^{49}frac{lceil{0.9n}rceil-i}{n-i}$$
Using multiplication rule for limit:
$$p{ntoinfty}=prod{i=0}^{49}lim_{ntoinfty}frac{lceil{0.9n}rceil-i}{n-i}$$
This is pretty much easy to solve, if not for the ceiling function. I’ll leave to the reader to solve how if $0.9n$ is not an integer as an exercise. Here I’ll just cover the simpler case, i.e. when $0.9n$ is an integer:
$$p{ntoinfty}=prod{i=0}^{49}lim{ntoinfty}frac{lceil{0.9n}rceil-i}{n-i}=prod{i=0}^{49}lim{ntoinfty}frac{0.9n-i}{n-i}=prod{i=0}^{49}lim{ntoinfty}frac{0.9-i/n}{1-i/n}$$ $$p{ntoinfty}=prod{i=0}^{49}frac{0.9–0}{1–0}=prod{i=0}^{49}0.9={0.9}^{50}= 0.00515377520732012$$
We can see above that the chance that we don’t get even one sample from the top 10% examples (regardless of data size) is at most around 0.5%.
This is rather remarkable, we could reduce the complexity from $O(n)$ to $O(50)$, which is constant! However, in reality, this is not always true. In the comment section below, let me know if you have any idea on why that is the case!
PS: There is another catch that you might find, notice that a sampling without replacement problem for $n toinfty$ could be seen as (or be approached as) a sampling with replacement problem!
Problem 2: Description
This is going to be a standard textbook problem that you might find all over the place when you learn Statistics, which is to derive formulas for Maximum Likelihood Estimation (MLE). But knowing this means you have a good foundation to dive into some subfields of ML, particularly Bayesian inference.
If you’re not familiar with the topic, consider checking this cool article, also by a fellow TDS author. You would want to also familiarize yourself with the terms that I’ll use below (i.i.d, random variable, probability density function, likelihood, etc.). So, here we go!
Consider a statistical model for i.i.d random variables $X_1, X_2, …, X_n$, parameterized by $theta in Theta$ . The parameter space is the positive reals $Theta = { theta in R : theta > 0 }$.
The distribution $P_theta$ of $X_1$ follows:
$X_1 ~ P_theta$ has a probability density function (p.d.f.) $f_theta$ satisfying:
$$f_theta propto 1text{ for all } 0 leq x leq theta$$
$$f_theta = 0 text{ for } x < 0 text{ and } x > theta$$
Derive a simple formula for the MLE for $theta$, given data $(X_1, X_2, …, X_n) = (x_1, x_2, …, x_n)$!
Problem 2: Solution
We will consider two definitions of the p.d.f.:
- For $x<0$ and $x>theta$, $f_{theta}(x)=0$. Thus, in this case, the likelihood is always zero and the MLE would be zero as well.
-
For $0leq{x}leq{theta}$, $f{theta}(x)propto1$. Remember that $propto$ means that the p.d.f. is proportionate to 1, but it is not exactly 1, instead, it is some constant. This means that $f{theta}(x)=c$ for some constant c: $0<c<infty$. Using the property that p.d.f. should integrate to one, we can determine c: $$cint_{0}^{theta}dx=1$$ $$cxBig|_0^{theta}=1$$ $$c[theta-0]=1$$ $$c=1/theta$$ Hence, the likelihood for ${x_1,x_2,x_3,…,xn}$ would be: $$L(theta|x)=prod{i=1}^{n} f_{theta}(x_i) = c^n = ({1/theta})^n = 1/{theta^n}$$ To maximize that likelihood, we want to choose the smallest $theta$ possible. Since we know that $0leq{x}leq{theta}$, the smallest $theta$ possible would be the the highest $x$ in the set. Thus, the MLE formula would be: $$boldsymbol{max({x_1, x_2, x_3, …, x_n})}$$ Since $0leq{x}leq{theta}$ and $theta>0$, this MLE formula will always be greater or equal to MLE for case 1 ($geq{0})$. Hence, we can pick this as our MLE formula.
Problem 3: Description
In this problem, you will take a look on some models to collect data that ensures some degree of privacy guarantee, a topic which some of you might be interested in these days.
Since election days are coming for some countries, I think it would be interesting to use it as a background story here. Suppose that there are two presidential candidates in one of those countries, call it country $U$, namely Mr. Murpt and Mr. Nebid. Refer Mr. Murpt as $M$ while Mr. Nebid as $N$.
Imagine that you are an independent data scientist and some parties hire you to estimate a proportion of people in that country who picks one of the candidate. You can of course easily ask people directly who they will vote for, but some people might find it uncomfortable to share their information with you outright since you’re basically nobody.
Here, I will introduce you to a concept called "Randomized Response". Basically, it tries to address the privacy concern by not asking directly what their choice is. Instead, you do the following for each surveyed individual:
Ask him/her to toss a coin twice and tell him/her to not let you know the outcome
If at least one of the tosses is head, tell him/her to respond truthfully (e.g. if he/she votes for $M$ then he/she will tell you $M$)
If both tosses are tails, tell him/her to give opposite response (e.g. if he/she votes for $M$, then he/she will tell you $N$)
I would argue that the scheme above assures a high degree of privacy guarantee, unless of course the coin is rigged.
Now we come to the main questions.
Consider a statistical model for $n$ responses collected using the scheme above, where the responses are regarded as iid {0, 1}-valued random variables $Y_1, Y_2, …, Y_n$, and all coin tosses are independent. Let $theta in [0, 1]$ denote the parameter of the model that equals the proportion of individuals in the population that votes for $M$.
Then, answer these three problems:
What is the probability that $Y_1=1$? The answer should be given in terms of $theta$.
What is the log-likelihood of $theta$ given data $y_1, y_2, …, y_n in {0,1}$? The answer should be given in terms of $theta$ and $y_1, y_2, …, y_n$
Suppose $n=100$, i.e. we only sample a few people from each state of the country, and the number of $yi$ that are equal to 1 is 40, i.e. $sum{i=1}^n{y_i}=40$. Plot the log-likelihood as a function of $theta in [0,1]$. What appears to be the $theta$ with the highest likelihood?
Problem 3.1: Solution
Let X is random variable that denotes number of heads occurring in two coin tosses ($X=[0,1,2]$). Then, we can define two probabilities below: $$P(X=0)=P(tails).P(tails)=(1/2)(1/2)=1/4$$ $$P(Xgeq1)=1-P(X=0)=1-(1/2)(1/2)=3/4$$
Using those probabilities, $Y_1=1$ occurs when a person chooses $M$ gets at least one head $P(Xgeq1)$ or when a person chooses $N$ get no heads $P(X=0)$. Also, remember that choosing $M$ has the probability of $theta$ and choosing $N$ has the probability of $1-theta$. Thus, $P(Y_1)=1$ is:
$$P(Y_1=1)=P(X=0)(1-theta) + P(Xgeq1)theta$$ $$P(Y_1=1)=(1/4)(1-theta) + (3/4)theta$$ $$boldsymbol{P(Y_1=1)=(1/2)theta + (1/4)}$$
Problem 3.2: Solution
Because this problem involves discrete random variable, we can model the likelihood with the same function as we model the probability. To compute the probability, there are two components that we need to define:
- How many combination of events (value of $yi$) are there, stated by $binom{n}{sum{i=1}^{n}y_i}$. For example, if $n=2$, there are four possibilities: [0,0], [0,1], [1,0], [1,1].
- For each event, if $y_i=1$, the probability would be $P(y_i=1)$, else if $y_i=0$, it would be $P(y_i=0)$. This can be stated as $P(y_i=1)^{y_i}P(y_i=0)^{1-y_i}$, which is basically a binomial distribution.
Thus, the likelihood is:
$$L(theta|n,sum_{i=1}^{n}yi)=P(sum{i=1}^{n}yi)|n,theta)$$ $$L(theta|n,sum{i=1}^{n}yi)=binom{n}{sum{i=1}^{n}yi}prod{i=1}^{n}P(y_i=1)^{y_i}P(y_i=0)^{1-yi}$$ $$L(theta|n,sum{i=1}^{n}yi)=binom{n}{sum{i=1}^{n}yi}prod{i=1}^{n}P(y_i=1)^{y_i}(1-P(y_i=1))^{1-yi}$$ $$L(theta|sum{i=1}^{n}yi)=binom{n}{sum{i=1}^{n}yi}prod{i=1}^{n}({frac{1}{2}theta+frac{1}{4}})^{y_i}({-frac{1}{2}theta+frac{3}{4}})^{1-yi}$$ Then, the log likelihood would be: $$ln{L(theta|sum{i=1}^{n}yi)}=ln{binom{n}{sum{i=1}^{n}yi}prod{i=1}^{n}({frac{1}{2}theta+frac{1}{4}})^{y_i}({-frac{1}{2}theta+frac{3}{4}})^{1-yi}}$$ $$ln{L(theta|sum{i=1}^{n}yi)}=ln{binom{n}{sum{i=1}^{n}yi}}+ln{prod{i=1}^{n}({frac{1}{2}theta+frac{1}{4}})^{y_i}({-frac{1}{2}theta+frac{3}{4}})^{1-yi}}$$ $$ln{L(theta|sum{i=1}^{n}yi)}=ln{binom{n}{sum{i=1}^{n}yi}}+sum{i=1}^{n}ln({frac{1}{2}theta+frac{1}{4}})^{yi}+sum{i=1}^nln({-frac{1}{2}theta+frac{3}{4}})^{1-yi}$$ $$ln{L(theta|sum{i=1}^{n}yi)}=ln{binom{n}{sum{i=1}^{n}yi}}+sum{i=1}^{n}yiln({frac{1}{2}theta+frac{1}{4}})+sum{i=1}^n(1-yi)ln({-frac{1}{2}theta+frac{3}{4}})$$ $$boldsymbol{ln{L(theta|sum{i=1}^{n}yi)}=ln{binom{n}{sum{i=1}^{n}yi}}+sum{i=1}^{n}yiln({frac{1}{2}theta+frac{1}{4}})+(n-sum{i=1}^n(y_i))ln({-frac{1}{2}theta+frac{3}{4}})}$$
Problem 3.3: Solution

Above is the plot of log-likelihood as a function of $thetain[0,1]$ given n=100 and $sum_{i=1}^{n}y_i=40$. The snippet of the code I use to generate such plot is in this GitHub Gist: log likelihood plot.ipynb.
We can see that the $theta$ with the highest likelihood seems to be somewhere around $0.2$ and $0.4$. Let’s verify that by deriving the optimal $theta$ for the log likelihood below, i.e. $hat{theta}_{MLE}$.
The function that corresponds to the plot is derived below by substituting given values:
$$L(theta|n=100,sum_{i=1}^{n}yi=40 )=ln{binom{100}{sum{i=1}^{100}yi}}+sum{i=1}^{100}yiln({frac{1}{2}theta+frac{1}{4}})+(100-sum{i=1}^{100}(yi))ln({-frac{1}{2}theta+frac{3}{4}})$$ $$L(theta|n=100,sum{i=1}^{n}yi=40)=ln{binom{100}{40}}+40ln({frac{1}{2}theta+frac{1}{4}})+(100–40)ln({-frac{1}{2}theta+frac{3}{4}})$$ $$L(theta|n=100,sum{i=1}^{n}yi=40)=ln({frac{100!}{60!40!}})+40ln({frac{1}{2}theta+frac{1}{4}})+60ln({-frac{1}{2}theta+frac{3}{4}})$$ $$boldsymbol{L(theta|n=100,sum{i=1}^{n}yi=40)=64.7906+40ln({frac{1}{2}theta+frac{1}{4}})+60ln({-frac{1}{2}theta+frac{3}{4}}})$$ To determine the $theta$ with the highest likelihood, we can take the first derivative of $L(theta)$ and then find $theta$ so that it equals to zero. $$frac{d}{dtheta}L(theta)=0$$ $$frac{d}{dtheta}ln({frac{100!}{60!40!}})+frac{d}{dtheta}40ln({frac{1}{2}theta+frac{1}{4}})+frac{d}{dtheta}60ln({-frac{1}{2}theta+frac{3}{4}})=0$$ $$frac{(40)(frac{1}{2})}{frac{1}{2}theta+frac{1}{4}}+frac{(60)(-frac{1}{2})}{-frac{1}{2}theta+frac{3}{4}}=0$$ $$frac{20}{frac{1}{2}theta+frac{1}{4}}+frac{-30}{-frac{1}{2}theta+frac{3}{4}}=0$$ $$frac{-10theta+15–15theta-7.5}{(1/2)theta+1/4)((-1/2)theta+3/4)}=0$$ $$frac{-25theta+7.5}{(1/2)theta+1/4)((-1/2)theta+3/4)}=0$$ $$theta=frac{7.5}{25}=frac{3}{10}=0.3$$ $$boldsymbol{hat{theta}{mle}=0.3}$$
We could see that this is consistent with what we see in the plot.
Closing Words
I think that would be all for this story. My apologies that I require you to install Math Anywhere add-on to read this story because I couldn’t find any more convenient way to write LaTeX on Medium.
I prefer not to upload images for mathematical notation and Tex to Unicode doesn’t work well for me. In the comment section below, please suggest me if you know any better way to present mathematical notation on Medium, I would really appreciate it.
Also, please let me know if the topic is of interest to you and if the explanation is clear and concise because I feel like I’m a little bit verbose here.
Cheers,
Geral