Essential probNstats 4 ML

Beginner level probability and statistics

Introduction to expected value, variance, standard deviation, covariance, correlation, covariance matrix and correlation matrix

Christos Mousmoulas
Towards Data Science
7 min readDec 11, 2019

--

A Data Scientist/ML Engineer must analyze the data and apply the right model to it, in order to get useful results. Probability and Statistics are two fields of study that can advance one’s such skills. Therefore, I decided to emphasize on that and write my first article on the Essential probNstats 4 ML project. Its contents are basic probabilistic or statistical concepts, every Data Scientist and ML Engineer should know.

Expected value

The Expected value of a random variable is the average of that variable. For example given the Variable X = [1, 2, 3, 4, 5, 6] which represents a six-sided die where every side has exactly the same probability of occurrence, its Expected value is E[X] = 1·⅙ + 2·⅙ + 3·⅙ + 4·⅙ + 5·⅙ + 6·⅙ = 3.5 as the number of rolls approaches infinity. The equation of the Expected value is:

Where X is a random variable with a finite number of finite outcomes (x₁, x₂,…, xₖ) occurring with probabilities (p₁, p₂, …, pₖ) respectively. If all probabilities are equal (p₁ = p₂ = … = pₖ) like in the case of a six-sided die, then the Expected value is equal to the simple average(or arithmetic mean). Furthermore, because the sum of the probabilities equals 1, the Expected value can be considered as the weighted average, with probabilities being the weights.

Variance and Standard Deviation

Variance given a set of (random) numbers, measures how far those numbers are spread out from their average value. More formally, is the expectation of the squared deviation of a random variable(X in our case) from its Expected value and is often represented as σ², , or Var(X).

By “unpacking” the outer Expected value and with X=xᵢ for (i=1,…,n), where n is the size of the population of X, we get:

If all probabilities are equal (p₁ = p₂ = … = pₙ) then the variance can be written as:

This is also called the Population variance because we use all the numbers(population) of X for the calculation.

In contrast, we can use a sample of the whole population of a variable for the calculation of the variance. This is called Sample variance and it is used in situations where the population is too big such that the calculation is impossible. If we choose this technique, the formula is different:

Ȳ is the average(Expected value) of the sample Y and n is the population size of that sample. Notice that the n is subtracted by 1, this makes the variance unbiased. I won’t get into details about the bias but you can read about its source here.

After finding the variance, we can calculate the Standard deviation of that variable which is the square root of its variance.

It is also represented by the σ symbol. The Standard deviation measures the amount of variation of a set of numbers. A high value indicates that those numbers can spread far from the average.

But why to calculate the squared and not just the simple deviation of a random variable from its arithmetic mean?

Suppose that the numbers in the following examples are the outputs of the deviations of numbers from their mean.

outputs of deviations of numbers from their mean

In the first example the mean is (4+4–4–4)/4=0 thus the standard deviation is also 0. This is obviously not an indicator of the amount of variation of those numbers. If we calculate the mean of the absolute values, (|4|+|4|+|-4|+|-4|)/4=4 we get a number other than 0. This looks like a good indicator but what if we calculate the mean in the second example? (|1|+|-6|+|7|+|-2|)/4=4 The value is the same as in the first one even though the numbers are spread out even further. By squaring the numbers and calculating their standard deviation, the outputs for the two cases are √((4²+4²+4²+4²)/4)=4 and √((7²+1²+6²+2²)/4)=4.74… respectively. This looks good since the second case has a greater standard deviation.

Covariance and Correlation coefficient

Covariance is a measure of the joint probability of two random variables. It shows the similarity of those variables, which means that if the greater and lesser values of the one variable mainly correspond to the ones from the second variable, the covariance is positive. If the opposite happens then the covariance is negative. If it is approximately or equal to zero, the variables are independent from each other. It is often represented as cov(X, Y), σxʏ or σ(X, Y) for two variables X and Y and its formal definition is: the expected value of the product of their deviations from their individual expected values(arithmetic mean).

By “unpacking” the outer Expected value, but with equal probabilities pᵢ between X=xᵢ and Y=yᵢ, for (i = 1,…,n), we get:

and more generalized:

The above is the Population covariance. We can calculate the Sample covariance with the same rules that apply to the Sample variance. We just use the unbiased version: Take the same amount of observations of size (n) from each variable, calculate their Expected value and replace 1/n with 1/(n-1).

A special case of covariance is when the two variables are identical (Covariance of a variable with itself). In that case, it is equal with the variance of that variable.

Now if we divide the covariance of two variables with the product of their standard deviation we will obtain the Pearson’s correlation coefficient. It is a normalization of the covariance so that it has values between +1 and -1 and it is used to make the magnitude interpretable.

Covariance and Correlation Matrix

It is a square matrix that describes the covariance between two or more variables. The covariance Matrix of a random vector X is typically denoted by Kxx or Σ. For example, if we want to calculate the covariance between three variables (X, Y, Z), we must construct the matrix as follows:

Every cell is the covariance between a row variable with its corresponding column variable. As you may have noticed, the diagonal of the matrix contains the special case of the covariance(Covariance of a variable with itself) and thus represents the variance of that variable. Another thing that you may have observed is that the matrix is symmetric and the covariance values under the diagonal are the same as those over it.

With the same logic, one can construct the Pearson’s correlation coefficient matrix, in which every covariance is divided by the product of the standard deviation of its corresponding variables. In that case, the diagonal always equals 1 which denotes total positive linear correlation.

Summary

We saw some basic concepts like the expected value, variance, standard deviation, covariance, correlation, covariance matrix and correlation matrix. These form a foundation of studying more complex probabilistic or statistical concepts and are the building blocks of many significant algorithms like PCA.

References

Brownlee, J. (2019). Basics of Linear Algebra for Machine Learning: Discover the Mathematical Language of Data in Python. (v1.7).

Pierce, Rod. (10 Oct 2019). “Standard Deviation and Variance”. Math Is Fun. Retrieved 10 Dec 2019 from http://www.mathsisfun.com/data/standard-deviation.html

Wikipedia contributors. (2019, December 10). Expected value. In Wikipedia, The Free Encyclopedia. Retrieved 12:08, December 11, 2019, from https://en.wikipedia.org/w/index.php?title=Expected_value&oldid=930177129

Wikipedia contributors. (2019, December 9). Variance. In Wikipedia, The Free Encyclopedia. Retrieved 12:09, December 11, 2019, from https://en.wikipedia.org/w/index.php?title=Variance&oldid=929926047

Wikipedia contributors. (2019, December 2). Standard deviation. In Wikipedia, The Free Encyclopedia. Retrieved 12:09, December 11, 2019, from https://en.wikipedia.org/w/index.php?title=Standard_deviation&oldid=928934524

Wikipedia contributors. (2019, December 7). Covariance. In Wikipedia, The Free Encyclopedia. Retrieved 12:10, December 11, 2019, from https://en.wikipedia.org/w/index.php?title=Covariance&oldid=929639677

Wikipedia contributors. (2019, October 21). Pearson correlation coefficient. In Wikipedia, The Free Encyclopedia. Retrieved 12:11, December 11, 2019, from https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=922293481

--

--