The world’s leading publication for data science, AI, and ML professionals.

A Checklist of Basic Statistics

Statistical inference, experimental design, tests of significance, regression.

I recently took a free introductory course on statistics offered by Stanford on Coursera. With a background in physical sciences, my knowledge is heavily skewed towards things like Boltzman distribution and Bose-Einstein statistics. I, therefore, wanted to supplement my knowledge base with something more traditional and relevant.

Overall, I enjoyed the class quite a bit. I think it’s good even if you already know some of the material because I was particularly impressed by examples they used to highlight the importance of certain concepts in statistics. Below, I want to give a summary of a few topics covered in the course.

Photo by Edge2Edge Media on Unsplash
Photo by Edge2Edge Media on Unsplash

Statistical Inference

Statistical Inference seeks to gain information about a population from a sample of that population. It often becomes relevant when the population we are interested in is so large that we don’t have the resources to collect data from it exhaustively. For example, suppose we wanted to know what percentage of voters approve the way the U.S. president is handling his job. That’s a population of about 250 million people! A more manageable way to get this information is to estimate it from a random sample of a few thousand voters using statistics. Our estimate can be expressed as

Estimate = Parameter + Bias + Chance Error

Here, the parameter is the quantity of the population we are interested in (the approval % among all U.S. voters, in our example). The chance error happens because we are drawing randomly. Luckily, though, it can be roughly calculated and is reducible by increasing our sample size.

Bias, on the other hand, increases with sample size and is hard to identify/quantify. The course talks about 2 types of biases: selection and participation. In this example, selection bias occurs when we don’t have a representative sample of the population. If you were to go out on the streets of Chicago and asked voters what they thought of the President, you’re excluding opinions from other cities and introducing selection bias. Participation bias has 2 sub-categories: voluntary and non-response bias. Suppose you put out an online survey. Voluntary bias happens when voters with strong opinions are more likely to respond so that you’re getting extremes of the spectrum. And because it’s an online survey, you may be excluding seniors who tend to shy away from technology or those from low-income families who don’t have access to technology; hence, no-response bias.


Randomized Controlled Experiments Vs. Observational Studies

An observational study measures the outcomes of interest and can be used to establish an association. But to establish causation, you need randomized controlled experiments.

Consider the question: "Does eating organic food make you healthier?" (Is this a weird topic of contention to anyone else, or is it just me?) One thing you could do is go to Whole Foods and measure customers’ weights and other health indicators. Then you go to, for lack of a less political example, McDonald’s and take the same measurements. Then you run it through a calculator and find that people who shop/eat at Whole Foods are (hypothetically speaking) healthier than those who frequent McDonald’s.

But this doesn’t mean that it’s the food that’s causing these numbers. It could just mean that those who shop at Whole Foods really pay attention to their health, and hence exercise more and really try to keep in shape (hypothetically speaking). So, to really try to establish causation, you could do a randomized controlled experiment where you have 2 groups of people called the ‘control’ and ‘treatment’ group. Pretty sure I’ve heard other names for ‘treatment,’ though. The randomness comes from placing people at random into either group. Then, you give the treatment group only organic food and the control group only non-organic food (what’s non-organic food, you ask? No comment). After an appropriate amount of time, you take those measurements to draw a conclusion.

Here, it’s probably hard to introduce a placebo, which resembles the treatment but is neutral. Placebo comes up so often in pop culture, so I don’t think more explanations are needed. Also, the experiments should be double-blind, where the person making the evaluations doesn’t know which group is which to prevent their conclusions from being influenced.


The Normal and Binomial Distributions

You can approximate the binomial distribution with the normal distribution! In fact, as the number of experiments you perform, the corresponding histogram looks more and more like the normal distribution. We can standardize the binomial probabilities by

where x is the number of successes, μ is the population mean, σ is the population standard deviation, n is the number of times we perform the experiment/flip the coin, p is the probability of success. This formula is handy when computing the probability via the binomial distribution is tedious. For example, if we flip a coin 100 times and ask what’s the probability of getting at most 45 heads, we have to compute the probability of getting 1 head, 2 heads, … 45 heads, then add them up. By approximating the binomial distribution with the normal distribution, we can get the standardized value for 45 like so:

Then, we can use some software to compute the area under the normal curve to the left of -1 to get at the probability of getting at most 45 heads.


The Expected Value, Standard Error, Confidence Interval & Sampling Distribution of a Statistic

Suppose we knew the average weight (μ) and standard deviation (σ) of the male population in the United States. We then measure at random 100 males and find the average to be different from the population’s average. We do this a second time and find yet another average. If we keep doing this many times, the central limit theorem tells us that when we plot the histogram of our computed averages, it will look like a bell curve with mean μ and spread

where x-bar is the average of a sample of size n (in this case, it’s 100), and SE stands for standard error. This formula tells us that the error in our computed averages decreases as we increase the sample size. Also, a rather counter-intuitive point is that the error doesn’t depend on the population size! The course also goes through the standard errors of sums and percentages, and the central limit theorem applies to these quantities as well.

For the central limit theorem to apply, though, there are a few caveats. First, we need to sample with replacement so that each draw is independent of one another. Second, the statistic of interest has to be a sum (here, averages & percentages are sums in disguise). And last, the sample size needs to be large enough. The more skewed the population histogram is, the more you need. If there is no strong skewness, a sample size of ~ 15 is about enough.

Oftentimes, an estimate (or a statistic) is reported with confidence intervals like so:

where z is the z-score corresponding to the desired confidence interval. For instance, a 95% confidence interval corresponds to z=1.96. In our example, the estimate is the average weight of an American adult male. We first measure 100 random males, then compute the corresponding average and SE. Then we plug in z=1.96 and say ‘we are 95% confident that the average male weight is between average+zSE and average-zSE. But what does this mean? Well, if we take another 100 random samples, we’ll get a slightly different confidence interval. If we keep doing this, we can be sure that 95% of the time, the population average actually falls within our interval.

What happens if we don’t know σ? By the bootstrap principles, we can estimate σ with our sample’s computed standard deviation (s)!

Hypothesis Testing

To test whether our hypothesis is true, we need to compare it against the scenario where it is not true (aka the null hypothesis). Suppose you bought a coin but then suspect that it is not fair. In this case, the null hypothesis is that the coin is fair. If it is fair, then we expect the probability of getting either head or tail is 0.5. But then you flip the coin a few times and observe that the probability of getting head is 0.7! Is this enough to demand your money back?

To answer this, we first set up a test statistic that measures how far away the data are from what we should get if the null hypothesis were true. A common test is the z-statistic:

Here, the expected and SE values are computed under the assumption that the null hypothesis is true. The larger |z| is, the stronger the evidence that our hypothesis is true. However, if we flip the coin another 10 times, we probably will observe a different statistic. It turns out, if the null hypothesis is true, then z follows the standard normal curve. Then, you can compute the area under the curve to the left of z if it’s negative and/or to the right if it’s positive to get your p-value. The standard is, if the p-value is smaller than 5%, you can reject the null hypothesis.

When the sample size is small (smaller than ~20), then you should instead use the student’s t-distribution with n-1 degrees of freedom so that we can estimate σ with

Here, different n will give a slightly different t-distribution. So, when we report the average, we replace the usual confidence interval expression with

There are 2 ways in which a test result in a wrong decision. Type I error occurs when we reject a null hypothesis that turns out to be true. Type II error occurs when we fail to reject a false null hypothesis.

χ² Test for Categorical Data

The χ² test statistic is preferred to other methods in 3 situations.

The goodness-of-fit test is used to determine whether a set of categorical data came from a claimed distribution. For example, have you ever played Settlers of Catan and wondered that the dies were unfair? I mean, if 2 is so improbable, why does it keep showing up?! You can use the χ² test to figure this out. Say you roll a die 60 times. The null hypothesis is ‘the die is fair’, which means we should expect to see each number from 1 to 6 come up 10. Then, we tally up our observed counts like so:

Using the χ² test, we get:

Then, we pick the χ² distribution that corresponds to our degree of freedom (number of categories-1) which is 5 in this case and look up the χ² value (which is coincidentally 5 as well). The p-value is going to be the area under the curve to the right of this χ² value. If it’s below 5%, we can reject the null hypothesis and get new dies.

The test of homogeneity **** is useful when we’re trying to determine if 2 or more sub-groups of a population share the same distribution of a single categorical variable. For example, we can look at data of passengers on the Titanic (available on Kaggle) and ask whether the proportion of survivors to non-survivors are dependent on class. We first start with the null hypothesis that the probability of survival is the same for all populations and compute the expected number of survivors in each class by pooling our data together. Say, there were 2229 passengers, and only 713 survived. That means the probability of survival for everyone on board was 32%, and if there were 325 passengers in first class we would expect 104 survivors. Then, we do a count of the actual survivors in each class and arrive at the table below:

where the numbers in parentheses are the expected counts. Then, using the same formula above, we compute χ². The degrees of freedom in this case is

(number of columns – 1) x (number of rows – 1) = 3

The test of independence is used to determine whether 2 categorical variables are associated with one another in a given population. Suppose the 2 categories we want to look at are gender and smoking. We then sample randomly from a group of 500 people to get

where the numbers in the parentheses are the expected count if the null hypothesis is true. If the null hypothesis is correct, then we have

Then, we can multiply this number by the total to get an expected count of 150 for a male that smokes. Using the same formula above, we can then obtain χ² and follow an identical procedure to compute the p-value.

There might be some confusion between the test of independence and the one of homogeneity. The difference is in how we design the experiment. In the test of independence, units are collected randomly from a population, and 2 categories are recorded for each unit. In the test of homogeneity, random samples are drawn from each sub-population.


Linear Regression

In linear regression, the correlation coefficient r varies from -1 to 1. The sign tells us whether the slope of the correlation line points upward (+) or downward (-). The magnitude tells us the strength of the linear relationship between x and y. If the magnitude is close to zero, we shouldn’t expect to see any resemblance to a line when we plot x vs. y. However, r = 0 does not mean there’s no correlation between x and y. It just means the relationship between x and y is not linear.

Each point on our line is represented by

The idea behind least squares is that the best line is one that minimizes

Linear regression is not always the appropriate regression method. A good way to test whether it is is to visualize a scatterplot of the data and check if it looks at least football-shaped. Another good (and complementary) way is to plot the residuals (difference between predicted and true values) and see if there’s any detectable pattern there. Linear regression is only appropriate when the scatterplot of the residuals is patternless. Sometimes, we can transform the data so that the resulting plots have the right shape (square root and log are popular).


Wrap-Up

Would I recommend this class? If you’re new or looking to brush up, I would highly recommend this class. The instructor doesn’t waste time and explains the concepts quite thoroughly, with very instructive examples. Almost every video has mini-quizzes in the middle and/or at the end. There are non-trivial end-of-week quizzes as well, so you have ample opportunities to test and solidify your understanding. Also, keep in mind the class covered many more topics than the ones I talked about here.


More articles by me

6 Common Metrics For Your Next Regression Project

Greece Used Reinforcement Learning to Curb Influx of COVID-19

What Has Machine Learning Been Up To In Healthcare?


Related Articles