An intuitive guide to basic statistics

A primer on basic statistical concepts

Sahil Gupta
Towards Data Science

--

One of the most fundamental pieces to the process of uncovering the secrets of a dataset is Statistics (and its vocabulary, i.e. probability). I have had a love-hate relationship with Statistics until recently. The subject was always intimidating to me and never felt intuitive. I recently came across a book called Statistics in Plain English by Timothy C. Urdan. This book made statistics intuitive to me and I hope to do the same for you by summarizing ideas from this book. If you have time, I would highly encourage you to take a look at the book. I will try my best here, but it is definitely worth it.

Why do we care?

Let’s begin by understanding why we even need statistics. Statistics in the simplest sense allows us to manage information. It allows us to collect, study and summarize information (/data). A researcher collects some information from a large number of people, uses this to summarize their experience, and make some general statements about a population. For instance, imagine you are modeling a dataset where the goal is to develop a predictive model. In a sense, you are doing something similar here: you collect some information (called train data in ML), summarize the information (for instance, as model parameters) by making some reasonable assumptions, and make general statements (on test data). So, I think developing a deeper understanding of Statistics can help us become better data modelers.

Population v/s Sample

In simple terms, the population is something we wish we had, but the sample is what we have. A population represents all the members of a certain group or category of interest, whereas a sample is a subset drawn from the population. An analogy from the Venn diagram would be: the population is our universe of interest and sample is an event in that.

Figure 1 Venn diagram with event A and universe U.

From figure 1, ‘A’ is the sample and ‘U’ is the population. This is what we are doing in a typical data modeling exercise: trying to build a model using train data which can be generalized to the unknown part of the population, i.e. test data (=U-A). There can be various reasons why we want to work with a sample instead of the population: the population is too large to collect (for instance, in a language modeling exercise the total set of all possible sentences is humungous), the information collection process is expensive and time-consuming, etc. The key here is that because we are working with a small chunk of the population we would want it to be representative of the actual thing. This is why statisticians take the pain to think about sampling.

Sampling

To ensure that a sample is representative of the population, we employ what is called random sampling. In the context of statistics, random sampling means that every member of the population has an equal chance of being selected into a sample. Based on this approach, we can be sure that any differences between our sample and the population will not be systematic and will be due to random chance. In other words, we can say that with random sampling we are not biassed towards any specific members of the population. This type of sampling is one of the most popular ways and is also used in k-fold cross-validation. There are other (nuanced) types of sampling techniques and you can read more about them here.

Distributions

The collected sample can contain different types of features (/random variables) such as continuous, or categorical. A distribution is simply a collection of data, or scores, on a variable (/feature). Also, any collection of scores on a variable, regardless of the type of variable, forms distribution and this distribution can be graphed. We are often interested in the characteristics of these distributions such as typical values of the distribution, variety of values, the shape of the distribution, and so on. Studying the distribution of a random variable gives us insight into its behavior.

The typical values in a distribution are often measured using statistics such as Mean (other measures include Median, Mode) and variety using Variance (or, range, IQR). The popular choices from these are Mean and Variance, but others can be more useful depending on the type of data (for instance, with outliers Median can be a better choice than Mean). The variance provides a statistical average of the amount of dispersion in a distribution of scores. One issue with looking at variance is that its units are not the same as the original variable. To fix this, we commonly look at standard deviation (which is the square root of variance).

To gain a deeper understanding of standard deviation, let’s look at the two words: Deviation refers to the difference between an individual value and the average score; Standard refers to typical, or average. So a standard deviation is the typical, or average deviation between individual values and the mean of the distribution. Standard deviation is used to examine the average dispersion of scores in a distribution. So, a measure for average value when combined with a measure for dispersion of values gives us a rough picture of the distribution of scores. A note on the correction in the formula for sample standard deviation and variance can be found here. Another useful way to examine the distribution is boxplot.

The Normal Distribution

I was planning on motivating this section by talking about the importance of Normal distribution. I think this Q/A thread does a good job. So, I am just going to summarize some facts about the Normal (also called Bell curve) distribution here.

A standard Normal distribution. Photo Credit: Wikimedia

It has three fundamental characteristics: a) symmetrical, b) mean, median, mode are all in the same place, i.e. center of the distribution, c) asymptotic, i.e. upper and lower tails never touch the x-axis. The reason why Normal distribution is used in practice is that we care about the exact probability of something occurring in the sample just due to chance. If we were only interested in describing a sample, it doesn’t matter if the values are normally distributed or not. For instance, if the average person in a sample consumes 2000 calories a day, what are the chances (or probability) of having a person who consumes 5000 calories a day in the sample? Additionally, we are often interested in making inferences about the population from which the sample is drawn and these can be done by working with normal distributions.

So, given a distribution, how would we find whether it follows a normal distribution or not? We can look at the Skew and Kurtosis. These are characteristics used to describe a distribution. We can compare the skew and kurtosis of our distribution with a normal distribution to check our assumption of whether a given random variable follows the normal distribution. As we will see in the later sections, the theoretical normal distribution is a critical element of statistics because many of the probabilities used in inferential statistics are based on the assumption of normal distribution.

Z-Score

Sometimes we are interested in describing individual values in a distribution. Using mean and standard deviation, we can generate a standard score, also called a z-score, to compare the relative significance of individual values. Also, this standardization helps us in comparing values in distributions of two separate variables (because both are now on the same scale).

For instance, we want to compare how Jim did on his English test vs Statistics test. Suppose the English test was on a 0–100 scale, whereas the Statistics test was on a 0–200 scale. Furthermore, we looked at the papers and realize that the Statistics test is tough compared to the English test. A direct comparison of Jim’s score on both tests is not correct. A more reasonable way is to standardize the scores before comparing them. Standardizing re-scales the test scores on a scale in standard deviation units. Note that even if one test was harder than the other, this difference is accounted for in the mean and standard deviation. In other words, a z-score indicates how far above or below the mean a given score in the distribution is in standard deviation units. Furthermore, when an entire distribution is standardized, the average z-score for the standardized distribution is always 0 and the standard deviation is always 1.

Let’s pause here for a moment to understand what a z-score of z=1 would tell us. A z-score of z=1 on an English test for Jim would tell us that: a) Jim did better than the average student taking the test, b) his score was 1 standard deviation from the mean, c) if the scores were distributed normally, he did better than roughly two-thirds of the class (from 68–95–99.7 rule). But, there can still be a lot of information that the z-score doesn’t aim to describe. For instance, how many words did Jim spell correctly, or if he is a good speller, the difficulty level of the test if other students taking the test are good spellers, etc. Similar to the above example if we are sure that the distribution of our random variable is Normal, we can compute the percentiles (using z-score tables).

It is important to note that we were simply interested in calculating the percentiles, we could have computed those without calculating z-scores: rank order the observations and use the definition of percentile.

Standard Error

This is one of the most important concepts in inferential statistics and is used extensively. There are two ways to think about standard error. Formally, a standard error is defined as the standard deviation of the sampling distribution of some statistic (if this made your head spin, hold on we will dissect it in a few lines). Another way to think about a standard error is that it is the denominator in the formulas used for calculations in many inferential statistics.

Let’s take a step back and try to understand the definitions more deeply. Imagine, we are interested in measuring the average height in a community. Based on whatever I have described until now, we draw a sample from all the people in the community. For the sake of simplicity, I am assuming there are 4 people in the community with heights 1cm, 2cm, 3cm, 4cm. Also, let’s assume our sample is of size 2. These are all the possible pairs: (1,2), (1,3), (1,4), (2,3), (2,4), (3,4), and the average heights based on these: 1.5cm, 2cm, 2.5cm, 2.5cm, 3cm, 3.5cm, respectively, and the population average height is 2.5cm. As we can observe, the average height we calculate depends on the sample we draw and changes as we change our sample. In other words, the statistic (average height) we are interested in has some variation (i.e. standard deviation) as a result of random sampling, and we are calling this the standard error.

If we try to dissect the definition in this context, it says that there is a sampling distribution (we get this as a result of random sampling; in the example, this is the collection of average heights), this distribution is associated with average height (which can be any other statistic we are interested in, eg. weight, IQ, etc.), and the standard deviation of such a distribution is called a standard error. In essence, the standard error is the measure of how much random variation we would expect from samples of equal size drawn from the same population. Although there are standard errors for all statistics, the standard error of the mean is most frequently used.

To avoid confusion and distinguish sampling distribution from a simple frequency distribution, the mean and standard deviation of the sampling distribution are given special names, i.e. expected value of mean and standard error, respectively. The mean is called the expected value because mean of sampling distribution of the mean (i.e. mean of distribution generated by repeatedly collecting the sample and computing the mean) is the same as the population mean: when we select a sample from the population, our best guess is that the mean of the sample will be the same as the mean for the population. This provides us with an interpretation for the standard error: it provides us with a measure of how much error (remember, standard deviation tells us the average difference between individual value and mean) we can expect when we say that a sample mean represents mean of the population (hence, the name standard error).

Most of the time, we don’t have the time and resources to draw multiple samples from the population and find the mean and standard deviation of the distribution of sample means (or the sampling distribution). Until now we have convinced ourselves that standard error is something that exists, it is important for the analysis, and we usually don’t have access to the sampling distribution. This brings us to this question: can we use information from our sample to provide some estimates of the standard error?

To examine this question, let’s think about two characteristics of the sample. First, how large is our sample? The larger our sample, the less error there should be in our estimate about the population because a larger sample is much more like the population, and so, the estimates will be more accurate. Second, we need to examine the standard deviation of the sample. We make an assumption here that the population standard deviation is equal to the sample standard deviation. This assumption about the population may not be true, but we must rely on them because this is all the information we have. Also, in case we somehow knew the population standard deviation, we can use that. Hence, the formula for standard error is given as,

If you ended up going through the Q/A thread I linked earlier and know about Central Limit Theorem (CLT), then we can also say here that CLT states that when we have a reasonably large sample (e.g. n=30), the sampling distribution of the mean will be normally distributed.

P-Value

I think this term is fairly popular. But, let me attempt to make it slightly more intuitive. Before moving on, I want to take a small pause and recap what we have talked about until now. We saw that there is a difference between the data we usually work with and the actual population as it is inconvenient to work with populations. To generate a sample, we think about the appropriate sampling technique. Once we obtain a sample we can study the distribution of random variables to gain an understanding of our sample. We also looked at some characteristics of one of the most fundamental distributions in all of the Statistics and ways to prove/ disprove if a random variable’s distribution follows the Normal distribution. We also looked at a standardization technique that helps us compare values across random variables on different scales/ distributions. Finally, we understood the significance of standard error in the context of our problem of generalizing the results of our sample to the population. Here, let’s use all these concepts together to further explore the idea of generalizing the insights from our sample.

Suppose that we have a sample (from city A) that has an average IQ test score of 110 and the national average is 100. We can see that there is a difference between sample and population scores of 10 points, but is this difference meaningful, or a trivial one? Maybe, if I sampled the data again, the difference might change. If this brings the standard error to your mind, you are on the right track! So, how do we know whether this difference is not simply due to random chance (from the random sampling technique)? More concretely, we are looking for the probability of getting a random sample such that the difference from the national average is 10 points. Take a pause and think about what this probability can do for us.

If we somehow had this probability and it was small, then we know that the difference is not due to random chance and is driven by some characteristic of the sample. This is the crux of the p-value. A popular cut-off value for probability is 0.05. This p-value of 0.05 says that if the probability of a difference of 10 points is ≤ 0.05, then we know that this difference is meaningful in the sense that we don’t see this because of random sampling.

So, let’s go back to our original problem of calculating the probability. You might remember from earlier that normal distribution enables us to compute probabilities (68–95–99.7 rule and z-score tables). To determine the probability of difference of 10 points, we need a normal distribution of sample means or a normal sampling distribution of the mean. We can also use a closely related family of t distributions to figure out this probability. As you can see here, this looks very similar to the Normal distribution. You can observe that as the size (/degrees of freedom) of the distribution increase, it goes towards a normal distribution. So, one way to think about a choice of t-distribution is that when we don’t know the population standard deviation and are looking at limited data (in the form of a sample), it is probably a better idea to work with decreased beliefs in certain events: we can observe the PDF of t-distribution is squished which means the probabilities at tail ends are increased relative to a normal distribution. The formula for calculating a t-value is identical to z-score, except it uses the standard error computed from the sample (because we don’t know the population standard deviation; if we knew it, we could have computed a z-score), and is called t-value because we are using a t-distribution to compute probabilities.

In essence, if we know the population standard deviation we can use that to compute z-score and normal distribution to compute the probability of a difference of 10 points on sample compared to the population, if not then we can use the standard error to compute t-value and use t-distribution to compute the probability.

Statistical Significance

In this section, we will bring all the ideas we learned until now to explore a popular application of statistics called inference. The idea behind inference is to study the sample and infer conclusions about the much larger population. The common theme of questions we often ask with many inferential statistics is whether some statistic we observe in our sample is big or small compared to the amount of variance due to random sampling (i.e. in terms of standard error). Remember, this is an important question because as a consequence of random sampling, we expect some variation (compared to the population) in the sample and we quantify this variation using sample standard deviation and size and call it the standard error.

Continuing the IQ test example from the previous section, the question we are trying to answer with statistical significance is whether the difference of 10 points is a result of random sampling or not. Three common tools used to reach conclusions about the statistical significance of a statistic are testing, effect sizes, and confidence intervals. Briefly, testing is simply using the standard error to calculate p-value and use this to test statistical significance; effect size is motivated by the observation that larger samples in have lower standard errors (because S.E. is inversely proportional to sample size which leads to higher z-score (/t-value) and it makes sense to discount the effect of size of the sample; confidence intervals are yet another way to quantify an interval inside which we can ensure that the actual statistic will lie if we repeatedly collected our sample.

In plain English, statistical significance helps us in deciding whether our conclusions from studying the sample also apply to the (broader) population.

Hypothesis Testing

Another popular word in Statistics. Here, all we are trying to do is come up with a hypothesis, and confirm or reject it. It follows from the previous section in the sense that even before we perform a statistical significance test, we would want to establish a benchmark. This benchmark is our hypothesis.

Usually, the primary hypothesis is the null hypothesis (H_0). As the name suggests, this assumes a null effect, or that the effect doesn’t exist (this can be whatever effect we are interested in measuring in our population). The complementary of this is the alternate hypothesis (H_a). Again as the name suggests, we are considering the alternative of the null hypothesis in this, i.e. the effect is present. So, now the question changes slightly: how different does the sample mean (this is an example, we can pick any statistic we like) have to be from population mean before we can consider the difference to be meaningful, or statistically significant? Note that in this question the population mean is our H_0 and the sample mean is H_a.

For instance, from the earlier IQ test example, H_0 is the national average value of 100, H_a is the sample average value of 110, and we are testing our hypothesis whether there is something special about this sample that caused a 10 points difference, or random sampling, i.e. statistically significant.

As you might have noticed there is a caveat in all this discussion. We mentioned earlier that if the calculated probability (p-value) is small enough (we defined it as 0.05), then we say the differences are statistically significant, or meaningful. When we accept a cut-off value of 0.05, there is a possibility that in fact, our random sample caused such a difference and the H_0 is true, i.e. we ended up selecting a random sample which was extremely rare (for instance, with normal distribution the probability of selecting such sample will be 1–0.997= 0.003!). So, we would end up committing an error in this case. This error is called a Type I error and the cut-off value alpha.

I hope some of the basic statistical concepts are much more intuitive now. I skipped some important concepts which build on the above ideas and are quite interesting: correlation, t-tests, ANOVA, Regression. I will leave you with an interesting test in the context of things we talked about: hypothesis test for normality.

--

--