The world’s leading publication for data science, AI, and ML professionals.

Key Statistical Concepts in Data Science

A compilation of some must-know statistical concepts

Photo by Dan Farrell on Unsplash
Photo by Dan Farrell on Unsplash

If you are working with data, often you might have come across terms like "tests", "scores", "values", etc. preceded by alphabets like ‘F’, ‘P’, ‘R’, ‘T’, ‘Z’, etc. This article is about a layman-explanation of a few of such statistical terms/concepts, often encountered in the world of Data Science.

Disclaimer: What this article is not about?

This article is not about ** providing a comprehensive explanation but a concise and high-level compilatio**n of some key statistical measures. Each of the _concept_s below has several dedicated articles. I will provide links to a few such articles for your reference.

The following topics will be covered:

1) H_0 and H_a – Hypothesis test

2) P-value

3) Z-score

4) t-test

5) F-test

6) R-squared


1) H_0 and H_a – Hypothesis Test

Put simply, a hypothesis is an idea (or premise or claim) that you want to test (or verify). For instance, while sipping a coffee, you start thinking, "how does the average (mean) height of people in the state of New York compares with the average (mean) height of people in the state of California?" This is an idea that you have.

Photo by Franki Chamaki on Unsplash
Photo by Franki Chamaki on Unsplash

Now, to test your idea/claim, you need some data. Moreover, you need to frame a problem statement – a hypothesis, and to test this hypothesis, you need a hypothesis test. More specifically, you need to frame two scenarios – one that supports your claim and the other, alternate scenario.

A Null hypothesis is basically the status quo (or the default establishment or the generally accepted value as this video nicely explains). For example, the status quo could be that the average (mean) height of people in both states is equal. This would form your Null hypothesis, denoted as H_0. In a court of law, you have prosecution and defense. Likewise, for every Null hypothesis, you have an alternative (or alternate) hypothesis (in order to test your idea/claim), denoted as H_a (aka research hypothesis as you need to do some research/sampling to challenge H_0).

Having formulated your null and alternative hypotheses, you now need some data. Of course, it is impractical to measure the height of the complete population of New York and California. So, you would collect sample data for a few people (say 50 or 100 or higher) to test your claim. Having the sample data, you would perform some statistical tests, that are explained (using other but equivalent examples) in the upcoming sections.

Further Material:


2) P-value

This term is mostly used to measure the statistical significance of the results during hypothesis testing. Here, P stands for "probability". So, it’s a probability value between 0 and 1. You use it to reject or support the Null hypothesis.

Photo by Pietro De Grandi on Unsplash
Photo by Pietro De Grandi on Unsplash

What goes hand-in-hand with the P-value is the significance level (denoted by the Greek letter alpha) that is typically 0.05 (or 5%). It is basically the confidence level subtracted from 100%. A higher alpha level means a lower confidence interval and vice versa.

If the P-value is less than this significance level, the Null hypothesis is rejected. If the P-value is more than the significance level, the support in the favor of the alternative hypothesis is not so strong and, therefore, the Null hypothesis is not rejected, i.e., the alternative hypothesis is not accepted.

I, personally, find P-value more intuitive when the meaning of "significance" is related quantitatively to "chance". For example, a P-value of 0.01 (indeed a small value) indicates that there is a mere 1% chance (probability) that the results of an experiment you conducted on a sample are obtained "by chance" or happened "at random" due to some sampling error. In other words, you can be glad that the results you obtained are indeed very significant. Simply put, the smaller the P-value, the more significant your results are, and less likely they are to have happened (obtained) randomly or by chance. Following this logic, a high P-value, say 0.7, means that there is a 70% probability that your results have been obtained by chance, which has nothing to do with the way you performed your experiment. So, your findings now are very less significant.

Is my significance level (alpha) too strict or too lenient?

Answer: It depends on the problem at hand. Imagine that by rejecting the Null hypothesis, you are making a wrong decision i.e., although now you choose the alternative hypothesis, your choice of rejecting the Null hypothesis was not a good one. You simply rejected it because your P-value fell below the alpha value. You just followed the rules (based on the alpha value that you set before the study). But, not all the rules are meant to be good.

Now, if you are conducting a hypothesis test for the average speed players kicks the penalty with, you must be happy with a 95% confidence interval i.e., leaving a broad window of 5% (think of it as being 5% lenient) for rejecting the Null hypothesis, or a 5% window for making a wrong decision. That’s fine. It’s just a kick’s speed. However, if you are studying something critical, say the impact of a non-invasive treatment on a tumor in a patient, you definitely would want a high confidence interval, maybe 99.9%, leaving a significance level of 0.1%, i.e. 0.001 (think of it as being only 0.1% lenient; very, very strict in compromising with the analysis of tumor). Now you have a much smaller window of just 0.1% to make a wrong decision (assuming that "wrong" here means rejecting the Null hypothesis and accepting the alternative hypothesis).

Further articles on P-value:


3) Z-score

Photo by Марьян Блан | @marjanblan on Unsplash
Photo by Марьян Блан | @marjanblan on Unsplash

A Z-score (aka standard score) tells you how many standard deviations away (either below/left or above/right) an observation is from the mean of a normal distribution. It can take both positive and negative values. A negative Z-score means the data point is Z standard deviations left of the mean, and a positive Z-score means the data point is Z standard deviations right of the mean of the normal distribution, to which the data point belongs.

Where to use Z-score?

One important application is to compute the (probability) area under the normal curve associated with a given data point. Didn’t understand anything? Keep reading!

Photo by Dylan Gillis on Unsplash
Photo by Dylan Gillis on Unsplash

Let’s take an example adapted from this wonderful video. Suppose the age of participants in your Yoga class follows a normal distribution (left curve in the figure below) with a mean of 50 (years) and a standard deviation of 10 (years). I know, as you get closer to retirement, you need Yoga!

Now, you are asked what proportion of participants are below 35 years of age (red-shaded area under the curve). To answer this, you will have to first standardize the normal distribution i.e., a normal distribution with mean = 0 and standard deviation = 1 using the formula below. The values on the x-axis in the standardized distribution are denoted in terms of standard deviations away from the mean (0) and represent the Z-score. The value of 35 years is now -1.5 standard deviations away from the mean i.e., the Z-score is -1.5. By definition, the Z-score is unitless.

Schematic of normal and standard normal distributions (not up to scale).
Schematic of normal and standard normal distributions (not up to scale).

So, the problem of finding the proportion of participants below 35 years of age transforms into finding the proportion of Z < -1.5. Now, all you have to do is, look up the Z-score table for the value corresponding to Z = -1.5. This value is 0.0668. It is the required proportion (area of the red-shaded region where the total area under the black curve is 1) – 6.68% of participants.

Some further applications of Z-score

Likewise, you can use Z-score to compare two or more situations where the data have different scales. For instance, finding the top 10% of students who wrote two different exams (different point systems), like GMAT and SAT.

Another application is to detect outliers in a population/dataset. How? Standardize the population distribution (if normal) and mark the data points that lie below the Z-score of -3 and above the Z-score of +3 because these points will lie beyond 3 times the standard deviation from the mean i.e., 0.003 probability (99.7% of the area lies within plus-minus 3 standard deviations).

Further Material:


4) t-test

It is a hypothesis test (aka Student’s t-test) that allows you to determine if there is a significant statistical difference between two studied groups/samples, the data points of which are assumed to be distributed normally.

Photo by Jonathan Farber on Unsplash
Photo by Jonathan Farber on Unsplash

An example:

Suppose you want to compare the height of people from two different continents. You take two independent/unpaired samples, one from each continent’s population, and suppose that they are normally distributed. Though you could simply argue that you can compare the mean of the two samples and say that the continent having the larger mean have taller people. But what about the variance/spread within the sample distribution? It can be that the samples are statistically significantly different.

This is where the t-value comes to the rescue. As this video nicely explains, it is basically the ratio of signal (the difference in the means) and the noise (the variation within the two samples), as sketched in the figure below. The higher the difference between the two means, the higher the t-value. The higher the variation, the lower the t-value.

Sketch explaining the calculation of t-value (adapted from this video)
Sketch explaining the calculation of t-value (adapted from this video)

Now you can use this t-value in your hypothesis’ t-test where your Null hypothesis might say that there is statistically no significant difference between the two samples. If the t-value is above the critical t-value (similar to a significance level, alpha, in the context of a p-value), the Null hypothesis is rejected (i.e. the two samples are statistically different) in the favor of the alternative hypothesis. If the t-value is below the critical t-value, the Null hypothesis is not rejected.

How to choose the critical t-value?

The critical value can be found using a t-table. To use the t-table, you need a pre-defined p-value and the degrees of freedom f, which is simply n-2, where n is the total number of data points in both the samples (total sample size). Using these two values (p and f), you can look up the critical t-value.

Further Material:


5) F-test

Photo by Ralph Hutter on Unsplash
Photo by Ralph Hutter on Unsplash

Why F? The test got its name in the honor of R. A. Fischer, who developed this concept. The F-test is widely used to compare statistical models, fitted to a dataset, in order to determine which one better explains or captures the variance of the independent/target variable.

One key application of the F-test is in the context of regression problems. Specifically, given a regression model with some parameters _p1 (termed as the restricted model), it allows you to determine if an even more complex (more regressors) regression model (termed as the unrestricted model) with _p__2 parameters (_p2 _ > p__1) is a better choice for modeling your data.

CAUTION: _Do not confuse p_1 and p2 with the p-value introduced earlier.

The simplest (most naive/base) restricted model could be a simple, intercept-only model (for e.g., mean of your target data) having _p1 = 1. In intercept-only model, all the regression coefficients are zero. The immediate next competitive, unrestricted model could be a one having only one independent feature in addition to the intercept i.e., _p__2 = 2. Likewise, you can generalize _p1 and _p__2 to any values.

Using the F-test

You can use the F-test in the context of hypothesis testing by formulating a Null hypothesis _H_0 that states, "The unrestricted model is significantly not better_ than the restricted one." The corresponding alternate hypothesis, _Ha, will be that the unrestricted model is significantly better than the restricted one.

In the formula below, the subscripts 1 and 2 correspond to the restricted and the unrestricted models, respectively, with RSS_1 and RSS_2 as their respective residual sums of squares. Evidently, the numerator indicates how much variance remains unexplained by the restricted model (model 1) in contrast to the unrestricted model (model 2).

Definition of F-statistic. The terms (p_2 - p_1) and (n_p2) represent the degrees of freedom with n being the number of data points in your sample.
Definition of F-statistic. The terms (p_2 – p_1) and (n_p2) represent the degrees of freedom with n being the number of data points in your sample.

Having the computed value of F-statistic, you need a pre-defined significance level _p_ (like in the t-test above; typically 0.05 meaning evaluating the F-statistic at a confidence interval of 95%). Now, under the assumption of the Null hypothesis, the F-statistic follows an F-distribution that has two degrees of freedom as its two characteristic parameters. Finally, to test your hypothesis (claim), you look up the F-distribution table where _df1 will be your first degree of freedom and _df2 the second. For your pre-defined p-value, the point of intersection of _df__1 and _df2 in the F-distribution table will be the critical value. If your computed F-statistic value is more than the critical value, you reject the Null hypothesis and vice versa.

Intuitively, a large difference in the numerator in the F-statistic formula means the unrestricted model (model 2) explains a larger variance in the data than explained by the restricted/simpler model (model 1). Hence, the higher the F-statistic value is, the more significantly better model 2 is, and the more the chances of rejecting the Null hypothesis are.

Another key application of the F-test is in the analysis of variance (ANOVA) for a set of groups of samples to determine if they are statistically different.

Further reading:


6) R-squared

Caution: This is not the square of any variable called R.

Photo by Cris DiNoto on Unsplash
Photo by Cris DiNoto on Unsplash

It is known as the coefficient of determination, often denoted as _R_² and pronounced as "R-squared". It indicates how good your model’s fit is as compared to the simple, baseline guess of the target variable’s average. More precisely, it measures what proportion of the variation of the target (response/dependent) variable is determined (captured, explained, or predicted) by your model.

The following definition of _R_² makes it clear.

Sketch explaining the computation of _R_² (redrawn from Wiki).
Sketch explaining the computation of _R_² (redrawn from Wiki).

What are the two colored terms in the formula?

Both terms are sort of variances. So, the second term is basically the ratio of two variances.

  • The blue term is the residual sum of squares
  • The magenta term is the total sum of squares (by definition, proportional to the variance of the data)

If your model is a perfect fit, your black, straight line in the right-hand figure will pass through each of the data points, and the blue term in the formula will be zero, resulting in _R_² = 1. If your model f is just an average of the target variable y, i.e., a baseline model, you will have the blue term equal to the magenta term, yielding _R_² = 0.

For example, an _R_² value of 0.85 means your model is capturing 85% of the variance in the target variable you are predicting. The higher _R_² the better the prediction. However, a better prediction does not always mean a better model. You can also have overfitting. So be careful!

Myth Buster: The article, Looking at R-Squared, explains why a low R-squared is not always bad and a high R-squared is not always good.

Note: A negative R-squared simply means your model is even worse than a simple, average baseline model. Time to check your model!

Adjusted R-squared

If you keep adding new features to your model, the _R_² value will always increase. This way, although you will get a better and better score, you can end up overfitting your data. To avoid this situation, sometimes the adjusted R-squared is used. It includes a penalty term that depends on the number of features (variables/regressors) p and the number of data points n.

Formula to compute the adjusted R-squared
Formula to compute the adjusted R-squared

Further articles on R-Squared:


Conclusion:

The aim of this post was to provide the reader a basic understanding of the most common statistical methods used in Data Science. The tests/concepts discussed in this article are not limited to any particular field but are general in their framework.

This brings me to the end of this post. To stay updated with my articles, follow me here. If you would like me to add something to this post, feel free to comment with relevant sources.


Related Articles