The world’s leading publication for data science, AI, and ML professionals.

What is the t-distribution

Discover the origins, theory and uses behind the famous t-distribution

Photo by Agence Olloweb on Unsplash
Photo by Agence Olloweb on Unsplash

?

The t-distribution, is a continuous probability distribution that is very similar to the normal distribution, however **** has the following key differences:

  • Heavier tails: More of its probability mass is located at the extremes (higher kurtosis). This means that it is more likely to produce values far from its mean.
  • One parameter: _The t-distribution has only one parameter, the degrees of freedom, as it’s used when we are unaware of the population’s variance._

An interesting fact about the t-distribution is that it is sometimes referred to as the "Student’s t-distribution." This is because the inventor of the distribution, William Sealy Gosset, an English statistician, published it using his pseudonym "Student" to keep his identity anonymous, thus leading to the name "Student’s t-distribution."

Theory & Definition

Let’s go over some theory behind the distribution to build some mathematical intuition.

Origin

The origin behind the t-distribution comes from the idea of modelling normally distributed data without knowing the population’s variance of that data.

For example, say we sample n data points from a normal distribution, the following will be the mean and variance of this sample respectively:

Equation by author in LaTeX.
Equation by author in LaTeX.
Equation by author in LaTeX.
Equation by author in LaTeX.

Where:

  • is the sample mean.
  • s is the sample standard deviation.

Combining the above two equations, we can construct the following random variable:

Equation by author in LaTeX.
Equation by author in LaTeX.

Here μ is the population mean and t is the t-statistic belongs to the t-distribution!

See here for a more thorough derivation.

Probability Density Function

As declared above, the t-distribution is parameterised by only one value, the degrees of freedom, ν, and its probability density function looks like this:

Equation by author in LaTeX.
Equation by author in LaTeX.

Where:

  • t is the random variable (the t-statistic).
  • ν is the degrees of freedom, which is equal to n−1, where n is the sample size.
  • Γ(z) _is the gamma function, which is:_
Equation by author in LaTeX.
Equation by author in LaTeX.

Don’t worry too much about this scary maths (I certainly don’t!), but the key things to know are:

  • The PDF is symmetric and is overall bell-shaped.
  • Closely resembles a standard normally distributed variable, the mean is 0 and the variance is 1, except that it is a bit shallower and wider.
  • As ν increases, the t-distribution approaches the standard normal distribution.

Characteristics

  • The mean is defined as follows for ν > 1:
Equation by author in LaTeX.
Equation by author in LaTeX.
  • And the variance is defined as follows for ν > 2:
Equation by author in LaTeX.
Equation by author in LaTeX.

Example Plots

Below is an example plot of the t-distribution as a function of various degrees of freedom and also compared to the standard normal distribution:


# Import packages
import numpy as np
from scipy.stats import t, norm
import plotly.graph_objects as go

# Generate data
x = np.linspace(-5, 5, 1000)
normal_pdf = norm.pdf(x, 0, 1)

# Create plot
fig = go.Figure()

# Add standard normal distribution to plot
fig.add_trace(go.Scatter(x=x, y=normal_pdf, mode='lines', name='Standard Normal'))

# Add t-distributions to plot for various degrees of freedom
for df in [1, 5, 10, 20]:
    t_pdf = t.pdf(x, df)
    fig.add_trace(go.Scatter(x=x, y=t_pdf, mode='lines', name=f't-distribution (df={df})'))

fig.update_layout(title='Comparison of Normal and t-distributions',
                  xaxis_title='Value',
                  yaxis_title='PDF',
                  legend_title='Distribution',
                  font=dict(size=16),
                  title_x=0.5,
                  width=900,
                  height=500,
                  template="simple_white")
fig.show()
Plot generated by author in Python.
Plot generated by author in Python.

Notice that as the degrees of freedom, df, get larger and larger the t-distribution becomes similar to the normal distribution. It’s at around df=30 when we say the two distributions are sufficiently similar.

Applications

The following are the most common and frequent applications of t-distribution in Data Science and machine learning:

  • T-test: _The most famous application of the t-distribution is hypothesis testing through use of the t-test, which measures the statistical difference between two sample means. You can check my previous blog about it here:_

Statistical T-Test Simply Explained

  • Confidence intervals: For small sample sizes (typically less than 30), it is used to compute the confidence interval for that certain statistic with increased uncertainty. You can read more about confidence intervals here:

Confidence Intervals Simply Explained

  • Regression: The t-distribution is used to determine if we should add certain covariates to our regression model and calculate hypothesis tests around the significance of their coefficients.
  • Bayesian Statistics: _The t-distribution is sometimes used as a prior distribution in bayesian inference, which can be applied in all areas of data science, particularly reinforcement learning. See here for more info:_

Bayesian Updating in Python

  • Quantitive Finance: In finance, assets, and derivatives often have excess kurtosis therefore they are modelled by the t-distribution which has heavy tails. This is very useful for Data Scientists in the finance space.

Summary & Further Thoughts

The t-distribution is a useful statistical distribution that is very similar to the normal distribution but with heavier tails. This makes it an important tool in situations where the population variance is unknown. It is parametrised by a single parameter: the degrees of freedom and as it increases, the t-distribution tends to resemble the normal distribution. It has sundry applications within the areas of data science, spanning hypothesis testing with the t-test, constructing confidence intervals for small datasets, and aiding in regression modelling.

The code used in this article is available on my GitHub here:

Medium-Articles/Statistics/Distributions/t_dist.py at main · egorhowell/Medium-Articles

References & Further Reading

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!


Related Articles