
Motivation
Building Machine Learning models are cool for making predictions. However, they are not suitable when it comes to having a better understanding of your business problem, which requires the most time in statistical modeling.
This article will first try to build your understanding of the fundamentals of statistics that can be beneficial for Data Scientists and Data Analysts’ day-to-day activities in order to help the Business make actionable decisions. It will also guide you through hands-on to practice those statistical concepts using Python.
If you prefer video instead of reading the article, then this is for you 👇🏽
What is the difference between population and sample?
Before starting to work with data, let’s first understand the concept of population and sample.
→ A population is the set of all items you are interested in (events, people, objects, etc.). In the image below the population is made of seven people.
→ A sample on the other hand is just a subset of a population. The sample from the image contains two people.

In real life, it is hard to find and observe the populations. However, gathering a sample is less time-consuming, and cheaper. Those are the main reason why we prefer working with samples, and most statistical tests are designed to work with incomplete data, which correspond to samples.
A sample needs to satisfy the following two criteria in order to be valid: (1) random
and (2)representative
.
→ A random sample means that each element within the sample is strictly chosen randomly from the population.
→ A sample is representative when it reflects an accurate representation of the population. For instance, a sample should not contain only men when the population is men and women.
What are the different types of data?
Data in real life is made of different types. Knowing them is important because different types of data have different characteristics and are collected and analyzed in different ways.

What are the main Measures of Central Tendency?
There are three main measures of central tendency: mean, median, and mode. When exploring your data, all these three measures should be applied together in order to come to a better conclusion. However, using only one might lead to providing corrupted information about your data.
This section focuses on defining each of them including their pros and cons.
Mean
Also known as average (µ for population, x with overhead bar **** for sample). It corresponds to the center of a finite set of numbers. The mean is computed by dividing all the numbers by the total number of elements. Considering a set of numbers `x1, x2, …, xn` the mean is defined as follows:

x
**** with overhead determines the sample mean.n
denotes the total number of observations in the sample set.
Below is an implementation in Python.
# Import the mean function from statistics module
from statistics import mean
# Define the set of numbers
data = [5, 53, 4, 8, 6, 9, 1]
# Compute the mean
mean_value = mean(data)
print(f"The mean of {data} is {mean_value} ")
The previous code should generate the following result:
The mean of [5, 53, 4, 8, 6, 9, 1] is 12.28
Even though the mean is mainly used, it does come with the issue that it is easily affected by outliers, hence may not be a better option to make relevant conclusions.
Median
The median represents the middle value of the data after being sorted in ascending or descending order, and the formula is given below.

As opposed to the mean, the median is not affected by the presence of outliers and can be for that reason a better measure of central tendency. However, median and mean only work for numerical data.
Using the same data above we can compute the median as follows:
# Import the median function from statistics module
from statistics import median
# Compute the median
median_value = median(data)
print(f"The median of {data} is {median_value} ")
The execution generates the result below:
The median of [5, 53, 4, 8, 6, 9, 1] is 6
Let’s break down the computation process of the median value of the data
- Step 1: arrange the data in increasing order: [1, 4, 5, 6, 8, 9, 53]
- Step 2: in our case
n = 7
is odd. - Step 3: the middle value is
(n + 1)/2 th
term, which is(7+1)/2 = 4th
hence 6.
Mode
It corresponds to the most occurring value in the data and can be applied to both numerical and categorical variables.
Similarly to the median, the mode is not sensitive to outliers. However, the mode does not exist when all the values in the data have the same number of occurrences. Most of the time, the maximum number of modes we can observe within the data is three.
Let’s use a different dataset to illustrate the use of the mode.
# Define the data
data = [5, 9, 4, 9, 7, 9, 1]
# Compute the mode
mode_value = mode(data)
print(f"The mode of {data} is {mode_value} ")
All the values in the data occur one time except 9 which occurs 3 times. Since the mode corresponds to the most occurring value, the result of the above code is shown as:
The mode of [5, 9, 4, 9, 7, 9, 1] is 9
What are the measures of shape?
Skewness
and Kurtosis
are the two main techniques that can tell more about the shape of a given dataset. This section covers each one in a well-detailed manner, including their illustration using Python.
Before diving into the explanation of each concept, let’s import the necessary Python libraries.
Numpy
is used to work with arrays.- The
scipy
module is for statistical analysis. - For visualization purposes, we use
matplotlib
library.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from scipy.stats import beta, kurtosis
Skewness
The data is said to be skewed when its probability distribution is not symmetric around the mean of that data. Three main scenarios can happen depending on the value of the skewness.
The following helper function is used to illustrate and plot each case.
# Use the seed to get the same results for randomization
np.random.seed(2023)
def plot_skewness(data, label):
plt.hist(data, density=True, color='orange', alpha=0.7, label=label)
plt.axvline(data.mean(), color='green', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(np.median(data), color='blue', linestyle='dashed', linewidth=2, label='Median')
plt.legend()
plt.show()
- The skewness is symmetric when the data follows a normal distribution. In this case
Mean = Median = Mode.
# Normal distribution
normal_data = np.random.normal(0, 1, 1000)
label = 'Normal: Symetric Skewness'
plot_skewness(normal_data, label)

- There is a positive skewness or right skewness when the value is greater than zero. This means that the right side of the mean value contains more value, and the mean is to the right side of the median. In this case, we have
Mean > Median > Mode.
# Exponential distribution
exp_data = np.random.exponential(1, 1000)
label = 'Exponential: Positive Skewness'
plot_skewness(exp_data, label)

- When it is less than zero, then, there is a negative skewness or skewed to the left. In this case, the left side is the one that contains more value, and we generally find the mean to the left of the median. In this scenario
Mean < Median < Mode.
# Beta
beta_data = beta.rvs(5, 2, size=10000)
label = 'Beta: Negative Skewness'
plot_skewness(beta_data, label)

Kurtosis
The kurtosis metric quantifies the proportion of the distribution’s weight in its tails compared to the rest of the distribution. It tells us if the data is spread out or concentrated around the mean.
A distribution with a higher concentration around the mean is said to have high kurtosis. A low kurtosis is related to a more flat distribution with fewer data concentrated around the mean.
Furthermore, kurtosis is used to check whether the data follows a normal distribution, and also for detecting the presence of outliers in the data.
There are overall three main types of Kurtosis that a given data can display: (1) Mesokurtic
, (2) Leptokurtic
, and (3) Platykurtic
. In addition to explaining each concept, the python code will be shown how to compute each one.
(1) Mesokurtic
in this case, kurtosis=3
. This means that kurtosis is similar to one of a normal distribution, and it is mainly used as a baseline against the existing distributions.

(2) Leptokurtic
also known as positive kurtosis has kurtosis>3
. Often referred to as a "peaked" distribution, Leptokurtic
has a higher concentration of data around the mean, compared to the normal distribution.

(3) Platykurtic
also known as negative kurtosis has kurtosis<3
. Often referred to as a "flat" distribution, Leptokurtic
has a lower concentration of data around the mean as opposed to the Platykurtic kurtosis
and has also shorter tails.

The following code from the official documentation of scipy perfectly illustrates how to compute the kurtosis.
x = np.linspace(-5, 5, 100)
ax = plt.subplot()
distnames = ['laplace', 'norm', 'uniform']
for distname in distnames:
if distname == 'uniform':
dist = getattr(stats, distname)(loc=-2, scale=4)
else:
dist = getattr(stats, distname)
data = dist.rvs(size=1000)
kur = kurtosis(data, fisher=True)
y = dist.pdf(x)
ax.plot(x, y, label="{}, {}".format(distname, round(kur, 3)))
ax.legend()

- The Laplace distribution carries the properties of a
Leptokurtic
kurtosis. It has a tail that is more pronounced than that of the normal distribution. - The uniform distribution has the least pronounced tail due to its negative kurtosis (
Platykurtic
).
Conclusion
This first section of the series has covered the different types of data, the difference between sample and population, the main measures of central tendency, and finally, the measures of asymmetry.
Stay tuned for the next section which will cover more topics to help you acquire relevant statistics skills.
If you like reading my stories and wish to support my writing, consider becoming a Medium member. With a $ 5-a-month commitment, you unlock unlimited access to stories on Medium.
Would you like to buy me a coffee ☕️? → Here you go!
Feel free to follow me on Medium, Twitter, and YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!
Source code available on GitHub.