I was never at the top in maths, but I wasn’t bad either and always liked it somehow. At the same time, I thought for a long time: if you’re not great at maths, you can’t go into computer science, data analysis, data science or IT. After a few detours via psychology, I finally ended up in a Salesforce consultant role and successfully completed my degree in business informatics and AI, specializing in software architecture and design, with top marks.
But whether I’m programming with Python, creating machine learning models or analyzing data, I come across mathematical concepts all the time. When using data analysis, automation or Python programming, I don’t think it’s important to understand maths topics in every detail. But mathematical understanding does come up very often. A few months ago, I wanted to brush up on my understanding of maths and logic and came across the app ‘Brilliant‘ (no affiliate link). The app explains topics such as Probability and regression models in a super simple way. The cool thing about it is that the app is structured similarly to Duolingo and is almost somewhat addictive thanks to the gamification… Read to the end of the article to see how many streaks I have already completed 💪
In this article, I will introduce you to some of the most important mathematical topics that are essential for applied data analysis. If you have beginner or intermediate knowledge and want to deepen your understanding of key mathematical concepts in data analysis, data science and machine learning, this article is perfect for you. The list is of course not exhaustive.
Table of Content 1 – Basics of Statistics: Descriptive Statistic, Normal Distribution, Probability Theory, Random Numbers and Monte Carlo 2 – Basics of Mathematics for Machine Learning: Linear Regression, Decision Tree, Naive Bayes, Time Series, Correlation vs. Causality, Gradient Descent 3 – Logic and Boolean Algebra 4 – Data Visualisation and Exploratory Data Analysis (EDA) 5 – Final Thoughts
1 – Basics of Statistics
Have you ever wondered why certain numbers in your data stand out or why a model doesn’t work as expected? The answer often lies in statistics. By looking at descriptive statistics and the distribution of our data, we can easily understand the basic properties of our data set. For example, we can recognize outliers (= unusual data points) or initial patterns. We can also better assess the data quality by looking for gaps or inconsistencies. And if we want to use machine learning with our data, the choice of models often also depends on the distribution of the data. For example, we use different models depending on whether the data is distributed linearly or non-linearly.
1.1 – How do we recognize important characteristics with descriptive statistics?
If we use the Pandas library in Python, we can use this command below. This returns the mean, standard deviation, the min & the max values and the quartiles:
#Descriptive Statistic
import pandas as pd
dfNAME = pd.DataFrame({'data': data})
summary = df.describe() #Returns mean, standard deviation, min/max, and quartiles.
print(summary)
The mean value is the average of all data values. The median, on the other hand, is the middle value when we sort all the data in ascending order. Therefore, with an odd number of values it is the middle value and with an even number it is the average of the two middle values. The median is useful because it is much less prone to outliers than the mean. The mode shows us the value that occurs most frequently – for example, in the case of categorical data, the favorite color that occurs most frequently.
1.2 – Why is the normal distribution important?
Our data can be normally distributed or skewed. If you see a symmetrical, bell-shaped distribution, the data is normally distributed. This means that most of the values are around the mean and the frequency of values decreases at the extremes. Many things in nature follow approximately this distribution: for example, the height of the vast majority of adult humans is between 1.55 and 2 meters. Extreme heights are extremely rare. Or the IQ value in the population is close to the average value of around 100.
We can use these commands in Python with the Matplotlib library:
#Data Distribution
##Histogram
import matplotlib.pyplot as plt
plt.hist(data, bins=5) #Creates a Histogram with 5 bins
plt.show()
##Boxplot
plt.boxplot(data)
plt.show()
The dispersion is a measure of how strongly the values in the data set are distributed around the mean value. It shows us whether the values are close together or far apart.
- A high dispersion means that the values are distributed over a wide range. For example, income in a country with high inequality is very widely distributed.
- If, on the other hand, there is a low dispersion, the data is close together. Here you can imagine the height of a primary school class in which all the children are of a similar age.
We use the variance to calculate the average squared deviation of each data point from the mean.
- If the value of the variance is large, this indicates that the data is highly scattered.
- If the value is small, this indicates a low dispersion – most values are therefore close to the mean.
The standard deviation, on the other hand, is the square root of the variance and is much easier for us to interpret, as it indicates the deviation from the mean value in the same unit as the data.
Why are these basics helpful?
Before we carry out in-depth analyses with the data, we need to have a basic understanding of our data. We can recognize unusual patterns, outliers or missing values in the data much more quickly. With visualizations such as the histogram or the boxplot, we can easily explain and present the results of the data to people who do not have much knowledge of mathematics and data analysis. The normal distribution is important to know because many statistical tests and models (such as linear regression) assume that the errors or residuals are normally distributed.
The output of descriptive statistics or boxplot visualizations are part of exploratory data analysis (EDA). In this article, I have summarised 9 important steps for EDA.

1.3 – What is probability theory?
In probability theory, we want to find out the probability of a certain event occurring. An event is a possible outcome of a situation.
For example, we want to find out the probability of a customer canceling their subscription. The probability is always between 0 (=impossible) and 1 (=certain). We specify the probability for event A as P(A).
What if we add conditions to the probability calculation?
In certain situations, the probability of an event A depends on whether another event B has already occurred. This is where the concept of conditional probability helps us. It gives us the probability of event A, but only under the condition that event B has already occurred. We specify this with the notation P(A|B).
You have probably come across the following formula before:
P(A|B) = P(A∩B) / P(B)
Here we first calculate the probability of A and B occurring at the same time. We then divide this result by the probability of B occurring.
Let’s take a quick look at this using an example: We want to know the probability of a customer buying a product (A) if they have previously seen an advert (B). We assume the following numbers:
- P(A∩B): 5% of customers buy a product after seeing an advert.
- P(B): 20% of customers see the advert.
Now we fill in the formula: P(A|B) = 0.05/0.2 = 0.25 (=25%)
The result means that the probability of purchase is 25% if a customer has seen the advert.
What do we need Bayes’ theorem for?
Bayes’ theorem extends conditional probability. It is a method that we can use to draw conclusions from P(A|B) to P(B|A). The formula for this is:
P(A∣B) = (P(B∣A) x P(A)) / P(B)
Let’s take a look at this using a simple example: We want to answer the question of how high the probability is that a customer actually has an interest in buying (A) if they have clicked on an email campaign (B). Here we assume these figures:
- P(A): 10% of customers have a genuine interest in buying.
- P(B|A): 80% of customers interested in buying click on the email.
- P(B): 30% of all customers click on the email.
Now we fill in the formula: P(A|B) = (0.8×0.1)/0.3 = 0.267 (=26.7%)
Here we see that if a customer clicks on the email, the probability is 26.7% that the customer is actually interested in buying.
Why are these basics helpful?
Understanding probabilities is important for many decisions that we make in companies and, of course, for predictive models – such as the probability of a customer canceling a subscription. Conditional probability helps us to understand relationships between variables – for example, how strong the influence of an advert is on a purchase. The basics of Bayes’ theorem are relevant as many probabilistic algorithms (e.g. in webshop recommendation systems or spam filters) are based on it. It enables us to update probabilities when new information is added.
1.4 – What do we need random numbers and Monte Carlo methods for?
If we want to carry out simulations, probability calculations or probabilistic models, we need random numbers. They help us to mathematically model uncertainties or stochastic processes. Random numbers are numbers that are generated without a predictable pattern.
With Python, we can generate pseudo-random numbers with these commands:
#To generate a number between 0 and 1
import random
print(random.random())
#To generate a int number between a and b
print(random.randint(a,b))
# Random numbers for different distributions
import numpy as np
print(np.random.normal(mean, std_dev, size)) # Normal distribution
These random numbers are generated by an algorithm and are sufficient for most applications (except cryptography).
What is seeding?
Seeding allows us to use random number generators in such a way that the random numbers generated remain identical each time they are run. We need seeding to achieve reproducible results, for debugging or for comparing models.
You can use this Python command for this:
import random
random.seed(42)
print(random.random())
How can we carry out simulations with random numbers?
Monte Carlo methods use random numbers to solve complex problems that are difficult to calculate analytically. They simulate many possible scenarios and approach a solution based on the results.
Let’s take a look at an example: A company wants to carry out a risk analysis for an investment project. It wants to estimate the probability of a loss on an investment. We assume the following figures:
- Expected return: CHF 10,000
- Existing uncertainty: The actual return can fluctuate by +- CHF 5,000 and follows a normal distribution.
- Threshold value: A loss occurs for the company if the return falls below CHF 0.
With Monte Carlo simulations, we can run many random scenarios to calculate the probability of loss. Here is the Python code we can use for this:
import numpy as np
# Simulation parameters
np.random.seed(42) # Ensure reproducibility
num_simulations = 10000 # Number of scenarios
expected_return = 10000 # Expected return
std_dev = 5000 # Standard deviation of the return
# Simulating random returns
returns = np.random.normal(expected_return, std_dev, num_simulations)
# Calculating the probability of a loss
loss_probability = np.mean(returns < 0)
print(f"Probability of a loss: {loss_probability:.2%}")
The result with these parameters is 0.023. This means that the simulation with the specified numbers indicates that a loss occurs with a probability of 2.3%.
Why are these basics helpful?
Random numbers and Monte Carlo methods help us to simulate real scenarios such as loss risks, market behavior, investment risks or production fluctuations. Random numbers and Monte Carlo methods also provide the basis for many machine learning models such as random forests.
2 – Basics of Mathematics for Machine Learning
Fun Fact: Did you know that the perceptron algorithm was one of the first algorithms developed to simulate a neural node? The node could only solve simple problems, such as recognising lines. But it laid one of the various foundations for today’s neural networks.
Reference: What is Perceptron?
2.1 – Is linear regression the simplest model?
Linear regression is one of the simplest models in machine learning. It is often used as a base model to compare the performance of other models.
You can see how you can use linear regression as a basic model in this article ‘Discovering the Best ML Model for Energy Forecasting: Which Tops the List? (Part 2)‘, for example:
Linear regression describes the relationship between a dependent variable and one or more independent variables. For example, we can use it to predict continuous values such as energy consumption. The simplified mathematical formula with a single independent variable looks like this:
y = xw + b
As soon as you use a multivariate linear regression (several independent variables), you have more x and more w:
y = x1w1 + x2w2 + x3w3 + … + xnwn + b
It is important to know that linear regression makes assumptions about the data, such as that the relationship between the variables is linear, that the errors (= residuals) are normally distributed and that there is homoscedasticity (= the variance of the errors is constant).
Fun fact: Did you know that the term regression was coined by the British statistician Sir Francis Galton? He discovered that extreme characteristics (e.g. height) ‘regress’ to the average value in offspring – hence the name ‘regression to the mean’.
_Reference: Regression toward the mean_
2.2. – Can we no longer see the wood for the trees with decision trees?
We often use decision trees for classification problems. The extension of this model is the Random Forest model, in which several trees are combined to make more robust predictions. For example, a credit institution can determine whether a person is creditworthy. Or in this article ‘Beginner’s Guide to Predicting House Prices with Random Forest: Step-by-Step Introduction with a Built-in scikit-learn Dataset ‘you can see how the price of houses is predicted using the Random Forest model as an example.
How do decision trees work?The nodes represent a condition. For example, the first node could define that the age is > 30. The leaves indicate the prediction or category – for example, creditworthy vs. not creditworthy. And the paths indicate the decision chain from the root to the leaf.
2.3 – Why is Naive Bayes "naive"?
For this model, you can read the section on probability theory again. The Naive Bayes classifier is based on Bayes’ theorem and is often used in classification problems. Naive Bayes estimates the probability that a class C belongs to a data point based on its characteristics. Let’s look at the following formula:
P(C|X) = (P(X∣C) x P(C)) / P(X)
The model is called ‘naive’ because we assume that all features are independent of each other. This assumption is often not realistic, but makes the calculations much easier. For example, if you roll two dice in succession and the first roll shows a 6 and the second roll a 4, the two characteristics are independent. The result of the first throw does not affect the result of the second throw. However, if we look at the characteristics that one team scores the first goal (A) and one team wins the game (B) during a soccer match, the characteristics are dependent on each other. The reason for this is that the first goal (A) strongly influences the probability of a team winning the match (B).
2.4 – Where do we encounter time series analyses?
We encounter time series data all the time in today’s world – in sales figures, share prices, IoT data or energy consumption data. If you have a data set where the data is collected over time and you have a timestamp, you are in time series analysis.
The first important thing to understand is whether there is a trend (long-term upward or downward movement) and seasonality (recurring pattern) in your data. The residuals then show the remainder that is not explained by trend or seasonality. In this article ‘Beginner’s Guide to Predicting Time Series Data with Python’ you will find a detailed explanation of time series analysis and forecasting.
2.5 – What do margarine and divorce rates have to do with correlation and causality?
I think this point is clear to practically everyone who deals with data. And yet it is important that we remember this when interpreting our data sets. Correlation shows us the strength and direction of the relationship between two variables.
- Correlation coefficient of +1: Perfect positive correlation
- Correlation coefficient of -1: Perfect negative correlation
- Correlation coefficient of 0: No correlation
If, on the other hand, there is causality, this tells us that one of the variables is the cause of the other variable.
Let’s take the following fictional example: A company analyses the productivity of its employees and discovers that teams that work more overtime also achieve higher quarterly results.
Management might jump to the conclusion that overtime is the cause of higher productivity. And based on this, they could introduce a policy that encourages overtime for all teams.
The problem with this conclusion is that there could be many other factors responsible for the higher quarterly results. For example, it may be that projects with better results are by nature more exciting and teams are more motivated as a result. Or it could be that teams that are already more productive are more often involved in particularly successful projects.
When we read this example, it is probably clear to everyone that a direct causal relationship between overtime and quarterly earnings cannot simply be assumed. However, such misinterpretations happen again and again (even I sometimes catch myself doing it…). A well-known example from the past is the assumption that a higher consumption of ice cream leads to an increase in violent crime. However, a closer look showed that both phenomena were caused by the common factor of rising temperatures in summer. It is therefore important to always carry out a more in-depth analysis first, taking various factors into account, before assuming such a causal relationship.
2.6 – How does the gradient descent help us to find the best parameters?
To say it simplified, you can imagine the gradient descent as a small ball that looks for the fastest way down a hill by following the slope of the hill.
The gradient descent minimizes the errors of a model by adjusting the parameters so that the error function becomes smaller.
Let’s look at an example: We have set up a machine learning model that predicts energy consumption over the next 48 hours. The results are not perfect. With the error function, we can see how good our model is. The function shows us how big the difference is between the predictions and the real values. With the gradient descent, we now try to adjust the model parameters step by step so that the error function becomes smaller. At this link you can find a course from my university days on GitHub about Gradient Descent (with Python code to replicate).
How the gradient descent works:
- We imagine the error function as a hilly landscape: The highest point is the worst fault. Our goal is to find the lowest point. This point gives us the minimum error.
- The gradient descent calculates the direction of the steepest descent at the current position. Like a ball, it rolls a small distance in this direction.
- The algorithm repeats this process until the lowest point is reached or is close enough.
Why are these basics helpful?
All concepts are foundations for modern machine learning models. Linear regression is a basic model for understanding relationships, while decision trees are intuitive models that we can use, for example, as a basis for more complex approaches such as random forests. An understanding of Naive Bayes helps us for applications such as spam filters or sentiment analyses. And in today’s world, we have so much data with a timestamp that a basic understanding of time series analysis is almost essential. Understanding gradient descent helps us because it is the basis of optimization of virtually all neural networks.
3 – Logic and Boolean Algebra
We encounter Boolean algebra and logic not only in maths, but also in everyday life – especially when we need to automate processes, perform segmentation or filter data. In Boolean algebra, we have the two values True (1) and False (0). We can use these values and the following three logical operators to define and check conditions:
- AND: Here, both or all conditions must be true. For example, an email should only be sent with marketing automation if the lead has given an opt-in AND the lead score is greater than 80.
- OR: One of the conditions must be true here. For example, we use a marketing automation tool to send a discount voucher to all contacts who are either VIP OR have made a purchase over CHF 100.
- NOT: Here we check whether a condition does not apply. For example, we check whether customers are NOT on a registration list for a webinar. These people will receive a reminder email.
Why are these basics helpful?
We can use the truth values to define complex conditions. They are the basis for workflows or automation in tools such as Salesforce or marketing automation platforms. We also need this logic if we want to segment a contact database in order to create target groups that fulfill certain conditions. And to create dashboards or reports, we can use the logical operators to compile the desired visualizations.
4 – Data Visualisation and Exploratory Data Analysis (EDA)
Fun Fact: Did you know that the data set ‘Anscombe’s Quartet’ by statistician Francis Anscombe showed that four different data sets can have identical statistical properties (mean & variance) but show completely different patterns when visualised?
_Reference: Anscombe’s quartet_
This shows us that it is so super important that we also plot our data and not just analyze it statistically.
4.1 – Which Chart do we use to Visualise what?
Histogram – understanding distributionWe use a histogram to show the distribution of the data (see also point 1.1.). The data from the data set is divided into intervals, so-called bins. This is particularly suitable for numerical data to recognize whether the values are normally distributed or whether there are skews and which data values occur frequently.
In Python, you can use this command for this:
import matplotlib.pyplot as plt
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
plt.hist(data, bins=5) # Creates a histogram with 5 bins
plt.title('Histogram')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
Boxplot – recognizing scatter and outliers in the dataA boxplot is a summary of all relevant values of the descriptive statistics in a single graph. We can see the median, upper and lower quartiles and potential outliers at a glance. If you are still in training, you can be sure that you will have to explain this diagram in an exam or oral examination 😉
With Python, you can output a boxplot with this command:
plt.boxplot(data)
plt.title('Boxplot')
plt.ylabel('Values')
plt.show()
Scatter plot – see patterns and correlationsWith scatter plots, we can recognize correlations or patterns in the data – for example, trends or clusters. It shows us the relationship between two numerical variables.
You can use this command in Python:
x = [1, 2, 3, 4, 5]
y = [2, 4, 1, 8, 7]
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('Variable X')
plt.ylabel('Variable Y')
plt.show()
The correlation shows us how strongly two variables are related. For example, there is a positive correlation if sales increase with higher marketing expenditure. There is a negative correlation if, for example, the sales price falls the older the cars are. There is no correlation if the data points are randomly distributed.
If we can recognize different clusters in the scatter plot, this indicates that there are different groups in the data with similar characteristics (e.g. different customer segments).
Line chart – trends over timeLine charts show us the trends over time and are super important for time series analyses. Typical use cases are sales trends, share prices or energy consumption data. If you want to dive deeper into the topic of time series analyses, you can read this article ‘Beginner’s Guide to Predicting Time Series Data with Python‘.
In Python, we can output a line chart with this command:
time = [1, 2, 3, 4, 5]
value = [2, 3, 5, 7, 11]
plt.plot(time, value)
plt.title('Line Chart')
plt.xlabel('Time')
plt.ylabel('Value')
plt.show()
4.2 – How Do We Identify Patterns in Our Data?
In data analysis, we need to identify trends, anomalies, or clusters:
- Trends indicate long-term changes in the data, such as an upward or downward movement over a certain period.
- Anomalies highlight data points that deviate from the rest, such as outliers or an unusual spike or drop in a trend.
- Clusters reveal whether data points form distinct groups or segments, with similar properties and close proximity to each other.
- Seasonality refers to recurring patterns in the data, such as daily, weekly, or monthly fluctuations. Recognizing these patterns allows us to understand cyclical behaviors.
Why are these basics helpful?
Visualizations allow us to represent data in an intuitive way, making patterns understandable to a broader audience. Before applying Machine Learning models, it’s essential to understand the characteristics of our data. Conducting an Exploratory Data Analysis (EDA) and creating visualizations are the first steps to uncover insights and select the appropriate Machine Learning models.
5 – Final Thoughts
Mathematics is the foundation of data analysis, Data Science, Machine Learning, and Deep Learning. With recent advancements (e.g., drag-and-drop solutions, simplified libraries, etc.), many applications can now be used without deep mathematical expertise. However, having a basic understanding is invaluable.
And now, to answer the question about my Brilliant streak: I’ve already achieved 140 streaks (=140 consecutive days of completing a lesson). Unfortunately, the app does not have a free version. However, in the long run, the investment of CHF 142 (~$160) is worth it to refresh knowledge daily in small, manageable doses. Plus, the app is perfect for filling short waiting times – whether on a train, at a restaurant, or before a meeting.

Continuous learning is crucial in this field. Other highly useful tools include Codecademy and DataCamp (no affiliate links).
Where can you continue learning?
- GeeksForGeeks – Choosing the right chart type
- FreeCodeCamp – Bayes’ Rule
- IBM YouTube-Channel – What is the Monte Carlo Simulation?
- Codecademy – .seed()
- GitHub University Course – Gradient Descent
- Medium – Master Bots Before Starting with AI Agents: Simple Steps to Create a Mastodon Bot with Python
- Medium – Who Does What in Data? A Practical Introduction to the Role of a Data Engineer & Data Scientist