The data science field is vast and complex, often lacking clear-cut answers. While seeking to resolve doubts and learn new concepts online, I’ve come across numerous low-quality, error-prone answers – some surprisingly well-received despite fundamental misunderstandings. To help others navigate these pitfalls, I’m starting a series to share mistakes found in online content (some of those may be mistakes which I made in the past).
In this article, I will share 4 such examples, together with a counter-example for each of them to disprove those statements. For Part 1, these examples will centre around basic machine learning and statistics concepts.
The examples will be structured in this way
Mistake X : <Wrong Statement>
<Why is it wrong>
Mistake 1: In Linear Regression, one of the assumptions is the target Y must be normally distributed
This sentence is incomplete, it should be
"In Linear Regression (LR), one of the assumptions is the target Y conditional on X must be normally distributed"
To Lets recall the definition of LR – albeit in its simplest form: the target Y is estimated as a linear combination of p predictors

When modelling a dataset with linear regression, some assumptions are made:
(1) [Linearity] There is a true linear relationship between the target Y and the predictors

(2) [Homoscedasticity] The errors should follow a constant variance across all X values

(3) The errors are independent of one another
(4) [Exogeneity] The errors should not be correlated to the predictors

(5) The errors conditioned on the predictors follow a normal distribution

(6) [Multicollinearity] No two predictors are highly correlated to each other
Using assumption (5) and Equation 1 above, we can derive the conditional distribution of target Y given predictors X:

Note: the assumptions are made with the error terms, not the residuals, which are estimates of the error terms
Notice that Y | X is normally distributed, but does it mean that the unconditional distribution of Y is normally distributed? Let’s give a counterexample to show this is false
Example
Let’s sample X iid from a uniform distribution:

and generate Y using the population regression model:

Does this obey the assumptions of the LR model?
(1) [Linearity] There is a true linear relationship between the target Y and the predictors
Yes.

(2) [Homoscedasticity] The variance of the errors should follow a constant variance across all X values
Yes. The errors have a constant variance of 1.
(3) The errors are independent of one another
Yes. Since X and the noise term is generated independently, the errors will also be independent of one another.
(4) [Exogeneity] The errors should not be correlated to the predictors
Yes. The expected value of the errors is the same as that of N(0, 1) which is 0.
(5) The errors conditioned on the predictors follow a normal distribution
Yes. The errors follow N(0, 1).
(6) [Multicollinearity] No two predictors are highly correlated to each other
Not applicable since in the above example there is only 1 predictor variable.
The model adheres to all the assumptions of LR!
Plotting the best-fit line together with its residuals and target distribution:

Let’s check with the Shapiro-Wilk Test (a statistical hypothesis test that determines if a sample of data is normally distributed) that Y does not follow a normal distribution

import numpy as np
import pandas as pd
from scipy.stats import shapiro
np.random.seed(42)
N = 1000000
X = np.random.uniform(low=0, high=100, size=N)
epsilon = np.random.normal(loc=0, scale=1, size=N)
Y = 3 + 5*X + epsilon
shapiro_stat, shapiro_p = shapiro(Y)
print(f"Test Statistic: {shapiro_stat}")
print(f"p-value: {shapiro_p}")
est Statistic: 0.9550506543292717
p-value: 3.4572639724334247e-129
Observing the visual of Y’s distribution above in Figure 1 (a platykurtic distribution) and finding that the p-value is extremely small (~ 0), we conclude that Y does not follow a normal distribution, which completes the counterexample.
Mistake 2: Skewness vs Kurtosis: In a distribution, Skewness measures the length of the tails, while Kurtosis measures the height of the peak
This is not true. Kurtosis measures how "wide" the tails are, in other worse, how much of the distribution’s mass are on the tails, which may not correlate with how tall the peak is
Example:
Let’s denote

and


As we can see from Figure 2, the uniform distribution has a much higher peak, but since it has no outliers (and as such less tail-heavy), its kurtosis is lower than the standard normal distribution
Mistake 3: A function f(x) is convex because it has a single global minimum
The above statement is not sufficient to ensure convexity.
The term "convex functions" is regularly used in machine learning in the world of optimization algorithms such as gradient descent – which is one key behind logistic regression.
Recall the definition of convexity (for simplicity I will only use univariate functions):


Let’s show a counter-example why this is not sufficient.

Taking the first derivative:

Taking the second derivative:

Just by substituting x = 0.25,

Voila! This violates the 2nd definition, and therefore the function is not convex
Lets plot the function and see:

As we can see, the function does have only 1 global minimum. However, the inflexion point (or saddle points when it comes to multivariate) is an area where the optimization algorithm can also converge, causing a suboptimal solution.
Mistake 4: Defintion of a 95% confidence interval for the parameter: There is a 95% chance that the parameter falls within this confidence interval
I have seen this "flawed" understanding of confidence intervals many times, even back then when I was in college.
This misconception arises from interpreting confidence intervals in terms of sample parameters rather than the population parameter. In reality, different random samples yield different confidence intervals.
Correct Definition: A 95% confidence interval (for the parameter) means that if you
- Create N random samples from the population
- Calculate the 95% confidence interval for the parameter for each sample
You get N different confidence intervals, and out of those, 0.95N will contain the true population parameter, as illustrated in the image below.

Conclusion
In the ever-evolving field of Data Science, misunderstandings in basic concepts can cascade and affect one’s ability to stomach more advanced concepts in the future. At the beginning of my data science journey, I made this mistake of blindly trusting online content assuming it is correct. As I progressed and gained more experience, I started approaching these content with a different mindset, not just blindly trusting, but also smartly evaluating with the help of other sources to have higher confidence that the content I am absorbing is correct.
In data science, understanding the "why" behind each concept is just as important as knowing the "how" – it is very different from software engineering where problems are often more structured in nature.
This is the first article of the series, future parts will continue to explore and debunk errors/misconceptions in various aspects of data science.
As always, if you have any feedback after reading the article, feel free to comment!
All images, unless otherwise noted, are by the author.