When I first started developing data science projects, I didn’t care about data visualization nor outlier detection, I only cared about creating cool models. But as soon as I started checking other data scientist’s code I realized that the data quality can be even more important than the model itself so I started paying more attention to the exploratory data analysis (EDA) part and I realized how foolish I was being.
Introduction
In almost any Data science project, we have to understand the data we are going to work with. This data varies a lot depending on the project we work on but the process is almost the same in every project:
- Read the data from the source (.csv, .xlsx, relational database…)
- Check the descriptive statistics of every column (mean, max, min, standard deviation, median…)
- Data cleaning and data wrangling (removing or filling NaN, inf and -inf values, dropping unuseful columns, creating new columns…)
- Visualize the data using different types of plots (bar plots, scatter plots, box plots…)
- Handle Outliers (remove or transform them if it is possible)
- Creating, training and testing models
- Iterate through the best models doing fine-tuning, feature selection and A/B testing
- Choose the best model and deploy it
Of course, explaining all this process in detail would take for ages so in this article I’m going to focus on outlier detection, handling and Visualization. But wait a minute… What are outliers and why I should care about them?
According to Wikipedia, this is the definition of an outlier:
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.
Some Outlier Theory
Now that we understand what is an outlier and why it’s important to handle them, we can start with some theory.
![Normal distribution and standard deviations [2]](https://towardsdatascience.com/wp-content/uploads/2021/06/1PCMZh7xGtjnSXwxyvuEzkg.png)
The image above shows a perfect normally distributed data set. If we look closely at the center of the curve, we’ll see that the mean value of the data set has a standard deviation of 0 and according to the theory, 99.7% of the data points of a normally distributed data set will be between 3 and -3 standard deviation away from the mean. That means that all the values with a standard deviation above 3 or below -3 will be considered as outliers.
Visualizing Outliers with Python
A very helpful way of detecting outliers is by visualizing them. The best type of graph for visualizing outliers is the box plot. But, before visualizing anything let’s load a data set:

Box Plot
As I said before when it comes to outlier visualization the box plot is the easiest way to grasp valuable information about your data’s outliers. But before visualizing any outliers let’s understand what’s a box plot and its different components:
![Components of a box plot [3]](https://towardsdatascience.com/wp-content/uploads/2021/06/10MPDTLn8KoLApoFvI0P2vQ.png)
As we can see in the image above, a box plot has a lot of components and every one of them helps us to represent and understand the data:
- Q1. 25% of the data is below this data point.
- Median. The central value of the data set. It can also be represented as Q2. 50% of the data is below this data point.
- Q3. 75% of the data is below this data point.
- Minimum. The data point with the smallest value in the data set that isn’t an outlier.
- Maximum. The data point with the biggest value in the data set that isn’t an outlier.
- IQR. Represents all the values between Q1 and Q3.
Once we understood all the components of a box plot let’s visualize it for a given variable in our data set:

Wow! It looks like there are a lot of outlier data points in our data set’s MedInc variable. But, what if we want to check these rows in panda’s data frame? How can we select only these rows?
In the following paragraphs, we are going to see how to detect outliers with Python from scratch and with the scipy package.
Detecting Outliers from Scratch
As we say at the very beginning of the post, all the data points 3 or -3 standard deviations away from the mean are outliers. Let’s code this for the MedInc column (median income column):
These would be the outliers:

This would be the data set if we removed all MedInc outliers:

Detecting Outliers with Scipy
There is an even easier way of detecting outliers. Thanks to the scipy package, we can calculate the z-score for any given variable. The z-score gives you an idea of how many standard deviations away from the mean a data point is. So, if the z-score is -1.8, our data point will be -1.8 standard deviations away from the mean. Let’s check the code:
If we display the data frames, the result will be the same as the method we did earlier from scratch:


Handling Outliers
We have learnt how to detect and visualize outliers, but how do we handle them? There is no short answer to this question but I’ll try to be as brief as possible. The answer is that it depends a lot on the kind of project you’re doing:
If you’re doing an exploratory data analysis (EDA), is possible that some of your insights will be wrong because the outliers can make you draw wrong conclusions. To prevent this, you should also analyze your outliers separately from the rest of the data and also try to repeat the analysis removing the outliers. Once you finish this iterative process, your insights will be much more consistent.
In the case you’re creating a Machine Learning model, outliers could make your model perform poorly. To prevent this from happening, you can try different things. Here are two examples:
- If you have a lot of data and very few outliers, you could try removing them and training your model with fewer data.
- If the outliers are caused by wrong measurements such as sensor collected data, you could try changing the outlier values with the mean.
Summary
During this article we have seen:
- The definition of an outlier
- Some theory about outliers and data distributions
- How to visualize outliers
- How to detect outliers
- How to handle outliers