
Finding outliers is technical, but understanding outliers is a mix of skill and art. As data scientists, we always tend to "remove" the outliers because it messes up a predictive model – and this practice has led to the wrong conclusion that an outlier is something not good and must be removed
Instead, the focus should be on understanding the outliers. And in this story, I will explain to you some techniques on how to better understand outlier data points.
We will take a dataset of cars which has columns such as make, fuel-type, aspiration, num-of-doors, price, etc.

Basic and essential – a Box-plot, but with a jitter
One of the simplest ways to understand outliers for individual columns is by using a box-plot. There are two variants of box-plot – the standard one and the slightly fancy one. First, let us look at the standard box plot as shown here:

The box-plot is a great tool to show extreme values. All extreme values fall outside the left and right vertical lines (also called whiskers). As you can see that there are some very high values of the price which are indicated by the dots on the right-hand side. Ok, nice we now know that some extreme price values are starting from 30000 onwards.
However beware that box plot might be telling only partial truth.
What is missing in the standard box-plot is an overall picture of other data-points. And this is where a fancy boxplot with jitter is useful. A box-plot with a jitter, shown below, is a box plot with all data points overlayed. And to avoid points with the same values hiding each other, we introduce a small random jitter so that all data points are visible

There are two interesting observations with the boxplot with jitter, which was missing in the standard box plot. First, as you can see all other data points, you may also want to consider some points in the price range between 25000 and 30000 also outliers. They have not crossed the right vertical line, but still, they are rare and a sparse bunch of data points
Secondly, we are also able to see points lying exactly on the extreme vertical lines – such as the data points on the extreme left with zero price. A standard box plot would not catch such points as outliers as there are exactly lying on the limit. However, with a box-plot with a jitter, we can visualize these points. And clearly, a car with zero price is an outlier indicating some issue with data
Going Multi-dimensional
One of the biggest mistake a data scientist can make is to judge a data point as an outlier based on a single dimension.
Let’s take one example – a BMW sedan, 209 engine size, 182 horsepower with a price of 36880 was indicated as an outlier by a boxplot. That is not a very smart way to conclude that it is an outlier. A BMW with all that characteristics is bound to be expensive and should not be an outlier
And this is why outlier analysis should be done on multiple dimensions. In our case, we should consider a data point as an outlier or not taking into account all dimensions – brand, fuel-type, aspiration, number of doors, horsepower, length, height, price, etc…. And various algorithms can help detect outliers in a multi-column setting.
However, the difficulty is in interpreting why it is an outlier. Earlier when we were looking at box-plot, we had only one column price to make sense of box-plot results. But now with multiple dimensions, if a data point is an outlier, how should one interpret the results? Let us see the techniques which help us interpret outlier results in a multi-dimension setting
Making a dimension reduction plot and adding a nice tool-tip
To interpret results, first, we need to visualize them. Though we have multiple columns in data, our eyes are limited to see and interpret 2 or a maximum of 3 dimensions. Fortunately, dimensionality reduction techniques, such as PCA (principal component analysis) come in handy as we can reduce multiple dimensions into fewer dimensions without losing the essence of the data.
Let’s says we have run an algorithm, such as Isolation forest, on our multi-dimensional car dataset and determined the outliers. We also reduce all columns to just 2 columns using the dimensions reduction technique. We then plot a 2D scatter-plot and indicate the outliers with a separate color. The result of this awesome "data-sciency" work would look something as below

We can make the visualization more interactive using a tooltip. Tooltips can help us with more information about the outlier by hovering on it. Shown below on how this looks like. We are trying to find out information about the bunch of dark red points (outliers) situated at bottom of the scatter plot

As you hover over the red points, which are situated at the bottom, you will see that most of them correspond to the Mercedes-Benz brand and price more than 20000. This is very useful information. Some of our outliers seem to be high-end cars. Already a big step in understanding the outliers
Taking it to next level with agglomerative clustering and dendrogram
The technique shown above is a good way to visualize a high-dimensional dataset. However, it has two important shortcomings. First that we see that the red points are scattered all over. So it is difficult to determine if there is any pattern of outliers. Also, all outliers may not be visible, as some may be hidden behind other very close points. Of course, we can determine outliers by hovering over a point, but as not all outlier points are visible, we might not feel very confident about our conclusions due to limited visibility in a pattern of outliers as well as unable to see all outlier points
Here is where agglomerative clustering and dendrogram can do wonders. Agglomerative clustering is a bottom-up clustering technique and will try to make clusters of data points that are close to each other. And dendrogram is used to visualize all clusters in a tree format in such a way that all data points (called leaves) are visible and nicely arranged at the same depth. So what you get is data points clustered together as well as visibility of all points – pure magic
Shown below is the result of agglomerative clustering and dendrogram on the car data. And to add a touch of beauty, you can color all outlier points with a different color compared to non-outlier points

We can see that there is a strong cluster of outliers with many red points in the same sub-tree. And all points are nicely and arranged and visible. This means that we have not only identified outliers, but also a possible pattern or cluster of outliers !!! Now things are getting really interesting
Now we use the tooltip technique to understand what this big red cluster is about. Show below is tooltip when you hover over these points

As you will observe that this big red cluster has mostly two brands. One is Mercedes-Benz diesel with prices above 20K and the second one is Peugeot diesel with prices above 17K. Now, this is a big significative step compared to the scatter plot + PCA technique, as we can visualize all outlier points and nothing is hidden and overlapping with each other. This helps us to gain confidence in our outlier analysis. What more – we have even found an additional pattern other than Mercedes Benz as compared to the PCA+scatterplot technique
Mixing sophistication and style – Adding a Heatmap to the Dendrogram
Ok, now we go up one level of sophistication to really nail our understanding of the outliers. The dendrogram and the tooltip above already gave a very good understanding of our outliers. However, we still have to manually observe the common values in the tooltip to understand the meaning of the outlier cluster. This is where a heatmap can be very handy, as it will visually signal out the fields with similar values

Also, you can hover over heatmap cells to find values that commonly occur in our outlier cluster, as shown in the below animation

In summary, finding outlier is more or less technical work. However understanding outlier is a mix of science and art. The process of interpreting outlier is more interesting and important than finding it. With a good usage of Data Science algorithms and visualization techniques as explained in this story, you should be able to confidently interpret an outlier.
Additional resources
Website
You can visit my website to make analytics with zero coding. https://experiencedatascience.com
Please subscribe to stay informed whenever I release a new story.
You can also join Medium with my referral link.
Youtube channel Here is link to my YouTube channel https://www.youtube.com/c/DataScienceDemonstrated