Local Outlier Factor (LOF) — Algorithm for outlier identification

Detection of the anomaly using LOF values

Vaibhav Jayaswal
Towards Data Science

--

Image by Markus Spiske, from Pexels

Table of Contents:

  1. INTRODUCTION
  2. K-DISTANCE AND K-NEIGHBORS
  3. REACHABILITY DISTANCE (RD)
  4. LOCAL REACHABILITY DISTANCE (LRD)
  5. LOCAL OUTLIER FACTOR (LOF)
  6. EXAMPLE
  7. ADVANTAGES OF LOF
  8. DISADVANTAGES OF LOF
  9. CONCLUSION
  10. REFERENCES

1. INTRODUCTION

An outlier is a data point that is different or far from the rest of the data points. The question that arises here is that can we identify the outliers present in the data?

Image by Pixabay, from Pexels

Local outlier factor (LOF) is an algorithm that identifies the outliers present in the dataset. But what does the local outlier mean?

When a point is considered as an outlier based on its local neighborhood, it is a local outlier. LOF will identify an outlier considering the density of the neighborhood. LOF performs well when the density of the data is not the same throughout the dataset.

To understand LOF, we have to learn a few concepts sequentially:

  • K-distance and K-neighbors
  • Reachability distance (RD)
  • Local reachability density (LRD)
  • Local Outlier Factor (LOF)

2. K-DISTANCE AND K-NEIGHBORS

K-distance is the distance between the point, and it’s Kᵗʰ nearest neighbor. K-neighbors denoted by Nₖ(A) includes a set of points that lie in or on the circle of radius K-distance. K-neighbors can be more than or equal to the value of K. How’s this possible?

We’ll see an example. Let’s say we have four points A, B, C, and D (shown below).

K-distance of A with K=2

If K=2, K-neighbors of A will be C, B, and D. Here, the value of K=2 but the ||N₂(A)|| = 3. Therefore, ||Nₖ(point)|| will always be greater than or equal to K.

3. REACHABILITY DENSITY (RD)

It is defined as the maximum of K-distance of Xj and the distance between Xi and Xj. The distance measure is problem-specific (Euclidean, Manhattan, etc.)

Illustration of reachability distance with K=2

In layman terms, if a point Xi lies within the K-neighbors of Xj, the reachability distance will be K-distance of Xj (blue line), else reachability distance will be the distance between Xi and Xj (orange line).

4. LOCAL REACHABILITY DENSITY (LRD)

LRD is inverse of the average reachability distance of A from its neighbors. Intuitively according to LRD formula, more the average reachability distance (i.e., neighbors are far from the point), less density of points are present around a particular point. This tells how far a point is from the nearest cluster of points. Low values of LRD implies that the closest cluster is far from the point.

5. LOCAL OUTLIER FACTOR (LOF)

LRD of each point is used to compare with the average LRD of its K neighbors. LOF is the ratio of the average LRD of the K neighbors of A to the LRD of A.

Intuitively, if the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of a point (because the density of a point and its neighbors are roughly equal). In that case, LOF is nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a point is less than the average LRD of neighbors. Then LOF value will be high.

Generally, if LOF> 1, it is considered as an outlier, but that is not always true. Let’s say we know that we only have one outlier in the data, then we take the maximum LOF value among all the LOF values, and the point corresponding to the maximum LOF value will be considered as an outlier.

6. EXAMPLE

4 points: A(0,0), B(1,0), C(1,1) and D(0,3) and K=2. We will use LOF to detect one outlier among these 4 points.

A(0,0), B(1,0), C(1,1), and D(0,3)

Following the procedure discussed above:

First, calculate the K-distance, distance between each pair of points, and K-neighborhood of all the points with K=2. We will be using Manhattan distance as a measure of distance.

K-distance(A) –> since C is the 2ᴺᴰ nearest neighbor of A –> distance(A, C) =2
K-distance(B) –> since A, C are the 2ᴺᴰ nearest neighbor of B –> distance(B,C) OR distance(B,A) = 1
K-distance(C) –> since A is the 2ᴺᴰ nearest neighbor of C –> distance(C,A) =2
K-distance(D) –> since A,C are the 2ᴺᴰ nearest neighbor of D –> distance(D,A) or distance(D,C) =3

K-neighborhood (A) = {B,C} , ||N2(A)|| =2
K-neighborhood (B) = {A,C}, ||N2(B)|| =2
K-neighborhood (C)= {B,A}, ||N2(C)|| =2
K-neighborhood (D) = {A,C}, ||N2(D)|| =2

K-distance, the distance between each pair of points, and K-neighborhood will be used to calculate LRD.

LRD for each point A, B, C, and D

Local reachability density (LRD) will be used to calculate the Local Outlier Factor (LOF).

LOF for each point A, B, C, and D

Highest LOF among the four points is LOF(D). Therefore, D is an outlier.

7. ADVANTAGES OF LOF

A point will be considered as an outlier if it is at a small distance to the extremely dense cluster. The global approach may not consider that point as an outlier. But the LOF can effectively identify the local outliers.

8. DISADVANTAGES OF LOF

Since LOF is a ratio, it is tough to interpret. There is no specific threshold value above which a point is defined as an outlier. The identification of an outlier is dependent on the problem and the user.

9. CONCLUSION

Local outlier factor (LOF) values identify an outlier based on the local neighborhood. It gives better results than the global approach to find outliers. Since there is no threshold value of LOF, the selection of a point as an outlier is user-dependent.

10. REFERENCES

Breunig, M. M., Kriegel, H. P., Ng, R. T., and Sander, J. (2000). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, №2, pp. 93–104). ACM.

--

--