Local Outlier Factor (LOF) — Algorithm for outlier identification

Detection of the anomaly using LOF values

Published in

Towards Data Science

5 min readAug 31, 2020

LRD is inverse of the average reachability distance of A from its neighbors. Intuitively according to LRD formula, more the average reachability distance (i.e., neighbors are far from the point), less density of points are present around a particular point. This tells how far a point is from the nearest cluster of points. Low values of LRD implies that the closest cluster is far from the point.

5. LOCAL OUTLIER FACTOR (LOF)

LRD of each point is used to compare with the average LRD of its K neighbors. LOF is the ratio of the average LRD of the K neighbors of A to the LRD of A.

Intuitively, if the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of a point (because the density of a point and its neighbors are roughly equal). In that case, LOF is nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a point is less than the average LRD of neighbors. Then LOF value will be high.

Generally, if LOF> 1, it is considered as an outlier, but that is not always true. Let’s say we know that we only have one outlier in the data, then we take the maximum LOF value among all the LOF values, and the point corresponding to the maximum LOF value will be considered as an outlier.

6. EXAMPLE

4 points: A(0,0), B(1,0), C(1,1) and D(0,3) and K=2. We will use LOF to detect one outlier among these 4 points.

Following the procedure discussed above:

First, calculate the K-distance, distance between each pair of points, and K-neighborhood of all the points with K=2. We will be using Manhattan distance as a measure of distance.

K-distance(A) –> since C is the 2ᴺᴰ nearest neighbor of A –> distance(A, C) =2
K-distance(B) –> since A, C are the 2ᴺᴰ nearest neighbor of B –> distance(B,C) OR distance(B,A) = 1
K-distance(C) –> since A is the 2ᴺᴰ nearest neighbor of C –> distance(C,A) =2
K-distance(D) –> since A,C are the 2ᴺᴰ nearest neighbor of D –> distance(D,A) or distance(D,C) =3

K-neighborhood (A) = {B,C} , ||N2(A)|| =2
K-neighborhood (B) = {A,C}, ||N2(B)|| =2
K-neighborhood (C)= {B,A}, ||N2(C)|| =2
K-neighborhood (D) = {A,C}, ||N2(D)|| =2

K-distance, the distance between each pair of points, and K-neighborhood will be used to calculate LRD.

Local reachability density (LRD) will be used to calculate the Local Outlier Factor (LOF).

Highest LOF among the four points is LOF(D). Therefore, D is an outlier.

7. ADVANTAGES OF LOF

A point will be considered as an outlier if it is at a small distance to the extremely dense cluster. The global approach may not consider that point as an outlier. But the LOF can effectively identify the local outliers.

8. DISADVANTAGES OF LOF

Since LOF is a ratio, it is tough to interpret. There is no specific threshold value above which a point is defined as an outlier. The identification of an outlier is dependent on the problem and the user.

9. CONCLUSION

Local outlier factor (LOF) values identify an outlier based on the local neighborhood. It gives better results than the global approach to find outliers. Since there is no threshold value of LOF, the selection of a point as an outlier is user-dependent.

10. REFERENCES

Breunig, M. M., Kriegel, H. P., Ng, R. T., and Sander, J. (2000). LOF: identifying density-based local outliers. In ACM sigmod record (Vol. 29, №2, pp. 93–104). ACM.

Local Outlier Factor (LOF) — Algorithm for outlier identification

Detection of the anomaly using LOF values

Table of Contents:

1. INTRODUCTION

2. K-DISTANCE AND K-NEIGHBORS

3. REACHABILITY DENSITY (RD)

4. LOCAL REACHABILITY DENSITY (LRD)