Density-based algorithms

The pure apprehension of two density-based algorithms: DBSCAN and OPTICS

Zanfina Svirca
Towards Data Science

--

Image credit: Fabrice Jazbinsek

Clustering methods like partitional methods or hierarchical clusters are not effective in finding clusters of arbitrary shapes. The density-based clustering method is efficient in finding the clusters of arbitrary shapes also prevents outliers and noise.

Object clustering when using a density-based method is simply based on the concept of density. And depending on the density, different types of algorithms are created using this method, for example, if clusters are created by using the density of neighborhood objects then the DBSAN algorithm is used or if clusters are created according to a density function then DENCLUE is used. Whereas OPTICS is a density-based which generates an enhanced order of the data collection structure.

DBSCAN

DBSCAN estimates the density by counting the number of points in a fixed-radius neighborhood or ɛ and deem that two points are connected only if they lie within each other’s neighborhood. So this algorithm uses two parameters such as ɛ and MinPts. ɛ denotes the Eps-neighborhood of a point and MinPts denotes the minimum points in an Eps neighborhood. So ɛ and MinPts are parameters which are specified by the user

There are a few key points to keep in mind:

Core point: as the core point is taken that particular point which In Eps — neighborhood, has greater value than a precise number of points that is MinPts.

Border Point: as the border point is taken that particular point which In Eps — neighborhood, has less value than a precise number of points that is MinPts.

Noise Point: A point that does not come under in core or border is said to be a noise point.

Density points

Algorithm

Input: N objects to be clusters and global parameters ɛ and MinPts.
Output: Clusters of objects

1.Arbitrary select a point P.
2. Retrieve all point density reachable from P wrt ɛ and MinPts.
3.If P is a core point a cluster is formed.
4. If P is a border point, then there is no point that is density-reachable and DBSCAN moves to the next point.
5. This process is continued until all the points are processed.

Of course, there are some advantages and disadvantages of using DBSAN for example:

Advantage
- DBSCAN can find arbitrary shaped clusters using MinPts parameters
- The order of the point in the database is insensitive
- Handle noise and outliers
Disadvantage
- Can not perform well with large differences in densities
- Not suitable when various density involve

OPTICS

OPTICS works like an extension of DBSCAN. The only difference is that it does not assign cluster memberships but stores the order in which the points are processed. So for each object stores: Core distance and Reachability distance. Order Seeds is called the record which constructs the output order.

Core Distance: The minimum value of ɛ which is present in the ɛ-neighborhood of a P is a core distance. Of course, it’s needed to hold the minimum MinPts objects.

Reachability Distance: Reachability distance between p and q is defined as the least radius value that formulates p density reachable from q.

Algorithm

1. Randomly selects an unvisited point P
2. Selects all point’s density reachable from P w.r.t Eps, MinPts.
3. Assign core distance & reachability distance = NULL
4. If P is not a core point
5. Move next point in the order Seeds list
6. If P is a core point
7. For each object q, in the ɛ — neighborhood of P
8. UPDATE reachability distance from P
9. If q is unvisited INSERT q into Order Seeds
10. Until no object is unvisited

Note: The algorithm above is taken from ‘Review on Density-Based Clustering Algorithms for Big Data’.

Advantage
-
It does not require density parameters.
- The clustering order is useful to extract the basic clustering information.
Disadvantage
- It only produces a cluster ordering.
- It can’t handle high dimensional data.

Note: It is not good to think that only these two density-based algorithms that are explained above exist. There are other algorithms of this kind that are not mentioned here eg DENCLUE but the most popular is DBSCAN because it is considered the simplest of density-based methods.

Conclusion

This article was intended to give a brief explanation of density-based methods. So to sum up density-based methods meet these requirements :
- Produce nonspherical clusters which actually occurs only on spatial data. The term spatial data is used to express points, lines, and polygons. Clusters formed in spatial data clusters may have arbitrary shapes.
-The need for subsequent full efficiency in large size databases.
-The ability to detect and reduce noise an outliers.

OPTICS and DBSCAN are not the only algorithms that implement the density-based method. Even though I mentioned them, it does not mean that only these algorithms are required to be used. However, these are among the most popular and each type of density-based algorithm has its advantages and disadvantages, so before using it you need to look at the dataset, to understand the dataset first and see whether or not that type of algorithm fits your needs.

References

T.Miranda Lakshmi, R.Josephine Sahana, V.Prasanna Venkatesan. Review on Density-Based Clustering Algorithms for Big Data p.15, p.17

Data Clustering: Algorithms and Applications(Chapman & Hall/CRC Data Mining and Knowledge Discovery Series,p 113–117

--

--

A dedicated Software Developer, currently pursuing a degree in Computer Science. Blessed with an inquisitive mind and an eager desire to learn new things.