AI teams are frequently asked to generate insights given unstructured data. Consider for example customers’ state analysis applications (which customers are at risk to churn? Who is a potential up-seller?). Commonly unlabeled data will be the most available relevant resource (product usage, previous escalations, purchase records, etc..) and therefore the algorithms in use will be unsupervised ones (like to cluster the customers given their raw matrices), but the question would be how to process these algorithms output? (like the projected customers clustering plane). A common solution is to rely on Anomaly Detection tools for that, to highlight outliers worth to observe. When we think of Anomaly Detection we mostly picture applications like time series outliers detection or to identify errors in visual repetitive patterns (like in the picture above). But the truth is anomaly detection is much wider, including many other sub domains. One of them is spatial anomaly detection, commonly in use for such needs. Ahead are few relevant anomaly detection techniques followed by an example use case script.
Spatial outliers detection
Commonly in use for scenarios like traffic analysis. First by comparing each point to its previous values and to the other points in the plane (in order to find general outliers like blocked junctions). Second, by comparing each point to its ‘context’ (in order to find ‘local’ outliers; ones which are normal in comparison to the whole, outlier in comparison to their neighbours). The challenge will be how to decide the local context bounds; where to draw the line between neighbours and not. The motivation is not only to find issues like malformed traffic lights (which will probably generate extreme values both in comparison to the overall and in comparison to the same junction previous records), but to identify local errors as well, like junctions with sporadic traffic light issues, bad road design, etc.. Overall normal but strange in comparison to the local context, the neighbour junctions.
Multivariate anomalies detection
Multi-Dimensional values can be normal according to each axis by itself but anomaly on the combined space (example on the picture below). Therefore it is important to take into account the variables’ relationship as well. Two important phenomena to consider (Acuna and Rodrigez, 2004) when analysing multivariate anomalies are the Masking Effect (outlier A is masking outlier B when B is considered as outlier only following the deletion of A. Commonly happens when few outliers skew the mean and the covariance towards them, making B to look normal) and the Swamping Effect (outlier A is swamping outlier B when B is considered as outlier only with the presence of outlier A. Commonly happens when few outliers skew the mean and the covariance from the non-outlier points, generating high distance from non outlier points to the mean, will be reverted with the deletion of the skewers). In order to find multivariate outliers we commonly try to detect points which are relatively far from the data distribution centre. A common statistical based approach is to use Mahalanobis distance for that need; using the dataset covariance and mean to conclude the probability that a point belongs to the dataset. Large mahalanobis distance can indicate an outlier (commonly ‘large’ is considered bigger than 3 standard deviations or being out of the 93%-95% of the values range). Masking can lower a point distance and swapping can increase a non outlier point distance (Ben-Gal, 2005). Both can be solved by using the median instead of the mean, giving more weight to the none outlier points. Other common approaches for multivariate anomalies detection look into the plane structure, rely on properties like distance from a point to its nth closest neighbour (high distance = outlier) or by analyzing the generated cluster groups (cluster with low number of points = outlier).

Back to the beginning, let’s assume we want to analyse our customers’ behaviour in order to find ones worth attention. Added below a what to do script with common pitfalls to avoid and some pivots to consider.
Clustering + Anomaly Detection
Out of the many possible operational (unlabelled) relevant metrics that will be available to start with, we will commonly choose product usage, assuming it to be a significant customer state indicator (underuse can indicate a Churn risk and overuse can indicate a potential upsell opportunity). Starting with plotting the customers’ usage patterns will enable us to quickly (manually) identify outliers. Comparing aggregated values (median, max, etc..) instead of the raw values can enable deeper visibility but with the cost of much data being lost through the aggregation process (again, probably only extreme values will be visible). The common next step would be to use clustering in order to enable direct usage (raw, time series, values) comparison. But for many cases besides detecting the obvious small anomalous clusters, the general clustering structure won’t be so informative (having most of the customers at the same mega cluster). What we’re truly looking for is a way to analyse the customers plane, to identify how each of the customers deviates from the rest. There comes Manifold methods for our aid.
_An important note before we continue; clustering requires a distance method, commonly the default choice would be Euclidean distance. The issue is as we compare time series data, it may highlight time zones as a significant distance indicator, clustering together customers from similar time zones. Implicitly assuming weekend shift to generate the highest distance share. Choose wisely the distance method you use to make sure it reflects the way you want to analyse the data plane. Consider pre-processing the data to avoid such issues. Such possible solution would be to aggregate the raw values into a weekly normalised view prior to applying the clustering on it._
Manifold + Anomaly Detection
Manifold methods can help to project the customers’ usage values into a lower dimensional plane. MDS (MultiDimensional-Scaling) is one of the many possible Manifold methods, searching for a ‘a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space’ (scikit learn). Using it we could draw a 2d plane and visually (manually) identify customers which seem to deviate from the others. As MDS is known to be sensitive to extreme values (Blouvshtein and Cohen-Or, 2018) we could do it in phases; alert on the 1th outlier, remove it and calculate MDS again to alert on the next 1th outlier, implicitly enabling us to generate an outliers’ hierarchy. But as we don’t want to rely on manual analysis to generate these insights, how to automate that detection? Mahalanobis distance can assist us here.
An important note before we continue; make sure to filter duplicate columns prior to this step. As MDS is commonly Euclidean distance based – all attributes will share the same importance and therefore dependent attributes will get double impact (which is probably not what we’re willing for).
Manifold + Mahalanobis + Anomaly Detection
As mentioned earlier, Looking at the MDS generated 2d plane we could use Mahalanobis to automatically highlight the points which are relatively far from the data distribution centre. Iteratively doing so (take the 1th outlier, remove it and calculate Mahalanobis again) can help us to avoid the swamping and masking phenomenons as well as enabling us a within level (MDS runs) hierarchy. But looking into such common output (like on the picture below) the extreme outliers will dominate the algorithm output, commonly they are ones with some extreme outlier values. What we need is to find a way to better scale (normalise) our data, in order to pay attention to the less visible outliers.
An important note before we continue; make sure to filter empty (legacy customers) vectors prior to this step. The reason is empty vectors will affect the mean and covariance which will later affect the Mahalanobis outputs as it highly relies on them.

Manifold + Mahalanobis + Scaling + Anomaly Detection
MDS which is Euclidean distance based is sensitive to attributes’ scale; for example values in range of [1,100] will have a bigger effect than ones with range of [0,1] on the distance calculation. And as the MDS task is to preserve the origin distances as possible while lowering the data dimensionality, the generated plane will be highly influenced by the higher range attributes (as they will most likely dominate the high distances across the origin plane). The most obvious immediate solution will be to normalise the raw product usage values, to make sure they are on the same scale (as can be seen below). On the other hand, as naive normalisation will generate distance measurements where all attributes share the same importance, what if some of the attributes are more important than others? (even on time series, extreme values on weekends may be more important than ones during the weekdays). Moreover, for many cases we will have few usage types (like logins and activity records), and dealing them all together may be miss leading (not comparing apples to apples). The solution should be to normalise while making sure to scale similar sub attributes together.

Manifold + Mahalanobis + Sub-Scaling + Anomaly Detection
In order to preserve the per domain and the per time frame uniqueness a possible solution would be to apply Sub MDS Normalisation, to divide the process into steps; first calculate a 1D MDS per data type (we will still face the scale issue but it’s ok since all vectors are from the same type so we want super scale outliers to remain that way). Then normalise each sub MDS score on a [0,1] scale (since now all values are comparable – representing points in the same space). Finally generate MDS on the joined table of the sub type MDS normalised values. The result will be a smoother projection which preserves the hints of each sub domain analysis.

Manifold + Clustering + Sub-Scaling + Anomaly Detection
Commonly we would like not only to highlight the outliers but also to mark their potential clusters. The EM (Expectation Maximization) Clustering algorithm can help us to continue our automated analysis as it automatically finds the optimal number of clusters (while algorithms like kMeans require us to use techniques like the Elbow method to decide the fitting number of clusters). Using EM which is Euclidean distance based, it makes sense to apply it following the sub-scaling phase to make sure it pays more attention to sub types distances VS sub-sub types distance (the within sub type distances, the type of distance which may be not comparable, like Login raw values VS applicable actions raw values).
Tune, adjust and monitor
It’s important to note that many of the described choices highly depend on the data in hand; if for example only a single usage type is available then sub scaling won’t be that important. For some cases we would like to replace the Euclidean distance with other distance methods or maybe to apply some data preparations (like to smooth the raw values using windowing) prior to calculating the distances. Make sure to explore and trick, plug and play, to make sure you find the configuration that fits you best. It’s also important to make sure you keep the generated insights feedback, assuming it could be your next development labelled data, enabling a pivot towards a supervised model.