
Hands-on Tutorials
In this post, we explore in more detail the specific properties of spatial networks and use them to gain some insight into two popular Machine Learning algorithms, k-Nearest Neighbors and DBSCAN.

We start our exploration of Spatial Networks by focusing on its simplest example, the Random Geometric Graph (RGG). RGGs are constructed by simply placing nodes at random in some (say, 2D) space and adding an edge between any pair of nodes that is within a certain distance d.
While this construction might seem like the kind of stuff that only the purest of mathematicians might play with, it is a surprisingly useful graph structure with real world practical applications in ad-hoc networks such as the ones used to connect drones in close proximity.

Amazon even has a recent patent for this kind of application, that you can read more over at the USPTO and researchers at the University of Nevada Robotics Lab proposed an algorithm for drones to coordinate search and rescue operations.
Random Geometric Graph
Without further ado, let us build our very own RGG. We start by randomly placing 100 nodes in the unit square:

The next step is to connect all the nodes that are within a certain distance, d. For this we compute the distance matrix containing the distance between all pairs of nodes.

Computing the full distance matrix is a bit wasteful as it requires O(N²) work. A more sophisticated approach would use KDTrees to achieve linear performance, but that’s beyond the scope of this post as we’re just trying to get familiar with these concepts instead of developing a practical implementation.
Using this matrix we can easily compute the distribution of distances, P(d) and the cumulative distribution of distances, C(d), between each pair of nodes.

The distribution follows a quasi-bell shaped curve (the real distribution is a bit more interesting). From the cumulative distribution, it becomes clear that depending on what the threshold d we choose to use for our RGG then we will get more or less edges.
In the GitHub repository, you can play around with an interactive tool to see how the distance matrix and Graph change as you change the threshold:

If you’re more mathematically inclined, you can think of this thresholded distance matrix as a weighted adjacency matrix for the graph. We will have a lot more to say about Adjacency Matrices in future posts but for now all you need to know is that an adjacency matrix is just a matrix where each non-zero element corresponds to an individual edge between two nodes.
k-Nearest Neighbors
The k-Nearest Neighbors algorithm is a well known family of a Machine Learning algorithms that can be used for classification or regression. The fundamental idea in both cases is fairly straightforward. If you want to know what the label or value of a new data point should be, you can consult its k-nearest neighbors and simply go with the majority. The specific value of k is a hyper parameter that must be defined when training the model.
The connection with Random Geometric Graphs should now be clear. Instead of just connecting each node with every other node within a certain distance d we simply connect it with its k nearest nodes.
When we play this game and connect each of our previous nodes to its 5 nearest neighbors we obtain:

If you examine this matrix closely you’ll start noticing something… unusual. It is not symmetric! The easiest place to confirm this is along the main diagonal where you will occasionally notice a value on one side than isn’t present on the other. This seems like a pretty strange result. If our distance metric is symmetric, and the euclidean distance certainly is, then how can the network not be symmetric? The reason can be made clear with a simple example.
Imagine that for some strange twist of fate, you find yourself completely alone and lost in the middle of the Sahara desert. The 5 people that at that precise moment are geographically closest to you might be in whatever the nearest city is, but they will all have many other people closer to them than you! You’re just a weird… outlier. One way to verify this intuition is to check the in-degree distribution of the nodes. You’ll find that unlike the out-degree where each node has exactly k outgoing edges, a node can have either more or fewer incoming edges depending on how close to the periphery of cluster it is and how many outliers there are.
Here lies the fundamental insight that this graph perspective can give use about k-NN algorithms. Nodes that are more central will be connected through many short-distance links while outliers will be connected to nodes that are far away and have few or not incoming edges at all.
A visualization can drive this point home. Here we plot the directed graph corresponding to the distance matrix in the previous figure. If you look closely at the isolated nodes close to the center of the figure you’ll notice that while it has 5 long outgoing edges it has only one incoming one due to its peripheral position… lost and alone in the middle of the Sahara.

Naturally, if we were to measure the distance distribution in this graph it would look very different from the one we saw before as we severely restricting the distances allowed by keeping only the k shortest outgoing edges for each node.
DBSCAN
Density-Based Spatial Clustering of Applications with Noise (Dbscan) is an unsupervised clustering algorithm that has the explicit goal of identifying clusters of arbitrary shapes in an efficient way. The original KDD paper outlines the algorithm as:
To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points density-reachable from p wrt. Eps and MinPts. If p is a core point, this procedure yields a cluster wrt. Eps and MinPts (see Lemma 2). If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
We can put this algorithm into a more intuitive way of by casting it into a network algorithm that explicitly takes advantage of the features of RGG and k-NN algorithms we discussed above. DBSCAN identifies clusters in data sets by connecting (with directed edges) points (nodes) that lie less than a certain distance away (like in RGGs). Peripheral and outlier nodes have fewer than k out-going (like in k-NN). By contrast, any node that has k or more neighbors within the prescribed distance is said to be a "core node". Finally, clusters in the data can be uniquely identified with the connected components in the resulting graph.
This approach of explicitly building out neighborhood of each node is computationally intensive and requires O(N²) storage space while more efficient implementations (as the one in the original paper) can reduce memory and computational requirements significantly.
def net_DBSCAN(X, eps=0.17, min_samples=5):
# Building the neighborhood matrix and truncate it
# at the maximum distance
matrix = distance_matrix(X, X)
matrix[matrix >= eps] = 0
# Build the directed graph using the non-zero elements of the matrix
G = nx.DiGraph()
G.add_edges_from(np.asarray(np.nonzero(matrix)).T)
# Create an undirected version of the network for ease of core computation
G2 = G.to_undirected()
# Find all core nodes
results = {}
results['core_sample_indices_'] = [node_i
for node_i, k in G2.degree()
if k >= min_samples-1]
# Use the connected components to label each node
# as belonging to each cluster
labels = []
for label, comp in enumerate(nx.connected_components(G2)):
for node in comp:
labels.append(label)
results['labels_'] = np.asarray(labels)
results['G'] = G
# Identify the outlier nodes.
results['non_core'] = set(np.arange(X.shape[0])) - set(results['core_sample_indices_'])
return results
When we apply the code above to the two moons dataset, we obtain:

Where blue and purple identify the two clusters in the dataset and the green squares are the ‘non-core’ nodes. A perfect match with the sklearn implementation:


We hope you enjoyed this way of looking at the venerable k-NN and DBSCAN algorithms and that you gained some new insights about how they work. You can find all the code for the analysis in this post in our companion GitHub Repository https://github.com/DataForScience/Graphs4Sci
And of course, don’t forget to Sign Up to my newsletter so that you never miss another post.