The world’s leading publication for data science, AI, and ML professionals.

Exploring the Helium Network with Graph Theory

Using blockchain data to extract insights about network coverage and "suspicious" hotspots

Photo by Alina Grubnyak on Unsplash
Photo by Alina Grubnyak on Unsplash

Quick Primer on the Helium Network

This article is going to assume that you have some basic familiarity with the Helium Blockchain and specifically, its purpose-built work algorithm called Proof of Coverage. Essentially, each wireless radio in the network periodically broadcasts "beacons", which are just small packets of encrypted data. Any hotspot that "witnesses" (receives) this transmission reports it to the Blockchain in a receipt. This constant communication between nearby hotspots helps prove that adequate coverage is provided in a given geographic region, and miners are rewarded for providing this service in an honest way. I’d encourage you to read more in Helium’s documentation.

Dataset Curation

All the data you will see here was generated by making requests to the publicly-available Helium Blockchain API. Open-source ledgers provide fault-tolerance and auditability, but they are also fantastic datasets for data mining. Here are some example API calls that I implemented with the requests library in Python:

What is a Graph?

Often, the way we choose represent a dataset is as important as the data itself. Some inputs, like images and financial time series, lend themselves to well-defined representations, like arrays, vectors, and tensors. Since many ML models, like convolutional neural networks, rely heavily on linear algebra, we often prefer matrix-like data structures. However, for many complex applications, such as drug discovery, recommender systems, and of course, wireless networks, we would benefit from a bit less structure.

Graphs are a data structure defined by discrete points, called nodes, and expressions of adjacency between nodes, called edges. For example, you could generate the graph for a molecule, where nodes were individual atoms, and edges are the bonds between them. For our purposes, we can think of the Helium Network as one giant graph, with 200,000+ nodes (Hotspots) and millions of edges (witness paths). Fortunately, there are a number of open-source libraries that help with constructing and analyzing graphs; in this article, I will use the Python modules NetworkX and PyTorch Geometric.

An example of witnesses for a given Helium hotspot (image by author).
An example of witnesses for a given Helium hotspot (image by author).

Generating the Helium Network Graph for a City

The following code snippet defines the CityGraph class, which we will use to generate the NetworkX graph for a given city. First, we create a node for each hotspot and include features like their location, elevation, antenna gain, and recent earnings. The index for each node is its unique hotspot address. Next, we loop through all the recent witness receipts for each hotspot and create edges between that node and its unique witnesses. We can even apply weights to the edges, such as metrics about the strength of the signal (e.g. RSSI or SNR) or the reported distance traveled by the packet.

We can then visualize the graphs with the nx.draw() function. Here is an example of the Helium Network’s presence in State College, PA:

The graph for the Helium Network's presence in State College, PA. Nodes represent individual hotspots and edges represent witness paths (i.e. hotspots that can "hear" each other). "Lone wolves" cannot prove their coverage and thus, earn minimal rewards (image by author).
The graph for the Helium Network’s presence in State College, PA. Nodes represent individual hotspots and edges represent witness paths (i.e. hotspots that can "hear" each other). "Lone wolves" cannot prove their coverage and thus, earn minimal rewards (image by author).

Feature Extraction: PageRank & Betweenness Centrality

Once we have this representation, we can leverage decades of research in Graph Theory to extract insights about our dataset. Algorithms that were originally developed for completely different domains may even be applicable to our analysis of wireless networks. For example, the PageRank metric was designed (in 1999!) as a way for search engines to find the most relevant websites for a given query. In these citation graphs, individual sites are nodes and hyperlinks between sites express adjacency. The most influential websites are those that users end up navigating to most often. Does this have any relevance to an IoT network? Let’s take a look at the PageRank values for the Helium Network in Pittsburgh using plotly .

The Helium Network presence in Pittsburgh, colored by PageRank score (image by author).
The Helium Network presence in Pittsburgh, colored by PageRank score (image by author).

The bright yellow nodes are those with the highest PageRank scores, while dark blue points have low scores. Having a lot of witnesses certainly helps your PageRank value, but it is not the only factor. The algorithm also takes into account the relative influence of any connected nodes, as that is an indicator of how well traffic will flow toward that hub (e.g. traffic coming from a high-value site, like CNN.com, is weighted more heavily than traffic coming from my blog). On the other hand, hotspots that are placed in the densest areas, but relatively underperforming their neighbors, are deemed insignificant by the PageRank score.

Another potentially relevant metric is called Betweenness Centrality (or just "betweenness"). In this case, the algorithm uses shortest paths to identify influential nodes. You could apply this to air travel; because of its central location in the continental U.S., Chicago O’Hare is a common layover destination for both east-west and north-south flights. Betweenness has also been used to identify the reddit communities with the highest influence on viral trends, since these are the subs that provide the most direct link between seemingly disparate topics, like politics and pop culture. Again, let’s take a look at this metric as it relates to the Helium Network in Pittsburgh.

The Helium Network in Pittsburgh, colored by Betweenness Centrality (image by author).
The Helium Network in Pittsburgh, colored by Betweenness Centrality (image by author).

The hotspots with the highest betweenness scores are those that uniquely connect geographically sparse regions, such as those that provide the only witness path between Downtown Pittsburgh and the South Hills. I would contend that these hotspots are lynchpins for the network; although they may not have the highest witness counts, they perform critical proofs of coverage.

Outlier Detection: Graph Neural Networks

Let’s bring these concepts together with some Deep Learning. "Graph Neural Networks" (GNNs) encompass a subset of DL models designed to extract insights from – you guessed it – graph-based datasets. Just like with image- and NLP-based AI, researchers are constantly coming up with new variations on GNN algorithms, optimizers, and activation functions to suit various tasks. Rather than dwell on the methods themselves, which are better explained elsewhere, I want to focus on how we will apply them to our dataset.

In the Helium Network, one of the premiere challenges for the engineering team is to identify bad actors who may be spoofing the system to increase their rewards. This gaming can take many forms, but one of the most popular approaches is to make hotspots appear more spread out then they really are. With sparse placement, miners benefit from optimal rewards scaling, since Proof of Coverage disincentivizes closely-packed hotspots. While the network uses the physical limits of radio propagation to weed out some dishonest activity (i.e. there is an upper limit for the strength of a signal traveling a given distance), it is not incredibly difficult to get around these relatively conservative thresholds in many cases. The nonlinear nature of deep learning means that the decision boundaries are less predictable and thus, potentially more difficult to circumvent.

Back to the model itself. Our inputs will be the city graphs we assembled earlier. At the time I generated this dataset, hotspots are placed in over 11,000 unique cities around the world – cities of a wide variety of sizes, geographic features, and regulatory conditions. For each hotspot, we will use the PageRank score and Betweenness Centrality as features, and each witness path will use the RSSI and SNR of the reported beacon as edge weights. Finally, the output of our model will be a regression prediction of each node’s reward scale value. You can download the full dataset (stored as a pickle binary) here.

My idea here is that our model should generalize what a "nominal" city looks like, but it will have a hard time classifying outlier cases. In other words, when rewards scales are suspiciously high (or low) given a certain feature set, our model will show high loss (measured as mean squared error between predicted and actual). Speaking of the model, here’s the code for our GNN, using PyTorch Geometric’s GCNConv layers and the Adam optimizer:

This will print out the cities and individual hotspots from our test set with the highest loss, since we’re making predictions on a node-by-node basis. Let’s use the Helium Explorer to take a closer look at some of our outliers:

Outlier City #1: Cardiff, Wales

Impeccably-placed hotspots in Cardiff (image by author).
Impeccably-placed hotspots in Cardiff (image by author).

Outlier City #2: Cape Coral, FL

Impressive signal propagation in another coastal city, Cape Coral, FL (image by author).
Impressive signal propagation in another coastal city, Cape Coral, FL (image by author).

You can see what the model is getting at. In both of these low-lying coastal cities, most hotspots have near-optimal rewards scales and impressive signal propagation. It’s possible there is some spoofing happening here, or maybe this is just an artifact of how well radio waves travel over water. In future models, I would like to account for geographic factors, like local elevation and significant bodies of water.

Outlier Hotspot #1: Tiny Marmalade Pheasant

Image by author.
Image by author.

Outlier Hotspot #2: Urban Pine Raven

Image by author.
Image by author.

Outlier Hotspot #3: Boxy Brown Chameleon

Image by author.
Image by author.

Now, we’re looking at individual hotspots with the highest loss. In this case, the model is capturing situations where the hotspot is placed in a relatively densely-populated area (i.e. there are lots of other miners nearby), but they still manage a perfect transmit scale. Another phenomena that we see here is some insanely far witness paths, on the order of hundreds of kilometers. Perhaps these hotspots just have extremely powerful antennas and great locations, or maybe this is a sign of suspicious activity.

Closing Thoughts

In this article, we leveraged Graph Theory to extract insights about the Helium Network. We found that metrics like PageRank and Betweenness, which were originally designed for completely different application domains, have some relevance with respect to finding the most (and least) significant hotspots in a given city. Using these features and state-of-the-art Graph Neural Network-based models, we can start to tackle the formidable task of identifying "suspicious" activity on the Helium blockchain. I am looking forward to further optimizing the model architecture and feature extraction methodologies, as well as extending the classification tasks to other areas, like coverage mapping. In a span of just a couple years, the Helium Network has grown to become the largest public LoRaWAN network on the planet. For folks excited about decentralization and cheap IoT coverage, this is already a massive victory. For data scientists attempting to model this vast (and publicly available) dataset, the work is just beginning.


References

Helium Whitepaper

NetworkX Documentation

PyTorch Geometric Documentation

City Graph Dataset (111 MB)


Related Articles