Unlocking the Power of Big Data: The Fascinating World of Graph Learning

Harnessing Deep Learning to Transform Untapped Data into a Strategic Asset for Long-Term Competitiveness.

Published in

Towards Data Science

12 min readNov 9, 2023

Large companies generate and collect vast amounts of data, as an example and 90% of this data has been created in recent years. Yet, 73% of these data remain unused [1]. However, as you may know, data is a goldmine for companies working with Big Data.

Deep learning is constantly evolving, and today, the challenge is to adapt these new solutions to specific goals to stand out and enhance long-term competitiveness.

My previous manager had a good intuition that these two events could come together, and together facilitate access, requests, and above all stop wasting time and money.

Why is this data left unused?

Accessing it takes too long, rights verification, and especially content checks are necessary before granting access to users.

Visualize reasons for data being unused. (generated by Bing Image Creator)

Is there a solution to automatically document new data?

If you’re not familiar with large enterprises, no problem — I wasn’t either. An interesting concept in such environments is the use of Big Data, particularly HDFS (Hadoop Distributed File System), which is a cluster designed to consolidate all of the company’s data. Within this vast pool of data, you can find structured data, and within that structured data, Hive columns are referenced. Some of these columns are used to create additional tables and likely serve as sources for various datasets. Companies keep the informations between some table by the lineage.

These columns also have various characteristics (domain, type, name, date, owner…). The goal of the project was to document the data known as physical data with business data.

Distinguishing between physical and business data:

To put it simply, physical data is a column name in a table, and business data is the usage of that column.

For exemple: Table named Friends contains columns (character, salary, address). Our physical data are character, salary, and address. Our business data are for example,

For “Character” -> Name of the Character
For “Salary” -> Amount of the salary
For “Address” -> Location of the person

Those business data would help in accessing data because you would directly have the information you needed. You would know that this is the dataset you want for your project, the information you’re looking for is in this table. So you’d just have to ask and find your happiness, go early without losing your time and money.

“During my final internship, I, along with my team of interns, implemented a Big Data / Graph Learning solution to document these data.

The idea was to create a graph to structure our data and at the end predict business data based on features. In other word from data stored on the company’s environnement, document each dataset to associate an use and in the future reduce the search cost and be more data-driven.

We had 830 labels to classify and not so many rows. Hopefully the power of graph learning come into play. I’m letting you read… “

Article Objectives: This article aims to provide an understanding of Big Data concepts, Graph Learning, the algorithm used, and the results. It also covers deployment considerations and how to successfully develop a model.

To help you understand my journey, the outline of this article contain :

Data Acquisition: Sourcing the Essential Data for Graph Creation
Graph-based Modeling with GSage
Effective Deployment Strategies

Data Acquisition

As I mentioned earlier, data is often stored in Hive columns. If you didn’t already know, these data are stored in large containers. We extract, transform, and load this data through techniques known as ETL.

What type of data did I need?

Physical data and their characteristics (domain, name, data type).
Lineage (the relationships between physical data, if they have undergone common transformations).
A mapping of ‘some physical data related to business data’ to then “let” the algorithm perform on its own.

1. Characteristics/ Features are obtained directly when we store the data; they are mandatory as soon as we store data. For example (depends on your case) :

**Exemple of main feature**s, (made by the author)

For the features, based on empirical experience, we decided to use a feature hasher on three columns.

Feature Hasher: technique used in machine learning to convert high-dimensional categorical data, such as text or categorical variables, into a lower-dimensional numerical representation to reduce memory and computational requirements while preserving meaningful information.

You could have the choice with One Hot Encoding technique if you have similar patterns. If you want to deliver your model, my advice would be to use Feature Hasher.

2. Lineage is a bit more complex but not impossible to understand. Lineage is like a history of physical data, where we have a rough idea of what transformations have been applied and where the data is stored elsewhere.

Imagine big data in your mind and all these data. In some projects, we use data from a table and apply a transformation through a job (Spark).

Atlas Lineage visualized, from Atlas Website, LINK

We gather the informations of all physical data we have to create connections in our graph, or at least one of the connections.

3. The mapping is the foundation that adds value to our project. It’s where we associate our business data with our physical data. This provides the algorithm with verified information so that it can classify the new incoming data in the end. This mapping had to be done by someone who understands the process of the company, and has the skills to recognize difficult patterns without asking.

ML advice, from my own experience :

Quoting Mr. Andrew NG, in classical machine learning, there’s something called the algorithm lifecycle. We often think about the algorithm, making it complicated, and not just using a good old Linear Regression (I’ve tried; it doesn’t work). In this lifecycle, there are all the stages of preprocessing, modeling and monitoring… but most importantly, there is data focusing.

This is a mistake we often make; we take it for granted and start doing data analysis. We draw conclusions from the dataset without sometimes questioning its relevance. Don’t forget data focusing, my friends; it can boost your performance or even lead to a change of project :)

Returning to our article, after obtaining the data, we can finally create our graph.

Plot (networkx) of the distribution of our dataset, in a graph. (made by the author)

This plot considers a batch of 2000 rows, so 2000 columns in datasets and tables. You can find in the center the business data and off-centered the physical data.

In mathematics, we denote a graph as G, G(N, V, f). N represents the nodes, V stands for vertices (edges), and f represents the features. Let’s assume all three are non-empty sets.

For the nodes (we have the business data IDs in the mapping table) and also the physical data to trace them with lineage.

Speaking of lineage, it partly serves as edges with the links we already have through the mapping and the IDs. We had to extract it through an ETL process using the Apache Atlas APIs.

You can see how a big data problem, after laying the foundations, can become easy to understand but more challenging to implement, especially for a young intern…

“Ninja cartoon on a computer” (generated by Dall.E 3)

Graph-based Modeling with GSage

Basics of Graph Learning

This section will be dedicated to explaining GSage and why it was chosen both mathematically and empirically.

Before this internship, I was not accustomed to working with graphs. That’s why I purchased the book [2], which I’ve included in the description, as it greatly assisted me in understanding the principles.

Graph Machine Learning: Take graph data to the next level by applying machine learning techniques…

Noté /5. Retrouvez Graph Machine Learning: Take graph data to the next level by applying machine learning techniques…

www.amazon.fr

The principle is simple: when we talk about graph learning, we will inevitably discuss embedding. In this context, nodes and their proximity are mathematically translated into coefficients that reduce the dimensionality of the original dataset, making it more efficient for calculations. During the reduction, one of the key principles of the decoder is to preserve the proximities between nodes that were initially close.

Another source of inspiration was Maxime Labonne [3] for his explanations of GraphSages and Graph Convolutional Networks. He demonstrated great pedagogy and provided clear and comprehensible examples, making these concepts accessible to those who wish to go into them.

GraphSage’s model

If this term doesn’t ring a bell, rest assured, just a few months ago, I was in your shoes. Architectures like Attention networks and Graph Convolutional Networks gave me quite a few nightmares and, more importantly, kept me awake at night.

But to save you from taking up your entire day and, especially, your commute time, I’m going to simplify the algorithm for you.

Once you have the embeddings in place, that’s when the magic can happen. But how does it all work, you ask?

Schema based on the Scooby-Doo Universe to explain GSage (made by the author).

“You are known by the company you keep” is the sentence, you must remember.

Because one of the fundamental assumptions underlying GraphSAGE is that nodes residing in the same neighborhood should exhibit similar embeddings. To achieve this, GraphSAGE employs aggregation functions that take a neighborhood as input and combine each neighbor’s embedding with specific weights. That’s why the mystery company embeddings would be in scooby’s neighborhood.

In essence, it gathers information from the neighborhood, with the weights being either learned or fixed depending on the loss function.

The true strength of GraphSAGE becomes evident when the aggregator weights are learned. At this point, the architecture can generate embeddings for unseen nodes using their features and neighborhood, making it a powerful tool for various applications in graph-based machine learning.

Difference in training time between architecture, Maxime Labonne’s Article, Link

As you saw on this graph, training time decrease when we’re taking the same dataset on GraphSage architecture. GAT (Graph Attention Network) and GCN (Graph Convolutional Network) are also really interesting graphs architectures. I really encourage you to look forward !

At the first compute, I was shocked, shocked to see 25 seconds to train 1000 batches on thousands of rows.

I know at this point you’re interested in Graph Learning and you want to learn more, my advice would be to read this guy. Great examples, great advice).

GraphSAGE: Scaling up Graph Neural Networks

Introduction to GraphSAGE with PyTorch Geometric

towardsdatascience.com

As I’m a reader of Medium, I’m curious to read code when I’m looking at a new article, and for you, we can implement a GraphSAGE architecture in PyTorch Geometric with the SAGEConv layer.

Let’s create a network with two SAGEConv layers:

The first one uses ReLU as the activation function and a dropout layer;
The second one directly outputs the node embeddings.

In our multi-class classification task, we’ve chosen to employ the cross-entropy loss as our primary loss function. This choice is driven by its suitability for classification problems with multiple classes. Additionally, we’ve incorporated L2 regularization with a strength of 0.0005.

This regularization technique helps prevent overfitting and promotes model generalization by penalizing large parameter values. It’s a well-rounded approach to ensure model stability and predictive accuracy.

import torch
from torch.nn import Linear, Dropout
from torch_geometric.nn import SAGEConv, GATv2Conv, GCNConv
import torch.nn.functional as F


class GraphSAGE(torch.nn.Module):
  """GraphSAGE"""
  def __init__(self, dim_in, dim_h, dim_out):
    super().__init__()
    self.sage1 = SAGEConv(dim_in, dim_h)
    self.sage2 = SAGEConv(dim_h, dim_out)#830 for my case
    self.optimizer = torch.optim.Adam(self.parameters(),
                                      lr=0.01,
                                      weight_decay=5e-4)

  def forward(self, x, edge_index):
    h = self.sage1(x, edge_index).relu()
    h = F.dropout(h, p=0.5, training=self.training)
    h = self.sage2(h, edge_index)
    return F.log_softmax(h, dim=1)

  def fit(self, data, epochs):
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = self.optimizer

    self.train()
    for epoch in range(epochs+1):
      total_loss = 0
      acc = 0
      val_loss = 0
      val_acc = 0

      # Train on batches
      for batch in train_loader:
        optimizer.zero_grad()
        out = self(batch.x, batch.edge_index)
        loss = criterion(out[batch.train_mask], batch.y[batch.train_mask])
        total_loss += loss
        acc += accuracy(out[batch.train_mask].argmax(dim=1), 
                        batch.y[batch.train_mask])
        loss.backward()
        optimizer.step()

        # Validation
        val_loss += criterion(out[batch.val_mask], batch.y[batch.val_mask])
        val_acc += accuracy(out[batch.val_mask].argmax(dim=1), 
                            batch.y[batch.val_mask])

      # Print metrics every 10 epochs
      if(epoch % 10 == 0):
          print(f'Epoch {epoch:>3} | Train Loss: {total_loss/len(train_loader):.3f} '
                f'| Train Acc: {acc/len(train_loader)*100:>6.2f}% | Val Loss: '
                f'{val_loss/len(train_loader):.2f} | Val Acc: '
                f'{val_acc/len(train_loader)*100:.2f}%')

            
def accuracy(pred_y, y):
    """Calculate accuracy."""
    return ((pred_y == y).sum() / len(y)).item()

@torch.no_grad()
def test(model, data):
    """Evaluate the model on test set and print the accuracy score."""
    model.eval()
    out = model(data.x, data.edge_index)
    acc = accuracy(out.argmax(dim=1)[data.test_mask], data.y[data.test_mask])
    return acc

Deployment of the model :

In the development and deployment of our project, we harnessed the power of three key technologies, each serving a distinct and integral purpose:

Airflow : To efficiently manage and schedule our project’s complex data workflows, we utilized the Airflow Orchestrator. Airflow is a widely adopted tool for orchestrating tasks, automating processes, and ensuring that our data pipelines ran smoothly and on schedule.

Mirantis: Our project’s infrastructure was built and hosted on the Mirantis cloud platform. Mirantis is renowned for providing robust, scalable, and reliable cloud solutions, offering a solid foundation for our deployment.

Jenkins: To streamline our development and deployment processes, we relied on Jenkins, a trusted name in the world of continuous integration and continuous delivery (CI/CD). Jenkins automated the building, testing, and deployment of our project, ensuring efficiency and reliability throughout our development cycle.

Additionally, we stored our machine learning code in the company’s Artifactory. But what exactly is an Artifactory?

Artifactory: An Artifactory is a centralized repository manager for storing, managing, and distributing various artifacts, such as code, libraries, and dependencies. It serves as a secure and organized storage space, ensuring that all team members have easy access to the necessary assets. This enables seamless collaboration and simplifies the deployment of applications and projects, making it a valuable asset for efficient development and deployment workflows.

By housing our machine learning code in the Artifactory, we ensured that our models and data were readily available to support our deployment via Jenkins.

ET VOILA ! The solution was deployed.

I talked a lot about the infrastrucute but not so much about the Machine Learning and the results we had.

Results :

The trust of the predictions :

For each physical data, we’re taking in consideration 2 predictions, because of the model performances.

How’s that possible?

probabilities = torch.softmax(raw_output, dim = 1)
#torch.topk to get the top 3 probabilites and their indices for each prediction
topk_values, topk_indices = torch.topk(probabilities, k = 2, dim = 1)

First I used a softmax to make the outputs comparable, and after I used a function named torch.topk. It returns the k largest elements of the given input tensor along a given dimension.

So, back to the first prediction, here was our distribution after training. Let me tell you boys and girls, that’s great!

Plot (from matplotlib) of the probabilities of the model outputs, First prediction (made by the author)

Accuracies, Losses on Train / Test / Validation.

I won’t teached you what’s accuracies and losses in ML, I presumed you are all pros… (ask to chatgpt if you’re not sure, no shame). On the training, by different scale, you can see convergences on the curves, which is great and show a stable learning.

Plot (matplotlib) of **accuracies and losses.** (made by the author)

t-SNE :

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction technique used for visualizing and exploring high-dimensional data by preserving the pairwise similarities between data points in a lower-dimensional space.

In other words, imagine a random distribution before training :

Data Distribution **before training,** (made by the author)

Remember we are doing multi-classification, so here’s the distribution after the training. The aggregations of features seem to have done a satisfactory work. Clusters are formed and physical data seem to have joined groups, demonstrating that the training went well.

Data distribution after training, (made by the author)

Conclusion :

Our goal was to predict business data based on physical data (and we did it). I am pleased to inform you that the algorithm is now in production and is onboarding new users for the future.

While I cannot provide the entire solution due to proprietary reasons, I believe you have all the necessary details or are well-equipped to implement it on your own.

My last piece of advice, I swear, have a great team, not only people who work well but people who make you laugh each day.

If you have any questions, please don’t hesitate to reach out to me. Feel free to connect with me, and we can have a detailed discussion about it.

In case I don’t see ya, good afternoon, good evening and goodnight !

Have you grasped everything ?

As Chandler Bing would have said :

“It’s always better to lie, than to have the complicated discussion”

Don’t forget to like and share!

References and Resources

[1] Inc (2018), Web Article from Inc

[2] Graph Machine Learning: Take graph data to the next level by applying machine learning techniques and algorithms (2021), Claudio Stamile

[3] GSage, Scaling up the Graph Neural Network, (2021), Maxime Labonne

Image Credits

Photo by Nathan Anderson on Unsplash
GSage difference of time,from Maxime Labonne’s article, link
Atlas Lineage visualized, from Atlas Website, LINK
Three logos taken Google