
Here what we are going to accomplish are the following tasks:
- Build a custom graph dataset formatted to work in DGL (Main subject in this post 😎 )
- Prepare Train and Test datasets randomly (Good to know 👍 )
- Define the Graph Conv Network (Piece of cake 🍰 )
- Train and evaluate Accuracy (As usual 🧠)
1. Build a custom graph dataset formatted to work in DGL
The dataset we are going to work is taken from the AIcrowd Learning to Smell Challenge, and consist of a column with the SMILES string identifying a given molecule and a second column with the names of the scents for those molecules.
For example, the second molecule in the table below, Dimethyl carbonate-13C3 whose SMILES are COC(=O)OC, has scents defined like "fresh, ethereal, fruity" which certainly corresponds with what it is known about this substance [1]. There are 100+ different scents forming unordered combinations for all these 4316 molecules.
Then we desire to a) transform the Smiles to a DGL graph for each molecule, b) to label whether or not it has a fruity scent and c) to get rid of any molecule that is problematic. The following blocks of code attend each of these requests:
a) A DGL graph object has to be defined for each molecule. We can accomplish this using RDKit first forming a mol RDKit object from its SMILES (row 2). As we want to identify nodes and its connections, we can get the adjacency matrix (row 3 and 4). The indices of nonzero values identify the nodes that are connected (row 5) and give source and destiny nodes of a bidirectional graph (rows 6 and 7).
We use the atom_to_feature_vector
from the ogb library [2] to generate feature vectors for the atoms
For example, when we apply feat_vec
__ on the string COC(=O)CO we get:
notice that each row corresponds to an atom with 9 features. These features are physico-chemical properties of the atoms, for example, the atomic number, number of hydrogens, or if the atom is in a ring, among others.
c) Now that we have the graphs with features, it is time to define the labels. AS we have seen, each molecule is associated to more than one scent. Here for the sake of simplicity, the problem will be a binary classification to determine whether or not the molecule has a fruity scent. The following block of code does that task assigning as label a 1 if has a fruity scent and 0 if not.
Within this dataset, I noticed that a small number of molecules are not correctly converted to DGL graphs, so I decided to get rid of them. I did this by simply ignoring the exceptions as shown below:
2. Prepare Train and Test datasets randomly
With all these steps, the dataset is ready, but we are going to do a further step to convert it to a DGL dataset in order to smoothly do the preprocessing and forming training and test subsets. Here I will quote the overview in the "Make your Own Dataset" official tutorial by the DGL team [3]:
Your custom graph dataset should inherit the
dgl.data.DGLDataset
class and implement the following methods:
__getitem__(self, i)
: retrieve thei
-th example of the dataset. An example often contains a single DGL graph, and occasionally its label.
__len__(self)
: the number of examples in the dataset.
process(self)
: load and process raw data from disk.
The process method is where we give the graphs labels. It could be possible to get this from a file, but in our case is just the graphs list and the labels list converted to a torch tensor. Thus this is a really simple class:
Then we proceed to build the train and test random sampling. Doing in this way has the advantage that you can set the batch size. Here we set it to five graphs per batch:
3. Define the Graph Conv Network
We are on our way to define the GCNN. I have explained in a previous post how the GCN class is built. Each node has an output feature tensor, so in order to do a binary classification we can set the output feature of length equal to the number of classes (here two), and then average all feature nodes with dgl.mean_nodes
to get an averaged (two-dimensional) tensor per graph:
4. Train and evaluate Accuracy
Finally, we build the model accepting 9 atomic features per atom and returning a two dimensional tensor that would be used to define if has fruity scent. The epochs run over batches and the loss is calculated with cross entropy. The last part is simply a loop that counts the correct predictions to evaluate the accuracy:
The accuracy we get is around 0.78 which is pretty good. Now you can try predicting other scents or even better making your own custom datasets to work with DGL with molecules or whatever you want.
If you want to try the code open this Colab Notebook. In case you have any questions, I will be more than happy to answer. Finally, consider subscribing to the mailing list:
and check my previous posts:
Start with Graph Convolutional Neural Networks using DGL
Making network graphs interactive with Python and Pyvis.
Graph convolutional nets for classifying COVID-19 incidence on states
References:
1)
2)
GitHub – snap-stanford/ogb: Benchmark datasets, data loaders, and evaluators for graph machine…
3)