Winner of the OGB-LSC Knowledge Graph competition at NeurIPS 2022

Large-Scale Knowledge Graph Completion on Graphcore IPUs

Accurate predictions through fast experiments, careful tuning, and a large ensemble

Daniel Justus

Follow

Published in

Towards Data Science

8 min readDec 6, 2022

--

Rendering of the computational graph of a Knowledge Graph Embedding model. Image by author.

Machine learning methods for representing graph-structured data keep growing in importance. One of the central challenges that researchers in the field are facing is the scalability of models to large datasets. As part of the NeurIPS 2022 Competition Track Programme the Open Graph Benchmark Large-Scale Challenge (OGB-LSC) aims to push the boundaries of graph representation learning by encouraging the graph ML research community to work with realistically sized datasets and develop solutions able to meet real-world needs.

In this blog post we the present Graphcore’s winning submission to the Knowledge Graph track of OGB-LSC@NeurIPS 2022. We give insights into the machine learning models, dataset considerations, the ensembling strategy and our efficient execution scheme that enabled this success. More in-depth information on our method can be found in the paper [1].

What is a Knowledge Graph?

Knowledge Graphs are a very natural way to structure and represent knowledge by capturing the relationship between real-world entities. Facts are represented as triples (head, relation, tail), where the relation describes the link between the head and tail entities. For example, the triple (Geoffrey Hinton, graduated from, King’s College Cambridge) is one of seven facts stated by the Knowledge Graph in Figure 1. Applications of Knowledge Graphs range from drug discovery to question-answering and recommender systems [2–4].

Figure 1: A small subgraph of the WikiKG90Mv2 Knowledge Graph. A possible query might be (Geoffrey Hinton, born in, ?). Image by author. Adapted from [5].

Knowledge Graph Embedding (KGE) models perform reasoning on Knowledge Graphs by learning embeddings of entities and relations in low-dimensional vector spaces, such that the plausibility of triples is measured by a scoring function of the head, relation, and tail embeddings. KGE models are trained on batches of both positive (true) triples and negative (randomly drawn) triples by maximising the score of positive and minimising the score of negative samples.

Figure 2: Three different examples of scoring functions: TransE measures the distance between head + relation and tail embedding; TransH projects head and tail embeddings onto a relation-dependent hyperplane before applying TransE; In RotatE the relation describes a rotation of the complex-valued head embedding. In all three models the negative distance can be used as score. Image by author.

While the Knowledge Graph literature often focuses on relatively small graphs, real-world applications of commercial value increasingly require reasoning on graphs with hundreds of millions, or even billions, of entities and triples. WikiKG90Mv2 is a large-scale Knowledge Graph based on Wikidata, consisting of more than 90 million nodes and 600 million triples. Knowledge Graphs are typically incomplete: While existing triples can be presumed to be true, the absence of a relation between two entities has no truth value. The task of the competition is therefore to complete triples of the form (head, relation, ?), for example (Geoffrey Hinton, born in, ?), see Figure 1. The huge size of the WikiKG90Mv2 dataset is one of the major challenges for implementing a fast and accurate Knowledge Graph Completion model, as our largest models consume over 300 GiB for parameters, optimiser state and features. To overcome this problem, we implemented BESS (Balanced Entity Sampling and Sharing), a distributed processing scheme for training KGE models that efficiently balances communication and compute over multiple workers.

Distributing Knowledge Graph Training with BESS

BESS randomly and uniformly partitions the set of entity embeddings across D available workers. This gives rise to a partitioning of the triples in the dataset into D² buckets {Tₘₙ, m, n = 1,…,D}, where Tₘₙ is the set of triples with head entity stored on worker m and tail entity stored on worker n. Then, during training, for each worker m a batch is sampled uniformly from {Tₘₙ, n=1,…,D}. Similarly, for each micro-batch BESS uniformly draws tails from all available workers to construct negative samples. These negative samples are shared across all triples in a micro-batch to increase the effective negative sample size without increasing the cost for communicating and scoring negative samples

Since the number of different relation types is typically small compared to the number of entities, relation embeddings are replicated across all workers and updated using an AllGather operation.

At inference time, the complete set of entities is traversed for a given query (head, relation, ?) and an ordered list of the top-K results is returned.

The BESS approach guarantees that only tail embeddings have to be exchanged across workers. At the same time, it balances the compute and communication between workers. We trained KGE models using BESS on a Graphcore Bow Pod16, where it benefits from collective communications running over fast IPU-IPU links that render a separate parameter server obsolete. Moreover, 14.4 GB in-processor memory allow to efficiently use the aggregate bandwidth to DRAM across the whole system.

Matching the dataset distribution

As detailed by the challenge organisers, the validation and test datasets have been created by sampling triples such that the count of relation types is proportional to the cube root of the original relation counts. This leads to a severe generalisation gap if the sampling from the training dataset is not adjusted accordingly. We therefore bias the sample of training triples so that the resulting distribution of relation types matches the cube root of relation type counts in the training dataset. This procedure aligns the frequency of different relation types used during training to the distribution of relations in the validation and test set and reduces the gap between training and validation MRR (Figure 3).

Figure 3: The cube root sampling strategy aligns the training and validation/test distribution of relation types (left) and reduces the gap between training and validation MRR (right). Image by author.

Diverse Ensembles to get the best of all worlds

The throughput achieved with BESS on a Graphcore Bow Pod16 enabled us to train a very diverse ensemble of 85 KGE models that combines five different scoring functions (TransE, TransH, RotatE, DistMult, ComplEx [6–10]) and two different loss functions. Each of the scoring functions has different advantages and disadvantages, as summarised in Table 1, making ensembles of models with different scoring functions particularly promising. To further boost model diversity, we adopted two different loss functions: log-sigmoid loss and sampled softmax cross entropy loss. See the paper [1] for further details.

Table 1: Scoring functions and their ability to model fundamental relation properties: S = Symmetry; AS = Antisymmetry; I = Inversion; C = Composition. Image by author.

In addition to the structure of the knowledge graph (the set of triples), the WikiKG90Mv2 dataset provides 768-dimensional feature vectors for each entity and relation. Our models utilise the entity features by projecting them into the entity embedding space using a learnable weight matrix and adding the result to the learnable entity embeddings. As the number of different relation types is small, we hypothesise that relation features are not critical for learning good relation embeddings, so we omit these in our model.

Predictions of multiple models are combined using a power-rank ensembling strategy, generalising [11]. Each predicted tail entity gets assigned a score based on the reciprocal square root of its rank in individual predictions:

The final predictions are then selected by ranking entities using this score.

Results

We trained a wide range of models to a validation Mean Reciprocal Rank (MRR) of at least 0.2, with the best individual model achieving an MRR of 0.243 (Figure 4). We found that, depending on the scoring function, models lend itself to a different degree to ensembling: Despite having lower individual validation MRRs, DistMult and ComplEx models perform exceptionally well in ensembles (Figure 4).

Figure 4: Validation MRR of the K-th best individual model (dashed) and of the ensemble of the K best models (solid) per scoring function. Image by author.

Based on this evidence, we chose an ensemble consisting of 85 KGE models (the 25 best TransE, DistMult and ComplEx models, and the 5 best TransH and RotatE models), achieving a validation MRR of 0.2922 and an MRR of 0.2562 on the test-challenge dataset, winning the WikiKG90Mv2 track of the 2022 Open Graph Benchmark Large-Scale Challenge.

Fast and accurate reasoning on large Knowledge Graphs is an important yet challenging machine learning application. We use a novel distribution framework to improve the scalability of KGE models to large graphs and demonstrate the advantage of a diverse ensemble of well-tuned models. While these methods are no magic bullet, we hope that our insights help the community creating fast and accurate Knowledge Graph Embedding models and accelerate their adoption in real-world applications.

Read the paper | Access the code

References

[1] A. Cattaneo, D. Justus, H. Mellor, D. Orr, J. Maloberti, Z. Liu, T. Farnsworth, A. Fitzgibbon, B. Banaszewski, C. Luschi, BESS: Balanced Entity Sampling and Sharing for Large-Scale Knowledge Graph Completion (2022), arXiv preprint arXiv:2211.12281.

[2] S. Bonner, I.P. Barrett, C. Ye, R. Swiers, O. Engkvist, A. Bender, C.T. Hoyt, W.L. Hamilton, A review of biomedical datasets relating to drug discovery: A knowledge graph perspective (2022), Briefings in Bioinformatics, 23(6), p.bbac404.

[3] Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, J. Zhao, An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge (2017), In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 221–231).

[4] F. Zhang, N.J. Yuan, D. Lian, X. Xie, W.Y. Ma, Collaborative knowledge base embedding for recommender systems (2016), In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 353–362).

[5] W. Hu, M. Fey, H. Ren, M. Nakata, Y. Dong, J. Leskovec, OGB-LSC: A large-scale challenge for machine learning on graphs (2021), arXiv preprint arXiv:2103.09430.

[6] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for modeling multi-relational data (2013). Advances in neural information processing systems, 26.

[7] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph embedding by translating on hyperplanes (2014), In Proceedings of the AAAI conference on artificial intelligence (Vol. 28, No. 1).

[8] Z. Sun, Z.H. Deng, J.Y. Nie, J. Tang, RotatE: Knowledge graph embedding by relational rotation in complex space (2019), arXiv preprint arXiv:1902.10197.

[9] B. Yang, W.T. Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning and inference in knowledge bases (2014), arXiv preprint arXiv:1412.6575.

[10] T. Trouillon, C.R. Dance, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Knowledge graph completion via complex tensor factorization (2017), arXiv preprint arXiv:1702.06879.

[11] G.V. Cormack, C.L. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet and individual rank learning methods (2009), In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 758–759).