The world’s leading publication for data science, AI, and ML professionals.

Gentle Introduction to Knowledge Representation Learning

Knowledge representation learning (KRL) mainly focus on the process of learning knowledge graph embeddings, while keeping the semantic…

Knowledge representation learning (KRL) mainly focus on the process of learning Knowledge Graph embeddings, while keeping the semantic similarities. This has proven extremely useful, as feature inputs, for a wide variety of prediction and graph analysis tasks [1, 2, 3]. We will further elaborate on specific applications in future articles. For now we will mainly focus on the technical implementations.

KRL systems, in general, include three components (Fig. 1). The knowledge graph data, the Entity/Relation Embeddings to decode the graph into low dimensional vector space, and third, the scoring component measures the plausibility of facts, to adopt the semantic similarities between embeddings. Here, we can differentiate between similarity-based scoring functions and distance-based scoring functions.

Figure 1: Three main components of knowledge representation learning systems
Figure 1: Three main components of knowledge representation learning systems

Knowledge Graph

A knowledge graph (KG) represents knowledge in the form of interconnections between nodes and edges in a graph (Fig. 2). It provides a structured representation of facts drawn from the relationships between entities. Where entities can be real-world objects or abstract concepts and relationships are the arrows connecting both.

Figure 2: Part of a larger biomedical knowledge graph
Figure 2: Part of a larger biomedical knowledge graph

Fromally defined as triples (head-entity, relation, tail-entity), let _E={e_1,…,en} be the set of all entities (head and tail) and _R={r_1,…, rm} be the set of all relations in the knowledge graph. Each potential triple x_ijk over this set of entities and relations can be represented in a binary random variable _yijk= 0 ∨ _yijk = 1, compressed in a three dimensional Adjacency tensor.

Figure 3: Adjacency tensor representing existing interactions E × R × E ∈ KG
Figure 3: Adjacency tensor representing existing interactions E × R × E ∈ KG

Entity/Relation Embedding

In this first step, we embed triples via latent features of entities: a possible explanation for the fact that Corona is a Virus is, the association to the family of influenza that spread rapidly amongst humans. This explanation uses latent features of entities (e.g., part of the influenza family) to explain observable facts. The concept is less intuitive because it is not directly observable in the data. Figure 4 provides a simple illustration. Here, we apply two features representing each entity. The latent feature representation _θi of entity Corona can be represented as a two dimensional feature vector. Note, in practice, the feature dimensions are around 200 and, unlike this example, usually inferred by the latent feature models, which make them generally impossible to translate or interpret. The basic intuition of relational latent feature models is that the relationships between entities can be interpreted from mathematical interactions of their latent features. A variety of possible ways to model these interactions is available, as are several methods to derive the existence of a relationship from them, some of which we will explore in the second step.

Figure 4: Example of a two dimensional embedding vector for each entity node.
Figure 4: Example of a two dimensional embedding vector for each entity node.

Scoring Component

Here, the idea is to embed similar/connected nodes close in vector space. Internally, the model tries to maximize the score of _f(θ_h,θ_r,θt) for any triple ∈ KG and minimize it for triple not in KG. This process gets iterated through multiple epochs to jointly learn the embeddings for all nodes/relations in the graph. In a neural link prediction model, both steps are comprised of a multi-layer neural network consisting of encoding layers, (CNN work very well) and multiple scoring components (mostly fully-connected layers) to measure the plausibility of triples. The central difference between individual models is in their choice of the scoring function f. According to the scoring functions used, the models can be roughly divided into two mainstream, translational distance models and semantic matching models. Distance-based scoring function measures the plausibility of facts by calculating the distance between entities, where addictive translation with relations as h + r ≈ t is widely used. Semantic similarity based scoring measures the plausibility of facts by semantic matching, which usually adopts multiplicative formulation, i.e., h⊤M_r ≈ t⊤, to transform head entity near the tail in the representation space.

Figure 5: translational distance-based scoring (left) and semantic similataity-based scoring function (right) [4]
Figure 5: translational distance-based scoring (left) and semantic similataity-based scoring function (right) [4]

References

  1. Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM international on conference on information and knowledge management, pages 891–900, 2015.
  2. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
  3. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
  4. Ji, Shaoxiong, et al. "A survey on knowledge graphs: Representation, acquisition and applications." arXiv preprint arXiv:2002.00388 (2020).

Related Articles