Text Generation from Knowledge Graphs with Graph Transformers

Alec Robinson
Towards Data Science
7 min readAug 25, 2020

--

A summary of the structure

This 2019 paper is a bit of an anachronism, given the speed of transformer model development and the world-changing impact of GPT-3. Yet, as someone with a cognitive science background, I enjoy models which, by their structure rather than raw computational power, could help to peel back the curtain on cognitive processes. This model is an attempt to solve the problem of representing long-term dependencies.

Language generation broadly consists of two components, planning and realisation. Realisation is purely the task of generating grammatical text, regardless of the meaning, relation to other text, or overall sense of it. Planning is the process of ensuring that long-term dependencies between entities in the text are resolved, and that items relate to each other semantically in the proper way. In this next phrase, the key entities are John, London, England, Bartender.

Moryossef, 2019 | https://arxiv.org/abs/1904.03396

Planning is managing those entities and the relations between them, and realisation is generating the phrase “John, who works as a bartender, lives in London, the capital of England” (Moryossef, 2019).

GraphWriter generates an abstract from the words in the title and the constructed knowledge graph. Contributions of this paper include:

  • A new graph transformer encoder that applies the sequence transformer to graph structured inputs
  • Shows how IE output can be transformed into a connected unlabeled graph for use in attention based encoders
  • A dataset of knowledge graphs paired with scientific texts for further study

Before the input goes into the encoder (more on that later), it has to be arranged in the right way. Input for this model goes in two channels, the title, and a knowledge graph of the entities and relations.

Dataset

For this, the AGENDA dataset was introduced — based off 40k paper titles and abstracts taken from the top 12 AI conferences, taken from the Semantic Scholar Corpus. (Ammar et al, 2018) After the graph creation and preprocessing below, the dataset was fed to the model. The full repo, including the dataset, is available here.

Graph Pre-Processing:

Creation of a knowledge graph:

  1. The NER/IE SciIE system of Luan et al (2017) is applied, which extracts entities and labels them, as well as creates coreference annotations
  2. These annotations are then collapsed into single labelled edges
Koncel-Kedziorski (Allen Institute) | 2019

3. Each graph is then converted to a connected graph using an added Global Node that all other nodes are connected to.

4. Each labelled edge is then replaced with two nodes representing the relation in each direction, and the new edges are unlabeled. This allows the graph to be represented as an adjacency matrix, which is a necessary precondition for easy processing.

Koncel-Kedziorski (Allen Institute) | 2019

One of the key features in the formation of this graph is the addition of a global node G which all entities are connected to, which transforms the disconnected graphs into connected graphs. Thus, the end product is a connected, unlabelled, bipartite graph.

Model Architecture:

This model uses encoder-decoder architecture, with the unlabeled, bipartite graph, and the title, as inputs.

Koncel-Kedziorski (Allen Institute) | 2019

Title encoder: The title is encoded with a BiLSTM, using 512-dimensional embeddings. No pre-trained word embeddings were used for this.

Graph encoder:

The first step is that each vertex representation vi is contextualized by attending to all other vertices in v’s neighbourhood.

The graph encoder then creates a set of vertex embeddings by concatenating the products of attention weights resulting from N attention heads.

These embeddings are then augmented with “block” networks, consisting of multiple stacked blocks of two layers each, with the form:

The end result is a list of entities, relations, and their context with the global node, called graph contextualized vertex encodings.

Decoder (graph and title)

The decoder is attention-based, with a copy mechanism for copying input from the entities in the knowledge graph, and words in the title. It also uses a hidden state ht at each timestep t.

From the encodings created with the encoder, context vectors Cg and Cs are computed using multi-headed attention

Here, Cs (title context vector) is computed the same way as Cg. The two are concatenated together to make Ct, the total context vector.

From this context vector, the probability of copying a token from a word in the title or an entity name is computed from the total context vector multiplied by Wcopy.

The final next-token probability distribution is:

Experiments

GraphWriter gets compared to GAT (graph transformer replaced with graph attention), EntityWriter (like GraphWriter, but does not use the graph structure), Rewriter (Wang et al, 2018) (only uses title encodings).

Human and automatic evaluation metrics used:

  • Human evaluation: Domain experts judging abstracts for whether they fit the title. Best-Worst scaling (Louviere and Woodworth, 1991)
  • BLEU (Papineni et al, 2002)
  • METEOR (Denkowski and Lavie, 2014) (precision and recall over the unigram frequencies of the generated output versus the original abstracts)

automatic evaluation:

human evaluation:

There was a particular thing about this paper that I found mildly disappointing; the ‘subject matter experts’ were undergrad CS students. While undergrads can know a lot by the time they graduate, I don’t know what familiarity they typically had with academic papers in computer science.

What I thought to be most telling about the human judgments of abstract quality was that while the human output wasn’t always deemed to be the best, it was never the worst. That led me to suspect that there was something about the generated abstracts that the relatively unseasoned undergrads couldn’t quite put their fingers on, but nonetheless tipped them off that something was a bit amiss.

The question, after looking at the decoding mechanism used for generation, is how much was ultimately copied versus generated? For this, I emailed the author, Rik Koncel-Kedziorski, for copies of the generated abstracts and the original documents, which he graciously provided.

Original abstract:

we present a learning architecture for lexical semantic classification problems that supplements task-specific training data with background data encoding general '' world knowledge '' . the learning architecture compiles knowledge contained in a dictionary-ontology into additional training data , and integrates task-specific and background data through a novel hierarchical learning architecture . experiments on a word sense disambiguation task provide empirical evidence that this '' hierarchical learning architecture '' outperforms a state-of-the-art standard '' flat '' one .

The GraphWriter-generated abstract:

in this paper , we present a novel learning architecture for lexical semantic classification problems , in which a learning architecture can be trained on a large amount of training data . in the proposed learning architecture , a hierarchical learning architecture is learned to select the most informative examples from the source domain to the target domain . the learning architecture is trained in a hierarchical learning architecture , where the background data is learned from the training data in the target domain . the learning architecture is trained on data from the source domain and a target domain in the target domain . experimental results show that the proposed learning architecture is effective in improving the performance of lexical semantic classification problems .

Analysis:

One of the key questions investigated here was whether knowledge helps, in some explicitly encoded form, as opposed to something to be just absorbed by a model with many parameters. In this case, it did. Rewriter, an LSTM-based model that just used title words to generate abstracts, performed the worst on every evaluation metric chosen. EntityWriter used more information than GraphWriter, and was an effective control in that it used the same entities extracted, but without the context provided by the graph. It performed better than no knowledge used in any form, but was still outperformed by a model which used the context created by the graph.

I think it’s important here to not view GraphWriter as being in direct competition with GPT-3 for plausible human output, it clearly isn’t, and the never-experts I unscientifically polled didn’t have difficulty telling which one was human-generated vs which one was clearly not. But, that wasn’t the test, or the goal. The test was whether a system which used structured domain knowledge would do better than one that didn’t, such as EntityWriter, and in that case, it did.

References:

Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019.Step-by-step: Separating planning from realization in neural data-to-text generation.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers), pages 2267–2277, Minneapolis, Minnesota. Association for Computational Linguistics

Moryossef, Goldberg, Dagan, https://arxiv.org/abs/1904.03396

Koncel-Kedziorski, https://arxiv.org/abs/1904.02342

--

--