Knowledge Graphs in Natural Language Processing @ ACL 2020

State of the Art Mid 2020

Michael Galkin
Towards Data Science

--

This post commemorates the first anniversary of the series where we examine advancements in NLP and Graph ML powered by knowledge graphs! 🎂 1️⃣
The feedback of the audience drives me to continue, so fasten your seatbelts (and maybe brew some ☕️): in this episode, we are looking at the KG-related ACL 2020 proceedings!

ACL 2020 went fully virtual this year and I can’t imagine how hard was it for the chairs to organize such an enormous event online catering for multiple time zones and 700+ accepted papers. Thanks to all involved as well as speakers and participants, the conference went smoothly given its size 👏

So did something change in the KG & NLP field comparing to ACL 2019? Yes!
I would summarize this year’s contribution as:

Knowledge graphs demonstrate better capabilities to reveal higher-order interdependencies in otherwise unstructured data

Today in our agenda:

  1. Question Answering over Structured Data
  2. KG Embeddings: Hyperbolic and Hyper-relational
  3. Data-to-text NLG: Prepare your Transformer
  4. Conversational AI: Improving Goal-Oriented Bots
  5. Information Extraction: OpenIE and Link Prediction
  6. Conclusion

Question Answering over Structured Data

In this context, questions are sent against structured sources like SPARQL-based KGs or SQL databases (other query languages are not that prominent).
This year, we can observe increasing efforts incorporating complex (also known as multi-hop) questions.

For example, Saxena et al tackle the problem of complex KGQA coupling KG embeddings with a question embedding vector in their EmbedKGQA. 1️⃣ First, an underlying KG is embedded with some algorithm (the authors choose ComplEx), so that each entity and relation is associated with a specific vector. In some cases, the authors freeze them, or keep fine-tuning depending on the KG size. 2️⃣ The input question is encoded via RoBERTA ([CLS] token from the last layer) and passed through 4 FC layers (well, if you ask why exactly 4, I don’t have an answer, looks like a magic number 🧙‍♂️) that are supposed to project the question into the complex space. 3️⃣ The crucial part happens in scoring where the authors adopt the KG embeddings framework and build a triple (head entity, question, candidate entity). The scoring function is the same as in ComplEx. Here, the head is the main entity in the question, the question itself is considered a relation (might seem a bit stretched though), and candidate entities are either all entities in the KG (if small) or 2-hop subgraph around the head (when pruning is required). Yes, it does resemble a typical 1-N scoring mechanism used for training KGE algorithms. ✂️ A candidate space can be further pruned by computing and thresholding a dot product (h_q, h_r) between the question embedding h_q and each relation embedding h_r .

🧪 In the experiments performed over MetaQA and WebQuestionsSP, the authors probe an additional scenario of an incomplete KG randomly removing 50% of the edges, so that the system has to learn to infer such missing links. In the full scenario, EmbedKGQA performs on par with PullNet (slightly better on 3-hop questions) and 10–40% better in absolute Hits@1 scores compared to baselines when not using additional text to augment the KG. 👏
Still, it is interesting to check how EmbedKGQA would process questions that require aggregations or have several grounded entities 🤔.

EmbedKGQA. Source: Saxena et al

Somewhat on the other side of the spectrum, Lan et al propose to use iterative an RL-based query generation approach (KG embeddings-free 😉 ). Based on the topic entity (obtained via some entity linker, the authors resort to Google KG API for linking to Freebase), there is a set of 3 operations, namely, extend, connect, and aggregate that are applied to the seed entity building a query pattern. Naturally, the operations allow for complex multi-hop patterns with min/max aggregations. 🔦 At each step, the authors use beam search to keep K best patterns and rank them by deriving a 7d feature vector followed by a feed-forward net with a softmax. 🤨 You’d ask: “Wait, but where is BERT? Everybody’s using BERT now!”. Well, no panic, here you go: the surface forms of entities and relations participating in a query graph are linearized, concatenated with the input question and fed into BERT to get a [CLS] representation of the last layer (which is one of 7d features).

🌡 The authors evaluate the approach on ComplexWebQuestions, WebQuestionsSP, and ComplexQuestions (looks a bit tailored to Freebase, isn’t it?), and find a noticeable improvement over the baselines. Further ablations show the importance of the 3 chosen operators. And here is the cliffhanger: it is a short paper! 👀 I would recommend this paper as a good example of a short paper that conveys the main idea, shows experiments, and demonstrates the validity of the approach with ablations 👍

Source: Lan et al

📑 Structured QA also includes semantic parsing over SQL tables, and many new complex datasets drive the research in SQLandia.

Source: Wang et al

Among others, I’d outline Wang et al and their RAT-SQL (relation-aware transformer, no 🐀) where they define explicit edges between columns and tables to encode the database schema. The authors also define an initial schema and value linking to obtain candidate columns and tables. Further, columns, tables, and question tokens are jointly passed through modified self-attention layers. ➡️ Finally, the tree-structured decoder builds an SQL query.
RAT-SQL shows drastic improvements in Spider with bigger gains when using BERT for initial embeddings of tokens 📈

Source: Elgohary et al

Often, interacting with a semantic parsing system you’d want to fix small problems on the fly, or point 🔴 the parser at its mistake. Elgohary et al tackle this exact problem and propose SPLASH, a dataset for correcting SQL parses with natural language feedback. The correction scenario is different from conversational text2SQL tasks, so even recent SOTA models like EditSQL exhibit a large gap in the correction task compared to human annotators, i.e., 25% against 81%. That’s quite a gap 👀.
Along the same dimension, Zeng et al develop Photon which is fully-fledged text-to-SQL system capable of performing some query correction as well 😉

KG Embeddings: Hyperbolic and Hyper-relational

Hyperbolic spaces are among recent hot topics in ML. Refraining from 🤯, in simpler terms, in a hyperbolic space 🔮 (thanks to its properties) you can represent hierarchies and tree-like structures more efficiently and at the same time use fewer dimensions!

Source: Chami et al

With that motivation, Chami et al propose AttH, a hyperbolic KG embedding algorithm that uses rotations, reflections, and translations to model logical and hierarchical patterns in the KG. Att comes from the hyperbolic attention that is applied to rotated and reflected vectors. 🎩The trick to bypass shaky Riemannian optimization is to use tangent spaces, to which every point of the d-dimensional Poincare ball can be mapped. In this obviously non-trivial setup, each relation is associated not only to one vector, but also to the parameters that describe relation-specific reflections and rotations. Still, in real-world KGs R << V , so the overhead is not that big.

⚖️ In the experiments, AttH performs especially well on WN18RR and Yago 3–10 that exhibit some hierarchical structure, with lesser margins on FB15k-237. More importantly, just 32-dimensional AttH shows huge margins compared to 32-d models in real and complex planes. Moreover, 32-d scores are just 0.02–0.03 MRR points smaller than SOTA 500-d embedding models on WN18RR and FB15k-237. Ablation studies demonstrate the importance of having learnable curvatures whereas its closest match, MurP, has them fixed.

Source: Rosso et al. Published at WebConf 2020.

Another growing trend in graph representation learning is to go beyond simple KGs consisting of triples and learn representations for more complex, hyper-relational KGs (as coined in the work by Rosso et al), when every triple might have a set of key-value attribute pairs that give fine-grained details on the validity of the triple in various contexts. In fact, Wikidata adopts the hyper-relational model in its Wikidata Statement Model where attributes are called qualifiers. It’s important not to mix the model with n-ary facts (that generate redundant predicates) and hypergraphs. That is, if you work with Wikidata only on the triple level, you lose a good half of the content 😃

Guan et al do not want to lose a good half of Wikidata and propose NeuInfer, an approach for learning embeddings of hyper-relational KGs (their previous work, NaLP, was more suited to n-ary facts).

Source: Guan et al

The idea of NeuInfer is to compute a validity and compatibility score of a hyper-relational fact (cf the illustration). First, (h,r,t) embeddings are fed into a fully-connected net (FCN) to estimate the likelihood of this triple (validity). Second, for each key-value pair a quintuple (h,r,t,k,v) is constructed and passed through another set of FCNs. Having m pairs, m vectors are min-pooled and the result represents the compatibility score, i.e., how well those qualifiers live with the main triple. Finally, the authors use a weighted sum of two scores to get the final score.

The authors evaluate NeuInfer on standard benchmarks JF17K (extracted from Freebase) and WikiPeople (from Wikidata) and report significant improvement in JF17K compared to NaLP when predicting heads, tails, and attribute values. 📈 I would encourage the authors to compare their numbers with HINGE (from Rosso et al) as both approaches are conceptually similar.

💡 And now we need to talk. We need to talk about reproducibility of KG embedding algorithms published even at top conferences like ACL 2019. Sun, Vashishth, Sanyal et al find that several recent KGE models that reported SOTA results (drastically better than existing baselines) suffer from test set leakages, or have unusually many zeroified neurons after ReLU activations scoring valid triples ☢️. Further, they show that their performance metric scores (like Hits@K and MRR) depend on the position of the valid triple among sampled negatives (which should not happen, actually). On the other hand, existing strong baselines perform exactly the same despite any position. The take-away message is to use the evaluation protocol that places a valid triple at a random position among negatives.

Source: Sun, Vashishth, Sanyal et al

[start of a shameless self-promotion 😊] Well, our team also has something to say about this issue: in our new paper “Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework” we performed 65K+ experiments and spent 21K+ GPU hours evaluating 19 models spanning from RESCAL first published in 2011 to the late 2019’s RotatE and TuckER, 5 loss functions, various training strategies with/without negative sampling, and many more hyper-parameters that turn out to be important to consider. We are also releasing the best found hyperparameters for all the models for you folks and our beloved community 🤗. In addition, we are releasing PyKEEN 1.0, a PyTorch library for training and benchmarking KG embeddings models! [end of the self-promotion]

🔥 Several other works I’d encourage you to read thoroughly: Sachan studies the problem of compression of KG entity embeddings by discretization, e.g., “Barack Obama” instead of a 200d float32 vector would be encoded as “2–1–3–3” and “Michelle Obama” as “2–1–3–2”.

An example of saving CD space with (hydraulic) compression

That is, you only need a D-long vector of K values (here, D=4, K=3). For discretization, tempered softmax is found to work better. And as a reverse function from the KD code back to N-dimensional vector of floats the author suggests using a simple Bi-LSTM. Experimental results are astonishing 👀 — compression rates for FB15k-237 and WN18RR reach 100–1000x with a negligible (max 2% MRR) performance drop and computation overhead at inference time (when a KD code has to be decoded back). 🤔 I suggest we all sit down for a minute and re-think our KGE pipelines (especially in production scenarios). For example, 200d embeddings of 78M Wikidata entities obtained via PyTorch-BigGraph require 👉 110 GB 👈 of space. Just imagine what would be possible with a gentle 100x compression 😏.

➕ There is also a lineup of works that improve popular KGE models:
* Tang et al generalize RotatE from 2D rotations to high-dimensional spaces with orthogonal relation transforms which works better for 1-N and N-N relations.
* Xu et al generalize bilinear models to multi-linear by chunking dense vectors in K parts. It is then shown that if K=1 the approach is equal to DistMult, if K=2 the approach reduces to ComplEx and HolE, and the authors experiment with K=4 and K=8.
* Xie et al extend ConvE by replacing standard conv filters with those from the Inception network famous in the Computer Vision domain.
* Nguyen et al apply a self-attention style encoder and a CNN decoder for triple classification and search personalization tasks.

Data-to-text NLG: Prepare your Transformer

An example RDF-to-text task from WebNLG 2017. Source: Zhao et al

As KGs (and structured data in general) become widely adopted in NLP, in 2020, we can observe a surge of natural language generation (NLG) approaches that take a set of RDF triples / an AMR graph / a set of table cells and produce a coherent human-readable text like description or question.

By the way, current RDF-to-text approaches are only evaluated on WebNLG 2017 data, but there is a new run of the challenge, WebNLG 2020! 🎉 If you are into NLG, be sure to participate 😉

Well, the NLG trend of the year is exhaustively summarized in one tweet:

A recent summary of SOTA algorithms in NLP

Sophisticated planners and executors? Some structural alignment? NO 😅. Just spin up your favorite pre-trained LM.

🤔 In fact, plugging in a pre-trained LM and feeding it a few examples indeed works. Chen et al demonstrate this phenomenon on tables coupled with the GPT-2 decoder. That is, table cells are first passed through a learnable LSTM encoder to get a hidden state for the copy mechanism. On the other hand, the text goes into GPT-2 with frozen weights. ✍️The copy mechanism on top helps to retain rare tokens from table cells. The experiments on WikiBio show that as few as 200 training examples are enough to generate texts much better than sophisticated strong baselines. Guess how many would GPT-3 need 😏

Source: Chen et al
Source: Chen et al

Continuing with tables, Chen et al build a new dataset, LogicNLG, that requires using additional logic on top of standard text generation. For example (cf the illustration), some comparative and counting operations are needed to include parts like “1 more gold medal” or “most gold medals” which make the text much more natural and lively 🌼. Baseline models for the dataset use pre-traind GPT-2 and BERT, but looks like there is still some room for LMs to improve.

In the graph-to-text domain, Song et al apply a slightly modified Transformer encoder that explicitly processes relation surface forms. The input is just a linearized graph (which you can build, say, with DFS). The decoder is kept intact though. 🎩 The key component of the approach is to add (along with the standard LM loss) two autoencoding losses that are designed to better capture structure of the verbalized graphs. The first loss reconstructs triple relations, whereas another reconstructs nodes and edge labels of linearized input graphs. 🧪 Experiments conducted on AMR and RDF graphs (WebNLG) suggest that just adding those two losses could yield about 2 BLEU points.

Aux losses try to reconstruct View 1 and View 2. Source: Song et al
Don’t hide your pain, just retire BLEU!

🗒 At this point I should make a small remark that everybody should stop using BLEU for evaluating NLG quality (one of the best ACL’20 paper nominees, I’d trust them). The organizers of WebNLG 2020 pretty much share this opinion as they officially measure chrF++ and BertScore in addition to the classic (or shall we say outdated?) metrics. Furhermore, here at ACL’20 a new metric, BLEURT, was proposed and shown to better correlate with human judgement. Let’s invest into those new evaluation metrics and let ol’ BLEU go for some rest 🏖

🌳 Still, there is a vivid life in the world without (or at least not that much of) transformers! Applied to the graph-to-text task, Zhao et al propose DualEnc, an encoder-planner-decoder model. 1️⃣ First, the input graph is pre-processed to transform a relation into an explicit node. Hence, there are several induced labeled edges s->p, p->s, p->o, o->p. The graph is then encoded via R-GCN to obtain entity and relation embeddings. 2️⃣ The same graph is encoded through another R-GCN with additional features showing whether a relation has been used already or not. The plan is constructed in the following manner: while there are unvisited relations, softmax selects the most probable relation, the relation is then appended to the plan. Once the sequence is ready, it is extended with subjects and objects of the those relations. Finally, the resulting sequence is encoded via LSTM. Both graph encoding and plan encoding are fed into the decoder to generate the output.
📏 The experiments show: 1) that DualEnc shows very good generalization on the unseen test set in plan building, 2) text generation quality outperforms straightly applied transformers, 3) great speedup of the planning stage, i.e., 2019 SOTA needs 250 seconds to solve one 7-triple instance, whereas DualEnc solves all 4928 examples in 10 seconds 🚀

Source: Zhao et al

Finally, let’s move from data-to-text to summarization. In the domain of abstractive summarization Huang et al employ KGs built from a document to enhance the generation procedure in their ASGARD approach.

Concretely, the encoder consists of two parts. 1️⃣ RoBERTa is used to encode input paragraphs. The final layer embeddings are fed into a BiLSTM to obtain hidden states. 2️⃣ OpenIE is used to extract triples and induce a graph from the input document. Tokens of relations are transformed into explicit nodes similar to DualEnc, and initial node states are taken from the BiLSTM from step 1. Graph Attention Net (GAT) is then used to update the node states with a readout function to get a graph context vector. 3️⃣ The generation process is conditioned by both vectors obtained at steps 1 and 2.

🧙‍♂️Some magic happens in training: ASGARD resorts to reinforcement learning where thee reward function depends on ROUGE and cloze score. The cloze part consists in extracting OpenIE graphs from human-written summaries and generating cloze-style questions based on them for the system to better learn the meaning of the summarized document. 📦 So you kinda have a QA model inside! The authors generate 1M+ cloze questions for CNN and NYT datasets. 📏 Experiments report improvements over previous baselines! However, the unanimous winner is a pre-trained BART fine-tuned on the target dataset 😅 Well, looks like the “TPUs go brrr” strategy works here, too.

Conversational AI: Improving Goal-Oriented Bots

In the ConvAI domain, I am a bit biased towards goal-oriented systems as KGs and structured data naturally extend their capabilities.

Source: Campagna et al

🔥 First, Campagna et al propose a method to synthesize goal-oriented dialogues as additional training data for the dialogue state tracking (DST) task. The authors create an abstract model (one could name it an ontology, too) that defines basic states, actions, and transitions. Why this is cool: 1️⃣ the model can be applied to various domains like restaurant booking or train connection search with any slots and values; 2️⃣ synthesized data allows for a zero-shot transfer in domains where you have very limited supervised data. 3️⃣ In fact, the experiments show that using only the synthesized corpus for training (and evaluating on the real MultiWoz 2.1 test) reaches about 2/3 of the accuracy of the original full training set 💪
I believe the method could be used as a general data augmentation practice in developing dialogue systems in specific domains or where annotated training data is limited.

When a new Friends-based dataset arrives

Focusing on relation extraction in dialogues, Yu et al develop DialogRE, a new dataset comprised of 36 relations taken from about 2k dialogues from Friends. Although the relations are not annotated with Wikidata or DBpedia URIs, the dataset still poses a considerable challenge even for BERT. Furthermore, the authors propose a new metric that shows how many turns a system needs to extract a certain relation.

OpenDialKG was one of the best paper award nominees at ACL 2019 for the efforts promoting graph-based reasoning in dialogue systems in a new dataset. Zhou et al did a great job adopting the main ideas of OpenDialKG in their new KdConv dataset suited for Chinese 👏

There is also a lineup of works studying how to incorporate external knowledge in end-to-end dialogue systems. If your background knowledge is expressed as textual triples or table cells (or even plain text), Lin et al suggest using Transformer as knowledge encoder, while Qin et al rely on memory network-like encoder. If you have a commonsense KG like ConceptNet, Zhang et al extract concepts from utterances to build a local graph and then employ a GNN encoder to encode the “central concept” of the conversation which will affect the decoder. If you are interested in even more recent ConvAI goodies, be sure to check the proceedings of the NLP for ConvAI workshop co-located (well, virtually) with ACL!

Information Extraction: OpenIE and Link Prediction

If your work happens to be about building KGs from raw text documents, probably you already know that OpenIE is a de-facto standard to start from. As we have seen in previous sections, rule-based frameworks like OpenIE4 or OpenIE 5 are still actively used. That is, increasing the quality of OpenIE extractions could alleviate many problems in KG construction. A small note: KGs obtained after Open IE are also called Open KGs.

Kolluru et al propose a generative OpenIE approach, IMoJIE (Iterative Memory-based Joint Information Extraction). Inspired by the CopyAttention paradigm, the authors propose an iterative generative seq2seq IE algorithm: at each iteration, the original sequence is concatenated with previous extractions and passed through BERT to obtain final embeddings. The LSTM decoder with copy and attention mechanisms is then tasked to generate a new extraction (tokens that would comprise a triple, for instance). 🤖To further improve the training set, the authors aggregate and rank outputs of OpenIE 3, OpenIE 4, and other systems as “silver labels” for generation.

Although the architecture looks pretty simple, it does bring significant improvements 📈 compared to existing baselines. The ablation study reports that BERT is pretty crucial for the overall quality, so I’d hypothesize that quality could be further improved plugging in a bigger Transformer, or using a domain-specific pre-trained LM, e.g., if your text is from legal or biomed field.

While link prediction (LP) on RDF-like KGs has an established track record and several milestones, we could not say the same about LP on open KGs.

Source:

But now we can! 🤩 Broscheit et al define the task of open link prediction given the challenges of open KGs:

  • Given a (‘subject text’, ‘relation text’, ?) query, the system has to predict genuine, new facts that can’t be trivially explained.
  • However, no entity or relation URIs available that would bind many surface forms to one representation.
  • Still, various surface forms of the same entity or relation might constitute a test set leakage, so the test set has to be carefully constructed and cleaned.

The authors propose a methodology how to build and clean the dataset, an evaluation protocol, and the benchmark itself. The OLPBench is one of the largest datasets for LP with KG embeddings: it contains 30M triples, 1M distinct open relations, and 2.5M mentions of 800K unique entities 👀. For experiments, the authors use ComplEx, where multi-token mentions are aggregated via LSTM. The open LP task turns out to be very difficult 😯: even mighty 768d ComplEx yields only 3.6 MRR, 2 Hits@1, and 6.6 Hits@10.
Clearly, this is a very challenging dataset: it is quite interesting to see the approaches that would be not only scalable to such a large graph but also increase the performance to FB15k-237-like numbers (FYI, currently it is about 35 MRR points and 55 Hits@10).

By the way, if you are further interested in building KGs from text, I’d encourage you to check the proceedings of the recent AKBC 2020 conference which attracted a great lineup of speakers and publications 👏

Conclusion

This year at ACL’20 we see less KG-augmented language models (but do have a look at TaPas and TABERT that were designed to work over tables) and maybe a bit less of NER. On the other hand, graph-to-text NLG is on the rise!
Still, you made it to the end, dear reader, and you deserve some applause :)

Let me know in the comments what you liked and what is to be improved. Thanks for reading and stay tuned for more publications 😌

--

--

AI Research Scientist @ Intel Labs. Working on Graph ML, Geometric DL, and Knowledge Graphs