State Of The Art, Summer’21
Welcome to the third iteration of our regular overview of NLP papers around Knowledge Graphs, this time published at ACL 2021! What will be (or has been) trending this year that you wouldn’t want to miss? 👀

ACL’21 remains to be one of the largest NLP venues: 700+ full papers and 500+ Findings of ACL papers were accepted this year 📈 . Besides, don’t forget often impactful short papers and a wide selection of workshops and tutorials. I tried to distill some KG papers from all those tracks into one post. Here are today’s topics:
- Neural Databases & Retrieval
- KG-augmented Language Models
- KG Embeddings & Link Prediction
- Entity Alignment
- KG Construction, Entity Linking, Relation Extraction
- KGQA: Temporal, Conversational, and AMR
- tl;dr
For dataset aficionados, I marked every new dataset with 💾 , so you could search and navigate a bit easier. Having said that, you’d probably want some navigation in this ocean of high-quality content 🧭.

Neural Databases & Retrieval
Neural retrieval continues to be one of the fastest-growing and hottest 🔥 topics in NLP: it now works with billions of vectors and indices of the scale 100+ GB. If the NLP stack is mature enough, can we approach the holy grail of database research, well, databases, from the neural side? 🤨

Yes! Thorne et al introduce the concept of natural language databases (denoted as NeuralDB): there is no pre-defined rigid schema, instead, you can store facts right as text utterances as you write them.
NB: if you are more of a database guy and rank "proper DB venues" higher, the foundational principles were also laid in the recent VLDB’21 paper by the same team of authors.
How does it work? What is the query engine? Are there any joins? (no joins – not a database!)
An introduced NL DB consists of K textual facts (25–1000 in this study). Essentially, the query answering task over textual facts is framed as retrieval 🔎 + extractive QA 📑 + aggregation 🧹(to support min/max/count queries). Given a natural language question, we first want to retrieve several relevant facts (supporting facts). Then, having a query and m __ supporting sets, we perform a join (select-project join, SPJ operator, okay, now it qualifies as a database 😀 ) against each pair (query, support) to find an answer or confirm its absence (extractive QA). Finally, join results are aggregated with simple post-processing.
🧱Can we just concatenate all K facts and put them in one big transformer? Technically, yes, but the authors show it is rather inefficient when DB size grows beyond 25 facts. In contrast, a multi-staged approach allows for parallel processing and better scaling. It seems that, currently, the crux of NL DBs is in the retrieval mechanism – we don’t want to create a powerset of all possible combinations but extract only relevant ones. So far it is done via the DPR-like dense retrieval (Support Set Generator) trained on annotated supporting sets for each query.
Speaking of annotation and training, the authors support NLDBs with a new collection of datasets 💾 WikiNLDB: KG triples from Wikidata were verbalized in sentences (and you can generate your own DBs varying a number of facts).
🧪 Experimentally, T5 and Longformer (with bigger context windows) can only compete with the Neural SPJ operator on the smallest graphs when given golden retrieval results. Otherwise, on bigger DBs of 25+ facts their performance quickly deteriorates while SPJ + SSG is far more stable. The paper is very accessible to the general NLP audience, definitely one of my favorites this year 👍 !

As retrievals become more important (even not in the context of neural databases), ACL’21 has a rich collection of new methods around the seminal Dense Passage Retrieval and its family of related retrievers.

Chen et al tackle a common and important IR problem of entity disambiguation, i.e., you have many entities that share the same name (surface form) but have different properties (🖼 👈 Abe Lincoln-politician and Abe Lincoln-musician).
To evaluate retrievers more systematically, the authors design a new dataset 💾 AmbER (ambiguous entity retrieval) collected from Wikipedia -Wikidata page alignment. Specifically, the dataset emphasizes the 🌟 "popularity gap" 🌟 : in most cases, retrievers fall back to the most prominent entities (for example, most viewed pages with more content) in their index, and we want to quantify that shift. Wikidata entities and predicates are used as a reference KG collection to generate new complex disambiguation tasks (calle_d AmbER se_ts).
AmbER consists of two parts: AmbER-H (disambiguating humans) and AmbER-N (non-humans like films, music bands, companies), and measures performance in 3 tasks: QA, slot filling, and fact checking.
🧪 In the experiments, the authors show that current SOTA retrievers do suffer from inefficient disambiguation – performance on tasks involving rare entities drops 15–20 points 📉 . That is, there is still a lot to be done for improving retrievers’ precision.

A common computational issue of modern retrievers is their index size: DPR with 21M items takes ~65GB of memory. Yamada et al propose an elegant solution: using the learning to hash idea let’s train a hash layer that approximates a sign function such that continuous vectors become binary vectors of +1/-1. Then, instead of a costly dot product (MIPS over the index) we can use highly efficient CPU implementations of Hamming distance to compute rough top-K candidates (1000 in the paper). Then, we can easily compute a dot product between a question and 1000 candidates.
The BPR approach (Binary Passage Retriever) enjoys several wins: 1️⃣ index size is reduced to ~2 GB (down from 66GB!) without big performance drops (only Top-1 accuracy is expectedly affected); 2️⃣ BPR is in the top performers of the EfficientQA NeurIPS Challenge 👏 . Overall, this is a very good example of an impactful short paper!
📬 Finally, I’d outline a few more retrieval-centric works from the conference: Sachan et al examine how pre-training on Inverse Cloze task and masked salient spans improves DPR performance on open-domain QA tasks. Maillard, Karpukhin et al design a 🌐 universal retriever 🌐 , a multi-task trained retriever suitable for many NLP tasks and evaluated on the KILT benchmark combining QA, entity linking, slot filling, and dialogue tasks. Ravfogel et al present a cool demo (over Covid-19 data) of the neural extractive search system available for everybody to play around with.
KG-augmented Language Models: 🪴🚿
One of the major trends in BERTology is to probe factual knowledge of large LMs, e.g., feeding a query "Steve Jobs was born in [MASK]" to predict "California". We can then quantify those probes using various benchmarks like LAMA. In other words, can we treat language models as knowledge bases? So far, we have the evidence that LMs can predict correctly a few simple facts.
But really, can they? 🤔
Our findings strongly question the conclusions of previous literatures, and demonstrate that current MLMs can not serve as reliable knowledge bases when using prompt-based retrieval paradigm. – Cao et al

The work of Cao et al is pretty much a cold shower for the whole area – they find that most of the reported performance can be attributed to spurious correlations 🥴 rather than actual "knowledge". The authors study 3 types of probing (illustrated 👈 ): prompts, cases (aka few-shot learning) and contexts. In all scenarios, LMs exhibit numerous flaws, e.g., cases can only help to identify answer type (person, city, etc) but can not point to a particular entity within this class. The paper is very easy to read and follow, and has lots of illustrative examples 🖌 , so I’d recommend giving it a proper read even for those who do not actively work in this area.
Interestingly, a similar result in the open-domain QA is reported by Wang et al here at ACL’21, too. They analyze BART and GPT-2 performance and arrive at pretty much the same conclusions. Time to rethink how we pack explicit knowledge in LMs? 🤔

From the previous posts, we know there exists quite a number of Transformer language models enriched with facts from knowledge graphs. Let’s welcome two new family members! 👨 👩 👦 👦
Wang et al propose K-Adapters, a knowledge infusion mechanism on top of pre-trained LMs. With K-Adapters, you don’t need to train a large Transformer stack from scratch. Instead, the authors suggest placing a few adapter layers in between the layers of already pre-trained frozen models (they experiment with BERT and RoBERTa), for example, after layers 0, 12, and 23. The frozen LM features are concatenated with learnable adapter features and trained on a set of new tasks – here, it is 1️⃣ relation prediction based on the T-REx dataset of aligned Wikipedia-Wikidata text-triples; 2️⃣ dependency-tree relation prediction. Experimentally, this approach improves performance on entity typing, commonsense QA, and relation classification tasks.


Qin et al design ERICA, a contrastive mechanism to enrich LMs with entity and relational information. Specifically, they add two more losses to the standard MLM: entity discrimination (🖼 👈 ) and relation discrimination. On the example of entity discrimination, pre-training documents have pairwise annotations 🍏 🍏 of entity spans. The model is asked to yield higher cosine similarities of true pairs 🍏 🍏 than negative ones 🍏 🍅 through a contrastive loss term. ERICA performs particularly well in the low-resource fine-tuning scenarios (1–10% of training data) in relation prediction and multi-hop QA tasks.
KG Embeddings and Link Prediction
Can the strengths of multi-relational KG embedding models be their own weaknesses prone to adversarial attacks? A zoo of algorithms is often compared by their abilities to capture certain relational patterns like symmetry, inversion, composition, and more. A short answer is yes :/

An insightful work of Bhardwaj et al studies various types and directions of poisoning 🔫 embedding models by adding adversarial triples (check the example illustration 👈 ). It does assume we have all access to pre-trained weights and can perform forward calls (white-box setup). After suggesting several ways of searching for adversarial relations and potentially decoy entities, experiments show that the most effective attack leverages symmetry 🦋 patterns (at least on standard FB15k-237 and WN18RR graphs). Interestingly, a convolutional model ConvE without geometric or translational priors looks most resilient 🛡 to designed attacks, ie, vanilla TransE or DistMult get poisoned more severely.
⚖️ I’d also outline a long-anticipated study of Kamigaito and Hayashi on the theoretical similarities of two popular families of loss functions for training KG embedding models: softmax cross-entropy and negative sampling, and in particular, self-adversarial negative sampling. In numerous studies (e.g., shameless plug, or Ruffinelli et al from ICLR’20) we’ve seen that models trained with one or another loss exhibit similar performance. And finally, in this work, the authors study their theoretical properties through the lens of Bregman divergence. Two important messages you want to take home after reading this article: 1️⃣ Self-adversarial negative sampling is very similar to cross-entropy with label smoothing. 2️⃣ Cross-entropy models might fit better than negative sampling ones. ProTip: you can now cite this paper if you forgot to run experiments with more loss functions 😉

🚨 New LP Dataset Alert 🚨 Freebase and Wordnet graphs as benchmarks have been there for too long and we, as a community, should finally adopt new datasets with fewer biases and larger scale as 2021–2022 testing suites. Cao et al explore test sets of FB15k-237 and WN18RR and find (like in the picture 👈 ) that, often, test triples are either unpredictable even for humans, or do not make much practical sense. Motivated by that, they created a new set of datasets 💾 InferWiki 16K & InferWiki 64K (based on Wikidata 😍 ) where test cases do have grounding in the train set. They also created a set o_f unkno_wn triples for the triple classification task (in addition to true/false). 🧪The main hypothesis is confirmed in the experiments – embedding models indeed operate much better on _non-random split_s when testing triples do have grounding in train.
Let’s welcome 👋 a few new approaches for link prediction. 1️⃣ BERT-ResNet by Lovelace et al encodes entity names and descriptions through BERT and passes triples through a ResNet-style deep CNN with subsequent re-ranking and distillation (quite a bit of everything put there!). The model yields large improvements 📈 on commonsense-style graphs like SNOMED CT Core and ConceptNet with lots of knowledge encoded into textual descriptions. 2️⃣ Next up, Chao et al propose PairRE, an extension of RotatE where relation embeddings are split into head-specific and tail-specific parts. PairRE showed quite competitive results on the OGB datasets. By the way, the model is already available in the PyKEEN library for training and evaluating KG embedding models. 😉 3️⃣ Li et al design CluSTeR, a model for temporal KG link prediction. CluSTeR employs RL at the first clue search stage and runs R-GCN on top of them at the second stage. 4️⃣ Finally, I am excited to see more research on hyper-relational KGs! 🎇 (find my review article here). Wang et al build their GRAN model on top of Transformer with a modified attention mechanism that includes qualifiers interaction. I’d be interested to see its performance on our new WD50K hyper-relational benchmarks!
Entity Alignment: 2 New Datasets 💾
In the task of entity alignment (EA), you have two graphs (possibly sharing the same set of relations) with two disjoint sets of entities, like entities from English and Chinese DBpedia, and you have to identify which entities from one graph can be mapped onto another one.
For years ⏳ ⌛️, entity alignment datasets implied there is a perfect 1–1 mapping between two graphs, but it is quite an artificial assumption for real-world tasks. Finally, Sun et al study this setup more formally through the notion of dangling entities (those who don’t have respective mappings).

The authors build a new dataset 💾 , DBP 2.0, where only 30–50% of entities are "mappable" and the rest being dangling. It therefore means, that your alignment model has to learn a way to decide whether a node can be mapped or not – the authors explore 3 possible approaches for doing that.
As most EA benchmarks are already saturated around very high values, it’s intriguing to see that adding "noisy" entities drastically drops 📉 the overall performance. One more step towards more practical setups!

Often, some edges if a graph can be contained implicitly in some text – then we talk about KG-Text alignment. Particularly, we are interested if there is any way to enrich graph embeddings with text embeddings and vice versa. Pahuja et al provide a large-scale study of this problem by designing a novel dataset 💾 derived from the whole English Wikipedia and Wikidata: 15M entities and 261M facts 🏋 . The authors analyze 4 alignment methods (e.g., by projecting KG embeddings to the text embedding space) and train KG / Text embeddings jointly.
🧪 Task-wise, the authors measure performance in few-shot link prediction (over the KG triples) and analogical reasoning (over the textual part). Indeed, all 4 alignment methods do improve the quality on both tasks compared to single-modality only, e.g., in analogical reasoning the best method of fusing KG information brings 16% Hits@1 of absolute improvement over the Wikipedia2Vec baseline 💪 . On the link prediction task, fusion can yield up to 10% Hits@1of absolute improvement.
It’s worth noting that the approach assumes joint training of two separate models. It would be definitely interesting to probe KG-augmented LMs (one model pre-trained on KGs) on this new task bypassing the alignment issue.
KG Construction, Entity Linking, Relation Extraction
🧩 Automatic KG construction from text is a highly non-trivial and sought-after task suitable for many industrial applications.
Mondal et al propose a workflow for KG construction of NLP papers from the ACL Anthology (to which belong, for example, all papers reviewed for this article). The resulting graph is called SciNLP-KG. It’s not exactly end-to-end as stated in the title (the authors justify it by error propagation in Section 5) and consists of 3 stages (🖼 👇 ) around relation extraction. SciNLP-KG builds upon the line of previous research (NAACL’21) on extracting mentions of Tasks, Datasets, and Metrics (TDM). The KG schema has 4 distinct predicates: evaluatedOn, evaluatedBy, coreferent, and related to capture links among TDM entities. The authors build two versions of SciKG: a small MVP and a fully-fledged one with 5K nodes and 15K edges. A solid plus of the approach is that the automatically built big SciKG has a big overlap (about 50% of entities) with Papers With Code!
Yes, it’s just 4 relations on a restricted domain, but it’s a good start – surely, more scalable and end-to-end approaches will follow.


In the era of neural entity linkers like BLINK and ELQ, a work by Jiang, Gurajada et al takes an unorthodox view on the problem: let’s combine textual heuristics together with neural features in a weighted rule-based framework (Logical Neural Nets). In fact, LNN-EL is a component of the bigger neuro-symbolic NSQA system, but more on that in the following KGQA section.
📝 The approach, LNN-EL, requires entity mentions to be already there along with top-K candidates from a target KG. Some textual features can be, for instance, Jaro-Winkler distance between mention and candidates or node centrality in the underlying KG. BLINK and neural methods can be plugged in as features, too. Then, an expert creates a set of rules with weights, e.g., assign w1 for Jaro-Winkler and w2 for BLINK, and the weights are learned with margin loss and negative sampling.
🧪 LNN-EL performs on par with BLINK and returns an explainable tree of weighted rules. Moreover, it can generalize onto other datasets that use the same underlying KG 👏
➖There exist some drawbacks, too: it seems that BLINK is actually the crucial factor in the overall performance responsible for 70–80% of total weights in the rules. So the natural question is – is it practically worth it to come up with sophisticated expert-heavy rules? 🤔 Second, the authors use DBpedia lookup for retrieving top-K candidates and "assume that similar services exist or can be implemented on top of other KGs". Unfortunately, this is often not the case – in fact, such candidate retrieval systems exist only for DBpedia and (partially) Wikidata while for the rest of large KGs it’s highly non-trivial to create such a mechanism. Nevertheless, LNN-EL lays a strong foundation for neuro-symbolic entity linking for KGQA.

Entity linking often goes hand in hand 🤝 with entity typing. Onoe et al tackle the problem of fine-grained entity typing (when you have hundreds and thousands of types) with box embeddings (Box4Types). Usually, fine-grained entities are modeled as vectors with a dot product between encoded mention+context vector and a matrix of all types vectors. Instead, the authors propose to move from vectors to 📦 boxes (d-dimensional hyper-rectangles). Moreover, not "just boxes" but Gumbel (soft) boxes (NeurIPS’20) that allow for doing backprop in corner cases when "just boxes" do not intersect. The 🖼 👇 gives a nice intuition: essentially, we model all interactions as geometric operators with 📦 and norming their volume to 1 gives an additional bonus of probabilistic interpretation.
🧪 Experimentally, boxes work at least as good as heavier vector-based models, and in some cases outperform them by a good margin of 5–7 F1 points 👏 . Besides, there are numerous qualitative experiments with insightful figures. Overall, I enjoyed reading this paper a lot – highly recommend it as an example of a strong paper.

Let’s add a few words on Relation Extraction papers that slightly improve SOTA in several benchmarks. Hu et al investigate how pre-trained KG entity embeddings can help in bag-level relation extraction (in fact, just a bit), and create a new dataset BagRel-Wiki73K 💾 based on entities and relations from Wikidata! Tian et al present a stereoscopic 🧊 _p_erspective, StereoRel, on the RE task, i.e., entities, relations, and words in the paragraph can be modeled as a 3D cube. BERT encoding of a passage is sent to several decoders to reconstruct a correct relational triple. Finally, Nadgeri et al present KGPool where known entities from a sentence induce a local neighborhood with subsequent GCN layers and pooling.
Question Answering over KGs: Temporal, Conversational, AMR
Contemporary KGQA focuses predominantly on classical static graphs, i.e., when you have a fixed set of entities and edges, and questions do not have any temporal dimension.
⏳But the time has come! Saxena et al introduce a large-scale task of QA over Temporal KGs, those that have a timestamp over the edge indicating its validity like (Barack Obama, position held, POTUS, 2008, 2016)
. It opens a whole new variety of simple and complex questions around time dimension: "Who was POTUS before/after Obama?", "Who portrayed Iron Man when Obama was POTUS?" and so on. The authors created a new dataset 💾 CronQuestions (based on Wikidata 😍 ) with 410K questions over KG with 123K entities, ~200 relations and 300K triples enriched with timestamps.
🧐 Not surprisingly, BERT and T5 cannot handle such questions with any decent accuracy, so the authors combine EmbedKGQA (an approach from ACL’20 that we highlighted in the previous ACL review) with pre-trained temporal KG embeddings TNT-ComplEx (from the ICLR’20 review, see, with this series you can stay up-to-date with most of the recent goodies 😉 ) in a new model CronKGQA. Essentially, we take a BERT embedding of a sentence as a relation embedding and pass it into static & temporal scoring functions as depicted below. 🧪Experimentally, CronKGQA yields around 99% Hits@1 rate for simple questions but still has a room for improvement on more complex ones. For the rest of KGQA community: look, there is a new non-saturated benchmark 👀 !

🗣 Conversational KGQA deals with sequential question-answer steps where context and dialogue history are of higher importance when generating queries to an underlying KG and forming predictions. In conversational KGQA, follow-up questions are often the hardest to deal with. Traditionally, dialogue history is encoded as one vector and there is no special treatment of recent entities. Furthermore, explicit entity naming in follow-up questions is often omitted (since humans are generally good at coreference resolution), so the natural question is: how can we keep track of the most relevant entity in a current conversation?
🎯 Lan and Jiang propose a concept of focal entities, i.e., an entity which is being discussed in a conversation about which we’ll most probably ask follow-up questions. The approach assumes we have an access to a SPARQL endpoint to query KG on the fly (obviously, it’s not end-to-end neural, but instead we can operate on much bigger graphs of the scale of the whole Wikidata).
The main idea is that we can dynamically change the focus of an ongoing conversation by computing a distribution over entities in an Entity Transition Graph. 1️⃣ First, we build such an ETG by expanding the graph around the starting node of a conversation (by 1–2 hops). 2️⃣ Then, the ETG is passed through a GCN encoder to get updated entity states. 3️⃣ Updated entity states are aggregated with the dialogue history in the Focal Entity Predictor (see the illustration below) that builds a distribution over entities as to being the focal entity. 4️⃣ Finally, the updated distribution is sent to an off-the-shelf Answer Predictor that returns an answer to a current utterance.
🧪The idea of changing focal entities yields significant gains (10 points on average over strong baselines) on ConvQuestions and the conversational version of CSQA 💾 ! The biggest error source stems from incorrect relation prediction, so there is a room to improve for sure.

Finally, let’s talk vanilla KGQA with one question – one answer given a graph. Kapanipathi and 29 more folks from IBM Research present a huge neuro-symbolic KGQA system, NSQA, built around AMR parsing. NSQA is a pipelined system with specifically tailored components 🧩. That is, an input question is first parsed into an AMR tree (pre-trained component 1️⃣), then entities in the tree are linked to a background KG (2️⃣ that’s the L_NN-EL d_escribed above!). A query graph is constructed via rule-based BFS traversal of the AMR tree. And Relation Linking is a separate component SemRel (3️⃣ presented in the other ACL’21 paper by Naseem et al).
NSQA heavily relies on AMR frames and their interconnections for better parsing a tree to a SPARQL query, e.g., an "amr-unknown" node will be a variable in the query. For sure, a lot of work was put into meticulously created rules to process AMR output 👏 On the other hand, all other components employ Transformers in one or another way. Looks pretty neuro-symbolic indeed! 🧪 Experimentally, AMR parsing is ~84% accurate (compared to those created by human experts) on the LC-QuAD 1.0 benchmark while the overall NSQA improves the F1 measure by ~11 points. Some openly available source code would be quite handy, dear IBM 😉

tl;dr
You made it to the final section! From the table of contents or after reading some relevant sections, either way, thank you for your time and interest in this area 😍 . Let me know in the comments what you think about this whole endeavor and the format in general!
From neural databases to question answering KGs are being applied in more tasks than ever. Overall, I think it’s a great time to do KG research: you can always find a niche and tackle both theoretical and practical challenges that might be used by (tens of) thousands of folks in the community.
Looking forward to what we’ll see at the next conference!
