The New Benchmark for Question Answering over Knowledge Graphs — QALD-9-Plus

And the problem of multilinguality that the benchmark is aiming to solve

Aleksandr Perevalov
Towards Data Science

--

TLDR

For heterogeneous user groups (e.g., language, age), the ability to interact with web applications in the same effective way is one of the most important factors in the concept of “accessibility”. This includes Knowledge Graph Question Answering (KGQA) systems, which provide access to data and knowledge from the Semantic Web through a natural language interface. While working on the topic of multilingual accessibility of KGQA systems, my colleagues and I have identified some of the most pressing problems. One of which is the lack of multilingual benchmarks for KGQA.

In this paper, we improve one of the most popular benchmarks for KGQA — QALD-9, by creating benchmark translations of questions from the original dataset into 8 different languages (German, French, Russian, Ukrainian, Belarusian, Armenian, Bashkir, Lithuanian). One of the most important aspects is that the translations were provided and validated by native speakers of the respective language. Five of these languages — Armenian, Ukrainian, Lithuanian, Bashkir, and Belarusian — have, to our knowledge, never been considered by KGQA systems before. And two languages (Bashkir and Belarusian) are considered “endangered” by UNESCO. We have called the new extended dataset “QALD-9-plus”. The dataset is available online.

Knowledge Graph Question Answering (KGQA)

KGQA systems convert a natural language question into a query to a particular Knowledge Graph, thus allowing the user to access “knowledge” without having to learn a query language (e.g. SPARQL). This is the main difference between KGQA systems and Text-based QA-systems (also called in the literature as MRC, ODQA, IR-based), working on the basis of unstructured data.

Question Answering systems based on knowledge graphs — an example of a question and a query. To the right are some of the most famous knowledge graphs (Image by Author).

Knowledge graphs are typically created based on the Resource Description Framework (RDF). Data in RDF is represented as triplets of the “subject-predicate-object” structure, e.g., John-Is_Friend_Of-Mary, which is why it is convenient to visualize them as a graph. The well-known schema.org is also based on RDF and is used by many websites to mark up their content (de facto, to improve search results). This structuring of the World Wide Web is the foundation of the previously mentioned Semantic Web, where all resources are structured and linked to each other. Thus, KGQA systems are our guides to the world of structured information and knowledge throughout the World Wide Web.

An example of a knowledge graph. Source: https://www.w3.org/TR/rdf11-primer/

The Problem of Multilinguality in Question Answering Systems over Knowledge Graphs

The seemingly natural accessibility of information through Google is not at all true for speakers of languages that are not spoken by hundreds of millions (e.g. German, Russian), but by a few millions (e.g. Belarusian) or even less (e.g. Bashkir). Frequently, people who speak “small” languages are also capable to speak on of the major ones. For example, people who speak Belarusian or Bashkir also speak Russian, which gives access to the second largest segment of the Web. But this does not work for all languages, and as usual, everything is relative. Russian-speaking people have access to understanding of only 6.9% of World Wide Web content, while English-speaking users to 63.6% of content. In this connection, the term “Digital language divide” was introduced. The term Digital language divide is based on the fact that the languages spoken by people affect their Web usage experience and effectiveness.

We did a little experiment on how Google handles “big” and “small” languages using English, German, Belarusian, and Bashkir as examples. One simple question was asked, “How old is Donald Trump?” in each of the languages, respectively. The answer, as they say, killed! In the picture below, you can see how Google successfully answered the question asked in English and German, and how it failed in Belarusian and Bashkir — is that not an indicator of the problem? It is worth noting that when the answer is successful, Google presents it in a structured form, this is comes into play Google Knowledge Graph, which is also helped by schema.org markup.

Illustration of Google’s work with English, German, Belarusian, and Bashkir (Image by Author).

How do others deal with this problem?

There is a misconception that with the advent of unsupervised, weakly-supervised, and semi-supervised methods (e.g. word2vec or BERT), the multilinguality problem has been solved (because there is no need for large amounts of marked data). However, this is not the case. While it is possible to evaluate a language model without using labeled data, it is not possible to evaluate more complex systems (e.g. KGQA). Therefore, the problem of having structured “gold standard” data (benchmarks) in multiple languages is still a pressing issue.

Question Answering over Knowledge Graphs is still a rather specific area of applied science, so there are not many papers published on this topic. At the time of writing this article there were only 3 multilingual benchmarks for KGQA. They are QALD, RuBQ, and CWQ (see the illustration below).

Existing KGQA multilingual benchmarks (Image by Author).

All of the above data sets are not perfect. For example QALD-9, although it has 10 languages, the quality of translation, to put it mildly, leaves much to be desired. RuBQ 2.0 and CWQ used automatic machine translation to obtain translations, which is a limitation of course.

What have we done? QALD-9-Plus dataset

In order to improve the situation with multilingual accessibility of the KGQA systems, we decided to completely update the QALD-9 dataset, leaving only the questions in English and involving crowdsourcing platforms (Amazon Mechanical Turk, Yandex Toloka) in this work. Also, volunteers from the Open Data Science community were involved in the translation process.

The translation task consisted of 2 steps: (1) a native speaker translates from English into their native language, and (2) another native speaker checks the translation options obtained at the previous step. Both steps were conducted independently of each other.

An example of the translation and validation process. Each question has been translated at least 2 times (Image by Author).

As a result, we obtained translations into 8 different languages: Russian, Ukrainian, Lithuanian, Belarusian, Bashkir, Armenian, German and French. Five of these languages have never been represented in the KGQA area until now (Ukrainian, Lithuanian, Belarusian, Bashkir, Armenian), and two languages (Belarusian, Bashkir) are considered endangered by UNESCO.

In addition to translations, we also improved the usability of our benchmark. The original QALD-9 allowed us to evaluate systems based only on the DBpedia knowledge graph. In our work on QALD-9-Plus, we decided to transfer the benchmark to another knowledge graph, Wikidata. This turned out to be quite a difficult task, since automatic SPARQL query converters between different knowledge graphs don’t exist yet, so we had to do the task manually. It’s amazing how different queries can be on DBpedia and Wikidata, because of the different data models (see code below).

The final characteristics of the QALD-9-Plus benchmark, as well as an example of its structure are presented in the form of a table and a code fragment below.

QALD-9-Plus benchmark and its characteristics (Image by Author).

In lieu of a conclusion

I will be very glad if you have read it all the way here! It is at the end of this article that I would like to share some useful links related to this work, namely:

Acknowledgments

I would like to thank co-authors of this contribution, namely: Dr. Dennis Diefenbach, Prof. Dr. Ricardo Usbeck, and Prof. Dr. Andreas Both. Additionally, I would like to thank all the contributors involved in the translation of the dataset, specifically: Konstantin Smirnov, Mikhail Orzhenovskii, Andrey Ogurtsov, Narek Maloyan, Artem Erokhin, Mykhailo Nedodai, Aliaksei Yeuzrezau, Anton Zabolotsky, Artur Peshkov, Vitaliy Lyalin, Artem Lialikov, Gleb Skiba, Vladyslava Dordii, Polina Fominykh, Tim Schrader, Susanne Both, and Anna Schrader.

--

--

PhD Student at Anhalt University of Applied Sciences; Research Assistant at Leipzig University of Applied Sciences (both Germany)