Thoughts and Theory
Our presented dataset enables researchers to build and evaluate CLIR systems in English and Seven European languages in the medical domain.

Introduction
In this story, I will present our contributions to extend existing Cross-Lingual Information Retrieval (CLIR) datasets that were released during the Information Retrieval (IR) tasks of CLEF eHealth Evaluation Lab. The resulted dataset aims at building and evaluating (CLIR) in the medical domain. The supported languages are English, Czech, French, German, Hungarian, Polish, Spanish, and Swedish.
What is Cross-Lingual Information Retrieval (CLIR)?
CLIR enables users to search for information by posing queries in a language that is different from the collection language. This helps to break the language barrier between system users and a vast amount of data that is represented in different language. The task has gotten the attention of the IR research community since the late 1990s, and the growth of internet was a compelling evidence of the need of CLIR system, because digital content across the globe had begun to increase significantly.
CLIR and COVID19
During the COVID19 pandemic, CLIR became significantly more important than anytime before, because individuals, policy-makers and medical doctors wanted to learn more about COVID19 and read stories, treatment protocols and showcases of fighting COVID19 from all over the world. Such information of course is available in multiple languages they probably do not speak.
One of the latest effort to improve search and information access around COVID19-topic was COVID19 MLIA EvaL: "Covid-19 MLIA Eval organizes a community evaluation effort aimed at accelerating the creation of resources and tools for improved MultiLingual Information Access (MLIA) in the current emergency situation with a reference to a general public use case", source: MLIA website.
Approaches to CLIR
A CLIR system usually includes two steps, the first step is the translation step, which includes translating either the queries into the language of the document collection, or translating document collection into the query language. After translation is done, the task is then reduced into a monolingual IR task.
Different approaches and studies investigated two main questions in the CLIR task:
- What is better to translate, queries or document collection? Or translating both into a common representation?
- How translation can be done? Is the translation task in CLIR similar to the normal machine translation task that aims at generating human-readable translations?
For more information about those two approaches, I refer to a paper I published last year in the Association for Linguistics (ACL) conference wherein I presented thorough comparison between the two approaches above.
Document Translation vs. Query Translation for Cross-Lingual Information Retrieval in the Medical…
CLIR and COVID19
Searching in the medical domain became significantly important during the COVID19 pandemic because individuals, policy makers and medical doctors want to stay up-to-date by accessing information available in multiple languages online. Actually, COVID-19 Multilingual Information Access (MLIA) initiative was the latest event that invited researchers from all over the world to design search engine systems that help searchers find COVID19-related information online in multiple languages.
Our CLIR Test Dataset
The test dataset is based on three test sets that were released during the CLEF eHealth patient-centered IR tasks 2013–2015 [Goeuriot et al., 2015, 2014, Suominen et al., 2013]. We extend the test collection mainly by translating the queries into more languages, and extending the relevance assessment by more than two times of the original ones. The extended test collection is available online via the LINDAT/CLARIN repository
The test dataset contains mainly three parts:
1- Document Collection
The document collection in our extended data is taken from the CLEF eHealth IR task 2015. The documents were provided in HTML format. Each document contains HTML markup, CSS and javascript code.
The collection includes around 1.1 million documents crawled from medical websites. More information about the document collection can be found in [1].
2-Queries
The queries in this work are adopted from the test sets that were released during the CLEF eHealth CLIR tasks 2013–2015, as follows:
Queries from 2013 and 2014 In CLEF eHealth IR task 2013 [Goeuriot et al., 2013] and CLEF eHealth IR task 2014 [Goeuriot et al., 2014], queries were generated by medical experts based on discharge summaries of patients.
The motivation of choosing medical experts (nurses and clinical practitioner) for query generation is that those experts are in touch with patients on daily basis; thus, they can understand their information need.
Queries were generated as follows: medical experts were given discharge summaries and they were asked to select randomly a disorder, then write a short query describing it. They assumed that patients would use the same query when they want to find more information about the same disorder. Involving medical experts to generate queries from discharge summaries affected the nature of the queries in a way that they contain medical terms, and they tend to be short.
Queries from 2015 In CLEF eHealth Evaluation Lab 2015, the IR task called retrieving information about medical symptoms [Palotti et al., 2015]. The goal of the task was to design IR systems that can help laypeople (users without medical experience) to find information related to their health conditions, and understand what caused their symptoms (self-diagnosis). Thus, the creation of the queries in this task attempted to simulate the real case as much as possible.
Participants in the query creation step were university students without medical experience, as an attempt to simulate the case of an average search engine user.
They were shown images and videos that contain symptoms of medical issues. Then, they were asked to generate queries for each case, as they think those queries would represent their information need, and eventually would lead them to relevant documents.
New Data split As was shown in the previous two paragraphs, the main difference between the queries in 2013, 2014 and 2015 labs of CLEF eHealth IR was the source of these queries, and the tendency to use medical terms in 2013–2014 as an opposite to the case in 2015.
We want to design a CLIR system that is stable for such a diversity of user queries, rather than designing a system that is biased to one type of queries (short with medical terms, or long queries without medical terminology).
To approach this, we got the test queries from each IR task in 2013 (50 queries), 2014 (50 queries), and 2015 (66 queries). We mixed them to get more representative and balanced query set, and then split these queries into two sets: 100 queries for training (33 queries from 2013 test set, 32 from 2014 and 35 from 2015) and 66 queries for testing (17 queries from 2013 test set, 18 queries from 2014 and 31 from 2015).
The two sets are stratified in terms of distribution of the year of their origin, number of relevant/not-relevant documents that exist in the relevance assessment information, and the query length (number of tokens).
Queries in all years were represented in aTREC (TREC is an abbreviation for NIST’s Text REtrieval Conference) format as follows:
• Title: this field contains the title of the query, usually referred to as query. User information need should be represented in this field, and this field will eventually be fed to an IR system to conduct retrieval.
_• D_escription: this field helps to describe the title in a longer sentence.
_• N_arrative: this field is to describe to the annotator what the relevant documents should contain. This field is useful for the relevance assessment process not to the retrieval phase.
_• P_rofile: information about the patient who is supposed to be doing self diagnose, like their gender, age and other medical-related information.
_• D_ischarge_summary: this field contains a handler (ID) to a text file that contains the discharge summary of the corresponding patient.

3- Relevance Assessment
Relevance assessment is the process when judges (humans with experience in the domain) determine if each document is relevant or not to a specific query.
We built a pool of top ranked documents that were retrieved using multiple systems, and then assessors looked at each document-query pair and determined the relevance degree for it. The relevance degree can be:
- Not Relevant: when the document is not related at all to the information need.
- Somewhat relevant: the document partially answers the information need. This means that some information is missed and searcher has to read more documents to completely get their question answered.
- Highly relevant: the document completely satisfies the information need, and no need to read any other documents.
We use Relevation toolkit, which is an open-source tool for conducting relevance assessment for IR evaluation [Koopman and Zuccon, 2014].

The following table shows statistics of the official assessment (that were done in 2013, 2014 and 2015) versus our extension in terms of the number of assessed documents. The extended dataset contains a total of 38, 109 document-query pairs, 14, 368 pairs of them are assessed by us.

Conclusion
I presented in this story our effort to extend existing datasets to support CLIR in seven European languages. The dataset is available publicly here.
A full description of this dataset was published in a short paper in the European Conference of Information Retrieval (ECIR) 2019 [2].
If you have any question regarding this work, do not hesitate to write your question in comments.
References
- [1] Palotti et al.: CLEF eHealth Evaluation Lab 2015, Task 2: Retrieving Information About Medical Symptoms, CEUR-WS.org
- [2] Shadi Saleh and Pavel Pecina, An Extended CLEF eHealth Test Collection for Cross-Lingual Information Retrieval in the Medical Domain, (2019), European Conference on Information Retrieval 2019, Springer.