NLP Tutorial

Creating a Dutch question-answering machine learning model

Creating a new dataset by using NLP translation

Erwin van Crasbeek

Published in

Towards Data Science

20 min readJan 29, 2023

Pipeline for the creation of a Dutch question-answering model

Natural language processing models are currently a hot topic. The release of ‘Attention Is All You Need’ by Google [1] has spurred the development of many Transformer models like BERT, GPT-3, and ChatGPT which have received a lot of attention all over the world. While many language models are trained on English or multiple languages, models and datasets for specific languages can be difficult to find or of questionable quality.

NLP has a vast amount of applications including but not limited to translation, information extraction, summarization and question answering, the latter of which is something I’ve personally been working on. As an Applied Artificial Intelligence student, I have been working on question answering NLP models and have found it challenging to find a useful Dutch dataset for training purposes. To address this issue, I have developed a translation solution that can be applied to various NLP problems and pretty much all languages, which may be of interest to other students. I feel like this also has a great value for the AI development and research community. There are basically no Dutch datasets available especially for specific tasks like question answering. By translating a large and well-known dataset, I have been able to create a Dutch question answering model with relatively low effort.

If you are interested in learning more about my process, the challenges I faced, and the potential applications of this solution, please continue reading. This article is aimed at students with a basic NLP background. However, I’ve also included a refresher and introductions to various concepts for those who are not yet familiar in the field or simply need a refresher.

To properly explain my solution for using translated datasets, I have divided this article into two main sections, the translation of a dataset and the training of a question answering model. I’ve written this article in a way that intends to show you my progress towards the solution but also as a step-by-step guide. The article consists of the following chapters:

Refresher on NLP and a brief history of NLP
The problem, the dataset and question answering
Translating the dataset
Building a question answering model
What has been achieved and what has not been achieved?
Future plans
Sources

Refresher on NLP and a brief history of NLP

To get a better understanding of the various elements of this solution, I want to start with refresher on NLP and its recent history. The languages we know can be split in two groups, formal and natural. Formal language refers to languages that have specifically been designed for specific tasks like math and programming. A natural or ordinary language is a language that has naturally been developed and evolved by humans without any form of planning ahead. This can take multiple forms like the different kinds of human speech we know or even sign language [2].

NLP in its broadest form is the application of computational methods to natural languages. By combining rule-based modelling of language with AI models, we have been able to get computers to ‘understand’ our human language in a way that allows it to process it both in text and voice form [3]. The way this understanding works — if it can even be called understanding — is up for debate. But recent developments like ChatGPT have shown that we humans do often feel like the output of these models makes it feel sentient and like has a high level of understanding [4].

Of course, this understanding didn’t come out of nowhere. NLP has a vast history dating back to the 1940s after World War II [5]. During this period, people realized the importance of translation and hoped to create a machine that could do so automatically. However, this proved to be quite the challenge. Around 1960, NLP research split into rule-based and stochastic. Rule-based, or symbolic covered mainly formal languages and the generation of syntax. Many of the linguistic researchers and computer scientists in this group saw this as the beginning of artificial intelligence research. Stochastic research focused more on statistics and problems like pattern recognition between texts.

Since then, many more developments on NLP have been made and many more areas of research have emerged. However, the actual text resulting from NLP models has always been quite limited and didn’t have many real-world applications. That is, until the early 2000s. After this period developments in NLP made big leaps every few years which led to where we are now.

The problem, the dataset and question answering

Now that I’ve given a short refresher on NLP it is time to introduce the actual problem that I have been working on. In short, my goal was to train a Dutch question answering machine learning model. However, the lack of suitable datasets made this quite difficult which is I created my own by using translation. In this article I will go through the creation of a dataset and the training of the machine learning model step by step so you can follow along and either replicate the entire solution or select the parts that are of importance to you.

This article can be split into two main components. The first one being the creation of a Dutch dataset and the second being the training of a question answering machine learning model. In this chapter I will give some background information on them, introduce my solutions and explain my choices.

The dataset

If we want to find a useful Dutch dataset it is important to look at what is exactly needed to train a question answering model. There are two main approaches to the generation of answers to questions. The first one being extractive and the second one being abstractive.

· Extractive question answering models are trained to extract an answer from the context (the source text) [7]. Older approaches used to do this by training a model to output a start and end index of the location of the answer in the context. However, the introduction of Transformers has made this approach obsolete.

· Abstractive question answering models are trained to generate new text based on the context and the question [8].

Figure 1 shows an example of the output extractive and abstractive models might give.

Although different approaches are possible, nowadays both extractive and abstractive question answering models are often based on Transformers like BERT [8], [9].

Figure 1. An example of an answer generated in an extractive versus abstractive way.

Based on the information about extractive and abstractive models, we now know that we need a dataset with contexts, questions, answers and, optionally, start and end indices of the location of the answer in the context. I have explored the following options in order to find a suitable dataset.

I have used A 2020 paper by Cambazoglu et al. [10] to get a clear image of what datasets are available for question answering. Their research has resulted in a table with the most prominent question answering datasets. Unfortunately, none of these big datasets are in the Dutch language.
Another option was Huggingface which hosts a large collection of datasets [11]. At first glance, there are a few question answering datasets available for the Dutch language. However, further inspection shows that these datasets are often incomplete, include website domains instead of contexts or are a mix of various languages. These are completely unusable or too incomplete to be used for our goal.

Concluding from these observations, there are practically no public datasets that can be used to train a Dutch question answering model. Creating our own dataset manually would take far too much time so what other options do we have? Firstly we could simply use an English model, translate the input from Dutch to English and then translate the output back to Dutch. However, a quick test with Google Translate and this approach has shown that the results are far from desirable and almost feel passive aggressive. Perhaps too much information and context got lost during the double translation step? That leads to the second option, translating the entire dataset and training on it. During my research I have come across a few instances where this was mentioned. For example a post by Zoumana Keita on Towardsdatascience [16] uses translation for data augmentation. Chapter three will dive into my execution of the translation of a dataset.

Lastly we need to select what dataset to use for our translation approach. Since we decided to translate the entire dataset, it does not matter what language the original dataset is in. The Stanford Question Answering Dataset (SQuAD) [12] seems to be quite popular and is used by Paperswithcode for the question answering benchmark [13]. It also contains a large amount (100.000+) of questions with answers and upon closer inspection does not seem to have any unexpected data. This is the dataset we will be working with.

The machine learning model

Now we have determined how we are going to get a dataset; we need to decide what kind of machine learning model will be suitable for the goal of answering questions. In the previous chapter we have established that we can choose between an extractive model and an abstractive model. In my research I have used an abstractive model because it is based on a newer technology and gives more interesting results. However, just in case anyone wants to take this approach for an extractive model I will cover that as well. This is also in line with the selection of the dataset since it contains the start indices of answers.

Training a Transformer from scratch would be, to say the least, inefficient. The book transfer Learning for Natural Language Processing by P. Azunre [14] goes in-depth on why transfer learning is done and shows a number of examples on how to do it. A large number of big NLP models are hosted on Huggingface [15] and are available for transfer learning. I have chosen the t5-v1_1-base model because it is multi-task trained on multiple languages. Chapter 4 will cover the transfer learning of this model.

Translating the dataset

In this chapter I will be showing how I have translated the dataset by giving snippets of code and explaining them. The code resulting from these code blocks in succession is the entire dataset translation script I’ve written. Feel free to follow along or take specific parts that are of use to you.

Imports

The solution uses a few modules. First of all, we need to translate text in a way that is as fast as possible. In my research I have tried using various translation AI models from Huggingface but by far the fastest translator was the Googletrans module which makes use of the Google Translate API. The solution also uses Timeout from httpx to define a timeout for the translations, json for SQuAD dataset parsing, Pandas for dataframes and Time to measure how long everything is taking.

from googletrans import Translator, constants
from httpx import Timeout

import json
import pandas as pd
import time

Initialization

First of all we should define a few constants that will be used throughout the script. For ease-of-access I have added the source language and translation language here.

The Googletrans module provides us with a Translator that can have a custom timout defined. I have used a relatively long timeout because translations kept timing out during my tests. I will provide a bit more information on this issue further along in the guide.

src_lang = "en"
dest_lang = "nl"

translator = Translator(timeout = Timeout(60))

Reading the SQuAD dataset

The following code extracts contexts, questions and answers from the train and validation json files. This is done by reading the files as json format and looping through the data inside in a way that extracts the three lists. For each question and answer, the context is copied and added to the contexts list. This way we can easily access a question with its relevant context and answer by using an index.

def read_squad(path):
    with open(path, 'rb') as f:
        squad_dict = json.load(f)
    contexts, questions, answers = [], [], []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            
            for qa in passage['qas']:
                question = qa['question']
                if 'plausible_answers' in qa.keys():
                    access = 'plausible_answers'
                else:
                    access = 'answers'
                for answer in qa[access]:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer['text'])
    return contexts, questions, answers

train_c, train_q, train_a = read_squad('squad-train-v2.0.json')
val_c, val_q, val_a= read_squad('squad-dev-v2.0.json')

Timing

The following code provides us with a very rough estimation of how long each translation takes.

def time_translation(entries, name):
    start_time = time.time()
    translation = translator.translate(entries[0], dest=dest_lang, src= src_lang)
    duration = time.time() - start_time
    total_duration = len(entries)*duration
    print(f"translating {name} takes {total_duration/60/60} hours")

time_translation(train_c, "train contexts")
time_translation(train_q, "train questions")
time_translation(train_a, "train answers")
time_translation(val_c, "validation contexts")
time_translation(val_q, "validation questions")
time_translation(val_a, "validation answers")

Translating

Remember how I mentioned translations timing out? During my research I kept bumping into the issue where translations were timing out and the resulting dataset got corrupted. It turns out that the Googletrans module is not 100% reliable since it uses the Google Translate API. The way I have found around this is to create a small wrapper function that keeps trying to translate until it has been successful. After doing this I no longer experienced the timing out problem.

def get_translation(text):
    success = False
    translation = ""
    while not success:
        translation = translator.translate(text, dest=dest_lang, src=src_lang).text
        success = True
    return translation

Because of the way we have extracted contexts from the dataset, they have been duplicated for each question and answer pair. Simply translating all contexts would be redundant and very slow so the following translation function compares the previous context to the current one first. If they match, the previous translation is used.

def translate_context(contexts, name):
    start_time = time.time()
    context_current = ""
    translated_contexts = []
    index = 0

    for context in contexts:
        index+=1
        if context != context_current:
            context_current = context
            print(f"[{index}/{len(contexts)}]")
            get_translation(context)
            context_translated = get_translation(context)
            translated_contexts.append(context_translated)
        else:
            translated_contexts.append(context_translated)

    duration = time.time() - start_time
    print(f"Translating {name} took {round(duration, 2)}s") 
    return translated_contexts

Translating the questions and answers is pretty straightforward since we just need to loop through the lists to translate all of them.

def translate_qa(input, name):
    start_time = time.time()
    input_translated = []
    index = 0
    for text in input:
        text_nl = get_translation(text)
        input_translated.append(text_nl)
        index+=1
        print(f"[{index}/{len(input)}]")
    duration = time.time() - start_time
    print(f"Translating {name} took {round(duration, 2)}s") 
    return input_translated

Now we can use the functions we have defined to translate all parts of the dataset.

train_c_translated = translate_context(train_c, "train contexts")
train_q_translated = translate_qa(train_q, "train questions")
train_a_translated = translate_qa(train_a, "train answers")

val_c_translated = translate_context(val_c, "val contexts")
val_q_translated = translate_qa(val_q, "val questions")
val_a_translated = translate_qa(val_a, "val answers")

Exporting

All that is left is exporting the translations for later use. We can do this by converting the lists to dataframes and then using the to_csv function. One thing to keep in mind is that the Googletrans module outputs translations with characters that are not included in utf-8 encoding. That is why we use utf-16 encoding here. It would make sense to convert it to utf-8 at some point since that might be more useful in an AI model. However, since we are just working on the dataset here we can decide to leave that step for later when we are doing the data preprocessing for training our model.

def save_data(data, name, header):
    data_df = pd.DataFrame(data)
    data_df.to_csv(name + "_pdcsv.csv", encoding='utf-16', index_label = "Index", header = [header])

save_data(train_c_translated, "train_contexts", "contexts")
save_data(train_q_translated, "train_questions", "questions")
save_data(train_a_translated, "train_answers", "answers")
save_data(val_c_translated, "val_contexts", "contexts")
save_data(val_q_translated, "val_questions", "questions")
save_data(val_a_translated, "val_answers", "answers")

Building a question answering model

Figuring out how to train a question answering model turned out to be a bit of a challenge. However, by taking inspiration from a Notebook by P. Suraj [17], I was able to create a Transformer based model that can be trained on question answering. In line with the Notebook I have used Torch to create the model.

Imports

Starting with the imports, the following modules are used. We also define some variables that define the max in- and output length of the model.

import pandas as pd
import unicodedata

import torch
from torch.utils.data import DataLoader

from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration
from transformers import AdamW
from tqdm import tqdm

from sklearn.model_selection import train_test_split 
from datetime import datetime

max_text_length = 512
max_output_length = 256

Loading data

Now we can load the dataset that we have previously created. Since we used Pandas to export a csv we can now easily load it and convert it to an array. I have also defined a function that will be used later on to convert any training or input data to utf-8 which is the format we will train the model on.

def load_data(path):
    df = pd.read_csv(path, encoding='utf-16')
    df = df.drop('Index', axis=1)
    data = df.values.tolist()
    data = [a[0] for a in data]
    return data

def to_utf8(text):
    try:
        text = unicode(text, 'utf-8')
    except NameError:
        pass
    text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode("utf-8")
    return str(text)

Now we can actually load the data. For the training of the model I only used the train data and split this with a test size of 0.2.

contexts_csv = 'train_contexts_pdcsv.csv'
questions_csv = 'train_questions_pdcsv.csv'
answers_csv = 'train_answers_pdcsv.csv'

contexts = load_data(contexts_csv)
questions = load_data(questions_csv)
answers = load_data(answers_csv)

c_train, c_val, q_train, q_val, a_train, a_val = train_test_split(contexts,
                                                questions, answers,
                                                test_size=0.2,
                                                random_state=42)

Preparing data

Like I mentioned before, it is possible to train an extractive model and an abstractive model. During my research I developed both an extractive and an abstractive model. In this article I just cover the abstractive version but, for anyone interested, I will also explain how I preprocessed my data for the extractive model. This was necessary to create the start- and endindices of the answers in contexts.

Abstractive

The dataset does not need much preprocessing in order to train an abstractive model. We simply convert all train data to utf-8. The last three lines can be uncommented to decrease the size of the trainset, this will improve training time and might help with debugging.

def clean_data(contexts, questions, answers):
    cleaned_contexts, cleaned_questions, cleaned_answers = [], [], []
    for i in range(len(answers)):
        cleaned_contexts.append(to_utf8(contexts[i]))
        cleaned_questions.append(to_utf8(questions[i]))
        cleaned_answers.append(to_utf8(answers[i]))
    return cleaned_contexts, cleaned_questions, cleaned_answers

cc_train, cq_train, ca_train = clean_data(c_train, q_train, a_train); 
cc_val, cq_val, ca_val = clean_data(c_val, q_val, a_val); 

print("Original data size: " + str(len(q_train)))
print("Filtered data size: " + str(len(cq_train)))

#cc_train = cc_train[0:1000]
#cq_train = cq_train[0:1000]
#ca_train = ca_train[0:1000]

Extractive

In many cases, extractive models need start and end indices of the answer in the context. However, since we translated our dataset using a Transformer a few issues can occur. For example, answers might be worded differently than in the context or the position or length of the answer might have changed. To solve this, we can try to find the answer in the context and, if the answer is found, add it to the cleaned answers. Because of this, we also have information about the start index and the end index is simply the start index plus the length of the answer.

def clean_data(contexts, questions, answers):
    cleaned_contexts, cleaned_questions, cleaned_answers = [], [], []
    for i in range(len(answers)):
        index = contexts[i].find(answers[i])
        if(index != -1):
        #print(str(index) + " + " + str(index+len(answers[i])))
            cleaned_contexts.append(contexts[i])
            cleaned_questions.append(questions[i])
            cleaned_answers.append({
                'text':answers[i],
                'answer_start': index,
                'answer_end': index+len(answers[i])
                })
    return cleaned_contexts, cleaned_questions, cleaned_answers

cc_train, cq_train, ca_train = clean_data(c_train, q_train, a_train); 
cc_val, cq_val, ca_val = clean_data(c_val, q_val, a_val);

Tokenizer

The next step is tokenizing, since we are using t5-v1_1-base, we can simply import the tokenizer from Huggingface. Then we will tokenize the contexts with the questions so that the tokenizer will add them together with end of string tokens. We also specify the previously defined max_text_length. Lastly the tokenized answers are added to the encodings as the target.

tokenizer = T5Tokenizer.from_pretrained('google/t5-v1_1-base')
train_encodings = tokenizer(cc_train, cq_train, max_length=max_text_length, truncation=True, padding=True)
val_encodings = tokenizer(cc_val, cq_val, max_length=max_text_length, truncation=True, padding=True)

def add_token_positions(encodings, answers):
    tokenized = tokenizer(answers, truncation=True, padding=True)
    encodings.update({'target_ids': tokenized['input_ids'], 'target_attention_mask': tokenized['attention_mask']})

add_token_positions(train_encodings, ca_train)
add_token_positions(val_encodings, ca_val)

Dataloader

We’ll use a Dataloader to train the PyTorch model as follows. In here the batch size can also be specified. The server I trained on had limited memory so I had to use a batch size of 2. If possible, a bigger batch size would be preferable.

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
        print(encodings.keys())

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

Training the model

The model we use is T5ForConditionalGeneration based on T5-v1_1-base. If CUDA is installed on the PC or server that is used for training, we can try to utilize it to significantly increase training speed. We also tell the model that we are going to train it.

The optimizer we use is AdamW with a learning rate of 1e-4. This is based on the T5 documentation [18] which mentions that it is a good value to use in our situation:

Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question answering, question generation).

Lastly we define a function that saves our model for later usage after it is done training.

model = T5ForConditionalGeneration.from_pretrained('google/t5-v1_1-base')
cuda = torch.cuda.is_available()
device = torch.device('cuda') if cuda else torch.device('cpu')
model.to(device)
model.train()

optimizer = AdamW(model.parameters(), lr=1e-4)

def save_model():
    now = datetime.now()
    date_time = now.strftime(" %m %d %Y %H %M %S")
    torch.save(model.state_dict(), "answer_gen_models/nlpModel"+date_time+".pt")

The actual training of the model will be done in three epochs, the Notebook I have used [17] and the T5 documentation both state that this is a good amount of epochs to train on. On my PC which has a RTX 3090 this would take about 24 hours per epoch. The server I have used took advantage of an Nvidia Tesla T4 and took about 6 hours per epoch.

The Tqdm module is used for visual feedback on the training state. It provides us with data about the elapsed time and the estimated time training will take. The steps between the two commented arrows are important for our goal of question answering, this is where we define what input to give the model. The other steps taken in this code block are pretty straightforward for the training of a PyTorch model.

for epoch in range(3):
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        optim.zero_grad()

        # >
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        target_ids = batch['target_ids'].to(device)
        target_attention_mask = batch['target_attention_mask'].to(device)
        
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=target_ids,
                        decoder_attention_mask=target_attention_mask)
        # >
        loss = outputs[0]
        loss.backward()
        optimizer.step()

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())
save_model()

Results

If you have followed along, congratulations! You have created your own Dutch dataset and trained a Dutch question answering model! If you are like me, you probably can’t wait to try the model to see what results it gives. You can use the following code to evaluate the model. Interestingly enough, you might find that the model is not only capable of answering Dutch questions! It is also somewhat capable of answering questions in different (mostly Germanic) languages. This is most likely due to the fact that the original T5-v1_1-base model has been trained on four different languages.

model = T5ForConditionalGeneration.from_pretrained('google/t5-v1_1-base')
model.load_state_dict(torch.load("answer_gen_models/some_model.pt"))

cuda = torch.cuda.is_available()
device = torch.device('cuda') if cuda else torch.device('cpu')
model.to(device)
model.eval()

def test(context, question):
    input = tokenizer([to_utf8(context)],
                      [to_utf8(question)],
                      max_length=max_text_length,
                      truncation=True,
                      padding=True)
    with torch.no_grad():
        input_ids = torch.tensor(input['input_ids']).to(device)
        attention_mask = torch.tensor(input['attention_mask']).to(device)
        out = model.generate(input_ids,
                             attention_mask=attention_mask,
                             max_length=max_output_length,
                             early_stopping=True)
        print([tokenizer.decode(ids,
        skip_special_tokens=True) for ids in out][0])

test("Dit is een voorbeeld", "Wat is dit?")

Here are some example contexts and questions together with the answers that have been generated by the model:

Context We zijn met de klas van de master Applied Artificial Intelligence naar keulen geweest.
Question Waar is de klas heen geweest?
Answer Keulen

Context De grote bruine vos springt over de luie hond heen.
Question Waar springt de vos overheen?
Answer Luie hond

Context The big brown fox jumps over the lazy dog.
Question What does the fox do?
Answer Jumps over the lazy dog

Context Twee maal twee is tien.
Question Wat is twee maal twee?
Answer Tien

What has been achieved and what has not been achieved?

So, to summarize, we have selected an English dataset for question answering, translated it to Dutch using the Google Translate API and we have trained a PyTorch encoder-decoder model based on T5-v1_1-base. What exactly have we achieved with this and can this be used in real-life situations?

First of all, it is important to realize that we have not properly evaluated the model as that was not part of the scope of this article. However, to be able to properly interpret our results and to be able to say something about its usability, I suggest looking into metrics like Rouge [19] or a human evaluation. The approach I have taken is a human evaluation. Table 2 shows the average rating between one and five that five people have given the generated answers of various context sources and questions. The average score is 2.96. This number alone does not tell us much but we can conclude from the table that the model we created can in some cases generate near perfect answers. However, it does also quite often generate answers that the panel of human evaluators consider to be complete nonsense.

Table 2. Human evaluation scores (1–5) of various articles, papers and theses.

It is also important to note that, by translating a dataset, we have most likely introduced a bias. The AI behind Google Translate has been trained on a dataset which, since it is based on natural language, naturally contains a bias. By translating our data with it, this bias will be passed on to any model that’s trained with the dataset. Before a dataset created like this can be used in a real-life situation, it should be evaluated thoroughly to indicate what biases there are and how they impact the results.

However, this solution can be very interesting to people who are experimenting with AI, developing a new kind of machine learning model or are simply learning about NLP. It is a very accessible way to get a big dataset in any language for almost any NLP problem. Many students do not have access to big datasets because they are only accessible to big companies or are too expensive. With an approach like this, any big English dataset can be transformed into a dataset in a specific language.

Future plans

Personally I am interested in seeing where I can take this approach. I am currently working on a question generation model that is using exactly the same approach and dataset. I would like to investigate the usage of these two models combined so I can learn more about potential biases or errors that have been introduced. This is in line with chapter 5 in which I talked about the need for evaluation. I have created a human evaluation by asking five people to rate the results of the created model. However, I intend to learn more about different metrics which can hopefully tell me more about how the model works, why it generates certain results and what biases it contains.

I have also learned that version 2.0 of the Stanford Question and Answer dataset includes questions that cannot be answered. Even though it is not directly related to the solution offered in this article, I am curious about the differences in results when I apply the solution of this article to the full SQuAD 2.0 dataset.

Sources

[1] A. Vaswani et al., “Attention Is All You Need,” 2017.

[2] D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,” Multimedia Tools and Applications, Jul. 2022, doi: 10.1007/s11042–022–13428–4.

[3] “What is Natural Language Processing? | IBM,” www.ibm.com. https://www.ibm.com/topics/natural-language-processing (accessed Jan. 11, 2023).

[4] E. Holloway, “Yes, ChatGPT Is Sentient — Because It’s Really Humans in the Loop,” Mind Matters, Dec. 26, 2022. https://mindmatters.ai/2022/12/yes-chatgpt-is-sentient-because-its-really-humans-in-the-loop/ (accessed Jan. 18, 2023).

[5] “NLP — overview,” cs.stanford.edu. https://cs.stanford.edu/people/eroberts/courses/soco/projects/2004-05/nlp/overview_history.html (accessed Jan. 18, 2023).

[6] S. Ruder, “A Review of the Recent History of Natural Language Processing,” Sebastian Ruder, Oct. 01, 2018. https://ruder.io/a-review-of-the-recent-history-of-nlp/ (accessed Jan. 18, 2023).

[7] S. Varanasi, S. Amin, and G. Neumann, “AutoEQA: Auto-Encoding Questions for Extractive Question Answering,” Findings of the Association for Computational Linguistics: EMNLP 2021, 2021.

[8] “What is Question Answering? — Hugging Face,” huggingface.co. https://huggingface.co/tasks/question-answering (accessed Jan. 18, 2023).

[9] R. E. López Condori and T. A. Salgueiro Pardo, “Opinion summarization methods: Comparing and extending extractive and abstractive approaches,” Expert Systems with Applications, vol. 78, pp. 124–134, Jul. 2017, doi: 10.1016/j.eswa.2017.02.006.

[10] B. B. Cambazoglu, M. Sanderson, F. Scholer, and B. Croft, “A review of public datasets in question answering research,” ACM SIGIR Forum, vol. 54, no. 2, pp. 1–23, Dec. 2020, doi: 10.1145/3483382.3483389.

[11] “Hugging Face — The AI community building the future.,” huggingface.co. https://huggingface.co/datasets?language=language:nl&task_categories=task_categories:question-answering&sort=downloads (accessed Jan. 18, 2023).

[12] “The Stanford Question Answering Dataset,” rajpurkar.github.io. https://rajpurkar.github.io/SQuAD-explorer/ (accessed Jan. 18, 2023).

[13] “Papers with Code — Question Answering,” paperswithcode.com. https://paperswithcode.com/task/question-answering (accessed Jan. 18, 2023).

[14] P. Azunre, Transfer Learning for Natural Language Processing. Simon and Schuster, 2021.

[15] “Hugging Face — On a mission to solve NLP, one commit at a time.,” huggingface.co. https://huggingface.co/models (accessed Jan. 18, 2023).

[16] Z. Keita, “Data Augmentation in NLP Using Back Translation With MarianMT,” Medium, Nov. 05, 2022. https://towardsdatascience.com/data-augmentation-in-nlp-using-back-translation-with-marianmt-a8939dfea50a (accessed Jan. 18, 2023).

[17] P. Suraj, “Google Colaboratory,” colab.research.google.com. https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb (accessed Jan. 25, 2023).

[18] “T5,” huggingface.co. https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Model (accessed Jan. 25, 2023).

[19] “ROUGE — a Hugging Face Space by evaluate-metric,” huggingface.co. https://huggingface.co/spaces/evaluate-metric/rouge (accessed Jan. 25, 2023).

All images unless otherwise noted are by the author.

NLP Tutorial

Creating a Dutch question-answering machine learning model

Creating a new dataset by using NLP translation

Refresher on NLP and a brief history of NLP

The problem, the dataset and question answering

The dataset

The machine learning model

Translating the dataset

Imports

Initialization

Reading the SQuAD dataset

Timing

Translating

Exporting

Building a question answering model

Imports

Loading data

Preparing data

Tokenizer

Dataloader

Training the model

Results

What has been achieved and what has not been achieved?

Future plans

Sources

Written by Erwin van Crasbeek