The world’s leading publication for data science, AI, and ML professionals.

MultiChoice Question Answering In HuggingFace

Unveiling the power of question answering

Image from unsplash.com
Image from unsplash.com

Natural language processing techniques are demonstrating immense capability on question answering (QA) tasks. In this post, we leverage the HuggingFace library to tackle a multiple choice question answering challenge.

Specifically, we fine-tune a pre-trained BERT model on a multi-choice question dataset using the Trainer API. This allows adapting the powerful bidirectional representations from pre-trained BERT to our target task. By adding a classification head, the model learns textual patterns that help determine the correct choice out of a set of answer options per question. We then evaluate performance using accuracy across the held-out test set.

The Transformer framework allows quickly experimenting with different model architectures, tokenizer options, and training approaches. In this analysis, we demonstrate a step by step recipe for achieving competitive performance on multiple choice QA through HuggingFace Transformers.

First step: Install and Import libraries

The first step is to install and import the libraries. To install the libraries use pip install command as following:

!pip install datasets transformers[torch] --quiet

and then import the necessary libraries:

import numpy as np
import pandas as pd
import os
import json
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers.modeling_outputs import SequenceClassifierOutput
from transformers import (
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    set_seed,
    DataCollatorWithPadding,
    DefaultDataCollator
)
from datasets import load_dataset, load_metric
from dataclasses import dataclass, field
from typing import Optional, Union

Second step: Load the dataset

In the second step, we load the train and test dataset. We use codah dataset which is available for commercial use and is licensed by "odc-by"[1]

from datasets import load_dataset

codah = load_dataset("codah", "codah")
Image by author
Image by author
def update_columns(example):
    example['question'] = example['question_propmt']
    example['answer'] = example['candidate_answers'][example['correct_answer_idx']]
    example['choice_list'] = example['candidate_answers']
    example['label'] = example['correct_answer_idx']
    return example

codah_processed = codah['train'].map(update_columns)
codah_processed = codah_processed.remove_columns(['id','question_category','question_propmt','candidate_answers', 'correct_answer_idx'])

We process the dataset in above code snippet so that it has four columns only: question, answer, choice_list, and label.

Let’s look at first row of the train data:

{'question': 'I am always very hungry before I go to bed. I am',
 'answer': 'tempted to snack when I feel this way.',
 'choice_list': ['concerned that this is an illness.',
  'glad that I do not have a kitchen.',
  'fearful that there are monsters under my bed.',
  'tempted to snack when I feel this way.'],
 'label': 3}

As we see in the image above, the train data has the following columns:

  1. question: this is the question.
  2. answer: this is the choice that is the answer to the question.
  3. choice_list: this is a list of distractors and answer.
  4. label: this is the index of the answer in the choice_list

Let’s take 10% of the data for test dataset:

codah_split = codah_processed.train_test_split(test_size=0.1)
Image by author
Image by author

We will use test dataset later to predict the correct choice.

Creating train-validation split:

Let’s split the training data to create some validation dataset. We split the training data into 90% train and 10% validation. To do so, we use the train-test-split function and pass a split ratio.


dataset = codah_split['train'].train_test_split(test_size=0.1)
train_dataset = dataset['train']
validation_dataset = dataset['test']
test_dataset = codah_split['test']

Third Step: Setting Parameters

When fine-tuning models in HuggingFace, there are 3 main classes of arguments we need to set up:

  1. Model Arguments: These configure the model architecture itself. For example:
  • model_name_or_path – pretrained model name from the Hub
  • num_labels – number of output labels
  • cache_dir – where to cache pre-trained models

2. Data Arguments: These configure the input data and how it is processed. For example:

  • train_file/validation_file/test_file – filenames for splits
  • max_seq_length – truncate input sequences
  • pad_to_max_length – pad token sequences

3. Training Arguments: These determine training behavior. For example:

  • output_dir – where to save model checkpoints
  • num_train_epochs – number of training epochs
  • per_device_train_batch_size – training batch size

Properly configuring arguments from these 3 classes is crucial for effectively training HuggingFace models.

@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    token: str = field(
        default=None,
        metadata={
            "help": (
                "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
                "generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
            )
        },
    )
    use_auth_token: bool = field(
        default=None,
        metadata={
            "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead."
        },
    )
    trust_remote_code: bool = field(
        default=False,
        metadata={
            "help": (
                "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option"
                "should only be set to `True` for repositories you trust and in which you have read the code, as it will "
                "execute code present on the Hub on your local machine."
            )
        },
    )

Let’s instantiate from the above class:

model_args = ModelArguments(model_name_or_path='bert-base-uncased', tokenizer_name='bert-base-uncased', cache_dir='./', use_fast_tokenizer=True)

Next, we load DataTrainingArguments; Arguments pertaining to what data we are going to input our model for training and eval.

@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})

    validation_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    max_seq_length: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "The maximum total input sequence length after tokenization. If passed, sequences longer "
                "than this will be truncated, sequences shorter will be padded."
            )
        },
    )
    pad_to_max_length: bool = field(
        default=False,
        metadata={
            "help": (
                "Whether to pad all samples to the maximum sentence length. "
                "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
                "efficient on GPU but very bad for TPU."
            )
        },
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of training examples to this "
                "value if set."
            )
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
                "value if set."
            )
        },
    )

and we instantiate from this class too:

data_args = DataTrainingArguments(train_file=train_dataset, validation_file=validation_dataset,
                      max_train_samples=len(train_dataset), max_eval_samples = len(validation_dataset))

Next, we instantiate TrainingArguments class from HuggingFace. We don’t need to define this class as it already exists in HuggingFace.

model_name = model_args.model_name_or_path.split("/")[-1]

training_args = TrainingArguments(
    f"{model_name}-finetuned-demo",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    weight_decay=0.01,
    do_train = True,
    do_eval = True,
    do_predict = True
)

Fourth Step: Loading tokenizer

We load a pre-trained tokenizer from AutoTokenizer.from_pretrained function to encode our datasets.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=model_args.use_fast_tokenizer)

print(tokenizer)
BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
 0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

Once we print the tokenizer, we see the specification and features of the tokenizer such as size of the vocabulary, the maximum context length etc.

Now, let’s tokenize the data

Note that every example consists of a question which is a string, and a choice_list which is a list. Tokenizer can only receive string objects not list. Let’s look at few examples of the data:

c = 0
for row in train_df[['question','choice_list']].values:
  c += 1
  print(row[0])
  for i, choice in enumerate(row[1]):
    print(str(i+1)+") "+choice)
  print("------------------------------------------")
  if c > 3:
    break
Mr. and Mrs. Mustard have six daughters and each daughter has one brother. But there are only 9 people in the family, how is that possible?
1) Some daughters get married and have their own family.
2) Each daughter shares the same brother.
3) Some brothers were not loved by family and moved away.
4) None of above.
------------------------------------------
The six daughters of Mr. and Mrs. Mustard each have one brother. However, the family only consists of nine people; how is that possible?
1) Some brothers were not loved by family and moved away.
2) Some daughters get married and have their own family.
3) Each daughter shares the same brother.
4) None of above.
------------------------------------------
A chess team has five players, and each player has one coach. But there are only six participants in the team. How is that possible?
1) Each player shares the same coach.
2) Some players are backups and not allowed to play.
3) Some coaches get a raise.
4) None of above.
------------------------------------------
A woman shoots her husband. Then she holds him underwater for over 5 minutes. Finally, she hangs him. But 5 minutes later, they both go out and enjoy a wonderful dinner together. How can this be?
1) The woman gets arrested for murder after dinner.
2) The woman gets a new partner.
3) The woman was a photographer. She shot a picture of her husband, developed it, and hung it up to dry.
4) None of above.
------------------------------------------

To tokenize each example, we have to replicate the question 4 times and send it along with each choice to the tokenizer. For start, let’s tokenize one question and one choice:

# To just tokenize one question and one choice:

print(validation_dataset[0]['question'])
print(validation_dataset[0]['choice_list'])

tokenizer(validation_dataset[0]['question'], validation_dataset[0]['choice_list'][0])

and we will get the following:

Some months' names contain eight letters, while others only contain five. Which month has three?
['August.', 'They all do.', 'June.', 'None of above.']
{'input_ids': [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2257, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now, let’s tokenize one question and all its four choices. We have to replicate the question 4 times since there are four choices for the question. We pass the question in a list containing the question four times, and we pass the choices in a list of length four as well.

# To tokenize one question and all its choices:

repeated_question = [validation_dataset[0]['question']]*4
tokenizer(repeated_question, validation_dataset[0]['choice_list'])

We get the following:

{'input_ids': [[101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2257, 1012, 102], [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2027, 2035, 2079, 1012, 102], [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2238, 1012, 102], [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 3904, 1997, 2682, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Now, we are going to replicate it for all examples in the dataset. We create two lists:

  1. first_sentences: contains questions each one repeated four times.
  2. second_sentences: contains choices of all questions
# repeat each question four times, add it to first_sentences list
examples = validation_dataset
first_sentences = []
for example in examples:
  for i in range(4):
    first_sentences.append(example['question'])

# add choices to second_sentences list
second_sentences = []
for example in validation_dataset:
  for choice in example['choice_list']:
    second_sentences.append(choice)

As a sanity check, we check to ensure first_sentences and second_sentences have same length:

print(len(first_sentences), len(second_sentences))
# prints 204, 204 -> this makes sense since validation data was 51 rows.

Now we pass them to the tokenizer:

# Tokenize
max_seq_length = tokenizer.model_max_length

tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True, max_length=max_seq_length)

tokenized_examples have the following keys:

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

The tokenized_examples['input_ids'] is a list of lists i.e. [[],[],[],[],[],….,[]]. The reason is we passed the data as first_sentences, second_sentences . Every sentence in first_sentences[i], second_sentences[i] is tokenized into a list of input_ids for example [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, ...] . We have lost the structure of which list of input_ids belonged to one question. To bring back the structure, we need to unflatten the list of input_ids to a list of list. This way we bring it back to multiple choices belonging to one question. To summarize, we are changing it from input_ids = [ [],[],[],[],[],[],…[] ] to input_ids = [ [ [],[],[],[] ], [ [],[],[],[] ], ….].

To unflatten, we do the following:

dic = {'input_ids':[], 'token_type_ids':[], 'attention_mask':[]}

for k, v in tokenized_examples.items():
    for i in range(0, len(v), 4):
        dic[k].append(v[i:i+4])

Putting it all together for Tokenizer

We put all the code we discussed above into a function called preprocess_function and then call it on train, validation and test data.

max_seq_length = tokenizer.model_max_length

def preprocess_function(examples):
    # replicating first sentences 4 times
    first_sentences = []
    for q in examples['question']:
      for i in range(4):
        first_sentences.append(q)

    # putting all choices in a list
    second_sentences = []
    for choice_list in examples['choice_list']:
      for choice in choice_list:
        second_sentences.append(choice)

    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True, max_length=max_seq_length)

    # Un-flatten
    dic = {'input_ids':[], 'token_type_ids':[], 'attention_mask':[]}

    for k, v in tokenized_examples.items():
        for i in range(0, len(v), 4):
            dic[k].append(v[i:i+4])

    return dic

encoded_train = train_dataset.map(preprocess_function, batched=True, remove_columns=['id', 'question', 'answer', 'choice_list'])
encoded_validation = validation_dataset.map(preprocess_function, batched=True, remove_columns=['id', 'question', 'answer', 'choice_list'])
encoded_test = test_dataset.map(preprocess_function, batched=True, remove_columns=['question', 'choice_list'])

And the encoded dataset for train looks as following:

Image by author
Image by author

Let’s take a look at input_ids for few example in validation data:

for item in encoded_validation[0]['input_ids']:
  print(len(item))
24
26
24
26

As we see the length of input_ids are not the same, some are 24, some are 25 and some are 26. We need data collator to pad them.

Fifth Step: Padding data via Data collator

Take an example of one input item in encoded_train dataset. It is as following:

{ label:1, input_ids:[ [],[],[],[] ], attention_mask:[ [],[],[],[] ] }.

This is a list of lists. But data collator works with a list of items. Similar to how in tokenizer every item has to be a string, in data collator every item has to be a list. So we have to flatten the input to the following format:

{input_ids: [], attention_mask:[]},

{input_ids: [], attention_mask:[]},

{input_ids: [], attention_mask:[]},

{input_ids: [], attention_mask:[]}

So we will flatten, then pad and then unflatten to bring the multi-choice structure back to the data. We will demonstrate it on validation dataset:

## flatten: convert  {input_ids: [[],[],[],[]], attention_mask:[[],[],[],[]]} to
## {input_ids: [], attention_mask:[]}, {input_ids: [], attention_mask:[]}, {input_ids: [], attention_mask:[]}, {input_ids: [], attention_mask:[]}

features = encoded_validation

accepted_keys = ["input_ids", "attention_mask", "label", "token_type_ids"]
features = [{k: v for k, v in encoded_validation[i].items() if k in accepted_keys} for i in range(len(features))]
labels = [feature.pop('label') for feature in features]

## to flatten:
flattened_features = []
for feature in features:
    for i in range(4):
        dic = {}
        for k,v in feature.items():
            dic[k] =v[i]
        flattened_features.append(dic)

Now flatten_features is a list of dictionaries where each dictionary is {input_ids: [], attention_mask:[] }. We then pad the sequence.

'''
max_length=None mean that the sequences will not be truncated at all.
The sequences will only be padded to match the longest sequence in the batch, but not truncated to a fixed length.
'''

batch = tokenizer.pad(
            encoded_inputs=flattened_features,
            padding=True,
            max_length=None,
            pad_to_multiple_of=None,
            return_tensors="pt",
        )

The flatten batch has the following shape:

flatten batch - image by author
flatten batch – image by author

Now that we have formed the padded batch, we unflatten it to the original shape:

# Un-flatten
batch_size = len(encoded_validation)
num_choices = 4
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}

This result in the new shape:

unflatten batch - Image by author
unflatten batch – Image by author

Putting it together for Data Collator

from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase

    '''
    Union indicates that the padding parameter accepts multiple different types. Union combines several types into one.
    This is a list of allowable types for the padding parameter. It can be a boolean, string, or an instance of PaddingStrategy.
    The default parameter is set to True
    '''
    padding: Union[bool, str, PaddingStrategy] = True

    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        flag = False
        if 'label' in features[0].keys():
          flag = True
          labels = [feature.pop("label") for feature in features]

        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])

        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]

        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        if flag:
          batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

Sixth step: Model and Training

Let’s load the model via AutoModelForMultipleChoice [2].

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained(model_args.model_name_or_path)

Then we compute the accuracy metric:

import numpy as np

def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

We then initialize the trainer object:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train,
    eval_dataset=encoded_validation,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics,
)

and call the training:

# Training

train_result = trainer.train()
trainer.save_model() # Saves the tokenizer too for easy upload

metrics = train_result.metrics

max_train_samples = (
            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        )
metrics["train_samples"] = min(max_train_samples, len(train_dataset))

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

This is the performance we get after training for one epoch:

Image by author
Image by author

and train_result is as following:

TrainOutput(global_step=114, training_loss=0.7121551245973822, metrics={'train_runtime': 30.786, 'train_samples_per_second': 14.812, 'train_steps_per_second': 3.703, 'total_flos': 84326336103936.0, 'train_loss': 0.7121551245973822, 'epoch': 1.0, 'train_samples': 456})

Evaluation on validation dataset:

Next let’s evaluate our model on the validation dataset.

# Evaluation
if training_args.do_eval:

    metrics = trainer.evaluate()
    max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(encoded_validation)
    metrics["eval_samples"] = min(max_eval_samples, len(encoded_validation))

    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

And the result is as following:

Evaluation result - Image by author
Evaluation result – Image by author

We see that evaluation accuracy is 96% and the loss is 0.2646. Next let’s predict the label on test data set.

Prediction on test dataset:

Remember that we have had our test data in a test_df dataframe and we encoded it using our tokenizer. We will use the encoded_test dataset to predict the labels.

test_results = trainer.predict(encoded_test)

argmax_idxs = np.argmax(test_results[0], axis=1)

argmax_idxs contains the index of the predicted choice for all questions:

Image by author
Image by author

Let’s pick the right choice based on the predicted index:

test_df['predicted_index'] = argmax_idxs

def get_answer(choice_list, predicted_index):
  return choice_list[predicted_index]

test_df['predicted_answer'] = test_df.apply(lambda row: get_answer(row['choice_list'], row['predicted_index']), axis=1 )

Therefore the test_df looks as following:

predicted choice for test dataset - Image by author
predicted choice for test dataset – Image by author

Conclusion

In this post, we looked at multi-choice question answering in HuggingFace. We tokenized a dataset and passed it through tokenizer and data collator. We reviewed the data format for each module. We then used the trainer API to train a model and evaluated it using validation data. Finally we predicted the right choice for test data using the trained model.


If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/

References

  1. Codah dataset: https://huggingface.co/datasets/codah
  2. HuggingFace multiple choice: https://huggingface.co/docs/transformers/tasks/multiple_choice

Related Articles