
Natural language processing techniques are demonstrating immense capability on question answering (QA) tasks. In this post, we leverage the HuggingFace library to tackle a multiple choice question answering challenge.
Specifically, we fine-tune a pre-trained BERT model on a multi-choice question dataset using the Trainer API. This allows adapting the powerful bidirectional representations from pre-trained BERT to our target task. By adding a classification head, the model learns textual patterns that help determine the correct choice out of a set of answer options per question. We then evaluate performance using accuracy across the held-out test set.
The Transformer framework allows quickly experimenting with different model architectures, tokenizer options, and training approaches. In this analysis, we demonstrate a step by step recipe for achieving competitive performance on multiple choice QA through HuggingFace Transformers.
First step: Install and Import libraries
The first step is to install and import the libraries. To install the libraries use pip install
command as following:
!pip install datasets transformers[torch] --quiet
and then import the necessary libraries:
import numpy as np
import pandas as pd
import os
import json
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers.modeling_outputs import SequenceClassifierOutput
from transformers import (
AutoTokenizer,
Trainer,
TrainingArguments,
set_seed,
DataCollatorWithPadding,
DefaultDataCollator
)
from datasets import load_dataset, load_metric
from dataclasses import dataclass, field
from typing import Optional, Union
Second step: Load the dataset
In the second step, we load the train and test dataset. We use codah dataset which is available for commercial use and is licensed by "odc-by"[1]
from datasets import load_dataset
codah = load_dataset("codah", "codah")

def update_columns(example):
example['question'] = example['question_propmt']
example['answer'] = example['candidate_answers'][example['correct_answer_idx']]
example['choice_list'] = example['candidate_answers']
example['label'] = example['correct_answer_idx']
return example
codah_processed = codah['train'].map(update_columns)
codah_processed = codah_processed.remove_columns(['id','question_category','question_propmt','candidate_answers', 'correct_answer_idx'])
We process the dataset in above code snippet so that it has four columns only: question, answer, choice_list, and label.
Let’s look at first row of the train data:
{'question': 'I am always very hungry before I go to bed. I am',
'answer': 'tempted to snack when I feel this way.',
'choice_list': ['concerned that this is an illness.',
'glad that I do not have a kitchen.',
'fearful that there are monsters under my bed.',
'tempted to snack when I feel this way.'],
'label': 3}
As we see in the image above, the train data has the following columns:
- question: this is the question.
- answer: this is the choice that is the answer to the question.
- choice_list: this is a list of distractors and answer.
- label: this is the index of the answer in the choice_list
Let’s take 10% of the data for test dataset:
codah_split = codah_processed.train_test_split(test_size=0.1)

We will use test dataset later to predict the correct choice.
Creating train-validation split:
Let’s split the training data to create some validation dataset. We split the training data into 90% train and 10% validation. To do so, we use the train-test-split
function and pass a split ratio.
dataset = codah_split['train'].train_test_split(test_size=0.1)
train_dataset = dataset['train']
validation_dataset = dataset['test']
test_dataset = codah_split['test']
Third Step: Setting Parameters
When fine-tuning models in HuggingFace, there are 3 main classes of arguments we need to set up:
- Model Arguments: These configure the model architecture itself. For example:
- model_name_or_path – pretrained model name from the Hub
- num_labels – number of output labels
- cache_dir – where to cache pre-trained models
2. Data Arguments: These configure the input data and how it is processed. For example:
- train_file/validation_file/test_file – filenames for splits
- max_seq_length – truncate input sequences
- pad_to_max_length – pad token sequences
3. Training Arguments: These determine training behavior. For example:
- output_dir – where to save model checkpoints
- num_train_epochs – number of training epochs
- per_device_train_batch_size – training batch size
Properly configuring arguments from these 3 classes is crucial for effectively training HuggingFace models.
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
token: str = field(
default=None,
metadata={
"help": (
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
)
},
)
use_auth_token: bool = field(
default=None,
metadata={
"help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead."
},
)
trust_remote_code: bool = field(
default=False,
metadata={
"help": (
"Whether or not to allow for custom models defined on the Hub in their own modeling files. This option"
"should only be set to `True` for repositories you trust and in which you have read the code, as it will "
"execute code present on the Hub on your local machine."
)
},
)
Let’s instantiate from the above class:
model_args = ModelArguments(model_name_or_path='bert-base-uncased', tokenizer_name='bert-base-uncased', cache_dir='./', use_fast_tokenizer=True)
Next, we load DataTrainingArguments; Arguments pertaining to what data we are going to input our model for training and eval.
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a text file)."})
validation_file: Optional[str] = field(
default=None,
metadata={"help": "An optional input evaluation data file to evaluate the perplexity on (a text file)."},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
max_seq_length: Optional[int] = field(
default=None,
metadata={
"help": (
"The maximum total input sequence length after tokenization. If passed, sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)
pad_to_max_length: bool = field(
default=False,
metadata={
"help": (
"Whether to pad all samples to the maximum sentence length. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
"efficient on GPU but very bad for TPU."
)
},
)
max_train_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
)
},
)
max_eval_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of evaluation examples to this "
"value if set."
)
},
)
and we instantiate from this class too:
data_args = DataTrainingArguments(train_file=train_dataset, validation_file=validation_dataset,
max_train_samples=len(train_dataset), max_eval_samples = len(validation_dataset))
Next, we instantiate TrainingArguments class from HuggingFace. We don’t need to define this class as it already exists in HuggingFace.
model_name = model_args.model_name_or_path.split("/")[-1]
training_args = TrainingArguments(
f"{model_name}-finetuned-demo",
evaluation_strategy = "epoch",
learning_rate=5e-5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
num_train_epochs=1,
weight_decay=0.01,
do_train = True,
do_eval = True,
do_predict = True
)
Fourth Step: Loading tokenizer
We load a pre-trained tokenizer from AutoTokenizer.from_pretrained
function to encode our datasets.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, use_fast=model_args.use_fast_tokenizer)
print(tokenizer)
BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
Once we print the tokenizer, we see the specification and features of the tokenizer such as size of the vocabulary, the maximum context length etc.
Now, let’s tokenize the data
Note that every example consists of a question which is a string, and a choice_list
which is a list. Tokenizer can only receive string objects not list. Let’s look at few examples of the data:
c = 0
for row in train_df[['question','choice_list']].values:
c += 1
print(row[0])
for i, choice in enumerate(row[1]):
print(str(i+1)+") "+choice)
print("------------------------------------------")
if c > 3:
break
Mr. and Mrs. Mustard have six daughters and each daughter has one brother. But there are only 9 people in the family, how is that possible?
1) Some daughters get married and have their own family.
2) Each daughter shares the same brother.
3) Some brothers were not loved by family and moved away.
4) None of above.
------------------------------------------
The six daughters of Mr. and Mrs. Mustard each have one brother. However, the family only consists of nine people; how is that possible?
1) Some brothers were not loved by family and moved away.
2) Some daughters get married and have their own family.
3) Each daughter shares the same brother.
4) None of above.
------------------------------------------
A chess team has five players, and each player has one coach. But there are only six participants in the team. How is that possible?
1) Each player shares the same coach.
2) Some players are backups and not allowed to play.
3) Some coaches get a raise.
4) None of above.
------------------------------------------
A woman shoots her husband. Then she holds him underwater for over 5 minutes. Finally, she hangs him. But 5 minutes later, they both go out and enjoy a wonderful dinner together. How can this be?
1) The woman gets arrested for murder after dinner.
2) The woman gets a new partner.
3) The woman was a photographer. She shot a picture of her husband, developed it, and hung it up to dry.
4) None of above.
------------------------------------------
To tokenize each example, we have to replicate the question 4 times and send it along with each choice to the tokenizer. For start, let’s tokenize one question and one choice:
# To just tokenize one question and one choice:
print(validation_dataset[0]['question'])
print(validation_dataset[0]['choice_list'])
tokenizer(validation_dataset[0]['question'], validation_dataset[0]['choice_list'][0])
and we will get the following:
Some months' names contain eight letters, while others only contain five. Which month has three?
['August.', 'They all do.', 'June.', 'None of above.']
{'input_ids': [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2257, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Now, let’s tokenize one question and all its four choices. We have to replicate the question 4 times since there are four choices for the question. We pass the question in a list containing the question four times, and we pass the choices in a list of length four as well.
# To tokenize one question and all its choices:
repeated_question = [validation_dataset[0]['question']]*4
tokenizer(repeated_question, validation_dataset[0]['choice_list'])
We get the following:
{'input_ids': [[101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2257, 1012, 102], [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2027, 2035, 2079, 1012, 102], [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 2238, 1012, 102], [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, 2096, 2500, 2069, 5383, 2274, 1012, 2029, 3204, 2038, 2093, 1029, 102, 3904, 1997, 2682, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
Now, we are going to replicate it for all examples in the dataset. We create two lists:
- first_sentences: contains questions each one repeated four times.
- second_sentences: contains choices of all questions
# repeat each question four times, add it to first_sentences list
examples = validation_dataset
first_sentences = []
for example in examples:
for i in range(4):
first_sentences.append(example['question'])
# add choices to second_sentences list
second_sentences = []
for example in validation_dataset:
for choice in example['choice_list']:
second_sentences.append(choice)
As a sanity check, we check to ensure first_sentences and second_sentences have same length:
print(len(first_sentences), len(second_sentences))
# prints 204, 204 -> this makes sense since validation data was 51 rows.
Now we pass them to the tokenizer:
# Tokenize
max_seq_length = tokenizer.model_max_length
tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True, max_length=max_seq_length)
tokenized_examples
have the following keys:
dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
The tokenized_examples['input_ids']
is a list of lists i.e. [[],[],[],[],[],….,[]]. The reason is we passed the data as first_sentences, second_sentences
. Every sentence in first_sentences[i], second_sentences[i]
is tokenized into a list of input_ids for example [101, 2070, 2706, 1005, 3415, 5383, 2809, 4144, 1010, ...]
. We have lost the structure of which list of input_ids belonged to one question. To bring back the structure, we need to unflatten the list of input_ids to a list of list. This way we bring it back to multiple choices belonging to one question. To summarize, we are changing it from input_ids = [ [],[],[],[],[],[],…[] ] to input_ids = [ [ [],[],[],[] ], [ [],[],[],[] ], ….].
To unflatten, we do the following:
dic = {'input_ids':[], 'token_type_ids':[], 'attention_mask':[]}
for k, v in tokenized_examples.items():
for i in range(0, len(v), 4):
dic[k].append(v[i:i+4])
Putting it all together for Tokenizer
We put all the code we discussed above into a function called preprocess_function
and then call it on train, validation and test data.
max_seq_length = tokenizer.model_max_length
def preprocess_function(examples):
# replicating first sentences 4 times
first_sentences = []
for q in examples['question']:
for i in range(4):
first_sentences.append(q)
# putting all choices in a list
second_sentences = []
for choice_list in examples['choice_list']:
for choice in choice_list:
second_sentences.append(choice)
# Tokenize
tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True, max_length=max_seq_length)
# Un-flatten
dic = {'input_ids':[], 'token_type_ids':[], 'attention_mask':[]}
for k, v in tokenized_examples.items():
for i in range(0, len(v), 4):
dic[k].append(v[i:i+4])
return dic
encoded_train = train_dataset.map(preprocess_function, batched=True, remove_columns=['id', 'question', 'answer', 'choice_list'])
encoded_validation = validation_dataset.map(preprocess_function, batched=True, remove_columns=['id', 'question', 'answer', 'choice_list'])
encoded_test = test_dataset.map(preprocess_function, batched=True, remove_columns=['question', 'choice_list'])
And the encoded dataset for train looks as following:

Let’s take a look at input_ids for few example in validation data:
for item in encoded_validation[0]['input_ids']:
print(len(item))
24
26
24
26
As we see the length of input_ids are not the same, some are 24, some are 25 and some are 26. We need data collator to pad them.
Fifth Step: Padding data via Data collator
Take an example of one input item in encoded_train dataset. It is as following:
{ label:1, input_ids:[ [],[],[],[] ], attention_mask:[ [],[],[],[] ] }.
This is a list of lists. But data collator works with a list of items. Similar to how in tokenizer every item has to be a string, in data collator every item has to be a list. So we have to flatten the input to the following format:
{input_ids: [], attention_mask:[]},
{input_ids: [], attention_mask:[]},
{input_ids: [], attention_mask:[]},
{input_ids: [], attention_mask:[]}
So we will flatten, then pad and then unflatten to bring the multi-choice structure back to the data. We will demonstrate it on validation dataset:
## flatten: convert {input_ids: [[],[],[],[]], attention_mask:[[],[],[],[]]} to
## {input_ids: [], attention_mask:[]}, {input_ids: [], attention_mask:[]}, {input_ids: [], attention_mask:[]}, {input_ids: [], attention_mask:[]}
features = encoded_validation
accepted_keys = ["input_ids", "attention_mask", "label", "token_type_ids"]
features = [{k: v for k, v in encoded_validation[i].items() if k in accepted_keys} for i in range(len(features))]
labels = [feature.pop('label') for feature in features]
## to flatten:
flattened_features = []
for feature in features:
for i in range(4):
dic = {}
for k,v in feature.items():
dic[k] =v[i]
flattened_features.append(dic)
Now flatten_features
is a list of dictionaries where each dictionary is {input_ids: [], attention_mask:[] }. We then pad the sequence.
'''
max_length=None mean that the sequences will not be truncated at all.
The sequences will only be padded to match the longest sequence in the batch, but not truncated to a fixed length.
'''
batch = tokenizer.pad(
encoded_inputs=flattened_features,
padding=True,
max_length=None,
pad_to_multiple_of=None,
return_tensors="pt",
)
The flatten batch has the following shape:

Now that we have formed the padded batch, we unflatten it to the original shape:
# Un-flatten
batch_size = len(encoded_validation)
num_choices = 4
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
This result in the new shape:

Putting it together for Data Collator
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch
@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
'''
Union indicates that the padding parameter accepts multiple different types. Union combines several types into one.
This is a list of allowable types for the padding parameter. It can be a boolean, string, or an instance of PaddingStrategy.
The default parameter is set to True
'''
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
flag = False
if 'label' in features[0].keys():
flag = True
labels = [feature.pop("label") for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
if flag:
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
Sixth step: Model and Training
Let’s load the model via AutoModelForMultipleChoice
[2].
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
model = AutoModelForMultipleChoice.from_pretrained(model_args.model_name_or_path)
Then we compute the accuracy metric:
import numpy as np
def compute_metrics(eval_predictions):
predictions, label_ids = eval_predictions
preds = np.argmax(predictions, axis=1)
return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}
We then initialize the trainer object:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=encoded_train,
eval_dataset=encoded_validation,
tokenizer=tokenizer,
data_collator=DataCollatorForMultipleChoice(tokenizer),
compute_metrics=compute_metrics,
)
and call the training:
# Training
train_result = trainer.train()
trainer.save_model() # Saves the tokenizer too for easy upload
metrics = train_result.metrics
max_train_samples = (
data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
This is the performance we get after training for one epoch:

and train_result
is as following:
TrainOutput(global_step=114, training_loss=0.7121551245973822, metrics={'train_runtime': 30.786, 'train_samples_per_second': 14.812, 'train_steps_per_second': 3.703, 'total_flos': 84326336103936.0, 'train_loss': 0.7121551245973822, 'epoch': 1.0, 'train_samples': 456})
Evaluation on validation dataset:
Next let’s evaluate our model on the validation dataset.
# Evaluation
if training_args.do_eval:
metrics = trainer.evaluate()
max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(encoded_validation)
metrics["eval_samples"] = min(max_eval_samples, len(encoded_validation))
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
And the result is as following:

We see that evaluation accuracy is 96% and the loss is 0.2646. Next let’s predict the label on test data set.
Prediction on test dataset:
Remember that we have had our test data in a test_df
dataframe and we encoded it using our tokenizer. We will use the encoded_test
dataset to predict the labels.
test_results = trainer.predict(encoded_test)
argmax_idxs = np.argmax(test_results[0], axis=1)
argmax_idxs
contains the index of the predicted choice for all questions:

Let’s pick the right choice based on the predicted index:
test_df['predicted_index'] = argmax_idxs
def get_answer(choice_list, predicted_index):
return choice_list[predicted_index]
test_df['predicted_answer'] = test_df.apply(lambda row: get_answer(row['choice_list'], row['predicted_index']), axis=1 )
Therefore the test_df
looks as following:

Conclusion
In this post, we looked at multi-choice question answering in HuggingFace. We tokenized a dataset and passed it through tokenizer and data collator. We reviewed the data format for each module. We then used the trainer API to train a model and evaluated it using validation data. Finally we predicted the right choice for test data using the trained model.
If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/
References
- Codah dataset: https://huggingface.co/datasets/codah
- HuggingFace multiple choice: https://huggingface.co/docs/transformers/tasks/multiple_choice