The world’s leading publication for data science, AI, and ML professionals.

Data Collators in HuggingFace

What they are and what they do

Image from unsplash.com
Image from unsplash.com

When I started learning HuggingFace, data collators were one of the least intuitive components for me. I had a hard time understanding them, and I did not find good enough resources that explain them intuitively.

In this post, we take a look at what data collators are, how they differ, and how to write a customized data collator.

Data Collators: High Level

Data collators are an essential part of data processing in HuggingFace. We all have used them after tokenizing the data, and before passing the data to the Trainer object to train the model.

In a nutshell, they put together a list of samples into a mini training batch. What they do depends on the task they are defined for, but at the very least they pad or truncate input samples to make sure all samples in a mini batch are of same length. Typical mini-batch sizes range from 16 to 256 samples, depending on the model size, data, and hardware constraints.

Data collators are task-specific. There is a data collator for each of the following tasks:

  • Causal Language Modeling (CLM)
  • Masking language modeling (MLM)
  • Sequence classification
  • Seq2Seq
  • Token classification

Some data collators are simple. For example for the "sequence classification" task, the data collator just needs to pad all sequences in a mini batch to ensure they are of the same length. It would then concatenate them into one tensor.

Some data collators are quite complex, as they need to handle the data processing for that task.

Basic Data Collators

Two of most basic data collators are as following:

1)DefaultDataCollator: This does not do any padding or truncation. It assumes all input samples are of the same length. If your input samples are not of the same length, this would throw errors.

import torch
from transformers import DefaultDataCollator

texts = ["Hello world", "How are you?"]

# Tokenize
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = [tokenizer(t) for t in texts]

# Default collate function 
collate_fn = DefaultDataCollator()

# Pass it to dataloader
dataloader = torch.utils.data.DataLoader(dataset=tokens, collate_fn=collate_fn, batch_size=2) 

# this will end in error
for batch in dataloader:
    print(batch)
    break

After tokenizing about the output is :

[{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]},
 {'input_ids': [101, 2129, 2024, 2017, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}]

As you see the two sequences have different lengths. The first sequence is 4 tokens, the second sequence is 6 tokens. This is the cause of error when you try to call the data loader !

Now if we change the input to two sequences of the same length, for example:

texts = ["Hello world", "How are"]

Then the code works fine and the output will be:

{'input_ids': tensor([[ 101, 7592, 2088,  102],
                      [ 101, 2129, 2024,  102]]), 
'token_type_ids': tensor([[0, 0, 0, 0],
                          [0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1],
                          [1, 1, 1, 1]])}

Notice that it does not return labels.

2) DataCollatorWithPadding: This collator pads the input samples so that they are all of the same length. For padding,

  • either it pads to the max_length argument provided
  • or it pads to the largest sequence in the batch.

See the documentation for more details.

In addition, this collator accepts tokenizer as many tokenizers have different padding token and so DataCollatorWithPadding accepts tokenizer to figure out the padding token while padding the sequence.

import torch
from transformers import DataCollatorWithPadding

texts = ["Hello world", "How are you?"]

# Tokenize
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = [tokenizer(t) for t in texts]

# Default collate function 
collate_fn = DataCollatorWithPadding(tokenizer, padding=True) #padding=True, 'max_length'

dataloader = torch.utils.data.DataLoader(dataset=tokens, collate_fn=collate_fn, batch_size=2) 

for batch in dataloader:
    print(batch)
    break

The output of above code is:

{'input_ids': tensor([[ 101, 7592, 2088,  102,    0,    0],
                      [ 101, 2129, 2024, 2017, 1029,  102]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
                          [0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 0, 0],
                          [1, 1, 1, 1, 1, 1]])}

Note the first sequence which was 4 tokens is now padded to be 6 tokens. The padding token id is 0 in bert-base-uncase tokenizer. Let’s see that in code:

print(tokenizer.special_tokens_map)
print()

This code outputs

{'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

and pad_token = [PAD] is what is used in padding. Let’s check its token_id.

print(tokenizer.convert_tokens_to_ids('[PAD]'))

and this prints 0 .

Pay attention that DataCollatorWithPadding similar to DefaultDataCollator does not create labels. If you have labels in your data already, they will return it but otherwise they won’t create it !!

In a nutshell, use these two collators if your labels is straightforward and your data does not need any special processing before feeding it to the model for training.

Language Modeling Data Collators

Language modeling data collator comes in two modes:

  • MLM data collator: this is for masked language modeling, where we mask 15% of tokens and the model predict them
  • CLM data collator: this is for causal language modeling where we mask all tokens to the right side of the current token, and expect the model to predict the next token at each step.

In code, the MLM collator is:

collate_fn = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

and the CLM collator is:

collate_fn = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Let’s see an example of MLM collator:

from transformers import DataCollatorForLanguageModeling
from transformers import AutoTokenizer
import torch

## input text
texts = [
  "The quick brown fox jumps over the lazy dog.",
  "I am learning about NLP and AI today"  
]

## tokenize
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

data = [tokenizer(t) for t in texts] 

## MLM collator
collate_fn = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

## pass collator to dataloader
dataloader = torch.utils.data.DataLoader(data, collate_fn=collate_fn, batch_size=2)

## let's look at one sample
for batch in dataloader:
    print(batch)
    break

It prints the following:

{'input_ids': tensor([[  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,  3899,
          1012,   102],
                      [  101,  1045,  2572,   103,  2055, 17953,  2361,  1998,  9932,  2651,
           102,     0]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                          [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                          [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]), 
'labels': tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
                  [-100, -100, -100, 4083, -100, -100, -100, -100, -100, -100, -100, -100]])}

Few points to pay attention to:

1)The input_ids for both example starts with 101 and end with 102 . In the second example, after 102 , there is a 0 which we know already is a padding token. Let’s see what is 101 and 102 ?

print( tokenizer.decode(101), tokenizer.decode(102))
## prints [CLS] and [SEP]

It prints [CLS] and [SEP] tokens, respectively.

2) It returns labels , unlike the basic data collators. We see there are many -100 in labels. Note the length of the labels is the same as length of input_ids for each sample! The places that label=-100 it means the corresponding token was not masked, and therefore we have to ignore this when computing the loss. MLM uses cross entropy loss function and if you check the documentation the default ignore_index=-100 ; that means ignore if a label is set to -100 .

3) Third point to pay attention to is that for the first example, the lable is [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100] which means none of its tokens were masked. However for the second input, the labels are [-100, -100, -100, 4083, -100, -100, -100, -100, -100, -100, -100, -100] and we see that one token was masked and its corresponding label is 4083. The corresponding token in the input_ids is 103 which is the token_id for [MASK] token.

The CLM collator is a lot easier as it is for causal language modeling i.e to predict the next token. Let’s take a look at an example:

from transformers import DataCollatorForLanguageModeling
from transformers import AutoTokenizer

texts = [
  "The quick brown fox jumps over the lazy dog.",
  "I am learning about NLP and AI today"  
]

# Tokenize
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = [tokenizer(t) for t in texts]

collate_fn = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
dataloader = torch.utils.data.DataLoader(data, collate_fn=collate_fn, batch_size=2)

for batch in dataloader:
    print(batch)

And the output is as following:

{'input_ids': tensor([[  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,  3899,
          1012,   102],
                      [  101,  1045,  2572,  4083,  2055, 17953,  2361,  1998,  9932,  2651,
           102,     0]]), 
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                          [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]), 
'labels': tensor([[  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,  3899,
          1012,   102],
                  [  101,  1045,  2572,  4083,  2055, 17953,  2361,  1998,  9932,  2651,
           102,  -100]])}

you see for the two examples, the labels is a copy of input_ids. It is because in causal language modeling, the task is to predict the token given all previous tokens and the label for a position is the token itself.

Customize A Data Collator

Let’s assume you have a dataset that contain two columns: instruction and response.

image by author
image by author

And you want to do instruction tuning on a pre-trained model. Without getting into too much details of training, we notice that we should have a customized data collator to only mask response and not the instruction.

Let’s say we write a function that combines the two columns into the following format:

Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Response: {response} ### End

The data now looks like this:

Image by author
Image by author

We see that there are special tokens:

  • Instruction

  • Response

  • End

So let’s get started by writing customized data collator. We want to have a collator that only masks response and not instruction. Why? because we want the model to generate the response not the instruction.

from typing import Any, Dict, List, Tuple, Union
import numpy as np

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Pipeline, PreTrainedTokenizer

RESPONSE_KEY = f"### Response:n"

class DataCollatorForCompletionLM(DataCollatorForLanguageModeling):    
    def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:

        # The torch_call method overrides the same method in the base class and 
        # takes a list of examples as input.  
        batch = super().torch_call(examples)

        labels = batch["labels"].clone()

        # The code then encodes a special token, RESPONSE_KEY_NL, 
        # representing the end of the prompt followed by a newline. 
        # It searches for this token in the sequence of tokens (labels) 
        # and finds its index.
        response_token_ids = self.tokenizer.encode(RESPONSE_KEY)

        for i in range(len(examples)):

            response_token_ids_start_idx = None
            for idx in np.where(batch["labels"][i] == response_token_ids[0])[0]:
                response_token_ids_start_idx = idx
                break

            if response_token_ids_start_idx is None:
                # If the response token is not found in the sequence, it raises a RuntimeError. 
                # Otherwise, it determines the end index of the response token.
                raise RuntimeError(
                    f'Could not find response key {response_token_ids} in token IDs 
                    {batch["labels"][i]}'
                )

            response_token_ids_end_idx = response_token_ids_start_idx + 1

            # To train the model to predict only the response and ignore the prompt tokens, 
            # it sets the label values before the response token to -100. 
            # This ensures that those tokens are ignored by the PyTorch loss function during training.
            labels[i, :response_token_ids_end_idx] = -100

        batch["labels"] = labels

        return batch

This data collator finds the location of ###Response: n and changes the label to any token before that to -100 .This way the loss function is going to ignore those tokens.

And when we call the collator, we use it as :

data_collator = DataCollatorForCompletionLM(
        tokenizer=tokenizer, mlm=False, return_tensors="pt"
)

As usual, we use this data_collator inside Training object before doing the model training.

Conclusion

In this post, we looked at data collators in HuggingFace. We learned that data collators are responsible for padding the sequences so that all samples in a batch are of same length. We also saw four different examples of data collators. One important data collator is DataCollatorForLanguageModeling which is used for both MLM and CLM training. We also saw an example of how to modify a data collator for instruction tuning.


If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/


Related Articles