Using a Dataloader in Hugging Face

The PyTorch Version

Natan Katz
Towards Data Science

--

Everyone that dug their heels into the DL world probably heard, believed, or was a target for convincing attempts that it is the era of Transformers . Since its very first appearance, Transformers were a subject for massive study in several directions :

  • Researchers searched for architecture improvements.
  • People study the theory that governs this domain.
  • Searching for applications that may use this method.

Those among the readers that aim to study Transformers in depth may find plenty of resources that discuss transformers in details. In short, Transformers are commonly used for developing language models for NLP problems. These models are used for tasks such as constructing sentences, Q & A, and translations. In an extremely high-level description, Transformers can be considered as sophisticated autoencoders that receive as an input triples of key, value, and query (words) and study a language model where each word has a specific representation that depends on its semantic context.

BERT & Hugging Face

BERT (Bidirectional Encoder Representations from Transformer) was introduced here. Following the appearance of Transformers, the idea of BERT was taking models that have been pre-trained by a transformers and perform a fine-tuning for these models’ weights upon specific tasks (downstream tasks). This approach led to a new classes of NLP problems that can be solved by initially using transformers, such as classification problems (e.g. sentiment analysis). It is achieved by modifying the upper layers of the network into a cluster’s structure or different type of sequences. As a result we have banks of BERT models. Such a great “models bank” is Hugging Face. This framework offers a package that provides three essential components:

  • Variety of pre-trained models and tools
  • Tokenizer engine
  • Framework flexibility (e.g. Torch, Keras)

A massive amount of NLP tasks can be handled by this package.

Why Do I write a post then?

When I began working with Hugging Face I was impressed with the excellent “end to end” pipeline that it offers, and the convenience of the data structure it offers. Nevertheless I felt that there is one part in their tutorials that is not well covered. After I managed to find a solution by myself, I felt that as a “radical open source” I have to share it.

In order to illustrate the problem, I will briefly describe the features extraction mechanism that is offered by Hugging Face. Our given data is simple: documents and labels.

The very basic function is tokenizer:

from transformers import AutoTokenizer
tokens = tokenizer.batch_encode_plus(documents )

This process maps the documents into Transformers’ standard representation and thus can be directly served to Hugging Face’s models. Here we present a generic feature extraction process:

def regular_procedure(tokenizer, documents , labels ):
tokens = tokenizer.batch_encode_plus(documents )

features=[InputFeatures(label=labels[j], **{key: tokens[key][j]
for key in tokens.keys()}) for j in range(len(documents ))]

return features

The output list of this method: features is a list that one can use for training and evaluation processes. The issue that I found as an obstacle was the absence of tutorials for working with Dataloader.

In all the tutorials, the assumption was that we work with data that is available during the train/eval. This assumption is clear for bootcamp needs, but it is wrong for real-world tasks. We are working with big data:

In big data the code goes to the data and not the data to the code

I began to try. My objective was to create a folder of features that I will be able to access with a PyTorch DataLoader. My first attempt was the following:

def generate_files_no_tensor(tokenizer, documents, labels ):
tokens = tokenizer.batch_encode_plus(documents )


file_pref ="my_file_"
for
j in range(len(documents) ):
inputs = {k: tokens[k][j] for k in tokens}
feature = InputFeatures(label=labels[j], **inputs)
file_name = file_pref +"_"+str(j)+".npy"
np.save(file_name, np.array(feature))
return

This code works pretty well, but it is not optimal. Its main disadvantage is that it saves numpy objects while Hugging’s models require tensors. This mean that my __getitem__ function will have additional tasks but uploading the files:

def __getitemnumpy__(self, idx):
aa = np.load(self.list_of_files[idx], allow_pickle=True)
cc = aa.data.obj.tolist()
c1 = cc.input_ids
c2 = cc.attention_mask
c3 = cc.label
return torch.tensor(c1), torch.tensor(c2), c3

We need to convert the numpy objects into tensors during training process.

I decided to work in a different way: I developed my __getitem__ and force the data to “admit its rules” it.

def __getitem__(self, idx):
aa = torch.load(self.list_of_files[idx])
return aa[0], aa[1], aa[2]

Now let’s face the challenge. Let’s try this:

def generate_files_no_tensor(tokenizer, documents, labels ):
tokens = tokenizer.batch_encode_plus(documents )


file_pref ="my_file_"
for
j in range(len(documents) ):
inputs = {k: tokens[k][j] for k in tokens}
feature = InputFeatures(label=labels[j], **inputs)
file_name = file_pref +"_"+str(j)+".pt"
torch.save(file_name, feature)
return

It works! we even have a direct access to the tensors. But this loop worked extremely slow. When I observed the data I saw two “phenomena”:

  • All the files have the same size
  • The files are huge!!

It took me a while to realize that all the files save the entire tensor (they probably save a point which is addressed to it location). Thus we have to slice it.

def generate_files_with_tensor(tokenizer, documents, labels ):
tokens = tokenizer.batch_encode_plus(doc0, return_tensors='pt')

file_pref ="my_file_"
for
j in range(len(documents) ):
file_name = file_pref +"_"+str(j)+".pt"
input_t = torch.squeeze(torch.index_select(tokens["input_ids"],dim=0,
index=torch.tensor(j)))
input_m = torch.squeeze(torch.index_select(tokens["attention_mask"],dim=0,
index=torch.tensor(j)))
torch.save([input_t, input_m, labels[j]], file_name)
return

The index_select function slices the tensors and the squeeze allows to remove a dimension of size 1. It achieved the required. Now I have a fast __getitem__ that does nothing but upload data.

A code example with the data structure and the DataLoader exists here.

I hope you will find it useful.

--

--

Interested in theory behind the ML non-linearity stochasticity sampling Bayesian inference & generative models “Tiefe Gedanken sind ewig, daher der größte Spaß”