Working with NLP datasets in Python

Tutorial: Comparing the new HuggingFace Datasets library with the TensorFlow Datasets library and other options

Published in

Towards Data Science

11 min readOct 19, 2020

In the field of Deep Learning, datasets are an essential part of every project. To train a neural network that can handle new situations, one has to use a dataset that represents the upcoming scenarios of the world. An image classification model trained on animal images will not perform well on a car classification task.

Alongside training the best models, researchers use public datasets as a benchmark of their model performance. I personally think that easy-to-use public benchmarks are one of the most useful tools to help facilitate the research process. A great example of this is the Papers With Code state-of-the-art charts.

Another great tool is the ready-to-use dataset libraries. In this post, I will review the new HuggingFace Dataset library on the example of IMBD Sentiment analysis dataset and compare it to the TensorFlow Datasets library using a Keras biLSTM network. The story can also work as a tutorial for using these libraries.

All codes are available on Google Colab.

IMDb Sentiment Analysis chart on PapersWithCode

Raw dataset publications

When someone publishes a new dataset library, the most straightforward thing to do is to share it in the research team’s webpage. For example, the IMDB Sentiment analysis dataset is published by a team of Stanford researchers and available at their own webpage: Large Movie Review Dataset. In case of a scientific publication, it usually comes with a published article: see Maas et al. [1] for example.

The original publication page of the IMDB Sentiment Dataset

I think, the two major problems with this are: 1) it is hard to find, especially, if you are an early-carrier scientist; 2) there is no standardised format to store the data and using a new dataset must come with a specific preprocessing step.

Online dataset collections

To make a dataset accessible, one should not only make it available but also make it sure that users will find it. Google realised the importance of it when they dedicated a search platform for datasets at datasetsearch.research.google.com. However, searching IMDB Large Movie Reviews Sentiment Dataset the result does not include the original webpage of the study. Browsing the Google results for dataset search, one will find that Kaggle is one of the greatest online public dataset collection.

Kaggle

Kaggle is the world’s largest online machine learning community with various competition tasks, dataset collections and discussion topics. If you never heard of Kaggle but interested in deep learning, I strongly recommend taking a look at it. In Kaggle, anyone can upload new datasets (with a limit of 10GB) and the community can rate the dataset based on its documentation, machine-readability and existence of code examples to work with it.

The IMDB Sentiment dataset on Kaggle has an 8.2 score and 164 public notebook examples to start working with it. The user can read the documentation of the dataset and preview it before downloading it.

It is important to note that this dataset does not include the original splits of the data. This does not help the reproducibility of the models unless the builders describe their split function.

Working with Kaggle datasets, the most important precautions are 1) make sure you use the exact dataset as many users share an altered/improved version of the datasets, 2) make sure that you have the license to work with it and the right person takes credit for it. Many datasets on Kaggle are not shared by the original creator.

Dataset libraries

While the main reason for dataset collections is to store all datasets in one place, the dataset libraries focus on ready-to-use accessibility and performance.

The machine learning libraries often come with a few dataset examples. Here is a list of the Scikit-learn datasets. I chose the IMDB dataset because this is the only text dataset included in Keras.

The TensorFlow team dedicated a package for datasets. It includes several datasets and compatible with the TensorFlow and Keras neural networks.

In the following, I will compare the TensorFlow Datasets library with the new HuggingFace Datasets library focusing on NLP problems.

Common datasets

Currently, the TensorFlow Datasets list 155 entries from various fields of machine learning while the HuggingFace Datasets contains 165 entries focusing on Natural Language Processing. Here is the list of datasets sharing the same name (39):

tfds_list = tfds.list_builders(); hfds_list = datasets.list_datasets(); list(set(tfds_list).intersection(set(hfds_list)))

[‘xnli’, ‘multi_news’, ‘multi_nli_mismatch’, ‘wikihow’, ‘squad’, ‘xsum’, ‘super_glue’, ‘cos_e’, ‘newsroom’, ‘lm1b’, ‘eraser_multi_rc’, ‘aeslc’, ‘civil_comments’, ‘gap’, ‘cfq’, ‘gigaword’, ‘esnli’, ‘multi_nli’, ‘scan’, ‘librispeech_lm’, ‘opinosis’, ‘snli’, ‘reddit_tifu’, ‘wikipedia’, ‘scicite’, ‘tiny_shakespeare’, ‘scientific_papers’, ‘qa4mre’, ‘c4’, ‘definite_pronoun_resolution’, ‘flores’, ‘math_dataset’, ‘trivia_qa’, ‘para_crawl’, ‘movie_rationales’, ‘natural_questions’, ‘billsum’, ‘cnn_dailymail’, ‘glue’]

Note that the IMDB dataset is not on the list! In the TensorFlow Datasets, it is under the name imdb_reviews while the HuggingFace Datasets refer to it as the imdb dataset. I think it is quite unfortunate and the library builders should strive to keep the same name.

Dataset description

The HuggingFace Datasets has a dataset viewer site, where samples of the dataset are presented. This site shows the splits of the data, link to the original website, citation and examples. Along with this, they have another dataset description site, where import usage and related models are shown.

The TensorFlow Datasets has a single dataset description site, where the previously mentioned metadata information is available. Compared to the HugginFace site, TensorFlow offers multiple download options: plain text and encoded numeric word tokens.

HuggingFace (above) and TensorFlow (below) IMDb dataset description pages

Keras example for Sentiment Analysis

In the Keras version of the IMDB dataset, the plain text is already preprocessed. I use the same processing steps to illustrate the use-cases of the other libraries. The steps are as follows:

Load the dataset
Tokenize the plain text and encode
Truncate long examples
Pad short examples
Shuffle, batch the data
Fit the same model for every dataset

Building the Keras model originally by fchollet

The Keras network will expect 200 tokens long integer vectors with a vocabulary of [0,20000). The words of the vocabulary are based on the dataset’s word frequency. The network consists of an Input layer for the input read, an Embedding layer to project the word representations from the integers to a 128 dimension vector space, two bidirectional LSTM layers and a Dense layer to match the output dimension.

As the data is already preprocessed, the early steps are only a few lines:

Preprocessing the Keras data originally by fchollet

And the training lines:

Training the model with the Keras data originally by fchollet

The epochs trained on the Colab computer are 450s long on CPU and 55s long on GPU, the final accuracy on the test data is 0.8447.

TensorFlow Datasets for Sentiment Analysis

TensorFlow offers really good tutorials, they are detailed and yet have a relatively short length to read. If you want to read more about it, start here.

Loading the IMDB dataset from tensorflow_datasets

The first parameter specifies the dataset by name. Next, the split parameter tells the library which data splits should be included. It can be a percentage of a split too: train[:10%]. The as_supervised parameter specifies the format, this one allows the Keras model to train from the TensorFlow dataset. The with_info adds a new return value to the list, it contains various information from the data. My favourite is the citation to show the credit of the original data builder team.

I think the most powerful tool of the TensorFlow Datasets library is that you don't have to load in the full data at once but only as batches of the training. Unfortunately, to build a vocabulary based on the word frequency we have to load the data before the training.

Tokenize and truncate for TensorFlow Datasets

The tokenizer building is based on this tutorial. However, I added a Counter to count the frequency of the words in the training dataset. The code first splits the sentences by whitespace. Then, the Counter counts the word frequency. To match the Keras model’s vocabulary size the counter keeps only the top max_features-2 . The additional two tokens are the padding token (0) and the Out-of-vocabulary (OOV) token for words not included in the most common list (max_features-1). The next lines of code build an encoder to ensure that every word in the vocabulary has a unique integer value. The encode_map_fn function wraps the encoder in a TensorFlow function so the Datasets objects can work with it. This code snippet also includes the truncate function: it makes sure that the lengths of the sentences are not too long by cutting off the end of them.

TensorFlow Datasets pipeline

The next part of the code builds the pipeline for the processing steps. Firstly, it reads the (sentence,label) pairs, then encodes it to (integer_list,label) pairs and truncates the long sentences (lists). Then the cache can speed-up the work if the data can be stored in the memory. The shuffle part is only necessary for the training data: the buffer_size=1024 parameter specifies that the program randomizes on smaller windows of the data and not the whole at once. It is useful if the data is too large to fit in the memory, however, true random can be only achieved if this buffer_size is greater than the number of data samples. The padded_batch step of the pipeline batch the data into groups of 32 and pad the shorter sentences to 200 tokens. After this step the input shape is (32,200) and the output is (32,1). Lastly, the prefetch step works with multiprocessing: while the model is training on a batch, the algorithm loads in the next batches so they will be ready when the model finishes the previous one.

Finally, the data is ready to train the model:

The first epoch is slower on GPU (82s) but after loading the processed data into the cache, the epoch duration is similar to the Keras’ (55s). The final accuracy of the test data is 0.8014. If you run the code, you can see that the Keras fit verbose can’t guess the first epoch’s duration: the pipeline reads the data without knowing it’s final length!

Reading the data one batch at the time (on CPU) — without knowing the end of it

Sentiment analysis with HuggingFace Datasets

Firstly, I want to talk about the HugginFace package names: the company has 3 pip packages, transformers, tokenizers and datasets. While I understand the PR value of securing these short names and maybe the transformers and tokenizers are first of their kind, I do not like their names because it can be confusing. For example, if I use the TensorFlow Datasets library and the HuggingFace datasets library, if one’s name is datasets, I can’t decide which one is it. I think if Tensorflow names their libraries with mention to their name (like tensorflow_datasets), it would be nice if HuggingFace do it so.

Secondly, working with both the tokenizers and the datasets, I have to note that while transformers and datasets have nice documentations, the tokenizers library lacks it. Also, I came across an issue during building this example following the documentation — and it was reported to them in June.

The HuggingFace Datasets can build a pipeline similar to the TensorFlow’s one to work with large data. In this experiment, I do not use it, please read the tutorial if you need it!

HuggingFace Datasets reading IMDB dataset

The data loading works similarly to the previous one. The HuggingFace library can handle percentages as well as the TensorFlow. The Dataset object has information about the data as properties like the citation info.

Tokenizers

Here comes a powerful feature of the HuggingFace libraries: as the company focuses on Natural Language Processing, it has more and better-suited features for the field than TensorFlow. If we want to work with a specific transformer model, we can import its tokenizer from the corresponding package. Or we can train a new one from scratch. I talked about the differences between tokenizers in a previous post, read it here if you are interested.

In this experiment, I built a WordPiece [2] tokenizer based on the training data. This is the tokenizer used by the famous BERT model [3]. Also, I show how to use the vocabulary from the previous part as the data of the tokenizer to achieve the same functionality.

Building WordPiece[2] using the training data — based on this by HuggingFace

This code sample shows how to build a WordPiece based on the Tokenizer implementation. Unfortunately, the trainer works with files only, therefore I had to save the plain texts of the IMDB dataset temporarily. The size of the vocabulary is customizable at the train function.

Building frequency list tokenizer

To build the tokenizer with the most frequent words, one should update the vocabulary. The only trick is to keep track of the special tokens. To see the data inside the tokenizer, a possible way is to save it to a JSON file: it is readable and contains all the information needed.

The difference is clear if called on the sentence “This is a hideout.”. The first version splits the hideout and recognizes the ‘.’ character but the second one has the whole word as a token but does not include punctuation characters. By default, the Tokenizer makes this data lowercase, I did not use this step in the previous version.

WordPiece: [‘this’, ‘is’, ‘a’, ‘hide’, ‘##out’, ‘.’]

From Vocab: [‘this’, ‘is’, ‘a’, ‘hideout’, ‘[UNK]’]

To make sure that I don’t use the same tokens, I called datasets.Dataset.cleanup_cache_files() between the two runs.

Formating to TensorFlow Dataset for Keras training

Implementing the example in the Dataset tutorial, we can load the data to the TensorFlow Dataset format and train the Keras model with it.

HuggingFace Dataset to TensorFlow Dataset — based on this Tutorial

This code snippet is similar to the one in the HuggingFace tutorial. The only difference comes from the use of different tokenizers. The tutorial uses the tokenizer of a BERT model from the transformers library while I use a BertWordPieceTokenizer from the tokenizers library. Unfortunately, these two logically similar class from the same company in different libraries are not entirely compatible.

Pipeline and training on HuggingFace data

The final step is almost identical to the one with the TensorFlow data. The only difference is the shuffle buffer. The samples in the IMDB database of the HuggingFace Datasets are sorted by label. In this case, it is not a problem but it disables the features of the TensorFlow that allowed to load only portions of the data at once. If we shuffle only with a small window in this data, in almost all cases the window contains only one of the label value. Hopefully, this is not true for the other datasets.

The final accuracies of the two models were similar: 0.8241 and 0.8224.

Summary

In this story, I showed the use of the TensorFlow’s and the HuggingFace’s dataset library. I talked about why I think that building dataset collections is important for the research field. Overall, I think that HuggingFace focusing on the NLP problems will be a great facilitator of the field. The library already has more NLP dataset than the TensorFlow’s. I think it is important for them to work closely with TensorFlow (as well as PyTorch) to ensure that every feature of both libraries could be utilized properly.

References

[1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

[2] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … & Klingner, J. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

[3] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.