The world’s leading publication for data science, AI, and ML professionals.

Distributed Hugging Face Tokenizer using PySpark

How to broadcast tokenizer and use it with UDFs

Photo by Eric Prouzet on Unsplash
Photo by Eric Prouzet on Unsplash

In this short post, I would show how to use Hugging Face Tokenizer. Before going into details on how to apply tokenizer on a DataFrame using PySpark, let us first look at some simple tokenizer code.

The above code simply loads a pre-trained tokenizer roberta-base. It prints out a dictionary that has two keys. In this post, we would only focus on _inputids which are basically the tokens corresponding to the word "hello".

The first step to use the tokenizer on a DataFrame is to convert it into UDF. In the code below, we create a method tokenize which takes a sequence of characters (string), and we use the tokenizer we initiated above on the input string. We only output the value for the key _inputids. Since the value is list of integers we define the schema as ArrayType(IntegerType()).

Now let us use this UDF on a test DataFrame. First, we create a DataFrame with two rows with the field "value". Then we use _tokenizeudf to tokenize the "value" field in each row.

Broadcasting Tokenizer

In the above implementation, the tokenizer isn’t broadcasted explicitly but implicitly. We can broadcast the tokenizer and use the broadcast variable for tokenization.

In the code above, we broadcast tokenizer and use the broadcasted variable _bctokenizer in the _bctokenize method which in turn is converted in the _bc_tokenizeudf to be used on DataFrame.

In this short post, I explained how a tokenizer could be used in a distributed manner with DataFrame. Even though Tokenizers are generally fast due to Rust implementation though in pipelines where a developer is processing data using Spark this comes in handy rather than collecting all data on a machine, processing (tokenizing), and reconverting to DataFrame. Especially streaming implementation benefits from UDF-based tokenizers more than others.


Related Articles