NLP Transformers pipelines with ONNX

How to build real-world NLP applications with ONNX, not just for benchmarking tensors.

Thomas Chaigneau
Towards Data Science

--

Photo by T K on Unsplash

ONNX is a machine learning format for neural networks. It is portable, open-source and really awesome to boost inference speed without sacrificing accuracy.

I found a lot of articles about ONNX benchmarks but none of them presented a convenient way to use it for real-world NLP tasks. I also answered a lot of questions about ONNX and the best way to use it for NLP on Hugging Face’s discord server.

This is why I decided to write this blog post: I want to help you get the best possible results using ONNX with awesome Transformers pipelines.

This tutorial will show you how to export Hugging Face’s NLP Transformers models to ONNX and how to use the exported model with the appropriate Transformers pipeline. I use a Named Entity Recognition (NER)model for the example, but it’s not limited to NER. (More about NER in this great article)

All code snippets are available in a dedicated notebook on the associated GitHub repository. So don’t worry about copying it, just clone the repository and run the notebook while reading this blog post.

📦 Working environment

First of all, you need to install all required dependencies. It is recommended to use an isolated environment to avoid conflicts.

The project requires Python 3.8 or higher. You can use any package manager you want. I recommend using conda for the tutorial. All required dependencies are listed in the requirements.txt file. To install them, run the following commands:

$ conda create -y -n hf-onnx python=3.8
$ conda activate hf-onnx
$ git clone https://github.com/ChainYo/transformers-pipeline-onnx.git
$ cd transformers-pipeline-onnx
$ pip install -r requirements.txt

🍿Export the model to ONNX

For this example, we can use any TokenClassification model from Hugging Face’s library because the task we are trying to solve is NER.

I have chosen dslim/bert-base-NER model because it is a base model which means medium computation time on CPU. Plus, BERT architecture is a good choice for NER.

Huggging Faces’s Transformers library provides a convenient way to export the model to ONNX format. You can refer to the official documentation for more details.

We use the bert-base-NER model as mentioned above and token-classification as feature. The token-classification is the task we are trying to solve. You can see the list of available features by executing the following code:

Check all supported features for a specific model type

By invoking the conversion script, you have to specify the model name, from a local directory or directly from the Hugging Face’s hub. You also need to specify the feature as seen above. The output file will be saved in the output directory.

We gave onnx/ as the output directory. This is where the ONNX model will be saved.

We let the opset parameter as default which is defined in the ONNX Config for the model.

And finally, we also let atol parameter as default which is 1e-05. This is the tolerance for the numerical precision between the original PyTorch model and the ONNX model.

So here is the command to export the model to ONNX format:

$ python -m transformers.onnx \
--model=dslim/bert-base-NER \
--feature=token-classification \
onnx/

💫 Use the ONNX model with Transformers pipeline

Now that we have exported the model to ONNX format, we can use it with the Transformers pipeline.

The process is simple:

  • Create a session with the ONNX model that allows you to load the model into the pipeline and do inference.
  • Override the _forward and preprocess methods of the pipeline to use the ONNX model.
  • Run the pipeline.

Let’s first import the required packages:

All needed imports

⚙️ Create a session with the ONNX model

Create a session with onnxruntime

Here we will use only the CPUExecutionProvider which is the default execution provider for the ONNX model. You can give one or more execution providers to the session. For example, you can use the CUDAExecutionProvider to run the model on GPU. By default, the session will use the one which is available on the machine by starting with the first one in the list.

Onnxruntime provides a function to see all the available execution providers:

All possible execution providers from onnxruntime
List of all execution providers

As you can see, there is a lot of providers available for every use case and configuration.

⚒️ Create a pipeline with the ONNX model

Now we have a session with the ONNX model ready to use, we can overcharge the original TokenClassificationPipeline class to use the ONNX model.

To fully understand what is happening, you can refer to the source code of the TokenClassificationPipeline python class.

We will only override the _forward and the preprocess methods, because the other methods are not dependent of the model format.

Adapted pipeline class to fit the onnx model needs

🏃 Run the pipeline

We have now everything set up, so we can run the pipeline.

As normal, the pipeline will need a tokenizer, a model and a task. We will use the ner task.

Create the full pipeline with the new overcharged class

Let’s see if we can run the pipeline and check the outputs:

Run the ONNX pipeline
ONNX Pipeline outputs

Here it is, the pipeline is running well with the ONNX model! We have now a fully working NER pipeline with ONNX. 🎉

Look at the optional benchmarking section to see how it performs compared to the original PyTorch model or jump directly to the conclusion to a quick summary of the process.

🧪 Benchmarking a full pipeline (Optional)

We will benchmark by measuring the inference time of the pipeline with the ONNX model and the PyTorch model.

We first need to load the PyTorch model and create a pipeline with it.

Create PyTorch pipeline

We will test both pipelines with the same data and 3 different sequence lengths.

Benchmark sequences, with three different lengths

Let’s compare the inference time for each pipeline with the 3 different sequences length. We will repeat each iteration 300 times for each sequence length to get a more accurate benchmark and put everything in a table to compare the results

Benchmark loop
Benchmark results

Wow that looks great! 🎉

It seems that for each sequence length, the ONNX model is much faster than the original PyTorch model. Let’s calculate the ratio between the inference time of the ONNX model and the PyTorch model.

Benchmark ratios

We nearly achieved a 3x speedup on long sequences! 🎉

We didn’t even do any optimization based on the model architecture and hardware where the model is running, which is possible to do with ONNX.

Optimization could be very useful but is a deep topic that can’t be covered in this post. But it’s good to know that you can do it and that we could explore it in a future post.

Plus, our tests were done on CPU, but all benchmarks on GPU I have seen are even more impressive than benchmarks on CPU. Check this great article for more benchmarks on different architecture and inference configurations.

📍 Conclusion

To summarize, we have built a fully working NER pipeline with ONNX. We have converted a PyTorch model to ONNX and overcharged the original pipeline class to fit the ONNX model requirements. Finally, we have benchmarked the ONNX model with the original PyTorch model and compared the results.

It’s important to note that unfortunately the PyTorch model has be loaded alongside the ONNX model. This is because the Transformers pipeline requires a PyTorch model to be loaded especially for the configuration of the model.

I am looking for a way to avoid this loading of the PyTorch model, because it could create RAM issues on some systems.

The process we used to make it work is the same for every models and tasks available in Hugging Face Transformers library.

The only thing that you need to care about is if the model architecture has a configuration implemented for ONNX. You can have the full architecture list in the documentation .

If the architecture you are looking for has not been implemented, you can still create it and make a Pull Request to the Transformers library to add it. This is exactly what I did for the CamemBERT architecture months ago. You can check out the full PR on the Transformers GitHub repository.

I hope you found this article useful and interesting. Let me know if you have any questions or face any issues. I would love to add more examples and support for other NLP tasks, so please tell me if you have any ideas or requests! 🤗

For questions or issues, please open an issue on GitHub or in the comments below.

P.S. I’m also planning to add another benchmark section to test if the ONNX model is achieving the same results as the original model (spoiler: yes!).

--

--