The world’s leading publication for data science, AI, and ML professionals.

A sub-50ms neural search with DistilBERT and Weaviate

How to build a fast and production-ready neural search using Transformers, DistilBERT, and the vector search engine Weaviate

When leveraging the capabilities of modern NLP models such as BERT getting accurate results is often not the biggest problem. UX-focused libraries have popped up which makes working with NLP models a lot more comfortable. Most notably hugging face transformers have made it considerably easier to experiment and tweak with BERT and similar models.

However, getting an experiment from a Jupyter Notebook into production is a different story. The main challenges are both the speed of such heavy neural models as well as operating them in production, known as MLOps.

In this tutorial, we will leverage the ability of the vector search engine Weaviate to run any model in production. Weaviate also comes with a wide variety of built-in models which can be configured through Weaviate’s module feature. This also includes support for transformer-based models. However, for this tutorial, we have decided to do all vectorization outside of Weaviate to demonstrate how Weaviate can be used with any model.

High speed and low latencies is going to be a key characteristic of the neural search we are about to build in this tutorial. Photo by Mathew Schwartz on Unsplash
High speed and low latencies is going to be a key characteristic of the neural search we are about to build in this tutorial. Photo by Mathew Schwartz on Unsplash

How we chose the technologies for this exercise

The main technology choices include:

  • Hugging Face TransformersTransformers have quickly become the de-facto standard for working with neural-network-based NLP models, such as BERT. With its deep integrations of PyTorch and Tensorflow, there is maximum flexibility and ease of use.
  • DistilBERT One of the key aspects of this article is to achieve high speed and very low latencies in production. DistilBERT offers 97% of the accuracy of regular BERT, while being considerably lighter and about 60% faster. This makes it a great choice. If you want to use a different model, you can very easily switch it out for an alternative.

  • Weaviate Vector Search EngineWeaviate is a real-time vector search engine that is both very fast at query time as well as suitable for production uses. It’s built in Go with a cloud-native mindset. This makes it both very reliable and easy to run on Kubernetes and similar container orchestrators. There is also a python client available which will make the connection between our transformers code and Weaviate an easy feat. Under the hood, Weaviate uses a variation of the HNSW index which is customized to provide all the features you expect from a proper database.

Should I run this with CPUs or GPUs?

Neural-network-based models tend to be considerably faster on GPUs than on CPUs, however, the barriers of entry are higher when a GPU is required. Luckily, all code below will work without GPUs as well. We will highlight the places were minimal changes need to be made to run on CPUs vs GPUs. Weaviate itself does not require any GPUs, as it runs plenty fast on CPUs. The BERT and BERT-derivative transformers benefit greatly from using a CUDA-supported CPU which is why we chose to run with GPUs for this example.

Where do I get a GPU from?

If you want to run with GPU support, there are (at least) two convenient options:

  1. You might already work with a computer that has a CUDA-compatible GPU.
  2. You can spin one up in the cloud. For this example, we are using the smallest and therefore cheapest GPU that is available on Google Cloud; an NVIDIA Tesla T4. At the time of writing it costs just about $180 per month. We won’t even need more than an hour, making the cost negligible.

Our Roadmap: What will we build?

In this Tutorial we will cover the following steps:

  1. First, we will spin up the Weaviate Vector Search engine locally using docker-compose.
  2. We will download and preprocess a freely available text-based dataset. You can also replace the data in the set with your own data if you like.
  3. We will then use transformers and DistilBERT to encode our text into vectors. You can easily switch to another transformer-based model of your choosing.
  4. We can now import the text objects and their vectors into Weaviate. Weaviate will automatically build up both a vector index and an inverted index to make all kinds of search queries possible.
  5. Finally we will use the DistilBERT model one more time to vectorize a search query and then perform a vector search with this.

Prerequisites

For this tutorial you are going to need:

  • A bash-compatible shell
  • python >= 3.8 & pip installed
  • Docker and Docker Compose installed

Optional

  • A CUDA-compatible GPU with the respective drivers installed for your OS

Step 1 – Spin up Weaviate

The easiest way to spin up Weaviate locally is to download a sample docker-compose file. If you want, you can tweak the configuration, but for this tutorial, we will leave in all the default settings:

To download the docker-compose file run:

You can then start it in the background using:

$ docker-compose up -d

Weaviate should now be running and exposed on your host port 8080. You can verify with docker ps or simply send a test query to Weaviate which should return something like the following:

$ curl localhost:8080/v1/schema
# {"classes":[]}

It shows an empty schema, as we haven’t imported anything yet.

You can also look at the documentation for alternative ways to run Weaviate, such as Kubernetes, Google Cloud Marketplace, or the Weaviate cloud service.

Step 2 – Download an example dataset

For this tutorial, we have chosen to use the 20 newsgroups dataset, a set that is commonly used for NLP-related tasks.

We will create a folder called data into which we’ll download and extract the dataset:

Note that we will have to do some preprocessing, such as removing the headers of each file so we are only left with the posts themselves. We will do this as part of the python script we’ll write next.

Step 3 – Install python dependencies

We will need torch and transformers for the BERT model, nltk for the tokenization and weaviate-client to import everything into Weaviate.

$ pip3 install torch transformers nltk weaviate-client

Step 4 – Start building our python script

We will now build our python script which will load and preprocess our data, encode our data to vectors, import them into Weaviate and finally run a vector search with Weaviate.

The following code snippets have been split into smaller chunks to make them easier to read and explain, you can also download the complete python script here.

Initialize everything

First, let’s initialize all the libraries we are going to use. Please note that you have to remove the line model.to('cuda') if you aren’t running with GPU-support.

Load and preprocess the dataset

Next, we will build two helper functions to read our dataset from disk. The first function will get a shuffled selection of filenames. The second function then reads those filenames and does some simple preprocessing. We will strip the headers from the newgroup posts by identifying the first occurrence of two line breaks. Additionally, we will replace all line breaks and tabs with regular spaces. Then, we skip every post that has less than 10 words to remove very noisy posts. And finally, we will truncate all posts to a maximum of 1000 characters. Feel free to adjust those parameters as you wish.

Vectorize using the DistilBERT model

Next up, we build two more helper functions, they are used to vectorize the posts we previously read from the dataset. Note that we are extracting the text2vec process into a separate function. This means we can reuse it later at query time.

Note that you have to remove the line tokens_pt.to('cuda') if you are running without GPU-support.

Init Weaviate Schema

Weaviate has a very simple schema models. For each class that you create, Weaviate will internally create one vector index. Classes can have properties. For our Post class a single property content of type text will suffice. Note that we are also explicitly telling Weaviate to use the none vectorizer, meaning that Weaviate will not vectorize anything itself – we provide the vectors which we created with DistilBERT above.

Import our data into Weaviate

We can now import all our data into Weviate. Note that for each 20-news post, we will import both the text as well as the vector. This will make our results very easy to read and even allow for mixing text-based and vector-based searching.

We can use Weaviate’s batch import feature which will make use of parallelization internally to speed up the import process. Choose the batch size according to your resources, our test VM should easily be able to handle 256 objects at a time.

Vectorize a search term with DistilBERT and perform a vector search with Weaviate

We can now reuse the text2vec function we defined above to vectorize our search query. Once we get the vector from DistilBERT, we can pass it to Weaviate’s nearVector API using the client’s with_near_vector method.

Let’s also take some measurements of how long it takes, because speed was one of the major motivators for this article.

Run it all

It’s finally time to run all of our methods. Let’s first init the schema, read and vectorize some posts and then import them to Weaviate. If you are running without GPU support, it might make sense to reduce the amount of posts.

You should be seeing something like the following:

So far 100 objects vectorized in 2.073981523513794s
So far 200 objects vectorized in 4.021450519561768s
So far 300 objects vectorized in 6.142252206802368s
...
So far 3800 objects vectorized in 79.79721140861511s
So far 3900 objects vectorized in 81.93943810462952s
Vectorized 3969 items in 83.24402093887329s

Now let’s perform a few searches:

It should print something along the lines of:

Query "the best camera lens" with 1 results took 0.018s (0.008s to vectorize and 0.010s to search)
0.8837:    Nikon L35 Af camera. 35/2.8 lens and camera case. Package $50  Send e-mail
---
Query "which software do i need to view jpeg files" with 1 results took 0.022s (0.007s to vectorize and 0.015s to search)
0.9486:   How do I view .eps files on X? I have an image in color encapsulated postscript, and need to view it on my screen.  Are there any utilities that will let me convert between encapsulated postscript and plain postscript?  Joseph Sirosh
---
Query "windows vs mac" with 1 results took 0.019s (0.007s to vectorize and 0.011s to search)
0.8491:   Appsoft Image is available for NeXTStep. It is a image processing program similar to Adobe Photoshop. It is reviewed in the April '93 issue of Publish! Magazine.   Richardt

Note how all three response times were below 25ms – we managed to beat our latency goal by 50%!

Feel free to play around with different queries and limits. Don’t be surprised if the content of some of the posts seem a bit outdated, the 20-newsgroup dataset is a really old one. There certainly was no iPhone at the time, but there is plenty of Macintosh content.

Is there an easier way?

One of the points of the exercise above was to show that Weaviate is compatible with any ML model, as long as it can produce vectors. As a result, we had to do quite some steps ourselves. Luckily Weaviate comes with optional modules that can help vectorize your data. For example the [text2vec-contextionary](https://www.semi.technology/developers/weaviate/current/modules/text2vec-contextionary.html) module can vectorize all your data at import time using a fasttext-based algorithm. If you want to use BERT and friends, check out the soon-to-be-released text2vec-transformers module.

A Recap and where to go from here.

We have shown that we can combine Weaviate, Hugging Face Transformers and (Distil)BERT to produce a production-quality neural search which is able to search through thousands of objects in about 25ms. That was a nice example for this demo, but your datasets are likely to be much bigger in your real-life applications. Luckily Weaviate is built to scale really well and will serve even searches through millions and billions of objects in under 50ms. To learn more about how Weaviate tackles large-scale vector searches, read about how HNSW is utilized inside Weaviate.


Related Articles