What’s New in txtai 4.0

Semantic search with SQL, content storage and more

David Mezzetti
NeuML

--

txtai is an open-source platform for semantic search and workflows powered by language models. The following articles have previously covered txtai.

txtai 4.0 brings a number of major feature enhancements, most importantly the capability to store full document content and text right in txtai. This article will cover all the changes with examples.

Install and run txtai

The following code snippet shows how to install txtai.

pip install txtai

Content storage

Up to now with txtai, once text was vectorized, it was no longer possible to trace back to the input text. Only document ids and vectors were stored. Results consisted of ids and scores. It was the responsibility of the developer to resolve matches with an external data store.

txtai 4.0 brings a major paradigm shift. Content can now be stored alongside embeddings vectors. This opens up a number of exciting possibilities with txtai!

Let’s see with the classic txtai example below.

Basic example with content storage

The only change above is setting the content flag to True. This enables storing text and metadata content (if provided) alongside the index. Note how the text is pulled right from the query result!

Query with SQL

When content is enabled, the entire dictionary will be stored and can be queried. In addition to similarity queries, txtai accepts SQL queries. This enables combined queries using both a similarity index and content stored in a database backend.

Query with SQL
{
"text": "The National Park Service warns against sacrificing slower friends in a bear attack",
"score": 0.3151373267173767
}
{
"text": "Maine man wins $1M from $25 lottery ticket",
"length": 42,
"score": 0.08329004049301147
}
{
"count(*)": 6,
"min(length)": 39,
"max(length)": 94,
"sum(length)": 387
}
{
"count(*)": 1,
"min(length)": 72,
"max(length)": 72,
"sum(length)": 72,
"text": "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"score": None
}
{
"count(*)": 1,
"min(length)": 94,
"max(length)": 94,
"sum(length)": 94,
"text": "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"score": None
}
{
"count(*)": 1,
"min(length)": 42,
"max(length)": 42,
"sum(length)": 42,
"text": "Maine man wins $1M from $25 lottery ticket",
"score": None
}
{
"count(*)": 1,
"min(length)": 57,
"max(length)": 57,
"sum(length)": 57,
"text": "Make huge profits without work, earn up to $100,000 a day",
"score": None
}
{
"count(*)": 1,
"min(length)": 83,
"max(length)": 83,
"sum(length)": 83,
"text": "The National Park Service warns against sacrificing slower friends in a bear attack",
"score": None
}
{
"count(*)": 1,
"min(length)": 39,
"max(length)": 39,
"sum(length)": 39,
"text": "US tops 5 million confirmed virus cases",
"score": None
}

This example above adds a simple additional field, text length. Starting with txtai 4.0, the index method accepts dictionaries in the data field.

Note the second query is filtering on the metadata field length along with a similarity query clause. This gives a great blend of similarity search with traditional filtering to help identify the best results.

Object storage

In addition to metadata, binary content can also be associated with documents. The example below downloads an image, upserts it along with associated text into the embeddings index.

Object Storage

Reindex

Now that content is stored, embedding indexes can be rebuilt with different configuration settings.

Reindex

Index compression

txtai normally saves index files to a directory. With 4.0, it is now possible to save compressed indexes. Indexes can be compressed to tar.gz, tar.bz2, tar.xz and zip. txtai can load compressed files and treats them as directories.

Compressed indexes can be used as a backup strategy and/or as the primary storage mechanism.

Compression

Note the compression ratio. Depending on the type of data stored, this could be quite substantial (text will compress much better than objects).

External vector models

txtai supports generating vectors with Hugging Face Transformers, PyTorch, ONNX and Word Vector models.

This release adds support for pre-computed vectors using external models. External models may be an API, custom library and/or another way to vectorize data. This adds flexibility given the high computation cost in building embeddings vectors. Embeddings generation could be outsourced or consolidated to a group of servers with GPUs, leaving index servers to run on lower resourced machines.

The example below uses the Hugging Face Inference API to build embeddings vectors. We’ll load the exact model as in the first example and produce the same results.

External vectors with HF API

The next example uses spaCy to build vectors and then loads them into txtai. The vectors with this model are much faster to generate at the expense of accuracy.

External vectors with spaCy

Wrapping up

This article gave a quick overview of txtai. txtai 4.0 is now out!

See the following links for more information.

--

--

David Mezzetti
NeuML
Editor for

Founder/CEO at NeuML. Building easy-to-use semantic search and workflow applications with txtai.