The Hidden World of (Vector) Indexes

Everything you always wanted to know about (vector) indexes but were afraid to ask.

Olivier Ruas
Towards Data Science

--

Since the public release of ChatGPT, hardly one day has passed without new content discussing LLMs, RAGs, and vector databases. The technology world buzzes with the possibilities of LLMs, seen as the latest technology that will change our lives: for the best for some, for the worst for others. Alongside them, Retrieval Augmented Generation (RAG) has emerged as a dynamic solution to adapt to the ever-changing landscape of knowledge. But there’s a crucial player behind the scenes: vector indexes and databases.

While LLMs, RAGs, and vector databases are intensively discussed, the (vector) indexes that empower these innovations are less known. In this article, we’ll demystify the concept of indexes to help you understand how an index makes finding information in vast collections a breeze.

1. What is an index?

We all have encountered such a situation. You are meeting with your friend at her place. The only information she gave you is “I live in Metro Town district”. When you arrive at the said Metro Town district:

Photo by Manson Yim on Unsplash

Well, without any help, finding her place will take a while! If only there were a map at the entrance…

This is precisely what indexes are about: how to quickly find where people (or data) are.

Yellow Pages are an index that allows you to find people’s home based on their names.

An index is a data structure made to improve the speed of data retrieval operations over data. In other words, it is how you organize information so that you can quickly find what you are looking for.

The data is indexed using keys. The order is based on the keys, and multiple keys can be used. In Yellow Pages, the first key is the family name, and the second is the first name.

The index does not necessarily store the whole data. It only focuses on the critical parts used to quickly locate and access specific pieces of data within the entire data.

The index at the end of the book is a good example: it shows you where to find pages using the word, so it maps each word to page numbers and not to the sentences themselves.

Indexes are behind search engines and databases: they play a crucial role in improving the efficiency and speed of data retrieval operations.

The choice of how to organize your data is critical and depends on the context.

For example, in the example of Yellow Pages, if the indexes were organized by phone numbers instead, and you only knew the names, finding their addresses would be quite challenging!

The information is there; you will find it eventually, but the required time will prevent you from even trying. On the other hand, using Yellow Pages, one glance at the page lets you know precisely if you need to look backward or forward! The lexicographic order allows you to make a roughly logarithmic search. That’s why the choice of the index is essential.

In general, an index has a very precise purpose: it can be designed to perform quick insertions or retrievals of the data or a more exotic query such as a range query (“retrieve all the data dated between the 1st of May and the 15th of August of this year”). The choice of the operation to optimize will determine what the index will look like.

The main difference between online transactional processing (OLTP) and online analytical processing (OLAP) databases is the choice of the operations they want to optimize: OLTP focuses on operations over rows (like updating an entry), while the other is aimed at operations over columns (computing an average for example). Both databases will not use the same indexes as they do not aim at the same operations.

1.1 What is the difference between indexes and data structures?

💡A data structure is a way of organizing and storing data in a computer so that it can be efficiently accessed and manipulated. Explained like this, the difference between indexes and data structures is sometimes hard to see, so what are the differences? Indexes focus on inserting, searching, sorting, or filtering data. Data structures are more generic.

Indexes are built using data structures but typically do not store the data itself.

If you consider a database of movies, you don’t want to move around large files whenever the index is updated: you store a pointer to the file, not the file itself. A pointer can be seen as the address of the file on the disk.

Now that you have a general idea about what an index is, let’s focus on numeric examples. Here are some common (numeric) indexes:

  • Inverted index
  • Hash indexes
  • B-trees
  • Locality-sensitive-Hashing (LSH).

To better understand how indexes work, let’s explore one of the most basic indexes: the inverted index.

1.2 Inverted index

The inverted index is a standard index used in search engines.

It is designed to find where the information is quickly: it aims at optimizing the retrieval time.

In a nutshell, an inverted index maps contents to their locations, a bit like the index of a book.
It is often used to map a characteristic to the data that have it.

For example, suppose you want to know who lives in the same building.

First, you should have a table where, for each name, you have the building (the table that would have helped you to find Alice):

This table is updated whenever a person arrives or leaves the district.

If you want to find who lives in building B in this table, you have to iterate through the whole table.

While it is technically possible, it won’t scale as the computation time will linearly increase with the size of the table.

Think about the number of apartments in the district: if you want to find all the persons who live in building B by checking all the profiles one by one, it will take a while!

Another solution is to use an inverted index: you maintain a table where the buildings are used as keys and connected to the people living in it:

This table is maintained simultaneously as the previous one: the cost to add or remove a new person is slightly higher than before, but the retrieval time has shrunk to nearly nothing!

To find who lives in building B, you simply need to access this table’s line “Building B” and have your result!

A reverse phone Lookup is a reverse index of phone numbers!

In practice, inverted indexes are a bit more complicated as they manipulate more complex data than pair (user, interest). The indexes are typically stored as hash tables.

Despite their relative simplicity, inverted indexes are among the most common indexes used in search engines.

1.3 Indexes and Databases

Databases are built over indexes. The indexes enhance data retrieval in the database by storing pointers or references to the database’s data. It does not store the actual data but acts as a means to quickly access it, significantly improving query performance.

The database is more than the index as it is a comprehensive system for data management. It stores, organizes, and manages the actual data, enforces data integrity, handles transactions, and provides a range of features beyond indexing, making it a central hub for data storage and manipulation. While indexes speed up data retrieval within databases, databases serve as the complete ecosystem for data storage, management, and retrieval.

In summary, indexes are like signposts in a database, pointing the way to the data you seek. In contrast, databases are the repositories where the actual data resides and are equipped with various tools and features to manage and manipulate that data.

Depending on your use case, you may not need an entire database but only the indexes, as the overlay managing the data may be costly.

2. Vector indexes and vector databases

2.1 What is a vector index?

In a nutshell, a vector index is an index where the keys are vectors.

In our reversed index example, the keys were words (hobbies and names). In vector indexes, we manipulate vectors: fixed-size sequences of numbers.

Two vectors of size 4.

I know, I know, I can hear you say, “I’m bad at math, I don’t want to use vectors”.

Don’t worry, you don’t need to be good at math to understand vector indexes.

All you need to know is that using vectors enables you to rely on powerful and optimized operations.

The first question you might ask yourself is, “What is so interesting about your vectors”?

Let’s say that you finally found Alice at her place, and now you want to find something to eat. You may want to find the nearest restaurant. You look for a list of restaurants and end up with a table of restaurants, specialties, and addresses. Let’s take a look at the information you can find:

This doesn’t look helpful right? Your only option is to scan through the list, reading addresses one by one and manually assessing how close it is to you. We could try to automate the ranking of the nearest place, but computing distances based on raw addresses is difficult (two streets may be nearby but have different names).

However, imagine now that you have a table with GPS positions representing each restaurant’s exact latitude and longitude:

Each position is a vector of size 2. With those vectors, you can easily compute the distance to your own position with a simple -and fast- mathematical operation. Then you can quickly retrieve the closest ones, in other words, the one with the smallest distance to you!

Now you can easily find the closest restaurant to you!

The interesting point is that by indexing the table directly by the vector (the GPS Position in this example), we can optimize the index so that finding the entries with the lowest distance is extremely fast.

Vector indexes are specialized indexes designed to efficiently retrieve vectors that are closest, or the most similar, to a given vector. These indexes rely on optimized mathematical operations to efficiently identify the most similar vectors.

In our example, the distance used was the classical distance, but there are indexes for all the existing distances or similarities, such as the cosine similarity metric.

Locality Sensitive Hashing (LSH) is one of the most widely used indexes to find the k most similar data points in a dataset, and it works with different distances or similarities.

“That’s nice, but I’m not using vectors in my database”.

That’s the exciting part: you can transform anything into a vector.

Simply taking the binary representation would be inefficient since it may contain noise, so it’s essential to find a representation that preserves the characteristics of the data.

Representing different information as a vector to use vector indexes has become a standard way to improve the efficiency of a system. Vectorization has become an art.

For example, if you have an image dataset and you want a database where you can find the images the most similar to a given one, you can use the SIFT descriptor of the images.

2.2 What is the difference between Vector Indexes and Vector databases?

The difference between vector indexes and vector databases is the same as the difference between indexes and databases: indexes are meant to simply quickly find where the data is, while vector databases use vector indexes to perform the retrieval queries fast, but they also store and maintain the data while providing additional operations and properties.

3. What is the link between LLMs and RAGs?

Now that you’ve learned about vector indexes, you might wonder why so many discussions about LLMs and RAGs also discussed vector indexes. To understand why, let’s first quickly explain what Retrieval Augmented Generation, or RAG, is. AG serves as a clever workaround for one of the inherent limitations of LLMs, namely, their limited knowledge.

LLMs are only aware of the data they were trained on. One technique to increase their knowledge is prompt engineering, where additional data is integrated into the query prompt: “Given this data {data}, answer this question: {question}”.

While effective, this approach faces a new challenge: scalability. Not only the size of the prompt is limited, but the more data you include, the more costly the query becomes.

To overcome this, Retrieval Augmented Generation limits the quantity of data by only inserting the most similar data, which is where vector indexes come into play!

Here’s how it works: All the documents are initially transformed into vectors using LLMs (1). To be more specific, the encoder part of the LLM is used.

These vectors are used as keys for indexing the documents in a vector index (2).

When doing a query, the query is vectorized using the LLM (3). The resulting vector is then queried in the vector index to retrieve the most similar documents (4). These documents are then used to answer the query using prompt engineering (5).

Retrieval-Augmented Generation (RAG) relies on both LLMs and vector indexes.

That’s it!

As you can see, similarly to LLM, the vector index holds a central position in RAGs.

Some people prefer to use a vector database instead of a vector index. That’s okay whenever you want to reuse the same data in multiple applications. However, if your primary concern is retrieval efficiency or flexibility in defining the index for each application, a single vector index is often simpler and faster to deploy.

Conclusion

Congratulations to the courageous who have read until there! I believe you now have all the background knowledge to engage in those passionate discussions about LLMs and RAGs.

Indexes hold a central role in data retrieval. As data retrieval is likely to remain a key component of data technologies, it is primordial to understand what indexes, including vector indexes, are about.

If you want to learn about more advanced indexes, I’d suggest you read my article about LSH. If you want to learn something more practical and are curious to experience real-time Retrieval Augmented Generation (RAG) in action, consider exploring the LLM-app, where you can experience firsthand the power of these technologies.

--

--