
Table of Contents
Intro
What is so special about Vector Databases? How do we map the meaning of a sentence to a numerical representation? How does that help our LLM app? Why can’t we just give the LLM all the data we have?
Hands-On Tutorial – Text to Embeddings and Distance Metrics
1. Text to Embeddings 2. Plot 384 dimensions in 2 using PCA 3. Calculate the distance metrics
Towards Vector Stores
How to accelerate the Similarity Search? What are the different Vector Stores we can choose from?
Hands-On Tutorial – Set up your first Vector Store
1. Install chroma 2. Get/create a chroma client and collection 3. Add some text documents to the collection 4. Extract all entries from database to excel file 5. Query the collection
Vector databases are a hot topic right now. Companies keep raising money to develop their vector databases or to add vector search capabilities to their existing SQL or NoSQL databases.

What is so special about Vector Databases?
Vector Databases make it possible to quickly search and compare large collections of vectors. This is so interesting because the most up-to-date embedding models are highly capable of understanding the semantics/meaning behind words and translating them into vectors. This allows us to efficiently compare sentences with each other.
Okay, but why should we care?
For most Large Language Model (LLM) applications we rely on that capability since our LLM can never know everything. It only sees like a frozen version of the world, which depends on the train set the LLM is trained with.

So we need to feed our model with additional data, information that the LLM can impossibly know by itself. And that all needs to happen during the runtime of our application. So we must have a process in place that decides as quickly as possible with which additional data we want to feed our model.
With traditional keyword search, we run into limitations, mainly because of two problems:
- Languages are complex. In most languages, you can ask more or less the same question in 20 different ways. It is often not enough to simply search our data for keywords. We need a way to map the meaning behind words and sentences to find content that’s related to the question.
- We also need to make sure that this search is done within milliseconds, not seconds or minutes. So we need a step that allows us to search the Vector Collection as efficiently as possible.
First things first – How do we map the meaning of a sentence to a numerical representation?
Before we can search our database, we need to translate our text content into vectors that capture the meaning of words and sentences. Pre-trained embedding models from OpenAI, Google, MetaAI, or the open source community help us do this. They learn from a huge corpus of text, how words are normally used and in what contexts. They use this extracted knowledge to map words into a multi-dimensional vector space. The location of the new data point in the vector space tells us which words are related to each other.
For the simple example below, we arrange the "meaning" of various fruits and vegetables in a simple two-dimensional vector space. If the embedding model does what it is supposed to do, we would expect apple and pear to be closer than apple and onion (at least, that is what I would expect with my limited knowledge on language and fruits). Similarity can be influenced by various features. For fruits and vegetables this may be: Size, colour, taste, country of origin, etc. It all depends on what object you want to describe.

The embedding models learn by observing how words are used in context, similar to how humans learn a language. When you’re growing up, you learn the meaning of words by listening to conversations and reading books.
The train process of our model is not so different. After the train processes, it learned that "pears and apples" are more likely to be seen together in a sentence than "apples and onions," so it assumes they have something in common. So when it comes to food, it’s likely that what matters most is whether different types of food are eaten together.
Objects, words and sentences have lots of features that don’t fit into two dimensions. To tackle this, modern embedding models turn words and sentences into vectors with hundreds or thousands of dimensions.
By transforming our content into a vector, we can measure the distance between them. This is easy in the two-dimensional example below, but it can become more complex and require more computing power when the vector has hundreds or thousands of dimensions.
This is where vector databases come into play.
They are incorporating several techniques that allow us to efficiently store and search our collection of text content. What we want to do once we have a collection of vectors, we want to compare them to each other and somehow quantify the similarities between them. Usually we are interested in the k-nearest neighbors, so the data points in our vector space that are closest to our query vector. In the example below our query would be the word "apple". But of course not the word "apple" itself, we are using the vector we get back from the Embedding Model.

How does that help our LLM app?
We use this approach in many of our LLM applications when the LLM themselves reach the limits of their knowledge:
Things LLMs don’t know out of the box:
- Data that is too new – Articles about current events, recent innovations, etc. Just any new content created after the collection of the LLM train set.
- Data that is not public – personal data, internal company data, secret data, etc.

Why can’t we just give the model all the data we have?
Short answer: The models have a limit, a token limit.
If we don’t want to train or fine-tune the model, we have no choice but to give the model all the necessary information within the prompt. We have to respect the token limits of the models.

LLMs have a token limit for practical and technical reasons. The latest models from OpenAI have a token limit of about 4,000–32,000 tokens, while the open source LLM LLama has 2,048 tokens (if not fine-tuned). You can increase the maximum number of tokens by fine-tuning, but more data is not always better. 32,000 token limits allow us to pack even large texts into a prompt at once. Whether this makes sense is another matter. (Lample, 2023; OpenAI, 2023)
The quality of data is more important than the sheer amount of data, irrelevant data can have a negative impact on the result.
Even reorganizing the information within the prompt can make a big difference in how accurately LLMs understand the task. Researcher from the Stanford University has found that when important information is placed at the beginning or end of the prompt, the answer is usually more accurate. If the same information is located in the middle of the prompt, accuracy can decrease significantly. It’s important to give careful consideration to what data we are providing our model with and how we structure our prompt. (Liu et al., 2023; Raschka, 2023)

The quality of the input data is key to the success of our LLM application, so it is essential we implement a process that accurately identifies the relevant content and avoids adding too much unnecessary data. To ensure this, we must use effective search processes to highlight the most relevant information.
How exactly do Vector databases help us do that?
Vector databases are made up of several parts that help us quickly find what we’re looking for. Indexing is the most important part and is done just once, when we insert new data into our dataset. After that, searches are much faster, saving us time and effort.

To feed the data into our vector database, we first have to convert all our content into vectors. As described in the first section of this article, we can use so-called embedding models for that. Simply because its more convenient, we often use one of the ready-to-use services from OpenAI, Google and Co..
In the image below you can see some Embedding Models you can choose from.

Hands-On Tutorial – Text to Vectors
To use our example, we’ll need the Hugging Face API and the sentence transformer model "all-MiniLM-L6-v2". To get started, just go to https://huggingface.co/settings/tokens and get your token.
Using the code snippet below, we:
- Transform the text snippets into vectors with 384 dimensions (Depends on the Embedding Model you use). This allows us to capture the meaning of sentences in a vector.
- We can then identify similarities between data points by calculating the distance between the sentences.
- To visualise this in a simple 2-dimensional plot, we reduce the 384 dimensions to two using Principal Component Analysis. This may result in a huge loss of information, but it’s worth a try!
For the example below I am generating 10 random sample sentences:
text_chunks = [
"The sky is blue.",
"The grass is green.",
"The sun is shining.",
"I love chocolate.",
"Pizza is delicious.",
"Coding is fun.",
"Roses are red.",
"Violets are blue.",
"Water is essential for life.",
"The moon orbits the Earth.",
]
If you’d like to try the tutorial for yourself, I’ve broken down the steps into data pipelines that you can run independently.
Create a new virtual environment .venv using:
Install all modules in "requirements.txt":
pip install -r requirements.txt
anyio==4.0.0
backoff==2.2.1
bcrypt==4.0.1
certifi==2023.7.22
charset-normalizer==3.2.0
chroma-hnswlib==0.7.3
chromadb==0.4.10
click==8.1.7
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.1.1
cycler==0.11.0
exceptiongroup==1.1.3
fastapi==0.99.1
filelock==3.12.4
flatbuffers==23.5.26
fonttools==4.42.1
fsspec==2023.9.1
h11==0.14.0
httptools==0.6.0
huggingface-hub==0.16.4
humanfriendly==10.0
idna==3.4
importlib-resources==6.0.1
joblib==1.3.2
kiwisolver==1.4.5
matplotlib==3.8.0
monotonic==1.6
mpmath==1.3.0
numpy==1.26.0
onnxruntime==1.15.1
overrides==7.4.0
packaging==23.1
pandas==2.1.0
Pillow==10.0.1
posthog==3.0.2
protobuf==4.24.3
pulsar-client==3.3.0
pydantic==1.10.12
pyparsing==3.1.1
PyPika==0.48.9
pyreadline3==3.4.1
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
scikit-learn==1.3.0
scipy==1.11.2
six==1.16.0
sniffio==1.3.0
starlette==0.27.0
sympy==1.12
threadpoolctl==3.2.0
tokenizers==0.14.0
tqdm==4.66.1
typing_extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4
uvicorn==0.23.2
watchfiles==0.20.0
websockets==11.0.3
If you want to follow along you can simply set up a similar folder structure with a folder for the data pipelines and the data (or change the path to the .csv files within the pipelines).

1. Text to Embeddings
The first pipeline, "01_text_to_embeddings.py" uses an Embedding Model to transform the 10 sample sentences. The results are saved in a data frame called "embeddings_df" and then exported to a CSV file named "embeddings_df.csv". Make sure you replace the HuggingFace Token with you own token.

##########################################################################################################
'''
The following script translates the list of strings "text_chunks" into vector embeddings and
saves the DataFrame including "text_chunks" and "embeddings" in the csv file "embeddings_df.csv"
'''
##########################################################################################################
import os
import requests
import pandas as pd
import numpy as np
# hugging face token
os.environ['hf_token'] = os.environ.get('HF_TOKEN')
# os.environ['hf_token'] = 'testtoken123'
# example text snippets we want to translate into vector embeddings
text_chunks = [
"The sky is blue.",
"The grass is green.",
"The sun is shining.",
"I love chocolate.",
"Pizza is delicious.",
"Coding is fun.",
"Roses are red.",
"Violets are blue.",
"Water is essential for life.",
"The moon orbits the Earth.",
]
def _get_embeddings(text_chunk):
'''
Use embedding model from hugging face to calculate embeddings for the text snippets provided
Parameters:
- text_chunk (string): the sentence or text snippet you want to translate into embeddings
Returns:
- embedding(list): list with all embedding dimensions
'''
# define the embedding model you want to use
model_id = "sentence-transformers/all-MiniLM-L6-v2"
# you can find the token to the hugging face api in your settings page https://huggingface.co/settings/tokens
hf_token = os.environ.get('hf_token')
# API endpoint for embedding model
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}
# call API
response = requests.post(api_url, headers=headers, json={"inputs": text_chunk, "options":{"wait_for_model":True}})
# load response from embedding model into json format
embedding = response.json()
return embedding
def from_text_to_embeddings(text_chunks):
'''
Translate sentences into vector embeddings
Attributes:
- text_chunks (list): list of example strings
Returns:
- embeddings_df (DataFrame): data frame with the columns "text_chunk" and "embeddings"
'''
# create new data frame using text chunks list
embeddings_df = pd.DataFrame(text_chunks).rename(columns={0:"text_chunk"})
# use the _get_embeddings function to retrieve the embeddings for each of the sentences
embeddings_df["embeddings"] = embeddings_df["text_chunk"].apply(_get_embeddings)
# split the embeddings column into individuell columns for each vector dimension
embeddings_df = embeddings_df['embeddings'].apply(pd.Series)
embeddings_df["text_chunk"] = text_chunks
return embeddings_df
# get embeddings for each of the text chunks
embeddings_df = from_text_to_embeddings(text_chunks)
# save data frame with text chunks and embeddings to csv
embeddings_df.to_csv('../02_Data/embeddings_df.csv', index=False)
You should now find the csv-file "embeddings_df.csv" in the folder "02_Data". It should look something like this:

2. Plot 384 dimensions vector in 2 dimensions using Principal Component Analysis
Now that you have the csv-file "embeddings_df.csv" in the folder "02_Data", which contains the vectors, let’s try to visualize it. We can use Principal Component Analysis to extract the 2 most important "Principal Components" and visualize it. We can not capture all of the original information in just two dimensions, but maybe it gives us a feeling what is happening when we translate our text into vectors.

Create a new file for the second pipeline, I call it: "02_create_PCA_analysis.py"
The second pipeline contains:
- Loading the embeddings from "embeddings_df.csv"
- Perform a Principal Component Analysis using scikit-learn to create and find the two most important (principal) components
- Create a scatter plot to visualize the result in a 2-dimensional plot
##########################################################################################################
# Create PCA plot using embeddings df
##########################################################################################################
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
def create_pca_plot(embeddings_df):
'''
The function performs a principal component analysis to reduce the dimensions to 2 so we can print them in a plot
Parameters
- embeddings_df (DataFrame): data frame with the columns "text_chunk" and "embeddings"
Returns
- df_reduced (DataFrame): data frame with the 2 most relevant Principal Components
'''
# Perform PCA with 2 components
pca = PCA(n_components=2)
# apply principal component analysis to the embeddings table
df_reduced = pca.fit_transform(embeddings_df[embeddings_df.columns[:-2]])
# Create a new DataFrame with reduced dimensions
df_reduced = pd.DataFrame(df_reduced, columns=['PC1', 'PC2'])
############################################################################################
# Create a scatter plot
############################################################################################
def create_scatter_plot(df_reduced):
plt.scatter(df_reduced['PC1'], df_reduced['PC2'], label=df_reduced['PC2'])
# Add labels and title
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
# Add labels to each dot
for i, label in enumerate(embeddings_df.iloc[: , -1].to_list()):
plt.text(df_reduced['PC1'][i], df_reduced['PC2'][i], label)
# Save and display the plot
plt.savefig('../02_Data/principal_component_plot.png', format='png')
# create and save scatter plot
create_scatter_plot(df_reduced=df_reduced)
return df_reduced
# Load embeddings_df.csv into data frame
embeddings_df = pd.read_csv('../02_Data/embeddings_df.csv')
# use the function create_pca_plot to
df_reduced = create_pca_plot(embeddings_df)

I’m not sure how good the example is, but you can at least see that sentences about food and drink tend to be on the right side of the plot, the two sentences describing the weather are close together on the left side, and the two sentences about flowers are close together at the bottom of the plot. I’ll just take this as a "not so scientific" evidence that the model is at least somewhat successful in mapping the semantics behind the words/phrases.

As humans, it is relatively easy to tell which points are closer together when plotted in a simple two-dimensional space. But how can we quantify that in a vector space with hundreds and thousands of dimensions? – We need a metric for that. A metric that describes the similarity. Therefor we calculate the distance between the points in the space.
3. Calculate the distance
How is the distance defined?
In Statistics we have several metrics to measure the distance between data points. One often used metric is the "Cosine Similarity". We’ll be using this one for our examples below.

What we want to do with Vector Databases is to find similar entries as fast as possible. But what makes Vector Databases so special, why can’t we just perform the similarity search by calculating the distance metric to every data point?
To make it easier to understand why we need an alternate approach, I have created a straightforward process that finds the closest data points. It does this by calculating the distance to each data point and then sorting them to find the nearest neighbors.
For the example below, We use the 10 pieces of text that we have already converted into a vector format earlier in this post.
So a common query of our database might look like this:
Let’s say we have a new sentence (user query), here: "Lilies are white."
- First, we are obtaining the embeddings for our new sentences. This vector gives us a location in our embedding space that allows us to compare them with the other text chunks in our collection.

- Then we are calculating the distance (here: cosine similarity) to each data point

- Usually, we are interested in the k-nearest neighbors. We simply filter the ones with the smallest distance to our new query vector
For the simple example, we use the sentence "Lilies are white." and try to describe the similarity to the other sentences in the dataset by calculating their distance from each other.
To do that, we create a third pipeline, I call it: "03_calculate_cosine_similarity.py"
##########################################################################################################
# Calculate cosine similarity between the query vector and all other embedding vectors
##########################################################################################################
import numpy as np
from numpy.linalg import norm
import time
import pandas as pd
import os
import requests
def _get_embeddings(text_chunk):
'''
Use embedding model from hugging face to calculate embeddings for the text snippets provided
Parameters:
- text_chunk (string): the sentence or text snippet you want to translate into embeddings
Returns:
- embedding(list): list with all embedding dimensions
'''
# define the embedding model you want to use
model_id = "sentence-transformers/all-MiniLM-L6-v2"
# you can find the token to the hugging face api in your settings page https://huggingface.co/settings/tokens
hf_token = os.environ.get('HF_TOKEN')
# API endpoint for embedding model
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}
# call API
response = requests.post(api_url, headers=headers, json={"inputs": text_chunk, "options":{"wait_for_model":True}})
# load response from embedding model into json format
embedding = response.json()
return embedding
def calculate_cosine_similarity(text_chunk, embeddings_df):
'''
Calculate the cosine similarity between the query sentence and every other sentence
1. Get the embeddings for the text chunk
2. Calculate the cosine similarity between the embeddings of our text chunk und every other entry in the data frame
Parameters:
- text_chunk (string): the text snippet we want to use to look for similar entries in our database (embeddings_df)
- embeddings_df (DataFrame): data frame with the columns "text_chunk" and "embeddings"
Returns:
-
'''
# use the _get_embeddings function the retrieve the embeddings for the text chunk
sentence_embedding = _get_embeddings(text_chunk)
# combine all dimensions of the vector embeddings to one array
embeddings_df['embeddings_array'] = embeddings_df.apply(lambda row: row.values[:-1], axis=1)
# start the timer
start_time = time.time()
print(start_time)
# create a list to store the calculated cosine similarity
cos_sim = []
for index, row in embeddings_df.iterrows():
A = row.embeddings_array
B = sentence_embedding
# calculate the cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
cos_sim.append(cosine)
embeddings_cosine_df = embeddings_df
embeddings_cosine_df["cos_sim"] = cos_sim
embeddings_cosine_df.sort_values(by=["cos_sim"], ascending=False)
# stop the timer
end_time = time.time()
# calculate the time needed to calculate the similarities
elapsed_time = (end_time - start_time)
print("Execution Time: ", elapsed_time, "seconds")
return embeddings_cosine_df
# Load embeddings_df.csv into data frame
embeddings_df = pd.read_csv('../02_Data/embeddings_df.csv')
# test query sentence
text_chunk = "Lilies are white."
# calculate cosine similarity
embeddings_cosine_df = calculate_cosine_similarity(text_chunk, embeddings_df)
# save data frame with text chunks and embeddings to csv
embeddings_cosine_df.to_csv('../02_Data/embeddings_cosine_df.csv', index=False)
# rank based on similarity
similarity_ranked_df = embeddings_cosine_df[["text_chunk", "cos_sim"]].sort_values(by=["cos_sim"], ascending=False)

If we look at the similarity scores, the sentences "Violets are blue." and "Roses are red." are significantly more similar to our query sentence "Lilies are white." than "Programming is fun.".
Makes sense, I guess.
The similarity search seems to work. So why do we need another approach at all?
Using the time.time() function I am stopping the time it takes to calculate the cosine similarity between our query vector and the 10 other vectors.
According to that, it takes around 0.005 seconds to search our 10 entries.

Comparing each point to every other point is called "exhaustive search" and it takes a linear amount of time. For example, if it takes 0.005 seconds to compare 10 points, it would take several minutes to compare 1 million chunks of text.
We need to find an efficient way to speed up the similarity search process.
How to accelerate the similarity search?
Approximate Nearest Neighbor algorithms are used to find the closest neighbors, even though they may not always be the exact closest. This trade-off of accuracy for speed is usually acceptable for LLM applications, since speed is more important and often the same information is found in multiple text snippets anyway. (Trabelsi, 2021)
I guess it is not essential for the average user to have a deep understanding of indexing techniques. But I don’t want to leave you completely without. To give you at least a basic understanding of the techniques, here’s an example of how to use an Inverted File Index. This way, you can get a sense of where accuracy can be lost using these techniques.
Inverted File Index (IFV) is a popular method for finding similarities between different items. It works by creating a database index that stores content and connects it to its position in a table or document. We divide the whole collection of items into partitions and their centroids. Each item can only be part of one partition at a time. When we search for similarities, we use the partition centroids to quickly find the items we are looking for.
If we’re looking for nearby points, we usually just search in the centroid closest to our point. But if there are points close to the edge of the neighboring centroid, we may miss them.

To avoid this issue, we search multiple partitions instead of just one. However, the underlying problem still remains. We lose some accuracy. But that is usually ok, speed is more important.
Chroma supports multiple approximate nearest neighbor (ANN) algorithms, including HNSW, IVFADC, and IVFPQ. (Yadav, 2023)
- Hierarchical Navigable Small World (HNSW): HNSW is an algorithm that creates a hierarchical graph structure to quickly store and search high-dimensional vectors with minimal memory usage.
- Inverted File with Product Quantization (IVFPQ): IVFPQ uses product quantization to compress vectors before indexing, resulting in a high-accuracy search that can handle massive datasets.

If you want to know in detail how the different indexing methods work, I can recommend James Briggs Youtube channel. I also liked the blog posts from Peggy Chang to inverted file index (IVF), product quantization (PQ) and co.
What is really relevant for us?
I think all we need to know is that through the indexing step, we store our embeddings in a form that allows us to quickly find "similar" vectors without having to calculate the distance to all the data points each time. By doing that, we trade speed for some accuracy.

The accuracy of the task should still be good enough for most of what we try to do with it. Translating language into embeddings is not an exact science anyway. The same word can have different meanings depending on the context or region of the world in which you use it. Therefore, it is usually okay if we lose a few accuracy points, but what is much more important is the speed of the response.
Google’s fast response time is what makes it so successful. The speed of each step of the process is even more important for our application, because we not only have to perform the Vector Search, but also pass the questions and context to our LLM.
However, this process takes a bit longer than Google, which is (I guess) why Bing Chat (with its LLM support) has not yet conquered the world. It is only a few milliseconds or seconds slower, but this small difference is enough to keep Google on top.
To illustrate these steps – Let’s say we want to create a chatbot like Bing Chat
We still have the (traditional) search part that looks for content and news that is most relevant. Only when we have found some common results, we provide them to our LLM and let it interpret the data and formulate a well-sounding answer.

Our vector store takes care of tokenizing, embedding and indexing the data when it’s loaded. Once the data is in the store, we can query it with new data points.
Suppose we decide to use a vector store – What options do we have?
Vector databases come in different shapes. We distinguish between:
- Pure vector databases
- Extended capabilities in SQL, NoSQL or text search databases
- Simple vector libraries

Text search databases can search through large amounts of text for specific words or phrases. Recently, some of these databases have begun to use vector search to further improve their ability to find what you’re looking for. At the last Microsoft Developer Conference, Elasticsearch explained how they use both traditional search and vector search to create a ‘Hybrid Scoring’ system, giving you the best possible search results.

Vector Search is also gradually being adopted by more and more SQL and NoSQL databases such as Redis, MongoDB or Postgres. Pgvector, for example, is the open source vector similarity search for Postgres. It supports (Github, 2023):
- exact and approximate nearest neighbor search
- L2 distance, inner product, and cosine distance
For smaller projects, vector libraries are a great option and provide most of the features needed.
Facebook AI Research released one of the first vector libraries, FAISS, in 2017. FAISS is a library for efficiently searching and clustering dense vectors, and can handle vector sets of any size, even those that don’t fit in memory. It’s written in C++ and comes with a Python wrapper, making it easy for data scientists to integrate into their code. (FAISS Documentation, 2023)

Chroma, Pinecone, Weaviate on the other side, are pure Vector Databases that can store your vector data and be searched like any other database. In this article, I’ll teach you how to set up a Vector Database with Chroma and how to fill it with your vector data. If you’re looking for a quick solution, vector libraries like FAISS can help you get started easily with all the necessary indexing methods.
So which one should we take?
I guess you’ll have to answer that one for yourselves and your specific project, but don’t make things unnecessarily complex. Andre Karpathy described it on Twitter with the words: (Andrej Karpathy, 2023):
"Np.array – people keep reaching for much fancier things way too fast these days" – Andrew Karpathy
If you only need to search a few pages of a PDF or text file, you can simply use an np.array or pandas data frames to store your embeddings. The capabilities of Vector Databases become interesting when we speak about hundreds, thousands or millions of vectors, we want to search on a regular basis. For this article, I am using Chroma, but the same principles apply to all databases.
Hands-On Tutorial – Set up your first Vector Store
To store our content into vectors and improve the performance of our similarity search, we need to set up our own Vector Database. Chroma is a great open-source option to use, as it is free to use and has an Apache 2.0 license. Other alternatives, such as FAISS, Weaviate, and Pinecone, also exist. Some of these options are open-source and free to use, while others are only available as a commercial service.
With just a few steps, we can get chromadb up and running. All we need to do is use our package manager pip to download the chromadb library. Once we have that, we can start setting up our first vector store database and get going.
1. Install chroma
https://pypi.org/project/chromadb/
pip install chromadb
2. Get/create a chroma client and collection
A collection is the designated storage for your embeddings, documents, and any additional metadata. If you want to save it, so you can later use your indexes and collections, you can use "persist_directory"

import chromadb
from chromadb.config import Settings
import pandas as pd
# vector store settings
VECTOR_STORE_PATH = r'../02_Data/00_Vector_Store'
COLLECTION_NAME = 'my_collection'
# Load embeddings_df.csv into data frame
embeddings_df = pd.read_csv('../02_Data/embeddings_df.csv')
def get_or_create_client_and_collection(VECTOR_STORE_PATH, COLLECTION_NAME):
# get/create a chroma client
chroma_client = chromadb.PersistentClient(path=VECTOR_STORE_PATH)
# get or create collection
collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)
return collection
# get or create collection
collection = get_or_create_client_and_collection(VECTOR_STORE_PATH, COLLECTION_NAME)
3. Add some text documents to the collection
Chroma makes it easy to store and organize your text documents. It takes care of the tokenization, embedding, and indexing processes for you, and if you have already created your own embeddings, you can load them directly into Chroma’s vector store.
We want to store the already created Data Frame "embeddings_df" into our new data store:
# Load embeddings_df.csv into data frame
embeddings_df = pd.read_csv('../02_Data/embeddings_df.csv')
def add_to_collection(embeddings_df):
# add a sample entry to collection
# collection.add(
# documents=["This is a document", "This is another document"],
# metadatas=[{"source": "my_source"}, {"source": "my_source"}],
# ids=["id1", "id2"]
# )
# combine all dimensions of the vector embeddings to one array
embeddings_df['embeddings_array'] = embeddings_df.apply(lambda row: row.values[:-1], axis=1)
embeddings_df['embeddings_array'] = embeddings_df['embeddings_array'].apply(lambda x: x.tolist())
# add data frame to collection
collection.add(
embeddings=embeddings_df.embeddings_array.to_list(),
documents=embeddings_df.text_chunk.to_list(),
# create a list of string as index
ids=list(map(str, embeddings_df.index.tolist()))
)
# add the embeddings_df to our vector store collection
add_to_collection(embeddings_df)

4. Extract all entries from database to excel file
If you want to export all entries in your vector store, you can use:
def get_all_entries(collection):
# query collection
existing_docs = pd.DataFrame(collection.get()).rename(columns={0: "ids", 1:"embeddings", 2:"documents", 3:"metadatas"})
existing_docs.to_excel(r"..//02_Data//01_vector_stores_export.xlsx")
return existing_docs
# extract all entries in vector store collection
existing_docs = get_all_entries(collection)
5. Query the collection
Chroma makes it easy to find the n most similar results to a query texts:
def query_vector_database(VECTOR_STORE_PATH, COLLECTION_NAME, query, n=2):
# query collection
results = collection.query(
query_texts=query,
n_results=n
)
print(f"Similarity Search: {n} most similar entries:")
print(results["documents"])
return results
# similarity search
similar_vector_entries = query_vector_database(VECTOR_STORE_PATH, COLLECTION_NAME, query=["Lilies are white."])

For this example we will get back the two most similar entries in our vector store.
Below you can find a summary of
##########################################################################################################
'''
Includes some functions to create a new vector store collection, fill it and query it
'''
##########################################################################################################
import chromadb
from chromadb.config import Settings
import pandas as pd
# vector store settings
VECTOR_STORE_PATH = r'../02_Data/00_Vector_Store'
COLLECTION_NAME = 'my_collection'
# Load embeddings_df.csv into data frame
embeddings_df = pd.read_csv('../02_Data/embeddings_df.csv')
def get_or_create_client_and_collection(VECTOR_STORE_PATH, COLLECTION_NAME):
# get/create a chroma client
chroma_client = chromadb.PersistentClient(path=VECTOR_STORE_PATH)
# get or create collection
collection = chroma_client.get_or_create_collection(name=COLLECTION_NAME)
return collection
# get or create collection
collection = get_or_create_client_and_collection(VECTOR_STORE_PATH, COLLECTION_NAME)
def add_to_collection(embeddings_df):
# add a sample entry to collection
# collection.add(
# documents=["This is a document", "This is another document"],
# metadatas=[{"source": "my_source"}, {"source": "my_source"}],
# ids=["id1", "id2"]
# )
# combine all dimensions of the vector embeddings to one array
embeddings_df['embeddings_array'] = embeddings_df.apply(lambda row: row.values[:-1], axis=1)
embeddings_df['embeddings_array'] = embeddings_df['embeddings_array'].apply(lambda x: x.tolist())
# add data frame to collection
collection.add(
embeddings=embeddings_df.embeddings_array.to_list(),
documents=embeddings_df.text_chunk.to_list(),
# create a list of string as index
ids=list(map(str, embeddings_df.index.tolist()))
)
# add the embeddings_df to our vector store collection
add_to_collection(embeddings_df)
def get_all_entries(collection):
# query collection
existing_docs = pd.DataFrame(collection.get()).rename(columns={0: "ids", 1:"embeddings", 2:"documents", 3:"metadatas"})
existing_docs.to_excel(r"..//02_Data//01_vector_stores_export.xlsx")
return existing_docs
# extract all entries in vector store collection
existing_docs = get_all_entries(collection)
def query_vector_database(VECTOR_STORE_PATH, COLLECTION_NAME, query, n=2):
# query collection
results = collection.query(
query_texts=query,
n_results=n
)
print(f"Similarity Search: {n} most similar entries:")
print(results["documents"])
return results
# similarity search
similar_vector_entries = query_vector_database(VECTOR_STORE_PATH, COLLECTION_NAME, query=["Lilies are white."])
These k-nearest neighbors we will then use to feed our LLM. Build a simple prompt template around it, insert the found text chunks into it, and you can send it to GPT, LLama, or any other LLM of your choice. I described how this works in one of my previous articles.
Summary
Vector search is becoming increasingly popular as Machine Learning models can now accurately convert various content into vectors. Not only are there more and more dedicated vector databases, but existing SQL, NoSQL, and text search databases are also incorporating vector search capabilities into their products. This is to either improve their search mechanisms or provide a product for those specifically looking for databases with vector search capabilities.
The interest in Vector Stores is only growing. Thanks to advances in Transformer Models in recent years, we can now turn text modules into vectors with confidence. This unlocks a world of mathematical possibilities when working with text.
Enjoyed the story?
- Subscribe for free to get notified when I publish a new story.
- Want to read more than 3 free stories a month? – Become a Medium member for 5$/month. You can support me by using my referral link when you sign up. I’ll receive a commission at no extra cost to you.
Feel free to reach out to me on LinkedIn !
You can find all code snippets on Github. Have fun 🙂
References
Andrej Karpathy. (2023, April 15). @sinclanich np.array people keep reaching for much fancier things way too fast these days [Tweet]. Twitter. https://twitter.com/karpathy/status/1647374645316968449
Chroma. (2023, April 7). Chroma raises $18M seed round. https://www.trychroma.com/blog/seed
Cook, J. (2022, March 1). SeMI Technologies secures $16 million in Series A Round – Business Leader News. Business Leader. https://www.businessleader.co.uk/semi-technologies-secures-16-million-in-series-a-round/
Github. (2023, August 23). Pgvector Github Repo. https://github.com/pgvector/pgvector
Lample, G. (2023, August 3). Inquiry about the maximum number of tokens that Llama can handle · Issue #148 · facebookresearch/llama. GitHub. https://github.com/facebookresearch/llama/issues/148
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts (arXiv:2307.03172). arXiv. http://arxiv.org/abs/2307.03172
Miller, R. (2022, March 29). Pinecone announces $28M Series A for purpose-built database aimed at data scientists | TechCrunch. https://techcrunch.com/2022/03/29/pinecone-announces-28m-series-a-for-purpose-built-database-aimed-at-data-scientists/?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAMIKiexS9AALbi0ePIRFgGd-Ck3yS3U6t1VCv3yX_OwLRX-zH0-EVXFnHmq1wCH3blQOZncsBjmEU6H4LunP98AK7H_CJDYnILkchAazm6IF8SuDhf5On1JfywlFOfNC1teEpTpChfnEVt-lZ72KG0Y_yiUM5wdb6-I1-aUOtyiG
OpenAI. (2023, July 24). OpenAI Platform. https://platform.openai.com
Raschka, S. (2023). LinkedIn Sebastian Raschka. https://www.linkedin.com/posts/sebastianraschka_llm-ai-machinelearning-activity-7083427280605089792-MS_N/?utm_source=share&utm_medium=member_desktop
Trabelsi, E. (2021, September 8). Comprehensive Guide To Approximate Nearest Neighbors Algorithms. Medium. https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6
Yadav, R. (2023, May 3). An Evaluation of Vector Database Systems: Features, and Use Cases. Medium. https://blog.devgenius.io/an-evaluation-of-vector-database-systems-features-and-use-cases-9a90b05eb51f