Getting Started with Weaviate: A Beginner’s Guide to Search with Vector Databases

If you landed on this article, I assume you have been playing around with building an app with a large language model (LLM) and came across the term vector database.

The tool landscape around building apps with LLMs is growing rapidly, with tools such as LangChain or LlamaIndex gaining popularity.

In a recent article, I described how to get started with LangChain, and in this article, I want to continue exploring the LLM tool landscape by playing around with Weaviate.

What is Weaviate?

Weaviate is an open-source vector database. It enables you to store data objects and vector embeddings and query them based on similarity measures.

GitHub – weaviate/weaviate: Weaviate is an open source vector database that stores both objects and…

Vector databases have been getting much attention since the rise of media attention on LLMs. Probably the most popular use case of vector databases in the context of LLMs is to "provide LLMs with long-term memory".

If you need a refresher on the concept of vector databases, you might want to have a look at my previous article:

Explaining Vector Databases in 3 Levels of Difficulty

In this tutorial, we will walk through how to populate a Weaviate vector database with embeddings of your dataset. Then we will go over three different ways you can retrieve information from it:

Vector Search
Question answering
Generative search

Prerequisites

To follow along in this tutorial, you will need to have the following:

Python 3 environment
OpenAI API key (or alternatively, an API key for Hugging Face, Cohere, or PaLM)

A note on the API key: In this tutorial, we will generate embeddings from text via an inference service (in this case, OpenAI). Depending on which inference service you use, make sure to check the provider’s pricing page to avoid unexpected costs. E.g., the used Ada model (version 2) costs $0.0001 per 1,000 tokens at the time of writing and resulted in less than 1 cent in inference costs for this tutorial.

Setup

You can run Weaviate either on your own instances (using Docker, Kubernetes, or Embedded Weaviate) or as a managed service using Weaviate Cloud Services (WCS). For this tutorial, we will run a Weaviate instance with WCS, as this is the recommended and most straightforward way.

How to Create a Cluster with Weaviate Cloud Services (WCS)

To be able to use the service, you first need to register with WCS.

Once you are registered, you can create a new Weaviate Cluster by clicking the "Create cluster" button.

For this tutorial, we will be using the free trial plan, which will provide you with a sandbox for 14 days. (You won’t have to add any payment information. Instead, the sandbox simply expires after the trial period. But you can create a new free trial sandbox anytime.)

Under the "Free sandbox" tab, make the following settings:

Enter a cluster name
Enable Authentication (set to "YES")

Screenshot of Weaviate Cloud Services plans

Finally, click "Create" to create your sandbox instance.

How to Install Weaviate in Python

Last but not least, add the weaviate-client to your Python environment with pip

$ pip install weaviate-client

and import the library:

import weaviate

How To Access a Weaviate Cluster Through a Client

For the next step, you will need the following two pieces of information to access your cluster:

The cluster URL
Weaviate API key (under "Enabled – Authentication")

Screenshot of Weaviate Cloud Services sandbox

Now, you can instantiate a Weaviate client to access your Weaviate cluster as follows.

auth_config = weaviate.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY")  # Replace w/ your Weaviate instance API key

# Instantiate the client
client = weaviate.Client(
    url="https://<your-sandbox-name>.weaviate.network", # Replace w/ your Weaviate cluster URL
    auth_client_secret=auth_config,
    additional_headers={
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY", # Replace with your OpenAI key
        }
)

As you can see, we are using the OpenAI API key under additional_headers to access the embedding model later. If you are using a different provider than OpenAI, change the key parameter to one of the following that apply: X-Cohere-Api-Key, X-HuggingFace-Api-Key, or X-Palm-Api-Key.

To check if everything is set up correctly, run:

client.is_ready()

If it returns True, you’re all set for the next steps.

How to Create and Populate a Weaviate Vector Database

Now, we’re ready to create a vector database in Weaviate and populate it with some data.

For this tutorial, we will use the first 100 rows of the 200.000+ Jeopardy Questions dataset [1] from Kaggle.

import pandas as pd

df = pd.read_csv("your_file_path.csv", nrows = 100)

First few rows of the 200.000+ Jeopardy Questions dataset [1] from Kaggle.

A note on the number of tokens and related costs: In the following example, we will embed the columns "category", "question", and "answer" for the first 100 rows. Based on a calculation with the tiktoken library, this will result in roughly 3,000 tokens to embed, which roughly results in $0.0003 inference costs with OpenAI’s Ada model (version 2) as of July 2023.

Step 1: Create a Schema

First, we need to define the underlying data structure and some configurations:

class: What will the collection of objects in this vector space be called?
properties: The properties of an object, including the property name and data type. In the Pandas Dataframe analogy, these would be the columns in the DataFrame.
vectorizer: The model that generates the embeddings. For text objects, you would typically select one of the [text2vec](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules) modules (text2vec-cohere, text2vec-huggingface, text2vec-openai, or text2vec-palm) according to the provider you are using.
moduleConfig: Here, you can define the details of the used modules. E.g., the vectorizer is a module for which you can define which model and version to use.

class_obj = {
    # Class definition
    "class": "JeopardyQuestion",

    # Property definitions
    "properties": [
        {
            "name": "category",
            "dataType": ["text"],
        },
        {
            "name": "question",
            "dataType": ["text"],
        },
        {
            "name": "answer",
            "dataType": ["text"],
        },
    ],

    # Specify a vectorizer
    "vectorizer": "text2vec-openai",

    # Module settings
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": False,
            "model": "ada",
            "modelVersion": "002",
            "type": "text"
        },
    },
}

In the above schema, you can see that we will create a class called "JeopardyQuestion", with the three text properties "category", "question", and "answer". The vectorizer we are using is OpenAI’s Ada model (version 2). All properties will be vectorized but not the class name ("vectorizeClassName" : False). If you have properties you don’t want to embed, you could specify this (see the docs).

Once you have defined the schema, you can create the class with the create_class() method.

client.schema.create_class(class_obj)

To check if the class has been created successfully, you can review its schema as follows:

client.schema.get("JeopardyQuestion")

The created schema looks as shown below:

{
  "class": "JeopardyQuestion",
  "invertedIndexConfig": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    },
    "cleanupIntervalSeconds": 60,
    "stopwords": {
      "additions": null,
      "preset": "en",
      "removals": null
    }
  },
  "moduleConfig": {
    "text2vec-openai": {
      "model": "ada",
      "modelVersion": "002",
      "type": "text",
      "vectorizeClassName": false
    }
  },
  "properties": [
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        }
      },
      "name": "category",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        }
      },
      "name": "question",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "indexFilterable": true,
      "indexSearchable": true,
      "moduleConfig": {
        "text2vec-openai": {
          "skip": false,
          "vectorizePropertyName": false
        }
      },
      "name": "answer",
      "tokenization": "word"
    }
  ],
  "replicationConfig": {
    "factor": 1
  },
  "shardingConfig": {
    "virtualPerPhysical": 128,
    "desiredCount": 1,
    "actualCount": 1,
    "desiredVirtualCount": 128,
    "actualVirtualCount": 128,
    "key": "_id",
    "strategy": "hash",
    "function": "murmur3"
  },
  "vectorIndexConfig": {
    "skip": false,
    "cleanupIntervalSeconds": 300,
    "maxConnections": 64,
    "efConstruction": 128,
    "ef": -1,
    "dynamicEfMin": 100,
    "dynamicEfMax": 500,
    "dynamicEfFactor": 8,
    "vectorCacheMaxObjects": 1000000000000,
    "flatSearchCutoff": 40000,
    "distance": "cosine",
    "pq": {
      "enabled": false,
      "bitCompression": false,
      "segments": 0,
      "centroids": 256,
      "encoder": {
        "type": "kmeans",
        "distribution": "log-normal"
      }
    }
  },
  "vectorIndexType": "hnsw",
  "vectorizer": "text2vec-openai"
}

Step 2: Import data into Weaviate

At this stage, the vector database has a schema but is still empty. So, let’s populate it with our dataset. This process is also called "upserting".

We will upsert the data in batches of 200. If you paid attention, you know this isn’t necessary here because we only have 100 rows of data. But once you are ready to upsert larger amounts of data, you will want to do this in batches. That’s why I’ll leave the code for batching here:

from weaviate.util import generate_uuid5

with client.batch(
    batch_size=200,  # Specify batch size
    num_workers=2,   # Parallelize the process
) as batch:
    for _, row in df.iterrows():
        question_object = {
            "category": row.category,
            "question": row.question,
            "answer": row.answer,
        }
        batch.add_data_object(
            question_object,
            class_name="JeopardyQuestion",
            uuid=generate_uuid5(question_object)
        )

Although, Weaviate will generate a universally unique identifier (uuid) automatically, we will manually generate the uuid with the generate_uuid5() function from the question_object to avoid importing duplicate items.

For a sanity check, you can review the number of imported objects with the following code snippet:

client.query.aggregate("JeopardyQuestion").with_meta_count().do()

{'data': {'Aggregate': {'JeopardyQuestion': [{'meta': {'count': 100}}]}}}

How to Query the Weaviate Vector Database

The most common operation you will do with a vector database is to retrieve objects. To retrieve objects, you query the Weaviate vector database with the get() function:

client.query.get(
    <Class>,
    [<properties>]
).<arguments>.do()

Class: specifies the name of the class of objects to be retrieved. Here: "JeopardyQuestion"
properties: specifies the properties of the objects to be retrieved. Here: one or more of "category", "question", and "answer".
arguments: specifies the search criteria to retrieve the objects, such as limits or aggregations. We will cover some of these in the following examples.

Let’s retrieve some entries from the JeopardyQuestion class with the get() function to see what they look like. In the Pandas analogy, you can think of the following as df.head(2). Because the get() function’s response is in JSON format, we will import the related library to display the result in a visually appealing format.

import json

res = client.query.get("JeopardyQuestion", 
                      ["question", "answer", "category"])
                  .with_additional(["id", "vector"])
                  .with_limit(2)
                  .do()

print(json.dumps(res, indent=4))

{
    "data": {
        "Get": {
            "JeopardyQuestion": [
                {
                    "_additional": {
                        "id": "064fee53-f8fd-4513-9294-432170cc9f77",
                        "vector": [ -0.02465364, ...] # Vector is trunkated for better readability 
                    },
                    "answer": "(Lou) Gehrig",
                    "category": "ESPN's TOP 10 ALL-TIME ATHLETES",
                    "question": "No. 10: FB/LB for Columbia U. in the 1920s; MVP for the Yankees in '27 &amp; '36; "Gibraltar in Cleats""
                },
                {
                    "_additional": {
                        "id": "1041117a-34af-40a4-ad05-3dae840ad6b9",
                        "vector": [ -0.031970825, ...] # Vector is trunkated for better readability
                    },
                    "answer": "Jim Thorpe",
                    "category": "ESPN's TOP 10 ALL-TIME ATHLETES",
                    "question": "No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants &amp; Braves"
                },
            ]
        }
    }
}

In the above code snippet, you can see that we are retrieving objects from the "JeopardyQuestion" class. We specified to retrieve the properties "category", "question", and "answer".

We specified two additional arguments: First, we specified with the .with_additional() argument to retrieve additional information about the object’s id and the vector embedding. And with the .with_limit(2) argument, we specified only to retrieve two objects. This limitation is important, and you will see it again in the later examples. This is because retrieving objects from a vector database does not return the exact matches but returns objects based on similarity, which has to be limited by a threshold.

Vector search

Now, we’re ready to do some vector search! What’s cool about retrieving information from a vector database is that you can e.g., tell it to retrieve Jeopardy questions related to the "concepts" around animals.

For this, we can use the .with_near_text() argument and pass it the "concepts" we are interested in as shown below:

res = client.query.get(
    "JeopardyQuestion",
    ["question", "answer", "category"])
    .with_near_text({"concepts": "animals"})
    .with_limit(2)
    .do()

The specified vectorizer then converts the input text ("animals") to a vector embedding and retrieves the two closest results:

{
    "data": {
        "Get": {
            "JeopardyQuestion": [
                {
                    "answer": "an octopus",
                    "category": "SEE &amp; SAY",
                    "question": "Say the name of <a href="http://www.j-archive.com/media/2010-07-06_DJ_26.jpg" target="_blank">this</a> type of mollusk you see"
                },
                {
                    "answer": "the ant",
                    "category": "3-LETTER WORDS",
                    "question": "In the title of an Aesop fable, this insect shared billing with a grasshopper"
                }
            ]
        }
    }
}

You can already see how cool this is: We can see that the vector search returned two questions where the answer is an animal from two completely different categories. With a classical keyword search, you would have had first to define a list of animals and then retrieve all questions that contain one of the defined animals.

Question answering

Question answering is one of the most popular examples when it comes to combining LLMs with vector databases.

To enable question answering, you need to specify a vectorizer (which you should already have) and a question-answering module under the module configuration, as shown in the following example:

# Module settings
    "moduleConfig": {
        "text2vec-openai": {
          ...
        },
        "qna-openai": {
          "model": "text-davinci-002"
        }
    },

For question-answering, you need to add the with_ask() argument and also retrieve _additionalproperties.

ask = {
  "question": "Which animal was mentioned in the title of the Aesop fable?",
  "properties": ["answer"]
}

res = (
  client.query
  .get("JeopardyQuestion", [
      "question",
      "_additional {answer {hasAnswer property result} }"
  ])
  .with_ask(ask)
  .with_limit(1)
  .do()
)

The above piece of code looks through all questions that may contain the answer to the question "Which animal was mentioned in the title of the Aesop fable?" and returns the answer "The ant".

{
    "JeopardyQuestion": [
        {
            "_additional": {
                "answer": {
                    "hasAnswer": true,
                    "property": "",
                    "result": " The ant"
                }
            },
            "question": "In the title of an Aesop fable, this insect shared billing with a grasshopper"
        }
    ]
}

Generative search

By incorporating LLMs, you can also transform the data before returning the search result. This concept is called generative search.

To enable generative search, you need to specify a generative module under the module configuration, as shown in the following example:

# Module settings
    "moduleConfig": {
        "text2vec-openai": {
          ...
        },
        "generative-openai": {
          "model": "gpt-3.5-turbo"
        }
    },

For generative search, you only need to add the with_generate() argument to your previous vector search code as shown below:

res = client.query.get(
    "JeopardyQuestion", 
    ["question", "answer"])
  .with_near_text({"concepts": ["animals"]})
  .with_limit(1)
  .with_generate(single_prompt= "Generate a question to which the answer is {answer}")
  .do()

The above piece of code does the following:

Search for the question closest to the concept of "animals"
Return the question "Say the name of this type of mollusk you see" with the answer "an octopus"
Generate a completion for the prompt "Generate a question to which the answer is an octopus" with the final result:

{
    "generate": {
        "error": null,
        "singleResult": "What sea creature has eight arms and is known for its intelligence and camouflage abilities?"
    }
}

Summary

The popularity of the LLM space has not only brought up many interesting new developer tools like LangChain or LLaMaIndex. It also has shown us how to use already existing tools, such as vector databases, to enhance the potential of LLM-powered applications.

In this article, we have started playing around with Weaviate to not only use vector databases for vector search but also for question-answering and generative search in combination with LLMs.

If you are interested in a more in-depth walkthrough, I recommend checking out this comprehensive four part course on vector databases and Weaviate: