Efficient semantic search over unstructured text in Neo4j
Integrate the newly added vector index into LangChain to enhance your RAG applications
Since the advent of ChatGPT six months ago, the technology landscape has undergone a transformative shift. ChatGPT’s exceptional capacity for generalization has diminished the requirement for specialized deep learning teams and extensive training datasets to create custom NLP models. This has democratized access to a range of NLP tasks, such as summarization and information extraction, making them more readily available than ever before. However, we soon realized the limitations of ChatGPT-like models, such as knowledge date cutoff and not having access to private information. In my opinion, what followed was the second wave of generative AI transformation with the rise of Retrieval Augmented Generation (RAG) applications, where you feed relevant information to the model at query time to construct better and more accurate answers.
As mentioned, the RAG applications require a smart search tool that is able to retrieve additional information based on the user input, which allows the LLMs to produce more accurate and up-to-date answers. At first, the focus was mostly on retrieving information from unstructured text using semantic search. However, it soon became evident that a combination of structured and unstructured data is the best approach to RAG applications if you want to move beyond “Chat with your PDF” applications.
Neo4j was and is an excellent fit for handling structured information, but it struggled a bit with semantic search due to its brute-force approach. However, the struggle is in the past as Neo4j has introduced a new vector index in version 5.11 designed to efficiently perform semantic search over unstructured text or other embedded data modalities. The newly added vector index makes Neo4j a great fit for most RAG applications as it now works great with both structured and unstructured data.
In this blog post I will show you how to setup a vector index in Neo4j and integrate it into the LangChain ecosystem. The code is available on GitHub.
Neo4j Environment setup
You need to setup a Neo4j 5.11 or greater to follow along with the examples in this blog post. The easiest way is to start a free instance on Neo4j Aura, which offers cloud instances of Neo4j database. Alternatively, you can also setup a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance.
After you have instantiated the Neo4j database, you can use the LangChain library to connect to it.
from langchain.graphs import Neo4jGraph
NEO4J_URI="neo4j+s://1234.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="-"
graph = Neo4jGraph(
url=NEO4J_URI,
username=NEO4J_USERNAME,
password=NEO4J_PASSWORD
)
Setting up the Vector Index
Neo4j vector index is powered by Lucene, where Lucene implements a Hierarchical Navigable Small World (HNSW) Graph to perform a approximate nearest neighbors (ANN) query over the vector space.
Neo4j’s implementation of the vector index is designed to index a single node property of a node label. For example, if you wanted to index nodes with the label Chunk
on their node property embedding
, you would use the following Cypher procedure.
CALL db.index.vector.createNodeIndex(
'wikipedia', // index name
'Chunk', // node label
'embedding', // node property
1536, // vector size
'cosine' // similarity metric
)
Along with the index name, node label, and property, you must specify the vector size (embedding dimension), and the similarity metric. We will be using OpenAI’s text-embedding-ada-002 embedding model, which uses vector size 1536 to represent text in the embedding space. At the moment, only the cosine and Euclidean similarity metrics are available. OpenAI suggests using the cosine similarity metric when using their embedding model.
Populating the Vector index
Neo4j is schema-less by design, which means it doesn’t enforce any restrictions what goes into a node property. For example, the embedding
property of the Chunk
node could store integers, list of integers or even strings. Let’s try this out.
WITH [1, [1,2,3], ["2","5"], [x in range(0, 1535) | toFloat(x)]] AS exampleValues
UNWIND range(0, size(exampleValues) - 1) as index
CREATE (:Chunk {embedding: exampleValues[index], index: index})
This query creates a Chunk
node for each element in the list and uses the element as the embedding
property value. For example, the first Chunk
node will have the embedding
property value 1, the second node [1,2,3], and so on. Neo4j doesn’t enforce any rules on what you can store under node properties. However, the vector index has clear instructions about the type of values and their embedding dimension it should index.
We can test which values were indexed by performing a vector index search.
CALL db.index.vector.queryNodes(
'wikipedia', // index name
3, // topK neighbors to return
[x in range(0,1535) | toFloat(x) / 2] // input vector
)
YIELD node, score
RETURN node.index AS index, score
If you run this query, you will get only a single node returned, even though you requested the top 3 neighbors to be returned. Why is that so? The vector index only indexes property values, where the value is a list of floats with the specified size. In this example, only one embedding
property value had the list of floats type with the selected length 1536.
A node is indexed by the vector index if all the following are true:
- The node contains the configured label.
- The node contains the configured property key.
- The respective property value is of type
LIST<FLOAT>
. - The
size()
of the respective value is the same as the configured dimensionality. - The value is a valid vector for the configured similarity function.
Integrating the vector index into the LangChain ecosystem
Now we will implement a simple custom LangChain class that will use the Neo4j Vector index to retrieve relevant information to generate accurate and up-to-date answers. But first, we have to populate the vector index.
The task will consist of the following steps:
- Retrieve a Wikipedia article
- Chunk the text
- Store the text along with its vector representation in Neo4j
- Implement a custom LangChain class to support RAG applications
In this example, we will fetch only a single Wikipedia article. I have decided to use Baldur’s Gate 3 page.
import wikipedia
bg3 = wikipedia.page(pageid=60979422)
Next, we need to chunk and embed the text. We will split the text by section using the double newline delimiter and then use OpenAI’s embedding model to represent each section with an appropriate vector.
import os
from langchain.embeddings import OpenAIEmbeddings
os.environ["OPENAI_API_KEY"] = "API_KEY"
embeddings = OpenAIEmbeddings()
chunks = [{'text':el, 'embedding': embeddings.embed_query(el)} for
el in bg3.content.split("\n\n") if len(el) > 50]
Before we move on to the LangChain class, we need to import the text chunks into Neo4j.
graph.query("""
UNWIND $data AS row
CREATE (c:Chunk {text: row.text})
WITH c, row
CALL db.create.setVectorProperty(c, 'embedding', row.embedding)
YIELD node
RETURN distinct 'done'
""", {'data': chunks})
One thing you can notice is that I used the db.create.setVectorProperty
procedure to store the vectors to Neo4j. This procedure is used to verify that the property value is indeed a list of floats. Additionally, it has the added benefit of reducing the storage space of vector property by approximately 50%. Therefore, it is recommended always to use this procedure to store vectors to Neo4j.
Now we can go and implement the custom LangChain class used to retrieve information from Neo4j vector index and use it to generate answers. First, we will define the Cypher statement used to retrieve information.
vector_search = """
WITH $embedding AS e
CALL db.index.vector.queryNodes('wikipedia',$k, e) yield node, score
RETURN node.text AS result
"""
As you can see, I have hardcoded the index name. You can make this dynamic by adding appropriate parameters if you wish.
The custom LangChain class is implemented pretty straightforward.
class Neo4jVectorChain(Chain):
"""Chain for question-answering against a Neo4j vector index."""
graph: Neo4jGraph = Field(exclude=True)
input_key: str = "query" #: :meta private:
output_key: str = "result" #: :meta private:
embeddings: OpenAIEmbeddings = OpenAIEmbeddings()
qa_chain: LLMChain = LLMChain(llm=ChatOpenAI(temperature=0), prompt=CHAT_PROMPT)
def _call(self, inputs: Dict[str, str], run_manager, k=3) -> Dict[str, Any]:
"""Embed a question and do vector search."""
question = inputs[self.input_key]
# Embed the question
embedding = self.embeddings.embed_query(question)
# Retrieve relevant information from the vector index
context = self.graph.query(
vector_search, {'embedding': embedding, 'k': 3})
context = [el['result'] for el in context]
# Generate the answer
result = self.qa_chain(
{"question": question, "context": context},
)
final_result = result[self.qa_chain.output_key]
return {self.output_key: final_result}
I have omitted some boilerplate code to make it more readable. Essentially, when you can call the Neo4jVectorChain, the following steps are executed:
- Embed the question using the relevant embedding model
- Use the text embedding value to retrieve most similar content from the vector index
- Use the provided context from similar content to generate the answer
We can now test our implementation.
vector_qa = Neo4jVectorChain(graph=graph, embeddings=embeddings, verbose=True)
vector_qa.run("What is the gameplay of Baldur's Gate 3 like?")
Response
By using the verbose
option, you can also evaluate the retrieved context from the vector index that was used to generate the answer.
Summary
Leveraging Neo4j’s new vector indexing capabilities, you can create a unified data source that powers Retrieval Augmented Generation applications effectively. This allows you to not only implement “Chat with your PDF or documentation” solutions but also to conduct real-time analytics, all from a single, robust data source. This multi-purpose utility can streamline your operations and enhances data synergy, making Neo4j a great solution for managing both structured and unstructured data.
As always, the code is available on GitHub.