Recreating Andrej Karpathy’s Weekend Project — a Movie Search Engine

Building a movie recommender system with OpenAI embeddings and a vector database

Published in

Towards Data Science

9 min readNov 7, 2023

Movie recommender system build with AI (OpenAI, Weaviate) — Stylized screenshot of the final movie recommender demo (Image by the author)

In April 2023, Andrej Karpathy, one of the founding members of OpenAI and former Director of AI at Tesla, shared this fun weekend hack, a movie search and recommendation engine:

The user interface is simple, with two key functionalities. First, you have a search bar where you can search for movies by their title. When you then click on any movie, you get a list of its 40 most similar movies recommended to you.

Demo live at https://awesome-movies.life/

Despite its popularity, Karpathy unfortunately has not publicly shared the project’s source code.

Screenshot of comment under original Tweet (Screenshot by author)

So, grab yourself some popcorn, and let’s recreate it ourselves!

Prerequisites

This project is built on four primary components:

OpenAI embedding model to generate embeddings
Weaviate vector database to store the embeddings, which is populated with a Python script
Frontend: HTML, CSS, Js
Backend: NodeJs

Thus, to follow along in this tutorial, you will need the following:

Python for data processing and populating the vector database
Docker and Docker-Compose for running the vector database locally.
Node.js and npm for running the application locally.
OpenAI API key to access the OpenAI embedding model

Implementing a Movie Search Engine

This section analyzes Karpathy’s weekend hack and aims to recreate it with its own little twists. To build a simple movie search engine, follow these steps:

Preparation: Movie dataset
Step 1: Generate and store embeddings
Step 2: Search for movies
Step 3: Get similar movie recommendations
Step 4: Run the demo

The full code is open source, and you can find it on GitHub.

Preparation: Movie dataset

Karpathy’s project indexes all 11,762 movies since 1970, including the plot and the summary from Wikipedia.

To achieve something similar without manually scraping Wikipedia, you can use the following two datasets from Kaggle:

48,000+ movies dataset (License: CC0: Public Domain) for the columns 'id', 'name', 'PosterLink', 'Genres', 'Actors', 'Director', 'Description', 'Keywords', and 'DatePublished'.
Wikipedia Movie Plots (License: CC BY-SA 4.0), for the column 'plot'.

The two datasets are merged on the movie title and release year and then filtered by movies released after 1970. You can find the detailed preprocessing steps in the add_data.py file. The resulting DataFrame contains roughly 35,000 movies of which about 8,500 movies have a plot in addition to the description and looks as follows:

Preprocessed movies dataframe (Screenshot by author)

Step 1: Generate and store embeddings

The core of this demo project is the embeddings of the movie data objects, which are mainly used to recommend movies by plot similarity. In Karpathy’s project, vector embeddings are generated for the movie summaries and plots. There are two options to generate vector embeddings:

Term Frequency-Inverse Document Frequency (TF-IDF), which are simple bigrams and should be used for the individual word use.
text-embedding-ada-002 embedding model from OpenAI, which should be used for semantic similarity.

Additionally, the similarity is calculated based on each movie’s Wikipedia summary and plot with two choices for a similarity ranker:

k-Nearest Neighbor (kNN) using cosine similarity
Support Vector Machine

Karpathy suggests a combination of text-embedding-ada-002 and kNN for a good and fast default setting.

And last but not least, as stated in this infamous response, the vector embeddings are stored in np.array:

In this project, we will also use the text-embedding-ada-002 embedding model from OpenAI but store the vector embeddings in a vector database.

Namely, we will use Weaviate*, an open source vector database. Although I could argue that vector databases are much faster than when you store your embeddings in np.arraybecause they use vector indexing, let’s be honest here: At this scale (thousands), you won’t notice any difference in speed. My main reason for using a vector database here is that Weaviate has many convenient built-in functionalities you can use out of the box, such as automatic vectorization using embedding models.

First, as shown in the add_data.py file, you need to set up your Weaviate client, which connects to a local Weaviate database instance, as follows. Additionally, you will define your OpenAI API key here to enable the usage of the integrated OpenAI modules.

# pip weaviate-client
import weaviate
import os

openai_key = os.environ.get("OPENAI_API_KEY", "")

# Setting up client
client = weaviate.Client(
    url = "http://localhost:8080",
    additional_headers={
         "X-OpenAI-Api-Key": openai_key,
    })

Next, you will define a data collection called Movies to store the movie data objects, which is analogous to creating a table in a relational database. In this step, you define the text2vec-openai module as a vectorizer, which enables automatic data vectorization at import and query time, and in the module settings, you define to use the text-embedding-ada-002 embedding model. Additionally, you can define the cosine distance as the similarity measure.

movie_class_schema = {
    "class": "Movies",
    "description": "A collection of movies since 1970.",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
            "vectorizeClassName": False,
            "model": "ada",
            "modelVersion": "002",
            "type": "text"
        },
    },
    "vectorIndexConfig": {"distance" : "cosine"},
}

Next, you define the movie data objects’ properties and for which properties to generate vector embeddings. In the following shortened code snippet, you can see that for the properties movie_id and title no vector embeddings are generated because of the "skip" : True setting for the vectorizer module. This is because, we only want to generate vector embeddings for the description and plot.

movie_class_schema["properties"] = [
        {
            "name": "movie_id",
            "dataType": ["number"],
            "description": "The id of the movie", 
            "moduleConfig": {
                "text2vec-openai": {  
                    "skip" : True,
                    "vectorizePropertyName" : False
                }
            }        
        },
        {
            "name": "title",
            "dataType": ["text"],
            "description": "The name of the movie", 
            "moduleConfig": {
                "text2vec-openai": {  
                    "skip" : True,
                    "vectorizePropertyName" : False
                }
            }   
        },
        # shortened for brevity ...
        {
            "name": "description",
            "dataType": ["text"],
            "description": "overview of the movie", 
        },
        {
            "name": "Plot",
            "dataType": ["text"],
            "description": "Plot of the movie from Wikipedia", 
        },
    ]

# Create class
client.schema.create_class(movie_class_schema)

Finally, you define a batch process to populate the vector database:

# Configure batch process - for faster imports 
client.batch.configure(batch_size=10)

# Importing the data
for i in range(len(df)):
    item = df.iloc[i]

    movie_object = {
        'movie_id':float(item['id']),
        'title': str(item['Name']).lower(),
        # shortened for brevity ...
        'description':str(item['Description']),
        'plot': str(item['Plot']),
    }

    client.batch.add_data_object(movie_object, "Movies")

Step 2: Search for movies

In Karpathy’s project, the search bar is a simple keyword-based search that tries to match your exact query to movie titles verbatim. When some people stated that they expected the search to allow semantic search for movies, Karpathy agreed that this could be a good extension of the project:

In this project, you will enable three types of searches in the queries.js file:

keyword-based search (BM25),
semantic search, and
hybrid search, which is a combination of keyword-based search and semantic search.

Each of these searches will return num_movies = 20 movies with the properties ['title', 'poster_link', 'genres', 'year', 'director', 'movie_id'].

To enable keyword-based search, you can use a .withBm25() search query across the properties ['title', 'director', 'genres', 'actors', 'keywords', 'description', 'plot']. You can give the property 'title' a bigger weight by specifying 'title^3'.

async function get_keyword_results(text) {
    let data = await client.graphql
        .get()
        .withClassName('Movies')
        .withBm25({query: text,
            properties: ['title^3', 'director', 'genres', 'actors', 'keywords', 'description', 'plot'],
        })
        .withFields(['title', 'poster_link', 'genres', 'year', 'director', 'movie_id'])
        .withLimit(num_movies)
        .do()
        .then(info => {
            return info
        })
        .catch(err => {
            console.error(err)
        })
    return data;
}

To enable semantic search, you can use a .withNearText() search query. This will automatically vectorize the search query and retrieve its closest movies in the vector space.

async function get_semantic_results(text) {
    let data = await client.graphql
        .get()
        .withClassName('Movies')
        .withFields(['title', 'poster_link', 'genres', 'year', 'director', 'movie_id'])
        .withNearText({concepts: [text]})
        .withLimit(num_movies)
        .do()
        .then(info => {
            return info
        })
        .catch(err => {
            console.error(err)
        });
        return data;
}

To enable hybrid search, you can use a .withHybrid() search query. The alpha : 0.5 means that keyword search and semantic search are weighted equally.

async function get_hybrid_results(text) {
    let data = await client.graphql
        .get()
        .withClassName('Movies')
        .withFields(['title', 'poster_link', 'genres', 'year', 'director', 'movie_id'])
        .withHybrid({query: text, alpha: 0.5})
        .withLimit(num_movies)
        .do()
        .then(info => {
            return info
        })
        .catch(err => {
            console.error(err)
        });
    return data;
}

Step 3: Get similar movie recommendations

To get similar movie recommendations, you can do a .withNearObject() search query, as shown in the queries.js file. By passing the movie’s id, the query returns the num_movies = 20 closest movies to the given movie in the vector space.

async function get_recommended_movies(mov_id) {
    let data = await client.graphql
        .get()
        .withClassName('Movies')
        .withFields(['title', 'genres', 'year', 'poster_link', 'movie_id'])
        .withNearObject({id: mov_id})
        .withLimit(20)
        .do()
        .then(info => {
            return info;
        })
        .catch(err => {
            console.error(err)
        });
    return data;
}

Step 4: Run the demo

Finally, wrap everything up nicely in a web application with the iconic 2000s GeoCities aesthetic (I’m not going to bore you with frontend stuff), and voila! You’re all set!

To run the demo locally, clone the GitHub repository.

git clone git@github.com:weaviate-tutorials/awesome-moviate.git

Navigate to the demo’s directory and set up a virtual environment.

python -m venv .venv             
source .venv/bin/activate

Make sure to set the environment variables for your $OPENAI_API_KEY in your virtual environment. Additionally, run the following command in the directory to install all required dependencies in your virtual environment.

pip install -r requirements.txt

Next, set your OPENAI_API_KEY in the docker-compose.yml file and run the following command to run Weaviate locally via Docker.

docker compose up -d

Once your Weaviate instance is up and running, run the add_data.py file to populate your vector database.

python add_data.py

Before you can run your application, install all required node modules.

npm install

Finally, run the following command to start up your movie search engine application locally.

npm run start

Now, navigate to http://localhost:3000/ and start playing around with your application.

Summary

This article has recreated Andrej Karpathy’s fun weekend project of a movie search engine/recommender system. Below, you can see a short video of the finished live demo:

Demo live at https://awesome-moviate.weaviate.io/

In contrast to the original project, this project uses a vector database to store the embeddings. Also, the search functionality was extended to allow for semantic and hybrid searches as well.

If you play around with it, you’ll notice that it is not perfect, but just as Karpathy has said:

“it works ~okay hah, have to tune it a bit more.”

You can find the project’s open source code on GitHub and tweak it if you like. Some suggestions for further improvements could be to play around with vectorizing different properties, to tweak the weighting between keyword search and semantic search or to switch out the embedding model with an open source alternative.

Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Get an email whenever Leonie Monigatti publishes.

Get an email whenever Leonie Monigatti publishes. By signing up, you will create a Medium account if you don’t already…

medium.com

Find me on LinkedIn, Twitter, and Kaggle!

Disclaimer

At the time of writing, I am a developer advocate at Weaviate, an open-source vector database.
This project is not an original idea: The project is inspired by Andrej Karpathy’s weekend project and the implementation is based on an old demo project of a movie search engine