Modern Semantic Search for Images

A how-to article leveraging Python, Pinecone, Hugging Face, and the Open AI CLIP model to create a semantic search application for your cloud photos.

Published in

Towards Data Science

6 min readNov 14, 2023

You want to find “that one picture” from several years ago. You remember a few details about the setting and want to search based on a specific phrase. Apple Photos doesn’t offer semantic search, and Google Photos is limited to a few predetermined item classifiers. Neither will do well with this kind of search. I’ll demonstrate the issue with two unusual queries of my Google Photos: “donut birthday cake” and “busted lip from a snowball fight”. Then I’ll share how to build your own semantic image search application.

Demonstration: current limitations compared to modern semantic image search

Example #1

I like birthday cakes. I also like donuts. Last year, I had the brilliant idea to combine the two with a stack of donuts as my birthday cake. Let’s try to find it.

Google Photos query: “donut birthday cake”

Results: Six pictures of cakes with no donuts followed by the one I wanted.

Semantic Search App query: “donut birthday cake”

Results: Two images and a video that were exactly what I wanted.

Example #2

I went to the snow with my teenage son and a big group of his friends. They climbed on top of an abandoned train tunnel. “Throw snowballs all at once, and I’ll get a slow-motion video of it!”, I yelled. It was not my brightest moment as I didn’t foresee the obvious conclusion that I would end up being target practice for twenty teenage boys with strong arms.

Google Photos query: “busted lip from a snowball fight”

Results:

The current Google image classification model is limited to words it has been trained on.

Semantic Search App query: “busted lip from a snowball fight”

Results: The busted lip picture (not shown) and the video that preceded the busted lip were results one and two.

OpenAI CLIP model and application architecture

CLIP allows the model to learn how to associate image pixels with text and gives it the flexibility to look for things like “donut cakes” and “busted lips” — things that you’d never think to include when training an image classifier. It stands for Constastive Language-Image Pretraining. It is an open-source, multi-modal, zero-shot model. It has been trained on millions of images with descriptive captions.

Given an image and text descriptions, the model can predict that image's most relevant text description, without optimizing for a particular task.

Source: Nikos Karfitsas, Towards Data Science

The CLIP architecture that you find in most online tutorials is good enough for a POC but is not enterprise-ready. In these tutorials, CLIP and the Hugging Face processors hold embeddings in memory to act as the vector store for running similarity scores and retrieval.

A vector database like Pinecone is a key component to scaling an application like this. It provides simplified, robust, enterprise-ready features such as batch and stream processing of images, enterprise management of embeddings, low latency retrieval, and metadata filtering.

Building the app

The code and supporting files for this application can be found on GitHub at https://github.com/joshpoduska/llm-image-caption-semantic-search. Use them to build a semantic search application for your cloud photos.

The application runs locally on a laptop with sufficient memory. I tested it on a MacBook Pro.

Components needed to build the app

Pinecone or similar vector database for embedding storage and semantic search (the free version of Pinecone is sufficient for this tutorial)
Hugging Face models and pipelines
OpenAI CLIP model for image and query text embedding creation (accessible from Hugging Face)
Google Photos API to access your personal Google Photos

Helpful information before you start

GitHub user, polzerdo55862, has a great notebook tutorial on using the Google Photos API via Python
The Pinecone quick tour shows how to initialize, fill, and delete a Pinecone “index”
Pinecone examples of how to query an index
HuggingFace example of how to use CLIP in a stand-alone query and search pipeline
Antti Havanko shared an example of how to use CLIP to generate embeddings for use in a vector search engine

Access your images

The Google Photos API has several key data fields of note. See the API reference for more details.

Id is Immutable
baseUrl allows you to access the bytes of the media items. They are valid for 60 minutes.

A combination of the pandas, JSON, and requests libraries can be used straightforwardly to load a DataFrame of your image IDs, URLs, and dates.

Generate image embeddings

With Hugging Face and the OpenAI CLIP model, this step is the simplest of the entire application.

from sentence_transformers import SentenceTransformer
img_model = SentenceTransformer('clip-ViT-B-32')
embeddings = img_model.encode(images)

Creating metadata

Semantic search is often enhanced with metadata filters. In this application, I use the date of the photo to extract the year, month, and day. These are stored as a dictionary in a DataFrame field. Pinecone queries can use this dictionary to filter searches by metadata in the dictionary.

Here is the first row of my pandas DataFrame with the image fields, vectors, and metadata dictionary field.

Load embeddings

There are Pinecone optimizations for async and parallel loading. The base loading function is simple, as follows.

index.upsert(vectors=ids_vectors_chunk, async_req=True)

Query embeddings

To query the images with the CLIP model, we need to pass it the text of our semantic query. This is facilitated by loading the CLIP text embedding model.

text_model = SentenceTransformer(‘sentence-transformers/clip-ViT-B-32-multilingual-v1’)

Now we can create an embedding for our search phrase and compare that to the embeddings of the images stored in Pinecone.

# create the query vector
xq = text_model.encode(query).tolist()

# now query
xc = index.query(xq,
                filter= {
                "year": {"$in":years_filter},
                "month": {"$in":months_filter}
                        },
                top_k= top_k,
                include_metadata=True)

Conclusion

The CLIP model is amazing. It is a general knowledge, zero-shot model that has learned to associate images with text in a way that frees it from the constraints of training an image classifier on pre-defined classes. When we combine this with the power of an enterprise-grade vector database like Pinecone, we can create semantic image search applications with low latency and high fidelity. This is just one of the exciting applications of generative AI sprouting up daily.