Up and running with Milvus

A Vector Similarity Search Engine. Build recommendation and search for image/video, audio, or unstructured text.

David Shvimer
Towards Data Science

--

Photo by Markus Winkler on Unsplash

What is Milvus?

A vector similarity search engine that seems to be production ready. According to the website:

  • It offers a variety of similarity metrics and index types.
  • Scales horizontally.
  • Reading and writing happens in near real time, this means we can insert records while the engine is online.
  • Exposes a REST interface.

Sounds pretty cool!

For those who don’t yet know, there are many use cases for a technology like this. With the help of machine learning, we can search images, video, and audio.

Say we train a CNN that classifies images. If we take a look at the output of any layer before the final output, we may find N-dimensional feature vectors that are describing the input. As we move through the network the features become more specific. In the beginning we can identify textures, shapes, etc. Towards the end we identify objects like cat ears and dog tails. We can take the output of any of these layers, flatten it by some method, and index it in a search engine! Voila! The layer chosen will have an effect on what is considered “similar”. This will vary by use case. This example is an application of Content Based Image Retrieval (CBIR).

What are we building?

A very simple implementation of CBIR using Milvus, in a Dockerized environment. Here is the complete repo and a list of the technologies we will be using. If you want to download the repo and just mess around, it’s ready to go.

  • Python — because
  • Docker — so everyone has a standard environment
  • PyTorch — because I always jump for Keras and wanted to learn something new.
  • Jupyter notebook — simple way to interact with Milvus

Setting up the project

In a new directory, let’s create some more empty directories.

-project
-notebook
-milvus
-conf

In the top level directory, create a file named docker-compose.yml with the following contents:

version: '2.0'
services:
notebook:
build:
context: ./notebook
ports:
- '8888:8888'
volumes:
- ./notebook:/home/jovyan
links:
- milvus
milvus:
image: milvusdb/milvus:0.9.1-cpu-d052920-e04ed5
ports:
- '19530:19530'
- '19121:19121'
volumes:
- ./milvus/db:/var/lib/milvus/db
- ./milvus/conf:/var/lib/milvus/conf
- ./milvus/logs:/var/lib/milvus/logs
- ./milvus/wal:/var/lib/milvus/wal

We have defined two docker containers, one for Milvus, and another for our jupyter notebook. The Milvus container is visible to the notebook via the links attribute. The volumes we declared are so that the Milvus file system shares some folders with our operating system. This lets us configure and monitor Milvus with ease. Since we gave the notebook container a context to build from, we need to create a file named Dockerfile in the notebook directory with the following contents:

FROM jupyter/scipy-notebookRUN pip install pymilvus==0.2.12RUN conda install --quiet --yes pytorch torchvision -c pytorch

Not the prettiest way to declare dependencies, but that can be optimized later. We should also download some images to play with. Feel free to download the ones I used here: Place them into notebook/images.

Last step, download the starter Milvus config file and place it into milvus/conf/. Now as long as Docker is installed and running, we rundocker-compose up and we are live!

If you see the following lines in the console output:

milvus_1    | Milvus server exit...milvus_1    | Config check fail: Invalid cpu cache capacity: 1. Possible reason: sum of cache_config.cpu_cache_capacity and cache_config.insert_buffer_size exceeds system memory.milvus_1    | ERROR: Milvus server fail to load config file

It means Docker is not configured with enough memory to run Milvus. If we open the config file and search for “cpu_cache_capacity” we see some helpful documentation. “The sum of ‘insert_buffer_size’ and ‘cpu_cache_capacity’ must be less than system memory size.”

Set both values to 1, and then open Docker’s settings and make sure that it is configured to any value greater than 2GB (MUST BE GREATER THAN). Make sure to apply the settings and restart Docker. Then try docker-compose up again. If something else isn’t working, please let me know in the comments.

Feature Vectors with PyTorch

The fun stuff, here we go. Once everything is running, we should have a URL to access our jupyter instance. Let’s create a new notebook and get to coding. One cell at a time.

First the imports:

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Now lets define a helper class to extract the feature vectors:

Why did I pick ResNet18? Because it has a layer that outputs a flat vector with length 512. Convenient and easy is reasonable when learning something new. There is quite a bit of room to extend this class. We could extract features from multiple layers at once, and input multiple images at once. For the time being, this is good enough.

Now we can load our images and start looking at similarity:

feat_vec = FeatureVector()
dog1 = Image.open('./images/dog1.jpg')
dog2 = Image.open('./images/dog2.jpg')
dog3 = Image.open('./images/dog3.jpg')
cat1 = Image.open('./images/cat1.jpg')
cat2 = Image.open('./images/cat2.jpg')
person1 = Image.open('./images/person1.jpg')
person2 = Image.open('./images/person2.jpg')
def compare(a, b):
plt.figure()
plt.subplot(1, 2, 1)
plt.imshow(a)
plt.subplot(1, 2, 2)
plt.imshow(b)
a_v = feat_vec.get_vector(a)
b_v = feat_vec.get_vector(b)
print('Similarity: {}'.format(feat_vec.similarity(a_v, b_v)))
compare(dog1, dog2)

In the first example, the puppy and adult golden retriever images get a similarity score ~0.79. When we compare a puppy golden retriever and a pug we get a similarity score ~0.58.

Working with Milvus: Connect, Insert, Query

Let’s do what we came here to do. We start by connecting to Milvus, and creating a collection

from milvus import Milvus, IndexType, MetricType, Status# Milvus server IP address and port.
# Because the link to milvus in docker-compose
# was named `milvus`, thats what the hostname will be
_HOST = 'milvus'
_PORT = '19530' # default value
# Vector parameters
_DIM = 512 # dimension of vector
_INDEX_FILE_SIZE = 32 # max file size of stored indexmilvus = Milvus(_HOST, _PORT, pool_size=10)# Create collection demo_collection if it dosen't exist.
collection_name = 'resnet18_simple'
status, ok = milvus.has_collection(collection_name)
if not ok:
param = {
'collection_name': collection_name,
'dimension': _DIM,
'index_file_size': _INDEX_FILE_SIZE, # optional
'metric_type': MetricType.L2 # optional
}
print(milvus.create_collection(param))# Milvus expo
_, collection = milvus.get_collection_info(collection_name)
print(collection)

Now we can insert the feature vectors from our images. We need to convert the vectors into python lists:

images = [
dog1,
dog2,
dog3,
cat1,
cat2,
person1,
person2
]
# 10000 vectors with 128 dimension
# element per dimension is float32 type
# vectors should be a 2-D array
vectors = [feat_vec.get_vector(i).tolist() for i in images]
# Insert vectors into demo_collection, return status and vectors id list
status, ids = milvus.insert(collection_name=collection_name, records=vectors)
if not status.OK():
print("Insert failed: {}".format(status))
else: print(ids)

If this was successful, we should see a list of ID’s that Milvus uses to identify the images. They are in the same order as our list of images, so lets create a quick lookup table to easily access an image, given some ID:

lookup = {}
for ID, img in zip(ids, images):
lookup[ID] = img
for k in lookup:
print(k, lookup[k])

We can flush the new items onto the disk and get some info for the collection:

# Flush collection  inserted data to disk.
milvus.flush([collection_name])
# Get demo_collection row count
status, result = milvus.count_entities(collection_name)
print(result)
# present collection statistics info
_, info = milvus.get_collection_stats(collection_name)
print(info)

Lets search!

# execute vector similarity search
search_param = {
"nprobe": 16
}
print("Searching ... ")param = {
'collection_name': collection_name,
'query_records': [vectors[0]],
'top_k': 10,
'params': search_param,
}
status, results = milvus.search(**param)
if status.OK():
print(results)
else:
print("Search failed. ", status)

If we see a list of results that means all is well. We can visualize them with the following snippet

for neighbors in results:
for n in neighbors:
plt.figure()
plt.subplot(1, 2, 1)
plt.imshow(images[0])
plt.subplot(1, 2, 2)
plt.imshow(lookup[n.id])
print('Distance: {}'.format(n.distance))

To drop the collection:

milvus.drop_collection(collection_name)

Conclusion

It was pretty easy to get up and running. Most of the Milvus related code here was from the getting started example available on the website. One of our vectors is about 1KB in size, so we could fit one million feature vectors into 1GB of memory. We didn’t use an index here, which would increase that cost, but it’s still quite an efficient way to index an image. The website docs are great, but I think reading through the config file is a great way to understand what this thing is capable of. We kept it very simple in this post, but here are some ideas for those who want to take it further:

  • Preprocess the feature vectors. (i.e. Normalizing)
  • Experiment with different layers, if they aren’t flat, try max pooling or average pooling followed by flattening
  • Applying dimensionality reduction techniques: tNSE, PCA, LDA
  • Preprocess using an Auto Encoder

This was my first post ever. So if you liked it let me know. If something didn’t work or the formatting was off, also let me know! Stay classy out there.

--

--