The world’s leading publication for data science, AI, and ML professionals.

Finding Familiar Faces with a Tensorflow Object Detector, Pytorch Feature Extractor, and Spotify’s…

In a few different posts I have put together facial recognition pipelines or something similar using object detectors and something like a…

In a few different posts I have put together facial recognition pipelines or something similar using object detectors and something like a Siamese network for feature extraction to find similar images but I haven’t really dug into how to use those feature vectors in a more practical larger scale way. What I mean here is that you don’t want to have to go ahead and do pairwise comparisons across your entire database, it just isn’t practical. So in this post I am going to demo how to use Spotify’s approximate nearest neighbor library (annoy) to find similar game characters based on some initial image.

Annoy is a library that Spotify developed to help power their music recommendations at a massive scale. It has some other useful properties in that you can precompute indexes that you can call later to find similar items which helps when you are working at a massive scale. It works by passing some sort of vector representation of whatever you care about (music, text, pictures of anime characters) and it builds its model and generates indexes based on that.

The simplest version of this pipeline would be to use some neural network for feature extraction and then pass those feature vectors into annoy and then use annoy to find similar images. This works depending on your use case. In this specific case I want to see if I can get annoy to return similar or the same character across different images. I tested this out and ended up doing a two stage approach by building a facial recognition network and passing annoy the feature vectors generated from the faces to try and focus the pipeline to find similar characters.

For this project the dataset I will be using the same Fate Grand Order dataset I have used in a number of posts. Mostly because it is the largest out of the recent small datasets I have built at around 410 images so it makes a good test case for something like this.

Going into this project I figured that the key (like many Data Science projects) would be to find a good way to represent my dataset for the problem I wanted to solve. For this type of task condensing the 3D array of pixels into a 1D feature vector is mechanically correct, but making sure that final 1D feature vector is describing what you want is also important.

feel free to check out the notebooks I used for this here. The notebooks arn’t super readable since I was hacking through things pretty quickly. I also don’t provide model/dataset files in the repo per usual.

Version 1: Full Image Feature Extraction

Since this was the easiest thing to do and was a good way to set up my general pipeline for later tests I used a Pytorch pretrained network (a ResNet101 which yields 1000 features per vector to be fed into annoy) and passed it full images which were then used by annoy to find similar images. This inital pass yielded results like the following. The left most image is the base image and the next 4 are the 4 most similar ones in the "database".

Left most image is the seed image followed by 4 of the top ones. (there was a duplicate in the dataset for this one)
Left most image is the seed image followed by 4 of the top ones. (there was a duplicate in the dataset for this one)

Looks like the feature extractor is getting a lot of similar color detail with dark colors with a red and black focus from the original image. However in this case I am trying to see if I can get similar characters so it is not the best output.

Here is another image example.

Left most image is the seed image followed by 4 of the top ones.
Left most image is the seed image followed by 4 of the top ones.

This one is a bit tricky since there are two characters in the main image, but when the full image is passed into the annoy model the similar images have similar-ish color palates (maybe…).

So, mechanically this process is working but I would say that the output is poor. Next step is to think of a way to make the output more coherent. In this case since what I want to build is something to display similar characters given some base image why not try and make it so the image is based on the just the character and not the full image? There are a number of ways I can think of to do this, but the easiest for me was to just build a single class object detector to recognize faces in images, pass that face into a pretrained network for feature extraction, then pass that feature vector into annoy. When annoy pulls the similar images I can just have it return the base images so basically it finds which faces in its database are most similar to the new extracted face and returns the images those faces appeared in.

Facial Recognition Based Similarity

For this project I built a simple 1 class object detector to just recognize anime character faces in images. I used Labelimg to quickly annotate the images and I am pretty sure it only took me around 20 minutes to label the 400 images for my test and training splits. I have done quite a few of these at this point and it only being a single class speeds up the process significantly. When training the detector I used a Faster-RCNN-Inceptionv2 model with the coco weights from the TensorFlow model zoo and trained the model for around 3 hours. I started training at midnight on Friday until around 3am on Saturday which has thrown a bit of a wrench in my sleep schedule since I was up working on some other stuff then.

The object detector trained up fairly quickly and the output looks quite clean. This is heartening since it will be the key piece in this pipeline to find more character specific similar images.

Example of some cropped image outputs
Example of some cropped image outputs

While using the object detector to crop the heads out of the original dataset, I saved a csv mapping of the heads to their original images. My idea being that I could run a feature extractor on the headshot and store that in the annoy model and when the time comes I can match the annoy output to the original image.

Feature Extraction with Pytorch and Annoy

Now that I can extract heads from images all I had to do was pass those heads through a feature extractor (once again a ResNet101), then pass those feature vectors to annoy.

As a demo here is one of the images from before where the raw image model had some issues. This is an example output of the object detector detecting two faces in the image. So each image will have features extracted from it and then matched against the larger database with annoy.

The first output is from the character on the left of the main image (who does appear in the dataset) and the first two and final images of the 4 similar ones are of that character. This is an improvement of over the raw image input for this image where there were 0 matches.

The second character (one on the right) doesn’t actually appear in the database… but she is basically identical in facial features to the one on the left so of the four images 2 match (1st similar and 4th)

Left is the base image, followed by 4 most similar
Left is the base image, followed by 4 most similar

This appears to be a good improvement over just using the base image since the goal is to return similar characters. Now lets check out the other example image I used before for the base model.

Left is the base image, followed by 4 most similar
Left is the base image, followed by 4 most similar

So in this one rather than getting a bunch of just red black images they seem a little more tailored. While the first, second, and fourth similar images are of a different character the 3rd is of the same character. This time all are at least of the same gender of the base image. The other one just had all male characters paired with a base image of a female character. While this is not a great result it seems to be an improvement over the previous version.

Closing Thoughts

After looking through the output of these two pipelines I felt that these results were acceptable, but not great. Using the base images returned images that have similar feels but not necessarily similar characters. While the face detector helped to focus the outputs to be of similar characters, it often did not return images of an overall similar style.

original full image model
original full image model

While 2 of the 4 returned images are not of the same character I do actually like the 2nd result in the middle because it has a similar "feel" to the base image.

headshot based model
headshot based model

As I mentioned before, the headshot based model focuses well on the characters in question. In this case all characters are the same. However It doesn’t match the feel of the original image. What I really want is some combination of the two where I can get similar characters and similar overall images (basically I selfishly want to have my cake and eat it too).

After some experimenting I found that I was able to get pretty close to that.

New model!
New model!

As I alluded to at the beginning of this post, getting better output from this pipeline basically comes down to modifying what data gets condensed down into that final feature vector that gets passed into annoy. While in most things having your cake and eating it too isn’t feasible, in this case it is! I would argue that this "new model" does a better job than the other two by getting all the correct character (beats out the base image model) and displays images that have a closer "feel" to the base image than the headshot model.

Per usual this was just a situation where I had to attack the problem from a new point of view.

Still really enjoy big hero 6 and code to Immortals as a theme song for my life
Still really enjoy big hero 6 and code to Immortals as a theme song for my life

I just had to rethink what information I was encoding into that final feature vector. What I ended up doing was passing both the information from the detected headshot and the base image into annoy for a combined feature vector that captured information about both the character’s face (to get similar characters) and the base image (to get the overall "feel").

However this final model wasn’t quite that straightforward and took me a bit to figure out so I will throw a second followup post at it to keep this post to a reasonable length.

So tune in next time where I will walk through how I combined the face and base image information into a dense representation to let Spotify’s annoy find similar images in terms of character and feel.

Follow up blog post here

Once again, feel free to check out the notebooks I used for this here. The notebooks arn’t super readable since I was hacking through things pretty quickly. I also don’t provide model/dataset files in the repo per usual.


Related Articles