The world’s leading publication for data science, AI, and ML professionals.

How to Unlock Powerful Computer Vision Applications by Adding a Flavor of NLP

A deep learning model that combines NLP with Computer Vision

image from Pixabay
image from Pixabay

What is CLIP?

Self-supervised learning in Computer Vision has shown great potential in learning different representations about images.

This is one approach where a neural network can learn representations that can later be used for different tasks such as image classification and object detection.

Another approach for learning representations from a dataset is called CLIP developed by OpenAI.

This approach uses (image,text) pairings to learn an image encoder AND a text encoder. Which means that it combines NLP with Computer Vision.

I think this is very cool!

Once these encoders are learned, they can be used later in a zero-shot setting to do different computer vision tasks.

High level overview of how CLIP works

In the image below (on the left) we see how two encoders are learned using (image, text) pairings. One is a text encoder and the other is an image encoder. It’s as if we’re trying to map language to vision and vice versa.

Then these two encoders can be used in a one-shot setting to predict what an image contains for example, as shown in the same image below on the right.

CLIP high level overview diagram (from original paper [1])
CLIP high level overview diagram (from original paper [1])

Why CLIP is impressive?

ResNet-50 was trained on 1.28 million crowd-labeled training examples from ImageNet. It reached a high top-1 accuracy by pure supervision.

CLIP reaches a comparable accuracy by using NONE of those 1.28 million images!

CLIP also reaches a comparable top-5 accuracy to InceptionV4. The latter was trained in a pure supervision setting.

This is incredible if you think about it!

I tested CLIP deep learning model

Using the open source code provided by OpenAI, I gave CLIP the image below, which I got from the stock images website Unsplash.

WALL-E (from Unsplash)
WALL-E (from Unsplash)

The image is of a robot called Wall-E that appeared in a movie with the same name.

I tested CLIP by giving it a dictionary of sentences containing the sentence "a robot".

image made by the author
image made by the author

Label probs: [[5.436e-04 3.622e-04 4.041e-04 9.985e-01]]

CLIP got it right! It gave this label the highest probability.

Then I gave it the word "wall-e" in addition to "a robot".

image made by the author
image made by the author

Label probs: [[1.329e-05 8.881e-06 9.756e-01 2.443e-02]]

Lo and behold, it got it right too!

In fact, I kept both sentences : "a robot" and "wall-e" and it gave the highest probability to the latter label!

Remember, CLIP was NOT trained to recognize a robot nor wall-e in a pure supervision setting. But it was still capable of recognizing them in a ONE-SHOT setting. This is incredible!

Not just that! By giving the model both labels "wall-e" and "a robot", I was attempting to understand how much does CLIP actually know about the image. The label "wall-e" is too specific I thought but apparently the model is that much powerful!

If you want me to test it on other specific cases, please let me know!

Conclusion

In this article, we took a look at how a combination of NLP with Computer Vision can lead to some incredible results. CLIP is a Deep Learning model that showcased exactly this. We saw how CLIP can be used in a one-shot setting and still make true predictions. Is this the future of deep learning?

References

[1] Alec Radford et al. "Learning Transferable Visual Models From Natural Language Supervision"

image made by the author
image made by the author

I am a Machine Learning Engineer working on solving challenging computer vision problems. I want to help you learn Machine Learning applied to Computer Vision problems. Here’s how.

  1. By helping you stay up-to-date with what’s happening in the field. I do this by sharing bite-size ML posts on LinkedIn and Twitter almost daily. So follow me there!
  2. By giving you a weekly digest of those bite-size posts on my newsletter. So subscribe to it!
  3. By writing articles here on Medium about different topics in Machine Learning. So follow me here!
  4. By giving you a free machine learning job-ready checklist to help you check all points you need to learn if you’re planning a career in ML, specifically in Computer Vision. You can get the checklist here.
  5. Last but not least, by sharing with you my FREE introductory Tensorflow course that has more than 4 hours of video content, and you can ask me any question you have there.

Also, feel free to contact me on LinkedIn or Twitter if you have any questions or you just want to chat about ML!


Related Articles