Teaching Machines with Few Examples

Published in

Towards Data Science

8 min readApr 8, 2022

When product managers and domain experts approach me about an idea they have that might benefit from AI, I often get asked how many examples will we need to train a Machine Learning (ML) model and test out if it can solve the problem. The answer has always been along the lines of — “It depends, but you typically need a few hundred examples”. With the latest advances in ML, however, that answer has started favoring the side of needing only a handful of examples to get a quick prototype or even an MVP out the door. This article explains these latest developments and how it enables a User Experience that empowers non-data-scientists (namely product managers, domain experts, and business users) to try out their ideas.

Just from a handful of examples, we can get a sense of what a teapot might be. Photo by 童彤 on Unsplash

From needing large datasets to a few examples to train ML — a.k.a. How did we get here?

When training an ML model, Data Scientists often grapple with the question — “How much data do I need?” Intuitively, they understand that more data is better. Even until just 3 years ago, the rule of thumb was that you at least needed about 1000 examples per class you want to detect. As a result, they have always tried to get as much data as they could get their hands on or was feasible.

The first breakthrough that allowed for fewer training data was Transfer Learning. By leveraging ML models that were trained on a similar task, and customizing the final decision logic on your domain (“fine-tuning the last few layers” in Deep Learning lingo), you could potentially get away with needed a few hundred examples. The way transfer learning achieved this was that the part of the model that was task-agnostic was good at understanding key features in the text or image that was the input to the model. So to adapt the model to your task, you only needed to train the model to make the decision from the features extracted. This approach is still so popular, that no one starts from scratch (“random weights” in Deep Learning lingo), and usually starts with a pre-trained model. For example, even if you were to train a chest X-ray image classifier, starting with a model pre-trained on a general dataset like ImageNet gave a statistically significant performance boost.

Next came the enthusiasm about Synthetic Data. Data augmentation has been a staple trick to improve model performance. While often this is about creating randomized variations of your inputs (e.g. flipping or rotating images, changing text words with similar words, etc.), teams such as autonomous driving companies found great success in creating pipelines to generate synthetic data on-demand to create photo-realistic simulations. This resulted in reducing the dependency on expensive data collection for large data. However, these pipelines to generate realistic data were themselves cost-prohibitive for many companies because of the technical expertise needed to create them, resulting in limited use.

In the last few years, Self-supervised Learning has emerged as a way to train the aforementioned feature extraction parts of the models using signals from the data itself. As a result, the dependency on expensive data labeling has been reduced to make way for robust training of the bulk of the model to exploit structures in the underlying data. Once trained, the feature extractor can be used in combination with a decision-making “head” and trained on a few examples to customize for the task at hand.

Rise of Few-shot Learning

In the last couple of years, one research area that’s emerged is Few-shot learning — essentially, trying to predict with a few examples. For instance, can you recognize if the image below is of a teapot given the examples above?

More formally, in the few-shot learning setting, you have (a) a model trained on images/text and classes you already know about, (b) the image/text you want to predict on, and (c) a support set of unseen classes with a handful of examples each that you want to classify the image/text on. You use the feature extractor part of the trained model to extract features (i.e. embeddings in Deep Learning lingo) and compare the features of the prediction image/text with the support examples. The class that is most similar, on average (i.e. to the mean of the embeddings), is the target class predicted. When you’re using the model to predict if you have k unseen classes and each of them has n support examples you call it a k-way, n-shot classification.

While this formalization might sound academic, the implications for creating a UX that can predict from a few examples are huge. You can now provide only a handful of image examples of previously undefined classes in which you have to create a prototype or an MVP. The “How many images do I need” question is now answered with “let’s start with a few”. The same goes for text classification.

It’s important to note though that the accuracy may not be something that you could easily launch production software with. It, of course, might need more data, but these approaches will get you within 5–10% of the accuracy targets.

Zero-shot Learning: Using language to go a step further

The emergence of powerful language models (such as BERT) has changed the landscape of how machines can understand textual data. Borrowing a lot from these advances, zero-shot learning aims at exploiting the semantic meaning of the class names. If we semantically understand the class names, we can predict the classes for our image or text to predict if we have a mapping from the features we extract to the meaning of the class names. Zero-shot learning aims to do just that.

Again, more formally, in zero-shot learning, you learn the mapping of the features to the semantic meaning of the class (i.e. embeddings of the class names, in Deep Learning lingo), where the meaning of the classes has been extracted using some language model trained on a large corpus. When you have an image/text to predict with an unseen set of new classes, you can then extract the features, generate the semantic meaning of the classes, and use the mapping to see which of the classes your features map to are closest to.

An interesting advantage of this technique is that your classes need not be single words — they can be more descriptive — e.g. instead of saying “teapot” you can maybe say “ceramic-ware with handle and spout” (assuming there are other things you’ve seen that have the attributes ceramic, handle, and spout). And so the UX this could lead to is a more “teachable” machine interface where the machine can try to predict your class, and you might be able to correct it by describing the attributes of the object.

Bringing it all together: Teaching Machines with a Few Examples

I believe the ability to learn with a few examples imbues machines with a renewed sense of “intelligence” — that is, the machine seems more intelligent if it can even learn with a handful of examples. On the UX side, this means that your users can customize your software by walking your software through a few examples. While previously, only restricted to Data Scientists who meticulously trained models on large data, a Domain Expert or a Business user can now be empowered to teach the machine through a guided UI. Once trained, the model can then be used by the expert to automate their work and reduce repetitive tasks, or even personalize for their own end-users.

With AI Hero, we have created this guided UI to enable you to quickly build out this prototype/MVP. Let’s take an example of image classification — teaching the machine to categorize cats vs dogs.

a) You’ll first need to upload the images you need to classify.

Upload the images you want to train AI Hero with.

b) You’ll then tell AI Hero the classes or their descriptions.

Tell AI Hero about the classes you’d like to categorize the images into

c) Spot-check what AI Hero predicts

AI Hero uses its best guess (using zero-shot learning) to predict the classes with the images, and create a list of images for you to “Spot-check”. As you confirm or correct AI Hero, it learns automatically!

d) Iterate with more examples as AI Hero learns.

Once you’re satisfied with its performance, you can start using your automation by downloading your predictions.

Download your predictions when you’re satisfied with AI Hero’s performance

It’s that easy!

The role of the Data Scientist

All this doesn’t mean we don’t need Data Scientists anymore. The core of these techniques is the feature extraction model and the language understanding part, which would need to train on a large dataset (perhaps already available in the public domain that matches your use case).

If you already have a Data Science team, here’s a quick example to get your team started with a text classification prototype.

GIF from Hugging Face’s blog on Zero-shot classification.

As your dataset size increases and you get towards productionizing your automation, the Data Scientists can help you build these models internally that understand the quirks of your domain.

Shout out to the Data Scientists building these incredible capabilities!

Conclusion

With zero-shot and few-shot learning, you can create a prototype or MVP with a handful of examples. It enables an easy UX that can empower product managers, domain experts, and business users to create their own automation. AI Hero can help simplify this further by providing a guided UI that can get your MVP/prototype built in no time.

I’d love to hear what you think — you can reach me directly over here.

This article is written by Rahul Parundekar for AI Hero.

Rahul is an AI expert with 13+ years of experience in architecting and building AI products, engineering, research, and leadership and is passionate about improving the Human Experience through AI. He loves to learn about how ML is used practice and helps fellow ML practitioners with his experience. Talk to him here!

AI Hero is a no-code platform that helps you go from Zero to ML in minutes. You can choose from our growing list of automations to tag text, recommend products, tag images, detect customer sentiment, and other tasks — all with a simple, no-code, self-serve platform.