The world’s leading publication for data science, AI, and ML professionals.

Small Training Dataset? You Need SetFit

The enterprise-friendly way to train NLP classifiers with Python in 2025

Image by author
Image by author

Data scarcity is a big problem for many data scientists.

That might sound ridiculous ("isn’t this the age of Big Data?"), but in many domains there simply isn’t enough labelled training data to train performant models using traditional ML approaches.

In classification tasks, the lazy approach to this problem is to "throw AI at it": take an off-the-shelf pre-trained LLM, add a clever prompt, and Bob’s your uncle.

But LLMs aren’t always the best tool for the job. At scale, LLM pipelines can be slow, expensive, and unreliable.

An alternative option is to use a fine-tuning/training technique that’s designed for few-shot scenarios (where there’s little training data).

In this article, I’ll introduce you to a favourite technique of mine: SetFit, a fine-tuning framework that can help you build highly performant NLP classifiers with as few as 8 labelled samples per class.

I first learned about SetFit on a project I delivered for a client in the financial sector

We were trying to build a model that could classify domain-specific texts which had only subtle differences between them. Unfortunately, we had only around 10 samples per class (and ~200+ classes), and weren’t having much luck with our arsenal of traditional NLP tools (TF-IDF, BERT, DistilBERT, RoBERTa, OpenAI/Llama 3).

While searching for a solution, I came across SetFit and decided to try it out after seeing this remarkable graph, which shows the results of an experiment run by the original creators of SetFit:

Source
Source

In the experiment, the researchers trained a RoBERTa Large model to classify the sentiment of customer reviews. They started with a training dataset of just 3(!) samples per class/sentiment, and gradually increased the size of the training dataset up to the full sample of 3k, recording the accuracy at each interval. As you can see from the graph (the orange line), the RoBERTa model was highly performant with a large sample size, and terrible at small sample sizes.

Next, the researchers trained a series of models using the SetFit framework (the blue line). And they found that:

SetFit-trained models easily outperform RoBERTa Large with small sample sizes

This blew my mind, but, to understand why it happens, we need to get a grasp on what SetFit actually is.

SetFit is a framework for few-shot fine-tuning of Sentence Transformers (text embedding models). It was developed by researchers at HuggingFace 🤗, and is completely free to use via the transformers and setfit Python libraries.

What’s a Sentence Transformer? It’s just a specific type of neural network designed for encoding/embedding text. They’re a key part of modern NLP toolkits because they let you convert text into high-dimensional, dense vectors which capture the semantic meaning of the text. In turn, those vector representations can then be used as features for training predictive models (e.g., text classifiers).

So SetFit is a framework for fine-tuning Sentence Transformers. But what does that mean?

In plain English, that means it can be used to create bespoke, fine-tuned embedding models using very small training datasets.

This happens through two phases:

  1. Fine-tuning the sentence transformers (with contrastive learning)
  2. Training a classification head (e.g., logistic regression)
Image by author
Image by author

What is contrastive learning, and why does it work so well on small, domain-specific datasets?

Most embedding models/sentence transformers are trained to embed text by looking at co-occurence patterns – they learn embeddings by looking at how often words appear near each other.

For example, in large training corpuses like Wikipedia, the words "tea" and "coffee" might frequently co-occur in the same sorts of articles, in the same sorts of sentences, near the same sorts of words:

Image by author. Screenshot of Wikipedia. And yes, we Brits really do call cafes "greasy spoons".
Image by author. Screenshot of Wikipedia. And yes, we Brits really do call cafes "greasy spoons".

As a result, the embedding model learns to encode these co-occuring words with similar embeddings:

Image by author
Image by author

These embeddings are really useful for general-purpose language models (it gives them the ability to recognise that tea and coffee fit in the same broad semantic space – i.e., they’re both hot drinks), but they’re not particularly useful in contexts where we need to build fine-grained classifiers that can recognise the differences between these classes.

In contrastive learning, embedders are explicitly trained to generate task-specifi cembeddings which are great at distinguishing between different classes or categories (in this case, tea vs coffee).

This happens in three stages:

  1. Create pairs: The SetFit algorithm starts by creating pairs of sentences. These pairs are labeled as either positive (if the sentences have the same label) or negative (if they have different labels):
Image by author
Image by author

2. Embed Sentences: Each sentence is passed through a pre-trained sentence transformer to obtain embeddings (vector representations). These embeddings capture the semantic meaning of the sentences.

3. Adjust the embeddings using contrastive Loss: Using a loss function like cosine similarity loss or triplet margin loss (depending on the config you set), the model adjusts the embeddings so that:

  • Embeddings of sentences in positive pairs are pulled closer together in the embedding space.
  • Embeddings of sentences in negative pairs are pushed farther apart.

For example:

  • The embeddings of "I love drinking tea" and "The tea here is terrible" are adjusted to be closer, because they’re both about tea (i.e., they’re a "positive" pair).
  • The embeddings of "I love coffee" and "Tea is amazing" are adjusted to be farther apart, because they’re not about the same subject.

The net result is that the model actively learns to separate sentences from different classes in the vector space, ensuring that sentences from the same class are more similar, while sentences from different classes are less similar:

Image by author
Image by author

This maximises the "information" (in a mathematical sense) contained in the vectors and can help you build really powerful classifiers using very little training data.

A worked example: Classifying technical news articles

Let’s look at a real example to see how SetFit compares to other approaches. We’ll use a subset of the 20 newsgroups dataset (CC BY 4.0 license), which is comprised of thousands of news article headlines and the corresponding category of each article.

We’ll take a sample of 20 articles from each of the 5 closely-related (but distinct) computing categories: graphics, Microsoft Windows, IBM, Mac, and Windows X.

Our goal is to build a classifier that can identify, for a given technical article, the appropriate category. The semantic similarity of these categories makes this a tricky challenge: how will our classifiers do?

First, let’s pip install our required packages and import the required libraries:

!pip install setfit transformers==4.42.2 peft==0.10.0 scikit-learn nltk pandas
from datasets import Dataset
from setfit import SetFitModel, Trainer
from typing import Tuple
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
import re

Next, let’s prepare the data.

# Fetch data for all the "Science" categories
cats = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware','comp.windows.x']
train = fetch_20newsgroups(subset='train', categories=cats, remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', categories=cats, remove=('headers', 'footers', 'quotes'))

# Convert to DataFrame
train_df = pd.DataFrame({'text': train.data, 'label': train.target})
train_df['label'] = train_df['label'].apply(lambda x: train.target_names[x])
test_df = pd.DataFrame({'text': test.data, 'label': test.target})
test_df['label'] = test_df['label'].apply(lambda x: test.target_names[x])

# Remove stopwords, lowercase, etc. 
def preprocess_text(text):
    text = text.lower() # Lowercase
    text = re.sub(r'[^a-zA-Zs]', '', text) # Remove special characters and numbers
    tokens = text.split()
    return ' '.join(tokens)

train_df['text'] = train_df['text'].apply(lambda x: preprocess_text(x))
test_df['text'] = test_df['text'].apply(lambda x: preprocess_text(x))

# Stratified sample in the training dataset: 20 samples per class
train_df = train_df.groupby('label', group_keys=False).apply(lambda x: x.sample(20, random_state=42))

train_df # Preview
Image by author
Image by author

Before these texts can be used to train a classifier, we need to embed them (i.e., we need to encode them as vectors).

Using SetFit, we can fine-tune an off-the-shelf Sentence Transformer so that it’s attuned to the specific nuances of our text, and then fit a logistic regression classification head:

import os
os.environ["WANDB_DISABLED"] = "true" # If running in Google Colab

# SetFit needs a Dataset class with two cols: `text` and `label`
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(test_df)

# A popular and performant sentence transformer from HuggingFace
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2",)

model.labels = train_dataset['label']

trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
)

trainer.train()

accuracy = trainer.evaluate(val_dataset)
y_pred = model.predict(val_dataset['text']) # Generate an array of predictions which we can insert into a `results` DataFrame later

results = pd.DataFrame({
    'text': test_df.text,
    'label': test_df.label,
    'y_pred': y_pred
})

print(f"Accuracy: {accuracy}")
# 0.63

So, the SetFit-trained model achieved an accuracy of 63% with 20 samples per class. This climbed to 68% using 40 samples per class.

How does this compare to other approaches? Let’s look at TF-IDF:

# TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(train_df['text'])
X_val_tfidf = tfidf.transform(test_df['text'])

# Logistic Regression classifier
lr = LogisticRegression(max_iter=1000)
print("Training TF-IDF + Logistic Regression model...")
lr.fit(X_train_tfidf, train_df['label'])

# Evaluate
tfidf_y_pred = lr.predict(X_val_tfidf)
tfidf_accuracy = accuracy_score(test_df['label'], tfidf_y_pred)

results = pd.DataFrame({
    'accuracy': tfidf_accuracy,
    'predictions': tfidf_y_pred,
})

print(f"TF-IDF Model Accuracy: {tfidf_accuracy:.4f}")
# 0.48

An accuracy of 38% – much lower.

RoBERTa models perform better, but only when using a lot more data. Take a look:

from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer as HFTrainer, TrainingArguments
from torch.utils.data import Dataset as TorchDataset

# Prepare the data for RoBERTa
class TextClassificationDataset(TorchDataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )
        return {key: val.squeeze(0) for key, val in encoding.items()}, label

# Tokenizer and model initialization
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
roberta_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=len(set(train_df['label'])))

train_dataset_roberta = TextClassificationDataset(
    train_df['text'].tolist(),
    train_df['label'].tolist(),
    tokenizer,
)

val_dataset_roberta = TextClassificationDataset(
    test_df['text'].tolist(),
    test_df['label'].tolist(),
    tokenizer,
)

# Training arguments for RoBERTa
training_args = TrainingArguments(
    output_dir="./roberta-results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    save_steps=1000,
    logging_dir="./logs",
    logging_steps=10,
)

hf_trainer = HFTrainer(
    model=roberta_model,
    args=training_args,
    train_dataset=train_dataset_roberta,
    eval_dataset=val_dataset_roberta,
)

print("Training RoBERTa model...")
hf_trainer.train()

# Evaluate RoBERTa Model
roberta_y_pred = hf_trainer.predict(val_dataset_roberta).predictions.argmax(axis=-1)
roberta_accuracy = accuracy_score(test_df['label'], roberta_y_pred)

results_summary['RoBERTa'] = {
    'accuracy': roberta_accuracy,
    'predictions': roberta_y_pred,
}

print(f"RoBERTa Model Accuracy: {roberta_accuracy:.4f}")

When using the total set of 3,000 samples, the accuracy was 74%. But when running that code with just 20 samples per class (i.e., the same as SetFit and TF-IDF), the accuracy was only 47%. This matches the findings of the original authors of SetFit:

SetFit is significantly more sample efficient and robust to noise than standard fine-tuning.

In conclusion

In love SetFit. It’s a great and quick way to train performant classifiers in domains where you don’t have much training data, and I’d love to see more data scientists making use of it.

I hope you’ve found this thought-provoking. If you disagree with anything I’ve said, let’s chat! I’d love to hear your take.

Also, feel free to can connect with me on LinkedIn, or get my Data Science/AI writing in your inbox via AI in Five!

Until next time 🙂


Related Articles