The world’s leading publication for data science, AI, and ML professionals.

Zero-Shot vs. Similarity-Based Text Classification

An evaluation of unsupervised text classification approaches

Image by Gertrūda Valasevičiūtė on Unsplash
Image by Gertrūda Valasevičiūtė on Unsplash

This post is based on our NLPIR 2022 paper "Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches". You can read more details there.

What is unsupervised text classification?

Unsupervised Text Classification approaches aim to perform categorization without using annotated data during training and therefore offer the potential to reduce annotation costs💰 . Generally, unsupervised text classification approaches aim to map text to labels based on their textual description, without using annotated training data. To accomplish this, there exist mainly two categories of approaches.

👯 The first category can be summarized under similarity-based approaches. Thereby, the approaches generate semantic embeddings of both the texts and the label descriptions, before attempting to match the texts to the labels using similarity measures such as cosine similarity.

0️⃣ 🔫 The second category uses zero-shot learning to classify texts of unseen classes. Zero-shot learning uses labeled training instances belonging to seen classes to learn a classifier that can predict testing instances belonging to different, unseen classes. For example, a zero-shot classification model may learn to correctly classify texts about soccer ⚽ during training, and then use this knowledge during testing to classify texts about basketball 🏀 without ever having seen texts about basketball before. The idea is that the model can transfer the knowledge learned about soccer to the very similar class of basketball. Although zero-shot learning techniques employ annotated data for training, they do not use labels to provide information about the target classes and can use their knowledge of the previously seen classes to classify instances of unseen classes. Since pretrained zero-shot text classification models do not require fine-tuning on labeled data from the target classes, we categorize them as an unsupervised text classification strategy.

In this blog post, we summarize the contributions of our paper 📄 "Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches (2022)" as follows:

  • We evaluate the similarity-based and zero-shot learning categories for unsupervised text classification of topics. Thereby, we conduct experiments with representative approaches of each category on different benchmark datasets.
  • We propose simple but strong baselines for unsupervised text classification based on SimCSE and SBERT sentence embedding similarities. Previous work has mostly been evaluated against different weak baselines such as Word2Vec similarities which are easy to outperform and tend to overestimate the performance of new unsupervised text classification approaches.
  • Since transformer-based text representations have been widely established as state-of-the-art for semantic text similarity in recent years, we further adapt Lbl2Vec, one of the most recent and well-performing similarity-based methods for unsupervised text classification, to be used with transformer-based language models.

Unsupervised text classification approaches

👯 Similarity-based text classification with Lbl2Vec

Numerous similarity-based approaches for unsupervised text classification exist. However, the recently introduced [[Lbl2Vec](https://medium.com/towards-data-science/unsupervised-text-classification-with-lbl2vec-6c5e040354de)](https://medium.com/towards-data-science/unsupervised-text-classification-with-lbl2vec-6c5e040354de) approach yields improved performance compared with other similarity-based approaches. Therefore, we focused on this approach in this study. Lbl2Vec works by jointly embedding word, document, and label representations. First, word and documented representations are learned with Doc2Vec. Then, the average of label keyword representations for each class is used to find a set of most similar candidate document representations via cosine similarity. The average of candidate document representations, in turn, generates the label vector for each class. For classification, eventually, the documents are assigned to the class where the cosine similarity of the label vector and the document vector is the highest. Here, you can find more information about how Lbl2Vec works.

Furthermore, we adapt the Lbl2Vec approach, using transformer-based text representations instead of Doc2Vec to create jointly embedded word, document, and label representations. Since transformer-based text representations currently achieve state-of-the-art results in text- similarity tasks, we investigate the effect of the different resulting text representations on this similarity-based text classification strategy. In this paper, we use SimCSE and SBERT transformer- models to create text representations. In the following, this approach is referred to as Lbl2TransformerVec.

0️⃣ 🔫 Zero-shot text classification using the entailment approach

Although newer zero-shot text classification (0SHOT-TC) approaches exist, the zero-shot entailment approach still produces state-of-the-art 0SHOT-TC results in predicting instances of unseen classes compared to models of similar size. As the name already implies, the zero-shot entailment approach deals with 0SHOT-TC as a textual entailment problem. The underlying idea is similar to that of similarity-based text classification approaches. Conventional 0SHOT-TC classifiers fail to understand the actual problem since the label names are usually converted into simple indices. Therefore, these classifiers can hardly generalize from seen to unseen classes. Considering 0SHOT-TC as an entailment problem provides the classifier with a textual label description and therefore enables it to understand the meaning of labels.

0️⃣ 🔫 Zero-shot text classification using TARS

TARS also uses the textual label description to classify text in a zero-shot setting. However, TARS approaches the task as a binary classification problem, where a text and a textual label description is given to the model, which makes a prediction about whether that label is true or not. The TARS authors state that this approach significantly outperforms GPT-2 in 0SHOT-TC

📐 Baselines

We compare the findings of current state-of-the-art unsupervised text classification approaches to some basic baselines to evaluate their performance.

LSA: For each dataset, we apply LSA to learn 𝑛 = number of classes concepts. Afterwards, the text documents are classified according to the highest cosine similarity of resulting LSA vectors of documents and label keywords.

Word2Vec: This produces semantic vector representations of words based on surrounding context words. The average of word embeddings is used to represent the text documents and label keywords. The text documents are predicted according to the highest cosine similarity of the resulting Word2Vec representations of documents and label keywords for classification.

SimCSE: This is a contrastive learning framework that produces sentence embeddings which acieve state-of-the-art results in semantic similarity tasks. We use, SimCSE document embeddings and SimCSE label keyword embeddings as class representations. Finally, the text documents are classified according to the highest cosine similarity of the resulting SimCSE representations of document and label keywords.

S[BERT](https://aclanthology.org/N19-1423/): This is a modification of BERT that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings. We use the same classification approach as with SimCSE, except that we now use SBERT embeddings instead of SimCSE embeddings.

🔬 Experiments

For our SimCSE experiments, we use the sup-simcse-roberta-large model. To create embeddings with SBERT, we use two different pretrained SBERT models. We choose the general purpose models all-mpnet- base-v2 and all-MiniLM-L6-v2, trained on more than one billion training pairs and expected to perform well on sentence similarity tasks. The all-mpnet-base-v2 model is larger than the all-MiniLM- L6-v2 model and guarantees slightly better quality sentence embeddings. The smaller all-MiniLM-L6-v2 model, on the other hand, guarantees a five times faster encoding time while still providing sentence embeddings of high quality.

For evaluation of 0SHOT-TC, we conduct experiments with three different pretrained zero-shot entailment models: a DeBERTa model, a large BART model, and a smaller DistilBERT model. For TARS experiments, we use the BERT-based pretrained tars-base-v8 model.

💾 Data

Our evaluation is based on the four publicly available text classification datasets, 20Newsgroups, _AG’s Corpus , Yahoo! Answers, and Medical Abstracts_ from different domains. As we use the semantic meaning of class descriptions for unsupervised text classification, we infer label keywords from each class name that serves the purpose of textual class descriptions. Thereby, the inference step simply consists of using the class names provided by the official documentation of the datasets as label keywords. In a few cases, we additionally substituted the class names with synonymous or semantically similar keywords, if we considered this to be a more appropriate description of a certain class.

Overview of the used text classification datasets.
Overview of the used text classification datasets.

📊 Evaluation Results

F1-scores (micro) of examined text classification approaches on different datasets. The best results on the respective dataset are displayed in bold. Since we use micro-averaging to calculate our classification metrics, we realize equal F1, Precision, and Recall scores respectively.
F1-scores (micro) of examined text classification approaches on different datasets. The best results on the respective dataset are displayed in bold. Since we use micro-averaging to calculate our classification metrics, we realize equal F1, Precision, and Recall scores respectively.

We can observe that none of the baselines achieves the highest F1-scores on any dataset based on these data. This indicates that the use of advanced unsupervised text classification approaches usually yields better results than simple baseline approaches. However, we observe that the LSA and Word2Vec approaches generally yield the worst results and are easy to outperform 👎 . In contrast, the SimCSE and SBERT baselines produce strong F1-scores 💪 that even some of the advanced approaches could not surpass in certain cases. Furthermore, the SimCSE and SBERT baseline approaches may produce better results than the Lbl2Vec similarity-based approach on three datasets. We nevertheless can deduce that the use of advanced similarity-based approaches generally produces better unsupervised text classification results than the use of simple baseline approaches or 0SHOT-TC. Specifically, the Lbl2TransformerVec approaches using SBERT embeddings appear to be promising, as they consistently perform well across all datasets 🏆 and outperform the baseline results. In contrast, the 0SHOT-TC approaches perform consistently weak and in the majority of cases did not even manage to outperform the baseline results 😲 . However, the DeBERTa zero-shot entailment model could classify the domain-specific medical abstracts surprisingly well and achieved the best F1-scores of all classifiers on this dataset. We observe, that the large DeBERTa zero-shot entailment model always significantly outperforms the smaller BART-large and DistilBERT zero-shot entailment models. Additionally, the BERT-based TARS model performs slightly better than the smaller DistilBERT zero-shot entailment model, except in case of the domain-specific Medical Abstracts dataset. Thus, we conclude that the performance of the 0SHOT-TC approaches improves with increasing model size 🤔.

💡 Conclusion

  • Similarity-based TC approaches generally outperform 0SHOT-TC approaches in a variety of different domains.
  • The characteristics of text embeddings enable representations of similar topics or classes to be located close to each other in embedding space. This implies that text representations which are capable of coherently clustering topics in the embedding space perform well when used in unsupervised text classification approaches. This characteristic is also evident in our work and can be seen in the figure below 👇 .
DensMAP visualizations of the document embeddings for each dataset. The document embeddings were created using SBERT (all-mpnet-base-v2).
DensMAP visualizations of the document embeddings for each dataset. The document embeddings were created using SBERT (all-mpnet-base-v2).
  • Simple approaches such as LSA or Word2Vec are easy to outperform and therefore are not recommended to be used as baselines for text classification of unseen classes.
  • SimCSE and SBERT baseline approaches generate strong unsupervised text classification results, outperforming even some more advanced classifiers. Therefore, we propose to use SimCSE and SBERT baselines for evaluating unsupervised text classification approaches and 0SHOT-TC performance on unseen classes in future work.
  • Lbl2TransformerVec, our proposed similarity-based text classification approach yields best F1-scores for almost all datasets.

We made our Lbl2TransformerVec code publicly available at https://github.com/sebischair/Lbl2Vec. If you want to read more details about our approach, you can read our paper here.

Sources

Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches

Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics

Unsupervised Text Classification with Lbl2Vec

GitHub – sebischair/Lbl2Vec: Lbl2Vec learns jointly embedded label, document and word vectors to…

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach

Task-Aware Representation of Sentences for Generic Text Classification


Related Articles