Written by Samuel Algherini and Leonardo Rigutini

The Large Language Models era
Over the past decade we have observed an important paradigm shift in the world of NLP. With the emergence of Deep Learning technologies, End2End approaches have gradually replaced the original machine learning approaches based on a rigorous phase of analysis and selection of features and they (and in particular Transformers) are currently the SOTA in the NLP and related fields. Normally, these methods consist of very large artificial neural networks models (billions of parameters) which are trained to model language with a statistics-based approach. When working with neural networks, the training phase would usually require a large set of data and greater computational resources. Therefore, due to the higher demand in data, power and costs, they are normally pre-trained on very large datasets using generic tasks and then released to be used or eventually integrated into proprietary tools. You can find a useful discussion about these giants in this post.
The integration phase normally requires a fine-tuning stage aiming at tailoring down the model to the data and task you want to reproduce and that responds to specific business needs. Notably, this process requires a large amount of labeled data (supervised), reflecting the characteristics and requirements of the task to replicate. However, in real world scenarios, that amount of labeled data is usually not available, and its production is time consuming and fairly expensive (here you can find an interesting article on a similar problem). Therefore, two of the greatest challenges of applying machine learning techniques to business or industry specific use cases are the scarcity of data and the lack of computational resources.
The research around "few-shot learning" [11] techniques focused mostly in studying and comparing approaches learning from small supervised data and large sets of unlabeled data. In this field, we’ve seen an increasing number of new hybrid approaches in which the dense representations returned by common Large Language Models (LLM) are combined with symbolic representations of knowledge (often enriched with linguistic and semantic information), thus providing a remarkable boost even when working with smaller datasets.
But AI is not just Deep Learning
Gartner defines "Composite AI" as the combination of different AI techniques to achieve better results. That’s right folks, AI is not exclusively Machine Learning or Deep Learning anymore. For instance, rule-based systems are a classic in the AI field and they’re a different but effective approach to solve specific tasks. The hybrid approach discussed in this article instead, consists of a ML algorithm powered by a symbolic representation of text.
This symbolic representation of text leverages a rich list of linguistic information (morpho-syntactic and semantic data) coming from a previous step of [NLP](https://en.wikipedia.org/wiki/Natural_language_processing) including many of the most common NLP tasks such as lemmatization, dependency and constituency parsing, PoS tagging, morphological analysis, relation detection and more. Moreover, these techniques hold on a rich knowledge graph which is a huge tree of knowledge made of nodes and connections representing the concepts and relations between them. The different types of information produced by this step of NLP analysis are represented in separated vectorial spaces which are concatenated and input to the Machine Learning algorithm.
Basically, in this hybrid approach, the most common representation techniques based on dense vectors (i.e. embeddings) resulting from the use of the Large Language Model (LLM) are replaced with symbolic representations of text in which each dimension of the vector encodes a clear and explicit linguistic characteristic of text.

The challenge
Combining Symbolic AI with Machine Learning can be a game changer when supervised data are scarce. In fact, fine tuning tasks are extremely complex ad ineffective when data is simply not enough for LLMs.
In this experiment we compared the performance of several mainstream Machine Learning techniques with the Hybrid ML used in the expert.ai platform under data scarcity conditions – the size of the training set (supervised data) would vary from a few examples per class to around a hundred of examples per class. The comparison focused on:
- Models based on Transformer architecture: BERT [1,2] and its derivatives DistilBERT [3,4] and RoBERTa [5,6];
- Models provided by Spacy : BoW, CNN and Ensemble
- Standard SkLearn [8,9] ML algorithms (SVM, NB, RF, LR) on BoW representation
- expert.ai‘s hybrid model
For this comparison, we selected the Consumer Complaint Dataset (CCD): a collection of complaints on financial products and services that is publicly available on Kaggle. The dataset consists of real-life complaints collected from existing companies and focuses on financial products and services in which each complaint has been properly tagged by product whilst creating a supervised text Classification task with 9 target categories (labels).
For our experiment, longer and shorter texts were removed, ending up with a final dataset consisting of 80,523 supervised documents: 10% of them were used as test set (8052) while the remaining ones as training data. To measure the classification capabilities of each model in a few-shot learning scenario, we built 4 training sets with increasing dimensions by random subsampling:
- T90: 10 documents per class (total size 90);
- T450: 50 documents per class (total size 450);
- T810: 90 documents per class (total size 810);
- TFull: no subsampling (total size 72471).
We used an incremental procedure to build the training data sets so that the category specific supervised documents were always included in the larger sets:

And the winner is …
Table 1 shows the categorization performance of all the models with increased training set size. For the expert.ai Hybrid model, the values represent the best results obtained by comparing 4 distinct algorithms (SVM, Naive Bayes, Random Forest and Logistic Regression) while leveraging the symbolic representation of text.

With smaller training sets (T90, T450 and T810), the Hybrid approach achieves its best performance with a significant boost especially compared to transformer-based models (i.e. BERT) and, more in general, to deep neural networks (such as SpaCy).
The results didn’t come unexpected since artificial neural networks, especially deep neural networks, usually require large amount of supervised data for effective fine-tuning. In absence of the necessary data, performances are likely to be very poor.
An interesting finding brought to the surface by the results of the experiment, is that the enriched representation of text coming from symbolic AI generally provided better results compared to mainstream algorithms too. This is clear when investigating the results coming from SkLearn models that are lower than expert.ai’s hybrid approach with no exception.
Obviously, when growing the size of the training set, the gap and differences between the approaches decrease – they literally disappear when the entire training set is used (TFull). In this scenario, the best performing model is RoBERTa-Large, even though all the others, including expert.ai’s Hybrid ML approach, follow with a very small deviation.
Conclusions
The concept emerging from these experiments confirms the hypothesis that most of the deep neural networks approaches may not be very effective in typical real-world scenario of supervised data scarcity.
In these situations, using hybrid approaches leveraging symbolic representations of text seemed to be more effective and produced the best results. This didn’t come unexpected as deep models usually tend to require large amounts of supervised data for fine-tuning phases too and performances tend to decrease under the opposite circumstances.
When using hybrid approaches, leveraging enriched symbolic representations of text compensates for the scarcity of supervised data, overperforming even compared to the classical methods based on BoW representation.
References:
- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 , (2018).
- BERT-Large uncased on Hugging Face: https://huggingface.co/bert-large-uncased
- SANH, Victor, et al. , "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter". arXiv preprint arXiv:1910.01108 , (2019).
- DistilBERT on HuggingFace: https://huggingface.co/distilbert-base-uncased
- LIU, Yinhan, et al. , "RoBERTa: A robustly optimized BERT pretraining approach". arXiv preprint arXiv:1907.11692, (2019).
- RoBERTa-Large on HuggingFace: https://huggingface.co/roberta-large
- SpaCy: https://spacy.io
- PEDREGOSA, Fabian, et al. , "Scikit-learn: Machine Learning in Python. the Journal of machine Learning research", (2011), 12: 2825–2830.
- SkLearn https://scikit-learn.org/stable/index.html
- expert.ai: https://www.expert.ai
- Wang, Yaqing, et al. "Generalizing from a few examples: A survey on few-shot learning." ACM computing surveys (csur) 53.3 (2020): 1–34. https://arxiv.org/pdf/1904.05046.pdf