Towards sustainable technology: “green” approaches to NLP

A comparative study in terms of performance and energy consumption

Leonardo Rigutini

Published in

Towards Data Science

11 min readJul 14, 2022

By Leonardo Rigutini and Samuel Algherini

BERTology

Over the past decade, we have observed an important paradigm shift in the world of NLP. The increasing diffusion of end-to-end approaches has resulted in the development of a wide range of Large Language Models (LLM) based on different neural network architectures and consisting of billions of parameters. Given their huge training costs, these giant models are typically exclusive to the handful of global companies (Google, Facebook, Amazon, etc.) that can sustain such costs. These large neural networks are typically released as models that have been pre-trained on millions of documents using a generic language modeling task, and that require a further training phase to refine and fine tune the model based on the customer’s requirements for categorization, information extraction, or other similar tasks. In the fine-tuning stage for downstream applications, one or more fully connected layers are typically added on top of the final encoder layer.

The best-known family of LLMs is BERT [1], which was developed by Google and is based on transformer architecture. This model has spread rapidly in the academic and scientific world of NLP (e.g., NER, sentiment analysis, etc.) and in automatic text analysis fields (e.g., categorization and information extraction) where it has become a mandatory reference in any experimental assessment. The use of this model has been so large and widespread that it has even created a growing field of study focused on it which some call “BERTology” [2,3]. However, while these approaches have reported excellent performance in some scientific and academic scenarios, their limitations begin to emerge when carried out in real-world applications.

Welcome to the real world

So, what happens when these LLMs are used for real-world tasks and with real customers? In a previous post, we showed how LLM performance decreases significantly in the case of data scarcity, a fairly common situation in the enterprise. In this post, we will discuss how BERT-based approaches tend to become less effective when used in linguistic vertical domains and how expensive they are in terms of computational resources and energy requirements.

In particular, we will report and analyze the results of a recent experiment that was performed with two primary goals in mind:

Performance: To compare the classification performances of BERT-based approaches, “light” models and a hybrid (symbolic + ML) approach when used in vertical domain. For the symbolic analysis, we used the proprietary technology of expert.ai [12].
Carbon footprint: To evaluate and compare the energy consumption requirements of these approaches.

To do this, we selected a domain that frequently uses automatic text analysis systems and that normally processes thousands or millions of large documents per year: LEGAL. Such companies regularly analyze and process millions of contracts each year to identify important passages (“unfair” or “dangerous,” for example) or to locate buyers, sellers or any entities involved.

The LexGLUE Benchmark

Following the spread of multitask benchmarks in the NLP field, such as GLUE and SuperGLUE, the LexGLUE Benchmark [4,10] was recently released. The LexGLUE (Legal General Language Understanding Evaluation) benchmark is a collection of seven datasets focused on the legal domain and built for evaluating model performance across a diverse set of legal NLP tasks. The first version of the benchmark only covers the English language, but more datasets, tasks and languages are expected to be added in later versions of LexGLUE as new legal NLP datasets become available.

The datasets were built using different legal sources, including the European Court of Human Rights (ECtHR), the U.S. Supreme Court (SCOTUS), the European Union legislation (EUR-LEX), the U.S. Security Exchange Commission (LEDGAR), the Terms of Service from famous online platforms (Unfair-ToS) and a U.S. legal data set (CaseHOLD). Their details are summarized in the Table 1:

Table 1: The 7 datasets included in the LexGLUE benchmark — *Table 1: The 7 datasets included in the* *LexGLUE* *benchmark*

The analysis

As previously mentioned, our investigation had two goals (1) to compare classification performance of LLM and “light” approaches for very specific and vertical domains and (2) to analyze the relative energy consumption of the different approaches, especially in light of the quality of the classification results obtained. In particular, we conducted a series of experiments to examine how LLMs lose value when used in vertical domains, also reporting evident disadvantages from the point of view of required computational resources and energy consumption.

To do this, we focused on the LEDGAR [5,13] dataset. The LEDGAR dataset was presented at the LREC 2020 conference [15] and it is freely available [16,17] under the “MIT License”. It consists of 80,000 clauses extracted from contracts downloaded from the EDGAR [11] site of the U.S. Security Exchange Commission [14]. Each clause is classified into a taxonomy of about 100 categories in a multi-class categorization task. In the original LEDGAR article, the authors compared the performance of several models and showed that a light approach — based on Bag-Of-Words (BOW) encoding, Term Frequency (TF) weighting scheme and Support Vector Machine (SVM) classifier — was performing similar to a set of BERT-like models. We replicated such tests and, in addition, we introduced a comparison with a symbolic approach based on the proprietary expert.ai NLP platform. The symbolic approach adopted for the experiments involved the following five steps:

Linguistic analysis: Data processing with expert.ai NLP technology to extract linguistic information;
Symbolic Data representation: Representation of the data using different symbolic representations, that is: words, lemmas and semantic concepts (the last returned by the expert.ai Word Sense Disambiguation engine);
Language modeling: Extraction of n-grams for all the previous symbolic representations in a range between 1 and 3;
Feature space representation: Vectorization of the obtained symbolic features using three independent sub-spaces by exploiting a Bag-Of-Words (BOW) encoding and the Term Frequency-Inverse Document Frequency (TF-IDF) weighting scheme with a minimum threshold on document frequency set to 5;
Model training: Training of a linear Support Vector Machine (SVM), performing validation on three different C values: 0.1, 1.0, 10.0.

To compare the models in terms of energy consumption and carbon footprint, we logged a set of information about the time and the energy required by using the Python library “codecarbon” [9]. All of the experiments were performed on computers with the following hardware configuration:

For both the SVM (BoW+TF) and the expert.ai approaches, we used a laptop with Intel(R) Core (TM) i7–10510U CPU @ 1.80GHz and 8GB RAM;
For the BERT-based approaches, we used the Google Colab environment running on a Tesla T4 GPU with 16 GB RAM.

Quality …

The classification performance obtained on the LEDGAR dataset are reported in Table 2:

Table 2 — The classification performances on LEDGAR dataset of the tested models. — *Table 2 — The* ***classification performances*** on ***LEDGAR dataset*** *of the tested models.*

As we can see, except for the BERT case with no fine tuning and for the two Spacy models (BOW and CNN), all of the remaining models show very similar performances. These results show that the use of an LLM in a specific domain (in this case, the LEGAL domain) does not show a significant gap in respect to the other approaches, even if a domain-specific LLM is used, such as LegalBERT [6]. Instead, the use of a symbolic approach that injects the linguistic knowledge of the domain into the feature representation (for example using an NLP platform as in the expert.ai approach we tested), allows to obtain good performance.

… or saving?

With these classification performances in mind, we can look at the resources consumed during the experiments and their costs.

Table 3 reports the energy consumption and carbon footprint details measured by the different models during the experimental phase, that consists in training, validation and test stages. The results of each model are expressed as a percentage of the BERT consumption, which is represented as 100.

Table 3 — The Energy Consumption comparison of the different models in the LEDGAR experiment. — *Table 3 — The* ***Energy Consumption*** *comparison of the different models in the* ***LEDGAR experiment***.

Looking at Table 3, we see a significant gap between the energy consumption of the light models (such as the SVM and the symbolic expert.ai method) and the one of BERT-based approaches. In particular, the expert.ai approach, which also achieved excellent results in terms of classification performance, reduces consumption by more than 99% compared to the BERT models.

Normally, NLP project development involves three main phases:

Bootstrapping and preparation: In this phase, data scientists and the product owners analyze the task and data and study the technology to be adopted;
Model training and evaluation (R&D): In this phase, data scientists iteratively perform training-validation-test steps to assess the solution;
Final delivery and production: In this final phase, the selected model is released and used in a production environment.

The energy consumption comparison reported in Table 3 only refers to the second of the above items. As we previously pointed out, in a company, this phase is involved in the initial setup of the system. This is often repeated several times, and the number of trials depends on the characteristics and nuances of the project. In many cases, the number of experiments needed during the step 2 can be significant, making the estimate of the effort unreliable. Thus, having the high energy savings achieved at each trial can translate into significant savings on the total costs for the company.

In addition, in our experience, the effort involved in step 2 represents just a fraction of the overall costs. The fully operational cost of the analysis systems is also deeply influenced by the production lifecycle (i.e., step 3 in the previous items), where the model is continuously involved in prediction invocations. Therefore, a further interesting analysis concerned the comparison of the energy needs of the different models once they are brought into production (i.e., in the prediction phase).

So, we evaluated the energy consumption data and the processing time in respect to a fixed number of contracts. In the LEDGAR dataset, the data consists of legal clauses. In our experience, common commercial contracts are composed of an average of 100 to 300 legal clauses. Therefore, we assumed 200 clauses per contract. Table 4 shows the comparisons in terms of reduction in energy consumption in prediction, evaluated for 50 contracts processed.

Table 4 — The Energy Consumption comparison of the different models in prediction phase. — *Table 4 — The* ***Energy Consumption*** *comparison of the different models in* ***prediction phase***.

While the reduction in energy consumption is not as extreme as in the previous case, it continues to be significant for both the basic model (SVM) and the symbolic expert.ai approach. In fact, these models reported energy savings of approximately 75% compared to the BERT-based models, which translates in a similar savings in terms of cost (€) and carbon footprint (CO2). Even Spacy’s BOW model showed some energy savings but with a lower percentage. With Spacy’s CNN, on the other hand, the energy consumption has more than doubled compared to the BERT-based models.

One positive for the BERT-based approaches was that they have shown better performance in terms of speed. In fact, both the SVM and expert.ai symbolic approaches, and all of the Spacy models (BOW and CNN) reported higher analysis times than the BERT-based models. However, they still remained within acceptable time limits.

Furthermore, as explained in a previous post, the BERT-based approaches require large amounts of supervised data to best exploit their potential. This implies that, for accurate cost analysis, in addition to the high costs of BERT-based approaches in terms of energy consumption, we must also add the equally high costs of preparing and annotating large amounts of data that are usually required by these models to reach a good level of performance. Conversely, symbolic approaches perform at a high level even in cases of data scarcity, a situation that is very common in real-world scenarios.

This makes these “lighter” approaches less expensive and much more convenient from several points of view.

Conclusion

In this post, we compared several NLP approaches (from the simplest to complex BERT-based models) when they are used in a specific and vertical domain, such as LEGAL. The experimental assessment focused on two main aspects: (1) performance and (2) carbon footprint (and therefore the related costs).

In such a scenario, the classification results showed that lighter models can reach a level of performance that is similar to those achieved by very complex and heavy models based on Transformer architectures. This suggests that, in the case of specific linguistic domains, LLM-based approaches do not necessarily lead to better performance and, often, a light symbolic approach based on NLP analysis returns results that are comparable to BERT.

Moreover, the carbon footprint study outlines a clear scenario where the energy consumption of BERT-based models is considerably higher than lighter models. In other words, while LLMs may “only” consume four times as much energy in the prediction phase, they require up to 100 times more energy in a typical training-validation-test scenario. In the tested experimental scenario, all BERT-based architectures required up to two orders of magnitude more in terms of energy consumption: not good news for research laboratories!

Finally, the time needed for a typical training-validation-test run is remarkably high for the BERT-based approaches compared to the others. For example, a simple SVM model (BoW+TF-IDF) was 31-times faster (25 minutes vs. 13 hours), while the expert.ai approach was 19.5-times faster (40 minutes vs. 13 hours). In the prediction phase, the BERT approaches were slightly faster, taking about 2.5 minutes compared with 3 minutes for SVM and nearly 4 minutes for expert.ai.

In general, this assessment showed that, in this specific linguistic domain, light approaches can achieve a level of performance that is comparable to the more complex BERT-based models with a significant savings of energy (up to two orders of magnitude in a typical training-validation-test experimental run) and time.

Furthermore, taking into consideration the results reported in this post related to performance in data scarcity situations, for accurate cost analysis, to the costs related to higher energy consumption, we must also consider the cost of annotating the large amount of supervised data that BERT-based approaches require to achieve good performance.

In conclusion, the comparison between LLM-based approaches (BERT and similar) and some classical methods showed that, in a specific domain, although they have similar classification performance, there is a large cost gap in terms of energy and €. The LLM-based approaches resulted to be slightly faster at the prediction phase.

Green, fast and still highly performing NLP systems are possible.

A huge thanks to Achille Globo who performed the experiments reported in the post.

References

“Bert: Pre-training of deep bidirectional transformers for language understanding.” — J. Devlin, M.W. Chang, K. Lee and K. Toutanova — arXiv preprint:1810.04805 (2018)
https://huggingface.co/docs/transformers/bertology
“A primer in bertology: What we know about how BERT works.” — A. Rogers, O. Kovaleva and A. Rumshisky — “Transactions of the Association for Computational Linguistics” (2020) vol. 8, pag. 842–866.
Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., Androutsopoulos, I., Katz, D. M., & Aletras, N. (2021). Lexglue: A benchmark dataset for legal language understanding in English. arXiv preprint arXiv:2110.00976.
Tuggener, D., von Däniken, P., Peetz, T., & Cieliebak, M. (2020). LEDGAR: a large-scale multi-label corpus for text classification of legal provisions in contracts. In 12th Language Resources and Evaluation Conference (LREC) 2020 (pp. 1228–1234). European Language Resources Association. link
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. arXiv preprint arXiv:2010.02559.
“LegaLMFiT: Efficient Short Legal Text Classification with LSTM Language Model Pre-Training.” — B. Clavié, et al. — arXiv preprint arXiv:2109.00993 (2021).
“The Unreasonable Effectiveness of the Baseline: Discussing SVMs in Legal Text Classification.” — E. Schweighofer (2021).
https://github.com/mlco2/codecarbon
https://github.com/coastalcph/lex-glue
https://www.sec.gov/edgar/search/
https://www.expert.ai/products/expert-ai-platform/
https://aclanthology.org/2020.lrec-1.155/
https://www.sec.gov/
https://lrec2020.lrec-conf.org/en/
LEDGAR License info
LEDGAR Licence Readme