Open Language Models

As open source language models become more readily available, getting lost in all the options is easy.
How do we determine their performance and compare them? And how can we confidently say that one model is better than another?
This article provides some answers by presenting training and evaluation metrics, and general and specific benchmarks to have a clear picture of your model’s performance.
If you missed it, take a look at the first article in the Open Language Models series:
Perplexity
Language models define a probability distribution over a vocabulary of words to select the most likely next word in a sequence. Given a text, a language model assigns a probability to each word in the language, and the most likely is selected.
Perplexity measures how well a language model can predict the next word in a given sequence. As a training metric, it shows how well the models learned its training set.
We won’t go into the mathematical details but intuitively, minimizing perplexity means maximizing the predicted probability.
In other words, the best model is the one that is not surprised when it sees the new text because it’s expecting it – meaning it already predicted well what words are coming next in the sequence.
While perplexity is helpful, it doesn’t consider the meaning behind the words or the context in which they are used, and it’s influenced by how we tokenize our data – different language models with varying vocabularies and tokenization techniques can produce varying perplexity scores, making direct comparisons less meaningful.
Perplexity is a useful but limited metric. We use it primarily to track progress during a model’s training or to compare different versions of the same model. For instance, after applying quantization – a technique that reduces a model’s computational demands – we often use perplexity to assess any changes in the model’s quality.
Perplexity is just one part of the equation – it offers valuable insights but doesn’t tell the whole story. ¹

BLEU and ROUGE
If you’re into Natural Language Processing, you may have heard about the ROUGE and BLEU scores.
Introduced in the early 2000s for machine translation, they quantify how close the machine text is to a human reference.
The BLEU score is the number of words in the human reference text divided by the total words. Similarly to the precision score, it takes values between zero and one, where values closer to one represent more similar texts.
ROUGE works on similar principles but is a bit more complex since it analyzes overlap through several aspects, such as n-grams (ROUGE-N), longest common subsequences (ROUGE-L) and skip bigrams. (ROUGE-S)
When it comes to Large Language Models, BLEU and ROUGE are used to evaluate how close the output is aligned to the human solution, considered correct. But they are not enough for every generative task. As you can imagine, ** producing the reference text** can be expensive and time-consuming and not even feasible for some domains or languages.
Sometimes there isn’t just one correct way to summarize or translate a text. These scores can only account for a few valid options.
Also, they don’t take into consideration the context – a text that works for a news article might not be the best fit for a social media post, and what’s suitable for a formal setting might not be appropriate for a casual one.
The need for benchmarks
Open source models are usually smaller and fine-tuned to be more specialized for a particular task.
Meta’s founder, Mark Zuckerberg, thinks we’ll interact with different AI entities for different needs instead of relying on a general-purpose AI assistant.² To really understand which model best suits a particular task, we need a way to compare them.
Specific benchmarks assess a particular aspect of a language model. For example, if you want to evaluate how truthful your model answers are or quantify how well it does on a task after fine-tuning, use a specific benchmark.
Four of them are used in _Hugging’s Face OpenLLM Leaderboard_.
The Abstraction and Reasoning Corpus (ARC) is an abstract reasoning test. It applies to humans and AIs and tries to measure a human-like form of fluid intelligence. Given an input grid, the user needs to choose the correct output.

HellaSwag is a test where the user needs to pick the best ending to a given context, a task called commonsense inference. While easy for humans, many LLMs struggle with this test. The only one able to reach almost human-level performance is GPT-4.

Massive Multitask Language Understanding (MMLU) measures a text model’s multitask accuracy on 57 tasks, including mathematics, US history, computer science, law, and more. The test looks like multiple choice questions on different problems and assesses understanding of the world and general knowledge.

TruthfulQA consists of two tasks: generation and multiple-choice. The generation task requires models to produce authentic and informative answers to questions, while the multiple-choice one requires models to select or assign probabilities to true and false answer choices. The benchmark covers 57 topics and uses various metrics to measure the models’ ability to recognize false information. Interestingly, the paper shows that larger models are less truthful. ⁶

Measuring code generation abilities
When ChatGPT came out, asking it to write some code is probably the first thing we’ve all tried. The ability to code is one of the most useful and time-saving skills that LLMs can offer to us.
In the open source landscape there are many models specialized in code generation, like Wizard Coder or the most recent Code LLama.
To show the impressive coding abilities of their new Code Llama model, they chose two code-specific benchmarks: HumanEval and Mostly Basic Python Programming (MBPP), complemented by a human evaluation. In the first, models need to generate a code starting from a docstring, while in the second they start from a text prompt. Every prompt comes with one or more unit tests to evaluate the correctness of the output.

After collecting a sample of k entries generated by the model, the pass@k metric is computed. If at least one entry passes the unit tests, the solution is considered correct. For example, a pass@1 score of 67.0 means the model can solve 67% of the problems at the first try.
When computing this metric, you can use any value of k. But in practice, we are interested in the pass@1. If you have to keep trying to get a correct solution, how can you trust that model?
The evaluation results for Code LLama are the following.

Their results show that GPT-4 is the best model, able to solve 67% of the tasks in HumanEval at the first try. However, Code Llama is the best open source code-specific model, with just 34B parameters.
Measuring general intelligence
Evaluation systems must cover numerous scenarios, especially for larger Language Models designed to be general purpose because of their impressive generalization ability to diverse tasks.
While for classic Machine Learning models, you are used to evaluating the model using a test set, LLMs enable zero-shot learning and few-shot learning – an LLM can learn to perform a task that hasn’t been explicitly trained for. Under these circumstances, using a test set or a single metric to benchmark LLM’s capabilities is insufficient.
General benchmarks are extensive collections of tests in diverse scenarios and tasks. They’re like the ultimate test for your model, aiming to gauge every aspect of intelligence.
Some of them are the Holistic Evaluation of Language Models (HELM), built to evaluate models based on seven key metrics: accuracy, calibration and uncertainty, robustness, fairness, bias and stereotypes, toxicity, and efficiency, calculated in 16 scenarios.

SuperGLUE, introduced in 2019, is an advanced version of the General Language Understanding Evaluation (GLUE) test. The GLUE benchmark comprises nine tasks related to sentence or sentence-pair language understanding, all built on pre-existing datasets. SuperGLUE offers a more challenging set of tasks and a public leaderboard.
BIG-bench, from Google, expands GLUE and SuperGLUE with a more extensive collection of natural language understanding tasks. It is a massive collaborative project with contributions from 444 authors from 132 institutions worldwide. It assesses LLMs based on their accuracy, fluency, creativity, and generalization abilities on over 200 tasks! Since running BIG-bench can be very time-consuming, the authors also provide a lite version with a subset of 24 tasks called BIG-bench lite. Their GitHub repo is open for contributions and new ideas.

Another way of evaluating language models is a manual human evaluation. As the name suggests, it measures the quality and performance of large language models by asking human judges to rate or compare the outputs of LLMs, like in Chatbot Arena. It’s a platform for benchmarking large language models (LLMs) using the Elo rating system – like chess – where users chat with two anonymized LLMs side-by-side and vote for the one they think is best. The votes are then used to calculate the ELO ratings and rank the LLMs on a leaderboard. You can visit their website and chat with different LLMs yourself.

A case from research: Llama 2
Llama 2 is the successor to Llama. It was released in July 2023 in 7B, 13B, 34B and 70B sizes, including a fine-tuned versions called Llama 2 Chat. In the paper, we can find two main evaluation procedures: a general and a safety evaluation.

The evaluation criteria in the authors’ work suggest that they prioritized two main objectives.
First, compare Llama 2 to the first version and the open source competitors. To achieve that, they used a comprehensive general evaluation, where the models are evaluated on five dimensions: Code, Commonsense Reasoning, World Knowledge, Reading Comprehension and Math. Each dimension is an average of multiple benchmarks.
The results are complemented by the MMLU, BBH (BigBench Hard), and AGI Eval benchmarks, shown in separate columns.
The second objective evident in the authors’ work was to show that their fine-tuning method led to a more truthful and less toxic model.


The safety evaluation is aimed to assess Truthfulness and Toxicity using the TruthfulQA and ToxiGen benchmarks.
It shows that thanks to the fine-tuning process, Llama 2 is less toxic than other models but less truthful than ChatGPT.
Conclusion
Language models have a multifaceted and flexible nature. Open-source models offer tailored solutions, and specialization might be the way forward.
When comparing models, look for benchmarks relevant to your needs. The best one isn’t necessarily the one with the lowest perplexity or highest BLEU score, but the one that truly adds value to your life.
If you enjoyed this article, join Text Generation – our newsletter has two weekly posts with the latest insights on Generative AI and Large Language Models.
Also, you can find me on LinkedIn.
References
¹ M. Shoeybi and R. Caruana, Language Model Evaluation Beyond Perplexity (2023), arXiv.org ² Lex Friedman Podcast, Mark Zuckerberg: The Future of AI (2023), YouTube. ³ Xu, Y., Li, W., Vaezipoor, P., Sanner, S., & Khalil, E. B., LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations. (2023), arXiv.org ⁴ Zellers, R. et al, HellaSwag: Can a Machine Really Finish Your Sentence? (2022), arXiv.org ⁵ Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. Measuring Massive Multitask Language Understanding. (2021), arXiv.org ⁶ Lin, S., Hilton, J., & Evans, O. (2021). TruthfulQA: Measuring How Models Mimic Human Falsehoods. (2022), arXiv.org ⁷ Introducing Code Llama, a state-of-the-art large language model for coding. (2023), meta.com ⁸ Liang, P. et al (2022). Holistic Evaluation of Language Models. (2022), arXiv.org ⁹ Srivastava, A. et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (2022), arXiv.org ¹⁰ Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. (2023), arXiv.org