The world’s leading publication for data science, AI, and ML professionals.

A Gentle Introduction to Open Source Large Language Models

Why everyone is talking about Llamas, Alpacas, Falcons and other animals

Open Language Models

Image by the author (generated with Midjourney)
Image by the author (generated with Midjourney)

Unless you’ve been living under a rock for the last year, you’ve witnessed the ChatGPT revolution and to how everyone seems unable to stop using it. In this article, we’ll explore its alternatives, jumping into the world of open source models. This first article of the series Open Language Models is helpful for people looking to get started and understand Open Source Large Language Models, and how and why to use them.

Table of contents

Why do we need Open Source Models?

What is a Large Language Model?

A Large Language Model (LLM) is an AI capable of understanding and generating human language. At the heart, there’s a type of neural network called a transformer, that works by predicting what word comes next in a sentence. The word large describes these models’ extensive nature since they can **** have billions or even trillions of parameters. What differentiates them is their ability to specialize in particular tasks, such as code generation or translation, or be applied to general instruction-following chatbots. One of the groundbreaking aspects of these models is that they enable _zero-sho_t and few-shot learning, as they exhibit an unprecedented ability to learn tasks they have not been explicitly trained for. [1]

Why do we need Open Source Models?

Suppose you use GPT API to create an innovative app that quickly gains traction. Everything is going smoothly until OpenAI changes their course of action. They might halt the service, escalate the cost, or even decrease the capability of their models – which is already happening. [2] Currently, your only solution would be to adjust to their new policies. Moreover, relying on a third-party API results in your data being transmitted to their server. While OpenAI may not utilize data from GPT APIs for model training, [3] deploying your own language model guarantees you total control over these operations. Even if this may seem like an ideal plan, deploying your private LLM also comes with its own set of limitations and challenges, which will be addressed in this article.

(Left) A Llama. Photo by Sébastien Goldberg | (Middle) A Vicuña. Photo by Dušan veverkolog | (Right) An Alpaca. Photo by Adrian Dascal. All from Unsplash.
(Left) A Llama. Photo by Sébastien Goldberg | (Middle) A Vicuña. Photo by Dušan veverkolog | (Right) An Alpaca. Photo by Adrian Dascal. All from Unsplash.

The bigger the better? Training Large Language Models

If you’ve stumbled upon something like LLaMA 65B, you’re likely wondering what the meaning of 65B. Well, it simply refers to the number of parameters in the model. As the size of the model increases, it requires more time to train and uses up more memory for inference. Unlike what you’re used to hearing in machine learning, complex models can more easily generalize to different tasks. Some have an enormous number of parameters: GPT-3 has 175 billion, while GPT-4 has more than 1 trillion. Training them from scratch has been estimated to cost millions of dollars. For example, Google’s PaLM 540B was trained on 2240 GPUs [4]. For comparison, EfficientNet-B7, one of the most popular deep learning models for image classification, has only 66M parameters.

Clearly, not something you can train on your laptop. [5]
Clearly, not something you can train on your laptop. [5]

In 2022, Google claimed:

As the scale of the model increases, the performance improves across tasks while also unlocking new capabilities.

In the current state of LLMs, more parameters are usually better [4]. Companies focus on building bigger models, but the current trend in open source is to create smaller, more efficient models. While the most popular open source models usually have up to 70B parameters, it’s not uncommon to see smaller, fine-tuned models performing better than bigger models on a specific task. Also, larger models require more resources for training and inference, making them challenging to deploy.

In fact, within a year, even Google changed their perspective.

Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months.

From the leaked document titled We have no moat, and neither does OpenAI [6], where they acknowledge the incredible evolution of open source models, that are quickly catching up using smaller and cheaper models.

ChatGPT took it well.
ChatGPT took it well.

Thanks to the outstanding work of the open source community in the last year, there are now accessible and free alternatives. Less than a year after Google’s PaLM, LLaMA was released; In the paper, the authors claim that their biggest model, LLaMA 65B, surpasses GPT-3 (175B) and PaLM (540B) on many tasks [5].

Fine-tuning Large Language Models

LLMs have earned their reputation for their versatility in handling many language tasks within a single framework. Yet, your specific application might require the model to excel in a single task. To achieve this, you can fine-tune a pre-trained model using a dataset specific to your task – say, text summarization. The fascinating part is that good results are attainable even with a small dataset. You might need just 500 to 1,000 examples to improve performance significantly despite the model initially being trained on billions of pieces of text.

A sample from the Alpaca Dataset. The model is fine-tuned to follow an instruction given by the user.
A sample from the Alpaca Dataset. The model is fine-tuned to follow an instruction given by the user.

A popular technique is instruction fine-tuning. This method involves training the model using examples that illustrate how it should respond to a particular instruction. The outcome of this process is an instruct model – an enhanced version of the base model that excels at following instructions, rather than just completing text. Examples of instruct models are Alpaca and Vicuna.

The Best Open Source Large Language Models

In February 2023, Meta’s LLaMA model hit the open-source market in various sizes, including 7B, 13B, 33B, and 65B. Initially, the model was only available to researchers under a non-commercial license, but in less than a week its weights were leaked. This event sparked a revolution in the open-source LLMs world as its training code was freely accessible under GPL 3 license. Consequently, several powerful fine-tuned variations have been released.

The first was Alpaca, released by Stanford University. This model utilized GPT’s generated instructions for its fine-tuning on 52K examples. It was soon followed by Vicuna, which surprisingly outperformed Alpaca by measuring a whopping 90% ChatGPT quality on many tasks. Its distinguishing feature is that it was fine-tuned on ShareGPT data.

Adding to these powerful models is GPT4All – inspired by its vision to make LLMs easily accessible, it features a range of consumer CPU-friendly models along with an interactive GUI application.

WizardLM also joined these remarkable LLaMa-based models. Through a new and unique method named Evol-Instruct, it underwent fine-tuning on complex instruction data and has shown similar performance to ChatGPT at an average of 97.8%.

Not all recent models are based on LLaMA, though. Models like MPT are known for achieving impressive context lengths with their variants capable of generating up to a staggering amount of 65k context length – a whole book at once!

Falcon also joins this bandwagon in both 7B and 40B variants. Surprisingly it outperforms LLaMA on the OpenLLM leaderboard due to its high-quality training dataset called RefinedWeb.

However, Falcon’s reign on top of HuggingFace’s OpenLLM leaderboard was short-lived. In July 2023, Meta unveiled the successor to their famous model, LLaMa 2. This next-generation model doubled its predecessor’s token limit and increased its context length to 4K tokens. Simultaneously featuring llama-2-chat – an optimized version for dialogue applications – it made a significant impact. At the time of writing, fine-tuned versions of LLaMa 2 such as StableBeluga2, Airoboros and Guanaco are the most powerful open source large language models, dominating the OpenLLM Leaderboard.

The top 10 models on the OpenLLM Leaderboard are mostly LLaMa 2 based.
The top 10 models on the OpenLLM Leaderboard are mostly LLaMa 2 based.

If you’re curious about a model and you want to try it, the easiest way is probably going to an HuggingFace model page, and opening a Space using it. They are simple Gradio interfaces that allow you to send an input to a model and receive back an output. Since the models run on their servers and there are many requests, often you’ll have to wait in queue for a while. Nonetheless, it’s not a big deal to wait a couple of minutes sometimes, given how great their service is.

HuggingFace Spaces makes it very easy to try language models. Image by author.
HuggingFace Spaces makes it very easy to try language models. Image by author.

Hardware requirements and optimizations

Last year, you needed a high-end GPU to run a LLM locally. Even small ones required many gigabytes of memory, and you had to load them on your graphics card memory. The situation changed with the release of llama.cpp. Initially a port of Facebook’s LLaMA model in C/C++, it now supports many other models. It allows to convert an LLM from PyTorch to GGML, their new format that allows fast inference on CPU. Being compiled in C++, the inference is multithreaded.

Thanks to this amazing work, it’s now possible to run many LLMs on your desktop computer, or even on a MacBook.

The main limitation is the memory they consume. For example, a 7B model weights around 14GB. A 65B model would require around 130GB of memory (RAM and disk space), which is more than most of us have on our computers. Fortunately, there is a way to compress them. Enter Quantization.

In every machine learning model, a parameter is a number. This number is usually represented as a float32, a 32bits (4 bytes) representation. Since every parameter takes two bytes of space, a model with 65B parameters would take approximately 46510⁹ = 260GB of memory. Model quantization refers to the idea of reducing the model weight by representing its parameters with a lower precision number, using rounding. For mathematical details about quantization, there is a great article on HuggingFace’s blog. [7]

8-bit integer quantization. [7]
8-bit integer quantization. [7]

Being a compression process, quantization involves some loss in performance. With the recent introduction of LLM.int8() and newer techniques, this loss has been greatly reduced, making it a must for LLMs.

Llama.cpp supports up to 4-bit integer quantization. Using Q4_0, one of the available options, it’s possible to reduce the model size up to 4 times.

Quantization effect on model perplexity and file size. Image by the author. Data from llama.cpp GitHub repo.
Quantization effect on model perplexity and file size. Image by the author. Data from llama.cpp GitHub repo.

Q4_0 reduced file size from 25.8 to 6.8 GB, while reducing perplexity by approximately 2%. It’s possible to load a 13B model on a computer with 16GB of RAM, while if you have 64GB, you can run even the biggest 70B models.

LLaMa 2 models with different quantization techniques. TheBloke's HuggingFace profile has many pre-trained models.
LLaMa 2 models with different quantization techniques. TheBloke‘s HuggingFace profile has many pre-trained models.

Another quantization technique worth mentioning is GPTQ, that can reduce VRAM usage up to 75% while preserving accuracy. [8] Released in March 2023, it made possible for the first time to run a 175B model on a single GPU. Talking about consumer hardware, you can run up to a 30B model with a single RTX 3090 card.

Running a Large Language Model on your computer

The most similar experience to ChatGPT is the GPT4All app, a chat interface that allows you to chat with your favorite model. It’s not limited to GPT4All models but supports many of the most popular ones. The app runs on Windows, Mac OS, and Linux and it’s completely open source.

LLaMA-2 7B inference on a Ryzen 5600 CPU. Image by the author.
LLaMA-2 7B inference on a Ryzen 5600 CPU. Image by the author.

Alternatively, you can use a GUI tool like text-generation-webui or openplayground. While they both offer a graphical interface that can be easily used to generate text, the first is probably the most complete tool available: it offers many features such as chat, training, support for GPU and HTML/Markdown output, and more. The second is very similar to OpenAI’s Playground, and it’s a great tool for testing and comparing LLMs quickly with different parameters.

Text generation web UI is an advanced web interface built with Gradio.
Text generation web UI is an advanced web interface built with Gradio.

If you want to use your LLMs in Python there are several options, such as llama-cpp-python, or the HuggingFace Transformers library, which offers an high-level syntax to interact with any HF model. It supports PyTorch, Tensorflow, and Jax backend and it can be used with any pre-trained model you find on HF, for text, images or audio. Using transformers, generating text with your favorite LLM it’s as simple as writing two lines of code:

from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf")

Among the thousands freely available models on the HF hub, there are different formats. The Transformers library require a model in HF format and llama.cpp requires a GGML model.

Limitations

While we have shown that you don’t need powerful computers for most models, scalability is the first concern if you are planning to build a system that allows hundreds or thousands of users to interact with your model. An advantage of using GPT is that OpenAI offers cheap, high rate-limit APIs that can scale your app easily.

Another limitation is safety and moderation. Since you can be held responsible for the model’s output, you must be extra careful of what the model is generating. Commercial LLMs have advanced moderation filters built using Reinforcement Learning with Human Feedback (RLHF), for limiting harmful content.

Remember to check the model’s license before using it. Open Source doesn’t always mean that the model can be used commercially.

Conclusion

The article has shown that open source models are viable and free alternatives that are getting better quickly. This year has witnessed incredible advancements in the development of these models, and it is now even possible to run them on your laptop.

I hope you found this article useful if you are considering an open source LLM for your next project. The following articles in the series will go more in-depth on Open Language Models different aspects and challenges.

See you in the next one!


If you enjoyed this article, join Text Generation – our newsletter has two weekly posts with the latest insights on Generative AI and Large Language Models.

Also, you can find me on LinkedIn.


References

[1] Stanford Scientists Find That Yes, ChatGPT Is Getting Stupider (2023), (futurism.com)

[2] T. Brown at al, Language Models are Few-Shot Learners (2020), arXiv.org

[3] M. Schade, How your data is used to improve model performance (2023), OpenAI Help Center

[4] Google AI Blog, Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance (2022), ai.googleblog.com

[5] A. Fan et al., LLaMA: Open and Efficient Foundation Language Models (2023), arXiv.org

[6] D. Patel and A. Ahmad, Google: We Have No Moat (And Neither Does OpenAI) (2023), semianalysis.com

[7] Y. Belkada and T. Dettmers, A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes (2022), Hugging Face Blog

[8] E. Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2023), arXiv:2210.17323v2 [cs.LG]


Related Articles