The world’s leading publication for data science, AI, and ML professionals.

Fine Tuning LLMs on a Single Consumer Graphic Card

Learnings from fine-tuning a large language model on a single consumer GPU

GENERATIVE AI

Image by Author (Midjourney).
Image by Author (Midjourney).

Background

When we think about Large Language Models or any other generative models, the first hardware that comes to mind is Gpu. Without GPUs, many advancements in Generative AI, machine learning, deep learning, and data science would’ve been impossible. If 15 years ago, gamers were enthusiastic about the latest GPU technologies, today data scientists and machine learning engineers join them and pursue the news in this field too. Although usually gamers and ML users are looking at two different kinds of GPUs and graphic cards.

Gaming users usually use consumer graphic cards (such as NVIDIA GeForce RTX Series GPUs), while ML and AI developers usually follow news about Data Center and Cloud Computing GPUs (such as V100, A100, or H100). Gaming graphic cards usually have much less GPU memory (at most 24GB as of January 2024) compared to Data Center GPUs (in the range of 40GB to 80GB usually). Also, their price is another significant difference. While most consumer graphic cards could be up to $3000, most Data Center graphic cards start from that price and can go tens of thousands of dollars easily.

Since many people, including myself, might have a consumer graphic card for their gaming or daily use, they might be interested to see if they can use the same graphic cards for training, fine-tuning, or inference of Llm models. In 2020, I wrote a comprehensive article about whether we can use consumer graphic cards for data science projects (link to the article). At that time, the models were mostly small ML or Deep Learning models and even a graphic card with 6GB of memory could handle many training projects. But, in this article, I am going to use such a graphic card for large language models with billions of parameters.

For this article, I used my GeForce 3090 RTX card which has 24GB of GPU memory. For your reference, data center graphic cards such as A100 and H100 have 40GB and 80GB of memory respectively. Also, a typical AWS EC2 p4d.24xlarge instance has 8 GPUs (V100) with a total of 320GB of GPU memory. As you can see the difference between a simple consumer GPU and a typical cloud ml instance is significant. But the question is, can we train large models on single consumer graphic cards or not? And if yes, what are the tips and learnings? Read the rest of this article, to find out.

Hardware and Software Setup

Before loading any LLM model or training dataset, we need to find out what hardware and software we need for such a process.

As mentioned, I used the NVIDIA GeForce RTX 3090 GPU because it has one of the highest memory (24GB) among consumer GPUs (FYI: 4090 model has the same memory size too). It is based on Ampere Architecture which is the same architecture that famous A100 GPUs have. You can see more about GeForce RTX 3090 GPU specifications here.

After all the tests, I believe 24GB is the least amount of GPU memory that we need for working with LLMs with billions of parameters.

Image by Author
Image by Author

In addition to the graphics card, we need to make sure that our PC has a good ventilation system. During fine-tuning, the temperature of the GPU easily goes up, and its fans are not enough to keep it cool. Higher GPU temperature can lower the GPU performance and the process will take much longer time.

In addition to hardware, there are some software considerations that I must mention here. First of all, if you are a Windows user, I have a piece of bad news for you. Some libraries and tools only work on Linux. Specifically, bitsandbytes which is being used frequently for model quantization is not Windows-friendly. Some people made a wrapper for Windows (for example here), but they have their cons and pros. So, my advice is either to install Linux on WSL or like me have a dual boot system and fully switch to Linux while you are working on LLMs.

Also, you need to install PyTorch and a compatible CUDA version. My recommendation is to install CUDA 12.3 (link). Then you need to go to this page, and based on your system, CUDA version, and package manager system, download and install the correct PyTorch (https://pytorch.org/).

Note: If you are using CUDA 12.3, you might need to add or configure BNB_CUDA_VERSION and LD_LIBRARY_PATH env variables in your .bashrc file. Here is an example for your reference.

export BNB_CUDA_VERSION=123
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/<YOUR-USER-DIR>/local/cuda-12.3

And finally, you need to install the following packages on your system. I recommend creating a new virtual environment (venv) to avoid conflict with other installed packages on your system. Also, for your reference here are versions of packages that I successfully used:

torch==2.1.2
transformers==4.36.2
datasets==2.16.1
bitsandbytes==0.42.0
peft==0.7.1

Technical Background

Now that you have all the hardware and software ready for working with LLMs on your system, it is better to have a very brief review of the technical concepts that you will face in the next sections.

Large language models comprise millions or billions of parameters. Usually, we use pre-trained models that have been trained on billions and sometimes trillions of tokens after a long training process and usually millions of dollars in cost. Each of these model parameters takes 32 bits (4 bytes) of memory for loading. As a rule of thumb, each 1 billion parameters require about 4GB of memory to just load the model. One technique of using less memory for loading (and later inferencing or training a model later) is "Quantization". In this technique, we reduce the precision of the model weights from 32-bit full precision (fb32) to 16-bit (fp16 or bfloat16), 8-bit (int8), or even less (read more). As you can imagine by reducing the precision of the model weights we can load larger models into a limited memory, but this comes at the cost of reducing the model’s performance. However, some studies suggested that the model performance difference between fp32 and bfloat16 is insignificant and many famous models (including Llama2) were pre-trained on bfloat16.

Quantization is a technique that we must use when we are fine-tuning or inferencing a large language model on a single GPU with 24GB of memory. Later you will see that the bitsandbytes library can help us achieve model quantization.

Even with the most rigorous quantization techniques, we are still unable to pre-train a small-size LLM model with millions of parameters. Chris Fregley et al. have a good rule of thumb in their newly published Generative AI on AWS book for required memory for training a model. As they explained for each 1 billion parameters of a model, we need 6GB of memory (in 16-bit half precision) to load and train the model. Remember, the memory size is only a part of the training story. The amount of time needed to complete pre-training is another important part. As an example, the smallest Llama2 model (which is Llama2 7B) has 7 billion parameters and it took 184320 GPU hours to complete its training (read more).

Therefore, most people (even with significant hardware resources and budget) prefer to use a pre-trained model and only fine-tune it for their specific use case. Still, the process of full fine-tuning could be overwhelming with limited resources (such as a single GPU). Therefore, Paremer-Efficient Fine-Tuning (PEFT), which only updates a limited subset of model parameters, looks more realistic with less computed resources.

Among different PEFT techniques, LoRA (Low Ranking Adaption) is very popular because of its computing efficiency. In this technique, we freeze all the original model weights and instead, we train low-rank matrices that can be added to specific layers of the Transformer architecture (read more). In many cases, using LoRA for fine-tuning an LLM, we update 0.5% of model weights.

QLoRA is a variation of LoRA that combines LoRA with the Quantization concept that we explained. Specifically, in our QLoRA implementation, we will use nf4 or Normal Float 4 for fine-tuning the model. QLoRA is very helpful in our case study of fine-tuning a large model with a single consumer GPU.

Coding Time

Finally, it’s coding time!

You can find the working Jupyter Notebook here on my GitHub repo. For many parts of this code, I got inspired and followed the instructions by Mathieu Busquet in this neat article.

I will not discuss the code line by line, but I will emphasize parts of the code that are important for fine-tuning a large model on a single GPU.

Transformer Model

First of all, I chose the Mistral 7B model (mistralai/Mistral-7B-v0.1) for this testing. The Mistral 7B model developed by Mistral AI is an open-source LLM released in September 2023 (link to the paper). From many aspects, this model outperforms famous models such as Llama2 (see the following charts).

Image from Mistral Release Blog Post (https://mistral.ai/news/announcing-mistral-7b/)
Image from Mistral Release Blog Post (https://mistral.ai/news/announcing-mistral-7b/)

Dataset

Also, I used Databricks databricks-dolly-15k dataset (under CC BY-SA 3.0 license) for fine-tuning (read more). I used a small subset (1000 rows) of this data to reduce the fine-tuning time and proof of concept.

Configurations

At the model loading time, I used the following quantization config to overcome the GPU memory limitations that I am facing.

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

This quantization config is vital for fine-tuning a model on a single GPUT as it has one low-precision storage data type, nf4 (4-bit Normal Float) which is a 4-bit data type, and a computation data type which is bfloat16. In practice, this means whenever a QLORA weight tensor is used, we dequantize the tensor to bfloat16 and then perform matrix multiplication in 16-bit (read more in the original paper).

Also, as mentioned before, we are using LoRA in conjunction with Quantization, known as QLoRA, to overcome the memory limitations. Here are my configs for the LoRA:

lora_config = LoraConfig(
    r=16,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"], 
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

For my LoRA configuration, I used 16 for rank. It is advised to set the rank between 4 and 16 to get a good trade-off between reducing the number of trainable parameters and the model performance. Finally, we applied LoRA to a subset of linear layers in our Mistral 7B transformer model.

Training and Monitoring

Using my single graphic card, I could complete 4 epochs of training (1000 steps). For me, one of the purposes of doing such a test, training an LLM on a single local GPU, is to monitor the hardware resources without any restrictions. One of the simplest tools for monitoring GPU during training is the Nvidia System Management Interface (SMI). Simply open a terminal and in your command line type:

nvidia-smi

or for continuously monitoring and updating, use (refreshes every 1 sec):

nvidia-smi -l 1

As a result, you will see the memory usage by each process on your GPU. In the following SMI view, I just loaded the model and it took about 5GB of memory (thanks to the quantization). Also, as you see the model is loaded by the Anaconda3 Python (a Jupyter notebook) process.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:29:00.0  On |                  N/A |
| 30%   37C    P8              33W / 350W |   5346MiB / 24576MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1610      G   /usr/lib/xorg/Xorg                          179MiB |
|    0   N/A  N/A      1820      G   /usr/bin/gnome-shell                         41MiB |
|    0   N/A  N/A    108004      G   ...2023.3.3/host-linux-x64/nsys-ui.bin        8MiB |
|    0   N/A  N/A    168032      G   ...seed-version=20240110-180219.406000      117MiB |
|    0   N/A  N/A    327503      C   /home/***/anaconda3/bin/python             4880MiB |
+---------------------------------------------------------------------------------------+

And here (the following snapshot) is the memory state after about 30 steps into the training process. As you see, the used GPU memory is now about 15GB.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:29:00.0  On |                  N/A |
| 30%   57C    P2             341W / 350W |  15054MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1610      G   /usr/lib/xorg/Xorg                          179MiB |
|    0   N/A  N/A      1820      G   /usr/bin/gnome-shell                         40MiB |
|    0   N/A  N/A    108004      G   ...2023.3.3/host-linux-x64/nsys-ui.bin        8MiB |
|    0   N/A  N/A    168032      G   ...seed-version=20240110-180219.406000      182MiB |
|    0   N/A  N/A    327503      C   /home/***/anaconda3/bin/python            14524MiB |
+---------------------------------------------------------------------------------------+

Although SMI is a simple tool for monitoring GPU memory usage, there are still a few advanced monitoring tools that provide more detailed information. One of these advanced tools is PyTorch Memory Snapshot which you can read more about in this interesting article.

Summary

In this article, I showed you that it is possible to fine-tune a large language model such as Mistral 7B on a single 24 GB GPU (such as NVIDIA GeForce RTX 3090 GPU). However, as discussed in detail, special PEFT techniques like QLoRA are necessary. Also, the batch size of the model is important and we might need much longer training time, just because of our limited resources.


Related Articles