
State-of-the-art large language models (LLMs) are pre-trained with billions of parameters. While pre-trained LLMs can perform many tasks, they can become much better once fine-tuned.
Thanks to LoRA, fine-tuning costs can be dramatically reduced. LoRA adds low-rank tensors, i.e., a small number of parameters (millions), on top of the frozen original parameters. Only the parameters in the added tensors are trained during fine-tuning.
LoRA still requires the model to be loaded in memory. To reduce the memory cost and speed-up fine-tuning, a new approach proposes quantization-aware LoRA (QA-LoRA) fine-tuning.
In this article, I explain QA-LoRA and review its performance compared with previous work (especially QLoRA). I also show how to use QA-LoRA to fine-tune your own quantization-aware LoRA for Llama 2.
What’s Wrong with QLoRA?
Fine-tuning LoRA on top of a quantized LLM is something that can already be done with QLoRA. In my previous articles, I used it many times to fine-tune LLMs, for instance, Llama 2 and GPT-NeoX, on my desktop computer or using the free instance of Google Colab.
Before delving into QA-LoRA, it is interesting to understand what are the current limits of QLoRA.
The NormalFloat4 (NF4) Quantization
LLM quantization algorithms usually quantize parameters to a 4-bit precision using the INT4 data type. Computation with this data type is more and more optimized with recent GPUs.
QLoRA doesn’t use INT4 by default but another data type called NormalFloat4 (NF4). You can see it as a compressed float number. According to the authors of QLoRA, NF4 is superior to INT4. LLMs quantized with NF4 achieve a lower perplexity.
However, NF4 computation is not optimal for fast inference. This is one of the reasons why models quantized with GPTQ are faster than models quantized with bitsandbytes NF4. In previous articles, I confirmed that GPTQ models are indeed faster.
GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs – Examples with Llama 2
NF4 is also one of the weaknesses pointed out by the authors of QA-LoRA.
NF4 Base Model, But FP16 LoRA
While the base model is quantized with NF4, the trained LoRA’s parameters remain at a higher precision which is usually FP16, as illustrated in the figure below.

This is key in the QLoRA performance as naively training quantized parameters would lead to poor performance.
Consequently, for inference, we have two different ways to use the LoRA adapters trained with QLoRA:
- Loading them on top of the base LLMs as we do during QLoRA fine-tuning
- Merging them with the base LLMs
Loading them is optimal to preserve the performance. We keep LoRA’s parameters at 16-bit precision but, since they are only a few million, they don’t consume much VRAM relative to the quantized base LLMs.
The other alternative is to merge the LoRA’s parameters with the base model. I explored several merging recipes in a previous article.
Ideally, we have to dequantize the base model to the same precision used by LoRA’s parameters, and then merge LoRA’s parameters with the dequantized base model.
But then, as a result, the merged model is not quantized anymore (FP16). It’s a big dequantized model. We could quantize the entire merged model but quantization always loses information. We would obtain a model performing below the performance we originally obtained at the end of the QLoRA fine-tuning.
Here are the results I obtained for different configurations:

Note: Lower perplexity is better.
We can see that quantizing the merged model leads to a significantly higher perplexity. We can’t merge the QLoRA adapters, while preserving the quantization, without a significant performance drop. QLoRA adapters are not "quantization-aware".
Quantization-Aware Fine-tuning with QA-LoRA
QA-LoRA is presented in this arXiv paper:
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models (Xu et al., 2023)
This isn’t an easy paper to read. QA-LoRA is well-motivated and most of the results/experiments are convincing. However, understanding why it works requires being familiar with the mechanics behind quantization.
I won’t go deep into the mathematical theory and demonstrations here. I think the easiest way to understand QA-LoRA is to see it as the process of jointly quantizing and fine-tuning LoRA’s parameters. The adapter’s parameters and quantization parameters are both learned, and applied, during the fine-tuning process.
To highlight the difference with QLoRA, we can refer to this paragraph from the paper:
We introduce group-wise operations, increasing the number of parameters of quantization from Dout to L×Dout, meanwhile decreasing that of adaptation from Din×Dint+Dint×Dout to L × Dint + Dint × Dout. As we shall see in experiments, a moderate L can achieve satisfying accuracy of language understanding meanwhile preserving computational efficiency.
In addition, QA-LoRA uses the standard INT4 data type while QLoRA uses NF4.
QA-LoRA Performance
Let’s have a look at the performance reported by the authors of QA-LoRA. They report on many experiments but I think the following table is the one that gives the best overview of QA-LoRA performance, compared to QLoRA, and for various quantization precisions:

In this table, we have the performance of the original LLaMA 7B (16-bit) compared to:
- Standard QLoRA with NF4 quantized base LLM and FP16 LoRA (denoted QLoRA")
- LLaMA 7B quantized with GPTQ to INT4 (denoted "LLaMA-7B w/ GPTQ")
- Merged QLoRA adapter quantized with GTPQ (denoted "QLoRA w/ GPTQ")
- QA-LoRA
The standard QLoRA performs the best. This is expected since it uses a very good data type for quantization (NF4) while LoRA’s parameters remain FP16.
We can see that when we want to merge QLoRA adapters and then quantize the merged models (QLoRA w/ GPTQ), the performance significantly drops. Again, as we discussed in the previous section of this article, this is expected.
QA-LoRA on the other hand performs almost as well as the standard QLoRA while the LLM is entirely quantized with INT4. In other words, QA-LoRA works.
QA-LoRA is also more flexible than QLoRA by allowing fine-tuning with LLMs quantized to the lower precisions. QA-LoRA with 3-bit precision is superior to QLoRA merged and quantized to 4-bit (60.1% accuracy for QA-LoRA 3-bit against 59.8% for QLoRA w/ GPTQ 4-bit).
Overall, QA-LoRA results look very impressive.
Overview of the Implementation of QA-LoRA
The authors of QA-LoRA released their implementation on GitHub (MIT license). Note: The original implementation is no longer available. I made a fork that you can find here.
The QA-LoRA implementation heavily relies on AutoGPTQ (MIT license). It exploits a specific branch of AutoGPTQ and replaces several functions.
The LLM that you wish to fine-tune must be already quantized with this specific branch of AutoGPTQ. You may try with quantized LLMs from the Hugging Face Hub but since AutoGPTQ often changes, these quantized LLMs might not be all compatible with QA-LoRA. For instance, the Llama 2 7B that I quantized with AutoGPTQ for a previous article has a quantization configuration that is not supported by this branch of AutoGPTQ.
If you wish to see what QA-LoRA changes in AutoGPTQ, have a closer look at the file "peft_utils.py" where there is a class "GPTQLoraLinear". The main innovation behind QA-LoRA holds in two lines of code:
torch.nn.Linear.__init__(self, linear_module.in_features, linear_module.out_features)
LoraLayer.__init__(self, linear_module.in_features//group_size, linear_module.out_features)
Adaptation of the Code for Llama 2 Support
I had to patch the implementation of QA-LoRA to make it work for Llama 2.
If QA-LoRA still doesn’t run Llama 2 when you read this article, replace the file "qalora.py" with this one:
https://about.benjaminmarie.com/data/py/qalora/qalora.py
I only made two modifications:
- Replace "model.config.torch_dtype=(torch.float32 if args.fp16 else (torch.bfloat16 if args.bf16 else torch.float32))" with "model.config.torch_dtype=torch.float16" (line 300 of the current version)
- Replace "module = module.to(torch.float32)" with "module = module.to(torch.float16)" (line 340 of the current version)
The current implementation only works for models using a pad token. Llama 2 doesn’t use one. I had to manually modify the config.json of the quantized Llama 2 to add this line:
"pad_token_id": 0,
It simply specifies the "unk_token", whose id is 0, for padding.
Requirements for Fine-tuning Llama 2 with QA-LoRA
I implemented all the following sections in a notebook that you can find here:
QA-LoRA Dependencies
I recommend creating a sandbox, for instance using conda, before setting up QA-LoRA. The current implementation uses outdated versions of several packages.
AutoGPTQ must be compiled from source since we have to replace a source file in AutoGPTQ to add QA-LoRA support.
Because of this replacement, we must first clone both repositories, AutoGPTQ and QA-LoRA, and then replace the file in AutoGPTQ:
git clone -b v0.3.0 https://github.com/PanQiWei/AutoGPTQ.git
git clone https://github.com/yuhuixu1993/qa-lora.git
cp qa-lora/peft_utils.py ./AutoGPTQ/auto_gptq/utils/
Apply my patch (if necessary):
wget https://about.benjaminmarie.com/data/py/qalora/qalora.py
cp qalora.py qa-lora/
AutoGPTQ is now ready to be installed:
cd AutoGPTQ
pip install .[triton]
cd ..
This can take up to 10 minutes.
Install QA-LoRA dependencies:
cd qa-lora
pip install -r requirements.txt
cd ..
I also installed bitsandbytes from source, following the recommendations of QA-LoRA’s documentation, but I don’t think it’s necessary. You may try a "pip install bitsandbytes" instead (it’s much faster).
git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes
# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
# make argument in {cuda110, cuda11x, cuda12x}
# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
CUDA_VERSION=118 make cuda11x
python setup.py install
pip install -r requirements.txt
pip install protobuf==3.20.*
cd ..
Pre-Quantized LLM
For now, QA-LoRA only fine-tunes LLMs already quantized with AutoGPTQ (INT4, INT3, and INT2 are all supported). If you want to fine-tune an fp16/32 LLM, you would have to quantize it before with the version of AutoGPTQ that we have installed.
You can also look at the end of the notebook where I wrote the code for quantization. For this article, I quantized Llama 2 7B and uploaded it on the Hugging Face Hub. You can use it to run this tutorial:
_Note: safetensors format is not supported yet. QA-LoRA expects to find a file pytorchmodel.bin in the repository.
The group size for quantization was set to 32 (32g). I chose this group size because it’s hardcoded to 32 in QA-LoRA.
Hardware Requirements
The following sections can run on the free instance of Google Colab or a GPU with at least 10 GB of VRAM.
QA-LoRA Fine-tuning
I fine-tuned with the default dataset, Alpaca, for only 100 steps. With a batch size of 1 and gradient_accumulation_steps at 16. It consumes around 7 GB of VRAM.
If your fine-tuning appears unstable, changing the learning rate and/or LoRA alpha/rank may also improve the stability.
cd /content/qa-lora/
python qalora.py --model_path kaitchup/Llama-2-7b-4bit-32g-autogptq
--save_steps 10
--output_dir output
--per_device_train_batch_size 1
--gradient_accumulation_steps 16
--max_steps 100
--lora_r 16
With the T4 GPU of Google Colab, this fine-tuning took 28 minutes.
I uploaded the final checkpoint on the Hugging Face Hub:
Merge QA-LoRA Adapters
In contrast with QLoRA, QA-LoRA adapters can be merged without performance loss into the quantized base LLM.
Here is the code for merging:
import torch
#Path the quantized base model
model_path = 'Llama-2-7b-4bit-32g-autogptq/gptq_model-4bit-32g.bin'
#Path to the adapter fine-tuned with QA-LoRA
lora_path = 'output/adapter_model.bin'
#Where the merged model will be saved
merged_path = 'output_model'
#The scale is the LoRA alpha divided by the LoRA rank. I trained with LoRA_alpha = LoRA_rank = 16
scale = 16 / 16
#The group size of the quantized base LLN
group_size = 32
#We merge using the CPU
model = torch.load(model_path, map_location='cpu')
lora = torch.load(lora_path, map_location='cpu')
tmp_keys = [key[17:-14] for key in lora.keys() if 'lora_A' in key]
for tmp_key in tmp_keys:
model[tmp_key+'.qzeros'] -= (lora['base_model.model.'+tmp_key+'.lora_B.weight'] @ lora['base_model.model.'+tmp_key+'.lora_A.weight']).t() * scale / group_size /model[tmp_key+'.scales']
torch.save(model, merged_path)
This code runs on the CPU so you only need enough CPU RAM to load the model to be merged. Note that the base LLM and the QA-LoRA adapter that we fine-tuned must be accessible locally.
Once merged, the model is ready for inference. It’s a standard GPTQ model.
Conclusion
QA-LoRA works. We fine-tuned a quantization-aware LoRA for Llama 2.
We saw that quantization-aware fine-tuning has 2 significant advantages over QLoRA:
- It’s faster
- It fine-tunes an adapter that can be perfectly merged with the base LLM
The current implementation is not very flexible. This is still a very young project. The main issue is that it relies on an older version of AutoGPTQ. The authors of QA-LoRA plan to support the most recent version of AutoGPTQ later.
Note that in my experiment I used a 4-bit quantization. QA-LoRA already supports 2-bit and 3-bit quantization. You may try these lower precisions to further reduce memory consumption.
To support my work, consider subscribing to my newsletter:
The Kaitchup – AI on a Budget | Benjamin Marie, PhD | Substack