The world’s leading publication for data science, AI, and ML professionals.

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs – Examples with Llama 2

Large language model quantization for affordable fine-tuning and inference on your computer

Image by the author - Made with an illustration from Pixabay
Image by the author – Made with an illustration from Pixabay

As Large Language Models (LLM) got bigger with more and more parameters, new techniques to reduce their memory usage have also been proposed.

One of the most effective methods to reduce the model size in memory is quantization. You can see quantization as a compression technique for LLMs. In practice, the main goal of quantization is to lower the precision of the LLM’s weights, typically from 16-bit to 8-bit, 4-bit, or even 3-bit, with minimal performance degradation.

There are two popular Quantization methods for LLMs: GPTQ and bitsandbytes.

In this article, I discuss what the main differences between these two approaches are. They both have their own advantages and disadvantages that make them suitable for different use cases. I present a comparison of their memory usage and inference speed using Llama 2. I also discuss their performance based on experiments from previous work.

Note: If you want to know more about quantization, I recommend reading this excellent introduction by Maxime Labonne:

Introduction to Weight Quantization

GPTQ: Post-training quantization for lightweight storage and fast inference

GPTQ (Frantar et al., 2023) was first applied to models ready to deploy. In other words, once the model is fully fine-tuned, GPTQ will be applied to reduce its size.

GPTQ can lower the weight precision to 4-bit or 3-bit. In practice, GPTQ is mainly used for 4-bit quantization. 3-bit has been shown very unstable (Dettmers and Zettlemoyer, 2023).

It quantizes without loading the entire model into memory. Instead, GPTQ loads and quantizes the LLM module by module. Quantization also requires a small sample of data for calibration which can take more than one hour on a consumer GPU.

In a previous article, I did experiments to quantize Llama 2 7B with 4-bit precision using GPTQ. The original model weighs 13.5 GB on the hard drive but after quantization, the model size is reduced to 3.9 GB (28.9% of its original size).

Quantization of Llama 2 with GTPQ for Fast Inference on Your Computer

Once quantized, the model can run on a much smaller GPU. For instance, the original Llama 2 7B wouldn’t run on a 12 GB of VRAM (which is about what you get on a free Google Colab instance), but it would easily run once quantized. Not only it would run, but it would also leave a significant amount of VRAM unused allowing inference with bigger batches.

The quantization of bitsandbytes is also optimized for inference as I will show you in the next sections of this article.

bitsandbytes: On-the-fly quantization for super simple fine-tuning and efficient inference

bitsandbytes also supports quantization but with a different kind of 4-bit precision denoted NormalFloat (NF).

It was proposed at the same time as QLoRa for fine-tuning quantized LLMs with adapters, which at that time was not possible with GPTQ LLMs (this is possible since June 2023). The convenient integration of 4-bit NF (nf4) in QLoRa is the main advantage of bitsandbytes over GPTQ.

I showed how to fine-tune Llama 2 7B with nf4 in a previous article.

Fine-tune Llama 2 on Your Computer with QLoRa and TRL

With bitsandbytes, the model is transparently quantized on-the-fly when loaded. If you use Hugging Face transformers, it only requires the "load_in_4bit" to be set to "True" when you call the "from_pretrained" method.

A significant difference with GPTQ is that at the time of writing this article, bitsandbytes can’t serialize nf4 quantized models. The model has to be quantized at loading time, every time.

Software and hardware requirements for QLoRa and GPTQ

Hardware

NVIDIA introduced the support for 4-bit precision with its Turing GPUs in 2018. Free Google Colab instances run on T4 GPUs which are based on the Turing architecture. I confirmed that you can run 4-bit quantization on Google Colab.

As for consumer GPUs, I can only say with certainty that it is supported by the RTX 30xx GPUs (I tried it on my RTX 3060), or more recent ones. In theory, it should also work with the GTX 16xx and RTX 20xx since they also exploit the Turing architecture but I didn’t try it and couldn’t find any evidence that GPTQ or bitsandbytes nf4 would work on these GPUs. Note: If you know such work, please drop a link in the comments and I’ll update this paragraph.

Software

GPTQ’s official repository is on GitHub (Apache 2.0 License). It can be directly used to quantize OPT, BLOOM, or LLaMa, with 4-bit and 3-bit precision. However, you will find that most quantized LLMs available online, for instance, on the Hugging Face Hub, were quantized with AutoGPTQ (Apache 2.0 License).

AutoGPTQ is user-friendly and uses interfaces from Hugging Face transformers. It is also well-documented and maintained.

To install AutoGPTQ, I recommend following the instructions given in the GitHub:

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .

bitsandbytes library (MIT license) is defined by its repository as a "wrapper around CUDA custom functions". In practice, it is directly supported by Hugging Face transformers, very much like Accelerate and Datasets, even though bitsandbytes is not an official Hugging Face library. For instance, if you try to quantize the model when loading it, transformers will tell you to install bitsandbytes.

You can get it with pip:

pip install bitsandbytes

Note: 1 day after I submitted this article to Towards Data Science, Hugging Face added the support of AutoGPTQ models in the transformers library. Using GPTQ inside transformers may enable even better performance than the performance I report in the following section.

Comparison between GPTQ and bitsandbytes

In this section, I present comparisons between models quantized using both quantization methods, according to:

  • The perplexity they reach on some datasets
  • Their GPU VRAM consumption
  • Their inference speed

For the perplexity evaluation, I rely on numbers already published. For VRAM consumption, I rely on my own experiments, also supported by numbers already published. For the inference speed, I couldn’t easily find results already published online, so I present my own results obtained with Llama 2 7B.

You can reproduce my results using my notebook published in The Kaitchup (notebook #11), my substack newsletter.

GPTQ vs. bitsandbytes: Perplexity

The main author of AutoGPTQ evaluated LLaMa (the first version) quantized with GPTQ and bitsandbytes by computing the perplexity on the C4 dataset.

The results are released in the "GPTQ-for-LLaMA" repository (Apache 2.0 License):

Results provided by GPTQ-for-LLaMA
Results provided by GPTQ-for-LLaMA

There is a lot of information to unpack here. Each table presents the results for a different size of LLaMA. Let’s focus on the last column, c4(ppl), for each table.

We will compare GPTQ-128g, which is GPTQ 4-bit, with nf4-double_quant and nf4 which are both bitsandbytes quantization algorithms. "nf4-double_quant" is a variant that quantizes the quantization constants.

For the 7B version, they all perform the same, at 5.30 ppl.

We see a difference for the 13B and 33B versions where GPTQ yields a lower perplexity. The results suggest that GPTQ seems better, compared to nf4, as the model gets bigger.

GPTQ seems to have a small advantage here over bitsandbytes’ nf4.

GPTQ vs. bitsandbytes: VRAM Usage

In the table above, the author also reports on VRAM usage. We can see that nf4-double_quant and GPTQ use almost the same amount of memory.

nf4 without double quantization significantly uses more memory than GPTQ. The difference for LLaMA 33B is greater than 1 GB. Double quantization is necessary to match GPTQ quantization performance.

In my own experiments with Llama 2 7B and using 3 different GPUs, I also observed that GPTQ and nf4-double_quant consume a very similar amount of VRAM. Note: I used the T4, V100, and A100 40 GB GPUs that are all available on Google Colab PRO.

GPTQ vs. bitsandbytes: Inference Speed

To get an idea about the inference speed, I have run 5 different prompts with both quantized models, without batching, generating outputs of up to 1,000 tokens (see the notebook). For each prompt, I counted how many tokens per second were generated. Then, I averaged over the 5 prompts.

Table by the author.
Table by the author.

The clear winner here is GPTQ. It is twice faster as bitsandbytes nf4 with double quantization.

The results also show that almost the same speed is achieved by the 3 GPUs. Getting more expensive GPUs won’t make the inference faster if you don’t do batching. However, even with small batches, I would expect to see some significant differences between the T4 and the A100.

Conclusion: GPTQ vs. bitsandbytes

To sum up, GPTQ 4-bit is better than bitsandbytes nf4 if you are looking for better performance. It achieves a lower perplexity and a faster inference for similar VRAM consumption.

However, if you are looking for fine-tuning quantized models, bitsandbytes is a better and more convenient alternative thanks to its support by Hugging Face libraries. bitsandbytes quantization is the algorithm behind QLoRA which allows fine-tuning of quantized models with adapters.

If you are developing/deploying LLMs on consumer hardware, I would recommend the following:

  • Fine-tune the LLM with bitsandbytes nf4 and QLoRa
  • Merge the adapter into the LLM
  • Quantize the resulting model with GPTQ 4-bit
  • Deploy

Here is a summary of the pros and cons of both quantization methods:

bitsandbytes pros:

  • Supports QLoRa
  • On-the-fly quantization

bitsandbytes cons:

  • Slow inference
  • Quantized models can’t be serialized

GPTQ pros:

  • Serialization
  • Supports 3-bit precision
  • Fast

GPTQ cons:

  • Model quantization is slow
  • Fine-tuning GPTQ models is possible but understudied

Related Articles