The world’s leading publication for data science, AI, and ML professionals.

Quantize Llama 3 8B with Bitsandbytes to Preserve Its Accuracy

Llama 2 vs. Llama 3 vs. Mistral 7B, quantized with GPTQ and Bitsandbytes

Generated with DALL-E
Generated with DALL-E

With quantization, we can reduce the size of large language models (LLMs). Quantized LLMs are easier to run on GPUs with smaller memory, effectively serving as a compression method for LLMs.

According to Meta’s own evaluation, Llama 3 8B is better than Llama 2 7B and Mistral 7B. However, the question remains: does Llama 3 8B maintain its superiority after quantization?

In other words, if Llama 3 is better than Mistral 7B and Llama 2 (Llama 3 > Mistral 7B > Llama 2 7B), is the quantized version also better than these models quantized (quantized Llama 3 > quantized Mistral 7B > quantized Llama 2 7B)?

In this article, we will answer this question. I quantized all the models with bitsandbytes to 8-bit and 4-bit, and with GPTQ to 8-bit, 4-bit, 3-bit, and 2-bit and checked their performance on 3 different tasks. We will see that 8-bit Quantization works reasonably well for Llama 3 with both quantization algorithms. I also found that while GPTQ 4-bit significantly degrades the model, bitsandbytes quantization seems to work well.

GPTQ Quantization for Llama 3

GPTQ is a very popular quantization scheme that supports many neural architectures. With AutoGPTQ (MIT license), we can quantize LLMs to 8-bit, 4-bit, 3-bit, and 2-bit.

Quantize and Fine-tune LLMs with GPTQ Using Transformers and TRL

Llama 3 8B has 8.03 billion bfloat16 parameters. bfloat16 is a 16-bit data type. Quantizing a 16-bit parameter to 4-bit divides its size by 4. However, if we quantize Llama 3 8B to 4-bit, it doesn’t mean that the size of the model will be divided by 4. GPTQ only quantizes the linear layers, i.e., the attention and MLP modules for the Llama architecture.

Since Llama 3 has a very large vocabulary, the token embeddings are a significant part of the 8.03 billion parameters, which won’t be quantized by GPTQ. In fact, Llama 3 8B quantized to 4-bit occupies 5.74 GB, against 16.07 GB for the original 16-bit model.

Here is the code I used to quantize Llama 3:

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer
import torch
model_path = 'meta-llama/Meta-Llama-3-8B'
w = [8,4,3,2]

for b in w:
  quant_path = 'Meta-Llama-3-8B-gptq-'+str(b)+'bit'

  tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
  model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto")

  quantizer = GPTQQuantizer(bits=b, dataset="c4", model_seqlen = 2048)
  quantized_model = quantizer.quantize_model(model, tokenizer)

  quantized_model.save_pretrained("./"+quant_path, safetensors=True)
  tokenizer.save_pretrained("./"+quant_path)

I also quantized Llama 2 7B, Llama 2 13B, and Mistral 7B with the same methods.

Here are the sizes of each model:

Figure by the author
Figure by the author

As expected, due to its large vocabulary, Llama 3 8B remains larger despite having a similar number of parameters. The 2-bit version is as large as the 2-bit version of Llama 2 13B.

I pushed all the quantized models to the Hugging Facec Hub:

Bitsandbytes Quantization for Llama 3

Bitsandbytes (MIT license) quantizes LLMs on the fly when they are loaded. In Hugging Face Transformers, we simply need to load the model as follows to quantize it to 8-bit:

AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", load_in_8bit=True)

or for 4-bit quantization:

AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", load_in_4bit=True)

Bitsandbytes usually yields slightly better results in 4-bit than GPTQ thanks to a superior data type: bitsandbytes uses NormalFloat4 while GPTQ uses INT4.

The downside of bitsandbytes is that it only supports quantization to 8-bit and 4-bit. It also creates models that are slower for inference.

Is Quantization Worse for Llama 3 8B than for Other LLMs?

Note: The full code I used to evaluate the models is in this notebook: Get the notebook (#70)

Now, we want to know whether the quantization of Llama 3 8B degrades more than the quantization of the other models. To check it, I measured the zero-shot accuracy of all the models on 3 different tasks:

  • Winogrande
  • Arc Challenge
  • HellaSwag

For these evaluations, I used the Evaluation Harness (MIT license).

The results:

Note: Ignore the results for Llama 2 7B 8-bit. Clearly, something went wrong. I don’t know what happened with this particular configuration but I couldn’t obtain better results.

Winogrande - Image by the author
Winogrande – Image by the author
HellaSwag - Image by the author
HellaSwag – Image by the author
Arc Challenge - Image by the author
Arc Challenge – Image by the author

Note: BnB denotes bitsandbytes quantization.

With these results, we are more interested in observing how the accuracy decreases given the quantization precision rather than the actual performance of the models.

In a nutshell:

  • GPTQ performs poorly at quantizing Llama 3 8B to 4-bit.
  • bitsandbytes 4-bit maintains the accuracy of the Llama 3, except on Arc Challenge but even on this task Llama 3 8B 4-bit remains better than Llama 2 13B 4-bit.

The results with GPTQ are particularly interesting since GPTQ 4-bit usually doesn’t degrade much the performance of the model.

On these particular tasks, Mistral 7B and Llama 3 8B, not quantized, perform similarly. However, while GPTQ 4-bit quantization doesn’t have much effect on Mistral 7B, it significantly degrades the performance of Llama 3 8B.

Llama 2 7B quantized to 4-bit with GPTQ is actually better than Llama 3 8B 4-bit according to these benchmarks. For lower precisions, 3-bit quantization seems a bit better for Llama 2 7B, while for 2-bit precision all the models seem to perform the same, except for Llama 2 13B on HellaSwag.

These results tend to confirm observations made by others with other quantization algorithms. For instance, on Reddit, experiments with ExLlamaV2 show that Llama 2 7B yields a lower (better) perplexity than Llama 3 8B.

We don’t know why Llama 3 8B seems difficult to be accurately quantized with GPTQ and ExLlama. An assumption would be that since the model has been trained much longer than Mistral 7B and Llama 2 7B, its weights "contain" much more information, i.e., they are less sparse. Consequently, trying to reduce the precision of much more "dense" weights would deteriorate the quality of the model faster than for a model with more sparse weights.

Conclusion

Should we quantize Llama 3 8B?

Clearly, we should avoid 4-bit (and lower) quantization with GPTQ, as it seems to make the model worse than Llama 2 7B. However, 8-bit quantization seems to yield reasonably good results as it doesn’t deteriorate much the accuracy of Llama 3 8B.

Using bitsandbytes for 4-bit quantization seems to be a good alternative. The main disadvantage would be that bitsandbytes quantized models are much slower than GPTQ models for inference.


To support my work, consider subscribing to my newsletter for more articles/tutorials on recent advances in AI:

The Kaitchup – AI on a Budget


Related Articles