The world’s leading publication for data science, AI, and ML professionals.

AutoRound: Accurate Low-bit Quantization for LLMs

Between quantization-aware training and post-training quantization

Generated with DALL-E
Generated with DALL-E

There are many quantization methods to reduce the size of large language models (LLM). Recently, better low-bit quantization methods have been proposed. For instance, AQLM achieves 2-bit quantization while preserving most of the model’s accuracy.

The main drawback of AQLM is that the quantization of large models is extremely costly. HQQ is another good alternative for low-bit quantization but requires further fine-tuning to preserve accuracy.

Intel is also very active in the research of better quantization algorithms. They propose AutoRound, a new quantization method adopting sign gradient descent (SignSD). AutoRound is especially accurate for low-bit quantization and quantizes faster than most other methods.

In this article, I review AutoRound. We will see how it works and how to quantize LLMs, such as Llama 3, with minimal accuracy drop. I found AutoRound to be a very good alternative to GPTQ and HQQ. It yields more accurate models.

I implemented the following notebook showing how to quantize LLMs with AutoRound, and evaluate/benchmark the resulting models:

Get the notebook (#82)

AutoRound: Optimized Signed Rounding for Accurate Low-bit Quantization

AutoRound is described in this paper by Intel:

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

To quantize LLMs, we have two main categories of methods:

  • Quantization-Aware Training (QAT)
  • Post-Training Quantization (PTQ)

QAT methods involve training LLMs with a simulated lower precision. This way the model learns the effects of quantization. This approach performs well in practice but is difficult to adopt since it requires a new training of the model. QAT is very costly, especially for large models.

On the other hand, PTQ methods don’t need to retrain the model. They are directly applied to the model’s weights. While PTQ is more convenient, it usually doesn’t perform as well as QAT.

Since they are much easier to apply, the most used quantization methods for LLMs are PTQ methods: GPTQ, AWQ, AQLM, BnB, SqueezeLLM, etc. Moreover, they are all weight-only quantization methods.

AutoRound is also a weight-only quantization method.

For AutoRound, Intel exploits SignSGD, an optimization method to find the optimal rounding solution of weights, i.e., from float16 (usually) to integer (INT2, INT3, INT4, or INT8). This method needs to be trained but SignSGD is efficient enough to be as fast as GPTQ or AWQ, and several orders of magnitude faster than AQLM.

SignSGD is also intuitive, facilitating easy step-size adjustments, and robust to hyperparameters changes. For instance, in the paper presenting AutoRound, Intel used the same hyperparameters across (almost) all experiments with 200 steps and a learning rate of 5e-3.

Moreover, AutoRound can produce models in the same format as GPTQ. It makes AutoRound directly exploitable by all the frameworks supporting GPTQ: vLLM, HF Transformers, etc.

How well does AutoRound perform?

According to the evaluation conducted by Intel, AutoRound is state-of-the-art. It significantly outperforms GPTQ, HQQ, and AWQ, especially for low-bit quantization.

Accuracy results averaged across 11 tasks. The "V2" models are Llama 2 - source (CC-BY)
Accuracy results averaged across 11 tasks. The "V2" models are Llama 2 – source (CC-BY)

Unfortunately, Intel didn’t compare AutoRound with AQLM which is the only other method, to the best of my knowledge, that can achieve this level of accuracy for 2-bit quantization.

By looking at these numbers, I guess that AQLM and AutoRound perform similarly. However, as we will see in the next section, AutoRound quantizes models much faster.

Quantize Llama 3 with AutoRound

Quantizing LLMs with AutoRound is straightforward since it supports models loaded with Hugging Face Transformers.

The implementation of AutoRound released by Intel is available here:

And can be installed through pip:

pip install auto-round

Then, import the following:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from auto_round import AutoRound

Next, we load the model (Llama 3 8B) and its tokenizer. The tokenizer will be used to tokenize the training data for SignSGD.

model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The next step is the quantization. We have to set the following hyperparameters:

  • bits: Higher bitwidth means a more accurate but larger model. As we will see in the next section, 4-bit quantization for an 8B model is very accurate while 2-bit quantization can significantly degrade it.
  • group_size: Similar to "bits", it impacts the size and accuracy of the model but to a much lower extent. A lower group size will yield a more accurate but larger model. I suggest using a value of 128 or lower.
  • iters: The number of training steps for SignSGD. The default is 200 but in its recipes, Intel uses 1,000 for Llama 2 and Llama 3. More iterations should yield a more accurate model but increase the quantization time. As we will see in the next section, 200 iterations seem enough.
  • batch_size and gradient_accumulate_steps: The batch size used for SignSGD. I use 2 and 4, respectively, which yields a training batch size of 8 (2*4). Increase the batch size if you have enough memory on your GPU. I didn’t test it but a larger batch size may yield better results.
  • seqlen: The sequence length of the training examples used for SignSGD. Longer sequences use more GPU memory. I use 512 but, ideally, we should set it to the maximum sequence length supported by the model, i.e., 8,192.
  • sym: Use symmetric quantization or not. The default is not using symmetric quantization. Symmetric quantization yields much faster models at the cost of a slight accuracy drop.
  • device: The device on which the quantization will be performed. It supports CPU but I can’t recommend it. It’s too slow compared to "CUDA" (GPUs).
autoround = AutoRound(model, tokenizer, bits=4, group_size=128, batch_size=2, seqlen=512, sym=False, iters=1000, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()

Finally, we can save the quantized model. By default, it uses the GPTQ format. However, Intel recommends the "auto_round" format as they identified some bugs with the GPTQ serialization.

output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round") #use the auto round format for quantization

Note: The following quantization time and memory consumption have been computed using the L4 instance of Google Colab and an 8 billion parameter model (Llama 3 8B).

I released Llama 3 8B quantized with AutoRound here:

Quantization Time for AutoRound

Table by the author
Table by the author

Overall, AutoRound is fast. With the default parameters (first row) and a sequence length of 512, it takes 1.5 hours. Intuitively, multiplying the number of SignSGD iterations also multiplies by 5 the quantization time (third row).

It’s fast enough but I could not confirm that it is faster than AutoGPTQ as claimed by the paper.

Memory Consumption for AutoRound

Screenshot by the author
Screenshot by the author

AutoRound consumes a very reasonable amount of GPU RAM. An 8 GB GPU would be enough. However, it requires 32 GB of CPU RAM which I found surprisingly high for an 8B model. Quantizing a 70B model might not be feasible with less than 128 GB of CPU RAM.

Benchmarking Llama 38B Quanized with AutoRound

Accuracy

Table by the author
Table by the author

My first observation is that AutoRound 4-bit (4th row) significantly outperforms both bitsandbytes (BnB) and AutoGPTQ.

However, I couldn’t obtain good results for Llama 3 8B quantized to 2-bit. Trying different sets of hyperparameters didn’t improve the results.

Inference Throughput for an AutoRound Model with vLLM

To benchmark the throughput, I used vLLM benchmarking script (Apache 2.0 license).

Table by the author
Table by the author

Note: Don’t pay much attention to the numbers as they depend on many other inference hyperparameters such as the batch size. These throughputs might be difficult to obtain for real use cases. The main goal of this table is to compare the methods.

As expected, AutoRound models are as fast as AutoGPTQ models. They use the same format. With asymmetric quantization, the model is significantly slower.

Note also that the models can be made faster by converting them to Marlin format (only supports models with symmetric quantization).

Marlin: Nearly Ideal Inference Speed for 4-bit Models with vLLM (1k+ tokens/sec)

Conclusion

AutoRound is a fast quantization framework that is easy to use and already supports most LLMs.

For 4-bit quantization, using AutoRound seems to be a good alternative to AutoGPTQ, especially for Llama 3.

Quantize Llama 3 8B with Bitsandbytes to Preserve Its Accuracy

Moreover, since AutoRound uses the same format as GPTQ, CUDA kernels optimized for fast inference with GPTQ models will also work with AutoRound models.

Reference

Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X. and Lv, K., 2023. Optimize weight rounding via signed gradient descent for the quantization of llms. arXiv preprint arXiv:2309.05516.


To support my work, consider subscribing to my newsletter for more articles/tutorials on recent advances in AI:

The Kaitchup – AI on a Budget


Related Articles