
There are many quantization methods to reduce the size of large language models (LLM). Recently, better low-bit quantization methods have been proposed. For instance, AQLM achieves 2-bit quantization while preserving most of the model’s accuracy.
The main drawback of AQLM is that the quantization of large models is extremely costly. HQQ is another good alternative for low-bit quantization but requires further fine-tuning to preserve accuracy.
Intel is also very active in the research of better quantization algorithms. They propose AutoRound, a new quantization method adopting sign gradient descent (SignSD). AutoRound is especially accurate for low-bit quantization and quantizes faster than most other methods.
In this article, I review AutoRound. We will see how it works and how to quantize LLMs, such as Llama 3, with minimal accuracy drop. I found AutoRound to be a very good alternative to GPTQ and HQQ. It yields more accurate models.
I implemented the following notebook showing how to quantize LLMs with AutoRound, and evaluate/benchmark the resulting models:
AutoRound: Optimized Signed Rounding for Accurate Low-bit Quantization
AutoRound is described in this paper by Intel:
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
To quantize LLMs, we have two main categories of methods:
- Quantization-Aware Training (QAT)
- Post-Training Quantization (PTQ)
QAT methods involve training LLMs with a simulated lower precision. This way the model learns the effects of quantization. This approach performs well in practice but is difficult to adopt since it requires a new training of the model. QAT is very costly, especially for large models.
On the other hand, PTQ methods don’t need to retrain the model. They are directly applied to the model’s weights. While PTQ is more convenient, it usually doesn’t perform as well as QAT.
Since they are much easier to apply, the most used quantization methods for LLMs are PTQ methods: GPTQ, AWQ, AQLM, BnB, SqueezeLLM, etc. Moreover, they are all weight-only quantization methods.
AutoRound is also a weight-only quantization method.
For AutoRound, Intel exploits SignSGD, an optimization method to find the optimal rounding solution of weights, i.e., from float16 (usually) to integer (INT2, INT3, INT4, or INT8). This method needs to be trained but SignSGD is efficient enough to be as fast as GPTQ or AWQ, and several orders of magnitude faster than AQLM.
SignSGD is also intuitive, facilitating easy step-size adjustments, and robust to hyperparameters changes. For instance, in the paper presenting AutoRound, Intel used the same hyperparameters across (almost) all experiments with 200 steps and a learning rate of 5e-3.
Moreover, AutoRound can produce models in the same format as GPTQ. It makes AutoRound directly exploitable by all the frameworks supporting GPTQ: vLLM, HF Transformers, etc.
How well does AutoRound perform?
According to the evaluation conducted by Intel, AutoRound is state-of-the-art. It significantly outperforms GPTQ, HQQ, and AWQ, especially for low-bit quantization.

Unfortunately, Intel didn’t compare AutoRound with AQLM which is the only other method, to the best of my knowledge, that can achieve this level of accuracy for 2-bit quantization.
By looking at these numbers, I guess that AQLM and AutoRound perform similarly. However, as we will see in the next section, AutoRound quantizes models much faster.
Quantize Llama 3 with AutoRound
Quantizing LLMs with AutoRound is straightforward since it supports models loaded with Hugging Face Transformers.
The implementation of AutoRound released by Intel is available here:
- GitHub: intel/auto-round (Apache 2.0 license)
And can be installed through pip:
pip install auto-round
Then, import the following:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from auto_round import AutoRound
Next, we load the model (Llama 3 8B) and its tokenizer. The tokenizer will be used to tokenize the training data for SignSGD.
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
The next step is the quantization. We have to set the following hyperparameters:
- bits: Higher bitwidth means a more accurate but larger model. As we will see in the next section, 4-bit quantization for an 8B model is very accurate while 2-bit quantization can significantly degrade it.
- group_size: Similar to "bits", it impacts the size and accuracy of the model but to a much lower extent. A lower group size will yield a more accurate but larger model. I suggest using a value of 128 or lower.
- iters: The number of training steps for SignSGD. The default is 200 but in its recipes, Intel uses 1,000 for Llama 2 and Llama 3. More iterations should yield a more accurate model but increase the quantization time. As we will see in the next section, 200 iterations seem enough.
- batch_size and gradient_accumulate_steps: The batch size used for SignSGD. I use 2 and 4, respectively, which yields a training batch size of 8 (2*4). Increase the batch size if you have enough memory on your GPU. I didn’t test it but a larger batch size may yield better results.
- seqlen: The sequence length of the training examples used for SignSGD. Longer sequences use more GPU memory. I use 512 but, ideally, we should set it to the maximum sequence length supported by the model, i.e., 8,192.
- sym: Use symmetric quantization or not. The default is not using symmetric quantization. Symmetric quantization yields much faster models at the cost of a slight accuracy drop.
- device: The device on which the quantization will be performed. It supports CPU but I can’t recommend it. It’s too slow compared to "CUDA" (GPUs).
autoround = AutoRound(model, tokenizer, bits=4, group_size=128, batch_size=2, seqlen=512, sym=False, iters=1000, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
Finally, we can save the quantized model. By default, it uses the GPTQ format. However, Intel recommends the "auto_round" format as they identified some bugs with the GPTQ serialization.
output_dir = "./tmp_autoround"
autoround.save_quantized(output_dir) ##save_quantized(output_dir,format=="auto_round") #use the auto round format for quantization
Note: The following quantization time and memory consumption have been computed using the L4 instance of Google Colab and an 8 billion parameter model (Llama 3 8B).
I released Llama 3 8B quantized with AutoRound here:
- AutoRound: Llama 3 (CC-BY)
Quantization Time for AutoRound

Overall, AutoRound is fast. With the default parameters (first row) and a sequence length of 512, it takes 1.5 hours. Intuitively, multiplying the number of SignSGD iterations also multiplies by 5 the quantization time (third row).
It’s fast enough but I could not confirm that it is faster than AutoGPTQ as claimed by the paper.
Memory Consumption for AutoRound

AutoRound consumes a very reasonable amount of GPU RAM. An 8 GB GPU would be enough. However, it requires 32 GB of CPU RAM which I found surprisingly high for an 8B model. Quantizing a 70B model might not be feasible with less than 128 GB of CPU RAM.
Benchmarking Llama 38B Quanized with AutoRound
Accuracy

My first observation is that AutoRound 4-bit (4th row) significantly outperforms both bitsandbytes (BnB) and AutoGPTQ.
However, I couldn’t obtain good results for Llama 3 8B quantized to 2-bit. Trying different sets of hyperparameters didn’t improve the results.
Inference Throughput for an AutoRound Model with vLLM
To benchmark the throughput, I used vLLM benchmarking script (Apache 2.0 license).

Note: Don’t pay much attention to the numbers as they depend on many other inference hyperparameters such as the batch size. These throughputs might be difficult to obtain for real use cases. The main goal of this table is to compare the methods.
As expected, AutoRound models are as fast as AutoGPTQ models. They use the same format. With asymmetric quantization, the model is significantly slower.
Note also that the models can be made faster by converting them to Marlin format (only supports models with symmetric quantization).
Marlin: Nearly Ideal Inference Speed for 4-bit Models with vLLM (1k+ tokens/sec)
Conclusion
AutoRound is a fast quantization framework that is easy to use and already supports most LLMs.
For 4-bit quantization, using AutoRound seems to be a good alternative to AutoGPTQ, especially for Llama 3.
Quantize Llama 3 8B with Bitsandbytes to Preserve Its Accuracy
Moreover, since AutoRound uses the same format as GPTQ, CUDA kernels optimized for fast inference with GPTQ models will also work with AutoRound models.
Reference
Cheng, W., Zhang, W., Shen, H., Cai, Y., He, X. and Lv, K., 2023. Optimize weight rounding via signed gradient descent for the quantization of llms. arXiv preprint arXiv:2309.05516.
To support my work, consider subscribing to my newsletter for more articles/tutorials on recent advances in AI: