
Recent developments in low-bit Quantization for LLMs, like AQLM and AutoRound, are now showing acceptable levels of degradation in downstream tasks, especially for large models. That said, 2-bit quantization still introduces noticeable accuracy loss in most cases.
One promising algorithm for low-bit quantization is VPTQ (MIT license), proposed by Microsoft. It was introduced in October 2024 and has since shown excellent performance and efficiency in quantizing large models.
In this article, we will:
- Review the VPTQ quantization algorithm.
- Demonstrate how to use VPTQ models, many of which are already available. For instance, we can easily find low-bit variants of Llama 3.3 70B, Llama 3.1 405B, and Qwen2.5 72B.
- Evaluate these models and discuss the results to understand when VPTQ models can be a good choice for LLMs in production.
Remarkably, 2-bit quantization with VPTQ almost achieves performance comparable to the original 16-bit model on tasks such as MMLU. Moreover, it enables running Llama 3.1 405B on a single GPU, while using less memory than a 70B model!
All the steps to run VPTQ models and evaluation are explained in the article, and implemented in this notebook:
Vector Post-Training Quantization
VPTQ is presented in this paper:
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
It leverages vector quantization (VQ), a technique that represents groups of weights as vectors instead of individual scalars.
The core idea is to reshape the weight matrix of the LLM into smaller vectors. Each vector is then mapped to the closest centroid from a predefined codebook. Note: Codebooks are a set of learnable candidate vectors that can be used to encode data.
This mapping minimizes the Euclidean distance between the centroid and the vector, replacing the vector with an index pointing to the centroid. This transformation allows significant compression while maintaining accuracy.

The quantization process is guided by a second-order optimization framework. The quantization error is minimized under the constraint that the error’s impact on the model’s performance is as small as possible. The Hessian matrix represents the second-order sensitivity of the loss to weight changes. This is similar to what GPTQ does.
Among the hyperparameters of VPTQ, we have:
- The vector length (v)
- The number of centroids (k)
v and k control the trade-off between accuracy and compression ratio. For instance, longer vectors reduce the number of centroids needed, improving memory efficiency, but may increase computational cost during dequantization. The size of the codebook is determined by k, where larger k provides a better representation of weight distributions but consumes more memory for storing centroids.
Residual vector quantization (RVQ) further refines the process. This multi-stage refinement uses a secondary codebook to minimize residual errors, enabling high accuracy with minimal bit overhead. Another refinement addresses outliers, which are rare but disproportionately large weights that can distort quantization accuracy. Outliers are treated separately using dedicated codebooks, minimizing their impact on the overall error.
The initialization of centroids is critical and is done using Hessian-weighted K-means clustering. This approach ensures that the centroids align well with the weight distribution, considering the relative importance of weights as indicated by the Hessian diagonal. This weighted initialization reduces the quantization error significantly compared to naive clustering.
During inference, the model reconstructs the weights by looking up centroids from the codebook based on indices, combining residual corrections if RVQ is used. This makes the inference process lightweight, as it only involves simple lookups and additions.
Estimating the Average Bitwidth of a VPTQ Model
Microsoft proposes a method to estimate the bitwidth, i.e., the quantization precision, of a model quantized with VPTQ. Here is how they do.
The model’s naming convention used by the VPTQ models released by Microsoft includes details about the vector length (v), codebook size (k), and residual codebook size. For example, the name "Meta-Llama-3.1–70B-Instruct-v8-k65536–256-woft" corresponds to the model "Meta-Llama-3.1–70B-Instruct" with the following parameters:
- Vector Length: v=8
- Number of Centroids: k=65536
- Number of Residual Centroids: k(res)=256
The equivalent bitwidth of the model can be calculated as follows:
Index Bitwidth:
Each vector is represented by the centroid index. For k=65536, we have the index bitwidth: log2(65536) = 16 bits. Dividing by the vector length (v=8) gives: 16/8= 2 bits per weight.
Residual Index Bitwidth:
For residual centroids, k(res) = 256, the index bitwidth is: log2(256) = 8 bit. Dividing by the vector length (v=8) gives: 8/8=1 bit per weight.
Total Bitwidth:
The combined bitwidth is: 2+1=3 bits per weight.
To estimate the model size, multiply the total number of parameters by the bitwidth and convert it to bytes.
Note: This estimation is fairly accurate but excludes the size of the codebook (lookup table), additional parameter overheads, and padding overhead for storing indices.
Running VPTQ Models
VPTQ is integrated into Hugging Face Transformers. To run the models, you need to install the following:
pip install --upgrade transformers vptq
Transformers automatically detects that the model is a VPTQ model. We don’t have to do anything special to run the models. It is as simple as this:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load VPTQ-quantized model directly from HuggingFace Hub
model = AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8-k65536-256-woft")
# Simple inference
prompt = "Explain: Do not go gentle into that good night."
output = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device))
print(tokenizer.decode(output[0], skip_special_tokens=True))
Many models were released by the "VPTQ-community" on Hugging Face. In the following section, we will evaluate the following ones:
- VPTQ-community/Qwen2.5–72B-Instruct-v16-k65536–65536-woft
- VPTQ-community/Qwen2.5–72B-Instruct-v8-k65536–0-woft
- VPTQ-community/Qwen2.5–72B-Instruct-v16-k65536–32768-woft
- VPTQ-community/Meta-Llama-3.3–70B-Instruct-v8-k65536–0-woft
- VPTQ-community/Meta-Llama-3.3–70B-Instruct-v16-k65536–1024-woft
- VPTQ-community/Meta-Llama-3.1–405B-Instruct-v16-k65536–65536-woft
- VPTQ-community/Meta-Llama-3.1–405B-Instruct-v16-k32768–32768-woft
- VPTQ-community/Meta-Llama-3.1–405B-Instruct-v16-k65536–1024-woft
- VPTQ-community/Meta-Llama-3.1–405B-Instruct-v16-k65536–256-woft
- VPTQ-community/Meta-Llama-3.1–405B-Instruct-v16-k65536–64-woft
Note: These Qwen and Llama models are published with a llama and a qwen license, respectively.
All these Llama 3.3 and Qwen2.5 models can run on a single 24 GB GPU! The Llama 3.1 405B models, while they are significantly smaller than the original one, still require a lot of GPU memory but I could run all of them using the H200 SXM provided by RunPod (referral link), so they can run on a single GPU.
I haven’t tested it yet but it should also be possible to fine-tune a LoRA adapter on top of VPTQ models, using the QLoRA method. At worst, there are only a few lines of code to add VPTQ support in the Hugging Face PEFT library.
Evaluating VPTQ Models
To evaluate how good is a quantization algorithm, I usually rely on MMLU. I compute the accuracy of the quantized model and compare it with the accuracy of the original model. If both accuracies are close, the quantization is good.
Note however that this is far from enough to assess a quantized model. Quantized models may be accurate for some tasks while being completely broken for others. For instance, with AutoRound quantization I have seen 2-bit models performing very well on MMLU while generating gibberish. That’s why I recommend also evaluating the model on some generative tasks, like IFEval or MATH, to make sure the model is not broken.
For this article, I only use MMLU to evaluate all the models, and also MMLU-PRO, MuSR, and GPQA to evaluate Llama 3.3 and Qwen2.5. I didn’t run generative benchmarks, and evaluated Llama 3.1 405B only on MMLU because of the high cost of the evaluation. I plan to do it when VPTQ support is added to vLLM (which should be soon, as a PR is open).
I used the Evaluation Harness to run this evaluation.

These results are very impressive.
For reference, with AutoRound, the best accuracy I could achieve with 2-bit quantization, with the same evaluation settings, was 73.71. This was also with Qwen2.5 72B Instruct.
The Recipe for Extremely Accurate and Cheap Quantization of 70B+ LLMs
The performance of the 2-bit Qwen2.5 models is only 5 points below the original model. We are getting very close to having 2-bit models as good as their original 16-bit models.
AutoRound 2-bit quantization with Llama 3.3 completely fails when I try. With VPTQ, it works very well with an MMLU accuracy close to 75.0.
As expected, a larger model like Llama 3.1 405B is more robust to low-bit quantization. Even with an average bitwidth of 1.5, the model still achieves an accuracy close to 70.0 which is only slightly below what I get when I evaluate Qwen2.5 7B Instruct. However, in practice, this model is not suitable for anything since it remains much larger than other models that perform much better.
Additionally, I ran MMLU-PRO, MuSR, and GPQA to evaluate Llama 3.3 and Qwen2.5:

All these scores are excellent for 2-bit models. For reference, the performance of the original (16-bit) Qwen2.5 72B Instruct on MMLU-PRO is 51.4 according to the OpenLLM Leaderboard. This is not far from the performance of the 2-bit VPTQ models. Note: The OpenLLM Leaderboard runs MMLU-PRO as a 5-shot task while I ran it as a 0-shot task, which disadvantages the VPTQ models here.
Conclusion
The accuracy of VPTQ is really impressive for low-bit quantization. Moreover, unlike previous algorithms which could completely break the models at low precision, VPTQ seems to be more robust. Finally, we have a 2-bit Llama 3.3 70B that doesn’t generate gibberish.
With low-bit quantization getting better, quantizing larger models might be preferable to using smaller 16-bit models. The main downside will be that inference will remain slower than with smaller models since they have many more parameters.
In this article, I only explored the quantized models officially released by the authors of VPTQ. How easy is it to quantize your own model with VPTQ?
The algorithm seems as costly as AutoRound. In other words, it should be possible to quantize 70B models for less than $10 in the cloud. However, the quantization algorithm is not fully published in the GitHub repository. It seems that the authors want to polish it more. As far as I understand, it is almost entirely available in the "algorithm" branch of VPTQ:
The part computing the Hessian matrices is still missing (as of the 25th of January, 2025). We have to rely on pre-computed matrices or compute them with third-party software.
To support my work, consider subscribing to my newsletter for more articles/tutorials on recent advances in AI: