
While Mixtral-8x7B is one of the best open Large Language Models (LLM), it is also a huge model with 46.7B parameters. Even when quantized to 4-bit, the model can’t be fully loaded on a consumer GPU (e.g., an RTX 3090 with 24 GB of VRAM is not enough).
Mixtral-8x7B is a mixture of experts (MoE). It is made of 8 expert sub-networks of 6 billion parameters each.
Since only 2 experts among 8 are effective during decoding, the 6 remaining experts can be moved, or offloaded, to another device, e.g., the CPU RAM, to free up some of the GPU VRAM. In practice, this offloading is complicated.
Choosing which one of the experts to activate is a decision taken at inference time for each input token and each layer of the model. Naively moving some parts of the model to the CPU RAM, as with Accelerate’s device_map, would create a communication bottleneck between the CPU and the GPU.
Mixtral-offloading (MIT license) is a project that proposes a much more efficient solution to reduce VRAM consumption while preserving a reasonable inference speed.
In this article, I explain how mixtral-offloading implements expert-aware quantization and expert offloading to save memory and maintain a good inference speed. Using this framework, we will see how to run Mixtral-8x7B on consumer hardware and benchmark its inference speed.
The tutorial section is also available as a notebook that you can find here:
Caching & Speculative Offloading
MoE language models often allocate distinct experts to sub-tasks, but not consistently across long token sequences. Some experts are active in short 2–4 token sequences, while others have intermittent "gaps" in their activation. This is well illustrated by the following figure:

To capitalize on this pattern, the authors of mixtral-offloading suggest keeping active experts in GPU memory as a "cache" for future tokens. This ensures quick availability if the same experts are needed again. GPU memory limits the number of stored experts, and a simple Least Recently Used (LRU) cache is employed, maintaining the k least recently used experts uniformly across all layers.
Despite its simplicity, the LRU cache strategy significantly speeds up inference for MoE models like Mixtral-8x7B.
However, while LRU caching improves the average expert loading time, a significant portion of inference time still involves waiting for the next expert to load. MoE offloading lacks effective overlap between expert loading and computation.
In standard (non-MoE) models, efficient offloading schedules involve pre-loading the next layer while the previous one runs. However, this advantage isn’t feasible for MoE models, as experts are selected just in time for computation. The system can’t pre-fetch the next layer until it determines which experts to load. Despite the inability to reliably pre-fetch, the authors found that speculative loading can be used to guess the next experts while processing the previous layer, accelerating the next layer’s inference if the guess is correct.
To sum up, an LRU cache and speculative offloading can save VRAM while keeping inference efficient by offloading the experts that are the less likely to be used.
Expert-Aware Aggressive Quantization
In addition to expert offloading, we need to quantize the model to make it run on consumer hardware. Naive 4-bit quantization with bitsandbytes’ NF4 reduces the size of the model to 23.5 GB. This is not enough if we assume that a consumer-grade GPU has at most 24 GB of VRAM.
GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs – Examples with Llama 2
Previous studies showed that experts in MoE can be aggressively quantized to lower precision without much impact on the model performance. For instance, the authors of mixtral-offloading mentioned in their technical report that they have tried 1-bit quantization methods such as the ones proposed by QMoE but observed a significant drop in performance.
Instead, they applied a mixed-precision quantization keeping the non-experts’ parameters to 4-bit.
Among the 46.7 billion parameters in the Mixtral-8x7B, 96.6% (45.1 billion) are for the experts, while the remainder is allocated to token embeddings, self-attention layers, MoE gates, and other minor layers such as LayerNorm.
For quantization, they chose to apply Half Quadratic Quantization (HQQ) (Badri & Shaji, 2023), a data-free algorithm accommodating various bit rates.
They have tried various quantization configurations: FP16 (no quantization), HQQ 4-bit with group size 64 and scale group size 256, HQQ 3-bit with group size 64 and scale group size 128, and HQQ 2-bit with group size 16 and scale group size 128.
As shown in the following table, it appears advantageous to quantize experts to 3 or 2 bits while maintaining attention layers at a higher bitwidth (16 or 4 bits).

After applying quantization and expert offloading, inference is between 2 and 3 times faster than with the offloading implemented by Accelerate (device_map):

Running Mixtral-7x8B with 16 GB of GPU VRAM
For this tutorial, I used the T4 GPU of Google Colab which is old and has only 15 GB of VRAM available. It’s a good baseline configuration to test the generation speed with offloaded experts.
First, we need to install mixtral-offloading and all its requirements:
git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
cd mixtral-offloading && pip install -q -r requirements.txt
Once installed, we need to rerun ldconfig:
export LD_LIBRARY_PATH="/usr/lib64-nvidia"
export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
ldconfig /usr/lib64-nvidia
Since quantization is very costly, we are going to use the model already quantized by mixtral-offloading’s authors. Note that the model can also be quantized with AWQ or GPTQ.
Simple, Fast, and Memory-Efficient Inference for Mistral 7B with Activation-Aware Quantization…
Moreover, mixtral-offloading doesn’t support yet directly loading the model from the Hugging Face Hub. We need the model to be saved locally. This is the model lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo (MIT license) and we are going to download it with huggingface-cli.
huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo
Then, import the following:
import sys
sys.path.append("mixtral-offloading")
import torch
from hqq.core.quantize import BaseQuantizeConfig
from transformers import AutoConfig, AutoTokenizer
from src.build_model import OffloadConfig, QuantConfig, build_mode
Set up the model name and get the configuration of the quantized model:
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
config = AutoConfig.from_pretrained(quantized_model_name)
device = torch.device("cuda:0")
num_experts = config.num_local_experts
The variable that will have the most influence on the decoding performance is the number of experts you offload to the CPU:
offload_per_layer = 3
If you lower this number, decoding will be faster but will also consume more VRAM. "3" works fine for a GPU with 16 GB of VRAM.
For the offloading configuration, I used the default one suggested by mixtral-offloading:
offload_config = OffloadConfig(
main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
offload_size=config.num_hidden_layers * offload_per_layer,
buffer_size=4,
offload_per_layer=offload_per_layer,
)
I also didn’t change the quantization configuration since I used the same model quantized by the authors.
Note that the model is quantized with a mixed-precision, as discussed in the previous section.
The attention modules are quantized to a higher precision of 4-bit since they are shared and used by all the experts:
attn_config = BaseQuantizeConfig(
nbits=4,
group_size=64,
quant_zero=True,
quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256
On the other hand, the experts are quantized to 2-bit:
ffn_config = BaseQuantizeConfig(
nbits=2,
group_size=16,
quant_zero=True,
quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)
Then, finally, we can build the model with mixtral-offloading given the quantization and offloading configurations:
model = build_model(
device=device,
quant_config=quant_config,
offload_config=offload_config,
state_path=state_path,
)
From this point, "model" can be used with Hugging Face Transformers. It has the same "generate" function for inference.
I tried it with 4 different prompts to benchmark the inference speed with the following code:
import time
tokenizer = AutoTokenizer.from_pretrained(model_name)
duration = 0.0
total_length = 0
prompt = []
prompt.append("Write the recipe for a chicken curry with coconut milk.")
prompt.append("Translate into French the following sentence: I love bread and cheese!")
prompt.append("Cite 20 famous people.")
prompt.append("Where is the moon right now?")
for i in range(len(prompt)):
model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
start_time = time.time()
with torch.autocast(model.device.type, dtype=torch.float16, enabled=True):
output = model.generate(**model_inputs, max_length=500)[0]
duration += float(time.time() - start_time)
total_length += len(output)
tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
print("Prompt --- %s tokens/second ---" % (tok_sec_prompt))
print(tokenizer.decode(output, skip_special_tokens=True))
tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/second ---" % (tok_sec))
Note: I won’t discuss the outputs generated by the model since it’s out of the scope of this article. If you want to see them, they are in the notebook.
On average, it consumes 13 GB of VRAM and generates 1.7 tokens/second. In my opinion, this is quite fast for the T4 GPU. If you offload 4 experts per layer, instead of 3, the VRAM consumption decreases to 11.7 GB and the inference speed to 1.4 tokens/second.
I also tried with the A100 GPU to benchmark the inference speed with a faster GPU. Note the A100 could load the entire model quantized but I kept 3 experts offloaded per layer to compare the inference speed with the speed obtained with the T4. On average, with the A100, the model generates 2.6 tokens/second. This is also the speed you can hope to achieve with a recent consumer GPU, e.g., an RTX 4080.
Conclusion
mixtral-offloading is a young project but it’s already working very well. It combines two ideas to significantly reduce memory usage while preserving inference speed: mixed-precision quantization and expert offloading.
While 1.4 tokens per second might seem slow, keep in mind that I obtained this speed with an old T4 GPU. If you are using a recent RTX GPU, such as an NVIDIA RTX 4060 which has also 16 GB of VRAM, we can achieve a speed much closer to the one I obtained with the A100, e.g., 2.6 tokens/second.
Following the success of Mixtral-8x7b, I expect MoE models to become more popular in the future. Frameworks optimizing inference for consumer hardware like mixtral-offloading will be essential to make MoEs more accessible.
To support my work, consider subscribing to my newsletter: