
Not so long ago, all IT news channels reported about the new open Mixtral 8x22B model, which outperforms ChatGPT 3.5 on benchmarks like MMLU (massive multitask language understanding) or WinoGrande (commonsense reasoning). This is a great achievement for the world of open models. Naturally, academic benchmarks are interesting, but how does this model practically work? What system requirements does it have, and is it really better compared to previous language models? In this article, I will test four different models (7B, 8x7B, 22B, and 8x22B, with and without a "Mixture of Experts" architecture), and we will see the results.
Let’s get started!
As an aside note, I have no business relationship with Mistral AI, and all tests here are done on my own.
Sparse Mixture of Experts (SMoE)
Already at the beginning of the LLM era, it became known that larger models are, generally speaking, smarter, have more knowledge, and can achieve better results. But larger models are also more computationally expensive. Nobody will wait for the chatbot’s response if it takes 5 minutes. The intuitive idea behind the "mixture of experts" is easy – let’s take several models and add a special layer that will forward different questions to the different models:

As we can see, the idea is not new. The paper about MoE was published in 2017, but only 5–7 years later, when the model’s use shifted from an academic to a commercial perspective, it became important. The MoE approach gives us a significant improvement: we can have a large language model that has a lot of knowledge but at the same time works faster, like a smaller one. For example, a Mixtral 8x7B model has 47B parameters, but only 13B parameters are active at any given time. A Mixtral 8x22B model has 141B parameters, but only 39B are active.
Now, when we get a general idea, it’s time to see how it practically works. Here, I will test 4 models:
- A Mistral 7B model, which was released in October 2023.
- A Mixtral 8x7B, which was released in January 2024.
- A Mixtral 8x22B, which was released in April 2024.
- An "unofficial" Mistral 22B model, which was made by enthusiasts from an 8x22B model. I will use this model only to compare the speed and RAM requirements.
I will run all tests in Google Colab Pro, which, at the time of making this post, cost me 11.2 euros per month. This subscription gives me every month 100 so-called "compute units." For this test, I will use the "TPU v2" instance, which has 334 GB of RAM – the only available VM in Colab capable of running the 8x22B model. The price of this VM is 1.76 "compute units" per hour, which is sufficient for that kind of test. As a bonus, I will also do a test of the 7B and 8x7B models with a 40 GB A100 GPU.
Code
Before testing the models, let’s prepare some code. First, we need to download all models:
!pip3 install huggingface-hub hf-transfer
# 7B Model
!export HF_HUB_ENABLE_HF_TRANSFER="1" && huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir /content --local-dir-use-symlinks False
# 8x7B Model
!export HF_HUB_ENABLE_HF_TRANSFER="1" && huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --local-dir /content --local-dir-use-symlinks False
# 22B Model
!export HF_HUB_ENABLE_HF_TRANSFER="1" && huggingface-cli download bartowski/Mistral-22B-v0.2-GGUF Mistral-22B-v0.2-Q4_K_M.gguf --local-dir /content --local-dir-use-symlinks False
# 8x22B Model
!export HF_HUB_ENABLE_HF_TRANSFER="1" && huggingface-cli download MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUF Mixtral-8x22B-Instruct-v0.1.Q4_K_M-00001-of-00002.gguf --local-dir /content --local-dir-use-symlinks False
!export HF_HUB_ENABLE_HF_TRANSFER="1" && huggingface-cli download MaziyarPanahi/Mixtral-8x22B-Instruct-v0.1-GGUF Mixtral-8x22B-Instruct-v0.1.Q4_K_M-00002-of-00002.gguf --local-dir /content --local-dir-use-symlinks False
It is not fast, and the download may take about an hour. The smallest 7B model has about 4 GB of size, and the largest 8x22B model is about 90 GB. At least these models are already converted into a 4-bit format; otherwise, the file size could be 4x bigger.
As a next step, let’s make a helper function to load the model. I will be using llama-cpp:
Python">from llama_cpp import Llama
def load_model(model_file: str):
""" Load the model from file """
return Llama(
model_path=model_file,
n_ctx=4096, # The max sequence length to use
use_mmap=False, # Don't use nmap (load full model in RAM)
)
Here, I set the "use_mmap" to False. By default, llama-cpp uses a memory-mapped file to load a model. This is great and allows us to use models that are actually larger than the available RAM (as a fun test, I was even able to run a 70B model on my smartphone). But this may not be good for benchmark purposes because it’s hard to know the full RAM use, and with a memory-mapped model, we’re actually testing the speed of the disk instead of the speed of actual inference.
Now, let’s prepare a test. In theory, we can run a full GSM8K or TriviaQA benchmark using the lm-eval Python library, but these tests contain hundreds of questions. It will take hours, and it does not make practical sense; the "official" benchmark results of all public models are already published online. Just to see how the models work, I created a small list of six questions and a helper method to run the test:
questions = [
"Who was the next British Prime Minister after Arthur Balfour?",
# A: Sir Henry Campbell-Bannerman
"In which decade did stereo records first go on sale?",
# A: "1950's
"Which Lloyd Webber musical premiered in the US on 10th December 1993?",
# A: "Sunset Boulevard"
"James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?",
# A: 540 meters
"In a dance class of 40 students, 20% enrolled in contemporary dance, 25% of the remaining enrolled in jazz dance,
and the rest enrolled in hip-hop dance. What percentage of the entire students enrolled in hip-hop dance?",
# A: 60% of the entire students
"John has 110 coins. There are 30 more gold coins than silver coins. How many gold coins does John have?",
# A: 70 coins
]
def make_request(model: Llama, question: str) -> str:
""" Make request to a model """
prompt = f"You are an expert. Make an answer to the following question: {question}"
output = model(
f"<s>[INST] {prompt} [/INST]", # Prompt
max_tokens=4096, # Generate up to 4096 tokens
temperature=0.0, # We don't need randomness
echo=False, # Whether to echo the prompt
stream=True # Use streaming
)
for chunk in output:
print(chunk["choices"][0]["text"], end='')
sys.stdout.flush()
print("n------")
Now, we are ready for the fun part!
Experiment
At this point, we have a code to load our models and to ask several questions. Let’s combine all the pieces:
ram_start = psutil.virtual_memory().used
model_file = "..."
llm = load_model(model_file)
# A warm-up question to initialize the model
make_request(llm, "Hello")
ram_end = psutil.virtual_memory().used
print("RAM:", ram_end - ram_start)
# Start
t_start = time.monotonic()
for q_str in questions:
make_request(llm, q_str)
t_diff = time.monotonic() - t_start
print(f"Total time: {t_diff:.2f}s ({t_diff/len(questions):.2f}s/request)")
Here, I also used a "psutil" library to get the memory consumption; other code parts are self-explanatory.
Now, let’s run the tests.
Mistral 7B
Model’s RAM use: 4,930,879,488 bytes.
Total time: 133.79s (22.30s/request). According to the Llama-cpp log, the average speed is about 7.87 tokens/s (as a reminder, a model runs on a CPU, and it’s not that fast).
Correct answers: 1: -, 2: +, 3: -, 4: -. 5: +, 6: +.
As we can see, the 7B model works, but it was not able to answer half of the questions. Still, the answers are good enough; for example, this is a model’s answer to the last question:
Let's use algebra to solve this problem:
Let G be the number of gold coins and S be the number of silver coins.
We know that the total number of coins is 110, so we can write our first equation as:
G + S = 110
We also know that there are 30 more gold coins than silver coins, so we can write our second equation as:
G = S + 30
Substitute this expression for G into the first equation:
(S + 30) + S = 110
Simplify and solve for S:
2S + 30 = 110
2S = 80
S = 40
Now that we know there are 40 silver coins, we can use the second equation to find the number of gold coins:
G = S + 30
G = 40 + 30
G = 70
So John has 70 gold coins.
It’s a good response for a model that is only 4 GB in size and can run locally on almost every piece of hardware. But obviously, a model of that size has its limitations.
Mixtral 8x7B
Model’s RAM use: 26,997,256,192 bytes (5.5x compared to a 7B model).
Total time: 338.44s (56.41s/request). According to the Llama-cpp log, the generation speed is about 5.05 tokens per second, which is only a bit slower. But the total processing time is almost twice as long, compared to a 7B model. What is the reason? I don’t know. The number of generated tokens may be different, so maybe a better prompt is required for more accurate results.
Correct answers: 1: -, 2: +, 3: +, 4: +, 5: +, 6: +.
As we can see, the 8x7B model is clearly better, compared to a smaller one.
"Mistral" (unofficial) 22B
Model’s RAM use: 14,307,201,024 bytes.
Correct answers: none.
The generation speed for this model is 3.1 tokens per second, but the total time was unknown because I stopped the test. The model just did not work properly; it produced hallucinated gibberish like this:
mermaid quadrantChart code syntax example. DONT USE QUOTO IN CODE.
title Reach and engagement of campaigns,
"Campaign: A": [0.3, 0.6]
"Campaign B": [0.7, 0.4]
## Product Goals
```python
As was mentioned at the beginning of the article, this model was used only to compare the RAM and processing speed.
Mistral 8x22B
Model’s RAM use: 86,716,272,640 bytes.
Total time: 710.54s (118.42s/request). According to the log, the generation speed is 1.82 tokens per second.
Correct answers: 1: +, 2: +, 3: +, 4: +, 5: +, 6: +.
Indeed, the model did a good job, and all answers are correct. But it has its cost: as we can see from a test, the 8x22B model was 5.3 times slower compared to a 7B model and 2.1 times slower compared to 7x8B. As for RAM requirements, the 8x22B model needs 3.3x more RAM compared to the 8x7B model and 17.7x more RAM compared to the 7B.
Bonus
In the previous tests, I used CPU inference because it was the only option for running an 8x22B model in Google Colab. But as a small bonus for the readers, I can compare the 7B and 8x7B models on the 40 GB A100 GPU. This allows us to see a more realistic and "production-ready" performance comparison.
The machine with an A100 GPU is available in Google Colab, but it is several times more expensive (11.7 "compute units" per hour), and the number of those machines is not that large. Practically, the A100 is rarely available, and in about 90% of attempts, I only got the message "The selected GPU is not available." Anyway, after 5–10 retries during the day, I was able to connect.
The results look like this:
- For a 7B model, the total test time was 10.70s (1.78s/request).
- For an 8x7B model on the same hardware, the total time was 24.2s (4.03s/request).
As we can see, even on a top A100 GPU, an 8x7B model is 2.2x slower compared to a 7B model. Interestingly, the ratio between the two models is almost the same on both CPU and GPU runs.
Conclusion
In this article, I tested an 8x22B Mixtral model, based on the Sparse Mixture of Experts (SMoE) architecture, and the results are interesting.
As we can see from the tests, SMoE only slightly improves the RAM efficiency, and the full model and its "experts" sub-models still need to be loaded in RAM. In bar chart form, it looks like this:

The speed improvement is more significant: an 8x7B model works only 2.5 times slower compared to a 7B, and an 8x22B model is 5.33 times slower:

Anyway, the SMoE is not a "magic bullet," and a larger model still needs more resources. But the fact that a 141B parameter model is "only" 5.3x slower compared to a small 7B model is a nice improvement. Last but not least, the possibility of running the model, which outperforms GPT 3.5, is just fun.
From a practical perspective, an 8x22B model is still "heavy" for many applications. At least in 2024, not so many people have access to computers with 90 GB of GPU RAM. According to the benchmark results, the 8x7B model has a 78.4% TriviaQA (text-based question-answering dataset) score, and the 8x22B model has 82.2%. Does this 4% make sense from a business perspective, considering the difference in hardware costs? It obviously depends on the use case; at least it is good that there is a choice.
Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.
Those who are interested in using language models and natural language processing are also welcome to read other articles:
- GPT Model: How Does it Work?
- 16, 8, and 4-bit Floating Point Formats – How Does it Work?
- Process Pandas DataFrames with a Large Language Model
- A Weekend AI Project (Part 1): Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi
- A Weekend AI Project (Part 2): Using Speech Recognition, PTT, and a Large Action Model on a Raspberry Pi
- A Weekend AI Project (Part 3): Making a Visual Assistant for People with Vision Impairments