
Running large language models (LLMs) on consumer hardware can be challenging. If the LLM doesn’t fit on the GPU memory, quantization is usually applied to reduce its size. However, even after quantization, the model might still be too large to fit on the GPU. An alternative is to run it on the CPU RAM using a framework optimized for CPU inference such as llama.cpp.
Intel is also working on accelerating inference on the CPU. They propose a framework, Intel’s extension for Transformers, built on top of Hugging Face Transformers and easy to use to exploit the CPU.
With Neural Speed (Apache 2.0 license), which relies on Intel’s extension for Transformers, Intel further accelerates inference for 4-bit LLMs on CPUs. According to Intel, using this framework can make inference up to 40x faster than llama.cpp.
In this article, I review the main optimizations Neural Speed brings. I show how to use it and benchmark the inference throughput. I also compare it with llama.cpp.
Neural Speed’s Inference Optimizations for 4-bit LLMs
At NeurIPS 2023, Intel presented the main optimizations for inference on CPUs:
Efficient LLM Inference on CPUs
In the following figure, the components in green are the main additions brought by Neural Speed for efficient inference:

The CPU tensor library provides several kernels optimized for inference with 4-bit models. They support x86 CPUs, including AMD CPUs.
These kernels are optimized for models quantized with the INT4 data type. GPTQ, AWQ, and GGUF models are supported and accelerated by Neural Speed. Moreover, Intel also has its own quantization library, Neural Compressor, that will be called if the model is not quantized.
As for the "LLM Optimizations" highlighted in the figure above, Intel doesn’t write much about them in the NeurIPS paper.
They only mention and illustrate a preallocation of memory for the KV cache, assuming that current frameworks don’t do this preallocation.

The paper shows one benchmark related to inference efficiency. They measured the latency for the next token prediction and compared it with the latency of ggml-based models (i.e., models quantized by llama.cpp in the GGUF format).

They don’t mention the hardware configuration used to obtain these results.
According to these experiments, Neural Speed has a latency for next token prediction which is 1.6x lower than with llama.cpp.
Accelerate Inference with Neural Speed
Note: All the steps described in this section are also implemented in this notebook: Get the notebook (#60)
Neural Speed is available here:
- intel/neural-speed (Apache 2.0 license)
We can install it with pip along with all the other libraries that we will need:
pip install neural-speed intel-extension-for-transformers accelerate datasets
Neural Speed is directly called by Intel’s extension for transformers. We don’t need to import it but we need to import the extension for transformers. Instead of importing from "transformers", we import from "intel_extension_for_transformers.transformers":
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
Then, load the model in 4-bit:
model = AutoModelForCausalLM.from_pretrained(
model_name, load_in_4bit=True
)
Intel’s transformers will quantize and serialize the model in 4-bit. This can take time. Using the 16 CPUs of the L4 instance of Google Colab, it approximately took 12 minutes for a 7B model.
I benchmarked the inference throughput of this model, still using the L4 instance of Google Colab. I used my Mayonnaise (Apache 2.0 license) as LLM. This is based on Mistral 7B.
The Mayonnaise: Rank First on the Open LLM Leaderboard with TIES-Merging
Using Neural Speed and Intel’s quantization, it yields, on average:
32.5 tokens per second
It is fast, indeed. But is it faster than llama.cpp?
I quantized the model with llama.cpp to 4-bit (type 0). The GGUF file is in the same repository:
Then, I run inference using this model with llama.cpp. It yields:
9.8 tokens per second
It’s much slower than Neural Speed but we are far from the "up to 40x faster than llama.cpp" mentioned on Neural Speed’s GitHub page. Note that, for both frameworks, I didn’t try different hyperparameters or hardware.
My understanding is that the speedup depends on the number of CPU cores. The speed gap between llama.cpp and Neural Speed should be greater with more cores, with Neural Speed getting faster.
In their blog post, Intel reports on experiments with an "Intel® Xeon® Platinum 8480+ system; The system details: 3.8GHz, 56 cores/socket, HT On, Turbo On" and an "Intel ® Core™ i9–12900; The system details: 2.4GHz, 24cores/socket, HT On, Turbo On". I found that my results are closer to the second set of experiments which shows an inference 3.1x faster than with llama.cpp.
Neural speed also supports the GGUF format. I ran once more the benchmark using the gguf version of my Mayonnaise that I loaded as follows:
model_name = "kaitchup/Mayonnaise-4in1-02"
model = AutoModelForCausalLM.from_pretrained(
model_name, model_file = "Q4_0.gguf"
)
I found it’s 25% faster than using the quantized model created by Neural Speed:
44.2 tokens per second
4.5x faster than llama.cpp.
Conclusion
Intel Neural Speed is very fast and certainly faster than llama.cpp. While I couldn’t confirm the claim that it is "up to 40x" faster than llama.cpp, I think a 4.5x acceleration is already enough to be convincing.
Should you use the CPU or the GPU for inference?
The performance of inference on CPUs is improving. When processing individual instances (batch size of one), the speed difference between CPUs and GPUs is marginal. However, for batch sizes of two or more, GPUs typically deliver significantly faster performance, provided there is sufficient memory available. This speed advantage is largely due to GPUs’ superior parallel processing capabilities.
To support my work, consider subscribing to my newsletter for more articles/tutorials on recent advances in AI: