
Nowadays, we can observe an interesting twist in developing new AI models. For a long time, it has been known that bigger models are "smarter" and capable of doing more complex things. But they are also more computationally expensive. Big device manufacturers like Microsoft, Google, and Samsung have already started to promote new AI features to their clients, but it is clear that if millions of users massively use AI on their phones or laptops, the computational cloud costs could be enormous. What is the solution? The obvious way is to run a model on-device, which has advantages in latency (no network connection is required, and the model can be accessed immediately), privacy (no need to process user responses in the cloud), and, naturally, computation costs. Using local AI models is important not only for laptops and smartphones but also for autonomous robots, smart home assistants, and other edge devices.
At the time of making this article, at least two models were announced that were specially designed for on-device running:
- Google’s Gemini Nano. The model was announced in December 2023; it has two versions with 1.8B and 3.25B parameters. According to the developer.android.com webpage, the model will be a part of the Android OS and will be available via the AI Edge SDK. However, this model is not open and probably will not be accessible on platforms like HuggingFace.
- Microsoft’s Phi-3. The model was released in April 2024. It is a 3.8B model that is available in two context-length variants, with 4K and 128K tokens (according to Microsoft, 7B and 14B models will also be available soon). The model was optimized for NVIDIA and ONNX runtime, and it can also run on a CPU. Last but not least, the Phi-3 model is open and can be downloaded.
At the time of writing this text, Google’s Gemini Nano is in the "early access preview" state and is not available for public testing. Microsoft’s Phi-3 is available on HuggingFace, and we can easily use it. As a baseline, I will use an 8B Llama-3 model, which is the newest model from Meta, also released in 2024.
Methodology
I will test 3.8B and 8B language models using different prompts with increasing complexity, from "easy" to "hard":
- Simple prompt: answering a user’s simple question.
- Text processing: summarization and making an answer for incoming messages.
- Tools and agents: answering questions that require external tools.
To test the models, I will use an open-source LlamaCpp library and an open-source GenAI ONNX library from Microsoft. I will test both models on my desktop PC and Raspberry Pi, and we will be able to compare their performance and system requirements.
Let’s get started!
1. Install
1.1 Raspberry PiThe goal of this article is to test the model’s performance on edge devices, and I will use a Raspberry Pi for that:

The Raspberry Pi is a cheap (about $100) credit card-size single-board ARM-based computer running 64-bit Linux. It has no moving parts, requires only 5V DC power, and has plenty of hardware interfaces (GPIO, Serial, I2C, SPI, HDMI), which makes the Raspberry Pi interesting for robots or smart home devices. But how well can it work with small language models? Let’s figure it out.
Raspberry Pi has its own Debian-based OS made by the Raspberry Pi Foundation, which can be good for basic scenarios and home use, but I’ve found that the newest libraries and software packages are tricky to install. I tried to install the ONNX GenAI runtime on the Raspberry Pi OS, but the installation failed. The ONNX GenAI is a new project, and it has a lot of dependencies that do not work "out of the box." In theory, it is possible to find a way to build the newest CMake and GCC with C++20 support from the source, but in my case, it just was not worth the time. So, I decided to use the latest Ubuntu OS, which has better software support and fewer compatibility problems. Ubuntu has official support for the Raspberry Pi as well, so the installation works smoothly:

The code presented in this article is cross-platform, and readers who don’t have a Raspberry Pi can also test the Phi-3 and Llama-3 models on Windows, OSX, or other Linux environments.
1.2 LlamaCppWe can use Phi-3 and Llama-3 models with an open-source LlamaCpp-Python library. LlamaCpp is written in pure C/C++ without any dependencies, and it works on all modern architectures, including CPU, CUDA, and Apple Silicon. We can easily build it for the Raspberry Pi:
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip3 install llama-cpp-python
When the installation is done, we also need to download both models:
pip3 install huggingface-hub
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf Phi-3-mini-4k-instruct-q4.gguf --local-dir . --local-dir-use-symlinks False
huggingface-cli download QuantFactory/Meta-Llama-3-8B-Instruct-GGUF Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
1.3 ONNX Generative AIAnother way to use a Phi-3 model is to use Microsoft’s open-source GenAI ONNX library. ONNX (Open Neural Network Exchange) is an open format designed to represent machine learning models. Microsoft has a well-written tutorial about using Phi-3 with ONNX. Alas, on a Raspberry Pi, it does not work. Pip cannot find a proper installer for the ARM64 onnxruntime-genai
package, and we need to build it from the source. Before compiling the onnxruntime-genai, we need to install the onnxruntime package and copy its library files to the source folder:
pip3 install onnxruntime numpy
wget https://github.com/microsoft/onnxruntime/releases/download/v1.17.3/onnxruntime-linux-aarch64-1.17.3.tgz
tar -xvzf onnxruntime-linux-aarch64-1.17.3.tgz
git clone https://github.com/microsoft/onnxruntime-genai.git --branch v0.2.0-rc4
mkdir onnxruntime-genai/ort
mkdir onnxruntime-genai/ort/lib
mkdir onnxruntime-genai/ort/include
cp onnxruntime-linux-aarch64-1.17.3/lib/* onnxruntime-genai/ort/lib
cp onnxruntime-linux-aarch64-1.17.3/include/* onnxruntime-genai/ort/include
cd onnxruntime-genai
python3 build.py
When the compilation is done, we can install a new wheel using pip:
pip3 install build/wheel/onnxruntime_genai-0.2.0rc4-cp312-cp312-linux_aarch64.whl
As a last step, we need to download the Phi-3 ONNX model:
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
Now, all components are installed, and we are ready for testing.
2. Inference
As was written before, I will be using two libraries for running out models, LlamaCpp and ONNX – let’s create Python methods for it.
Let’s start with LlamaCpp:
from llama_cpp import Llama
def load_llama_model(path: str) -> Llama:
""" Load LlamaCpp model from file """
return Llama(
model_path=path,
n_gpu_layers=0,
n_ctx=4096,
use_mmap=False,
echo=False
)
When the model is loaded, we can run the generation stream:
def llama_inference(model: Llama, prompt: str) -> str:
""" Call a model with a prompt """
stream = model(prompt, stream=True, max_tokens=4096, temperature=0.2)
result = ""
for output in stream:
print(output['choices'][0]['text'], end="")
result += output['choices'][0]['text']
print()
return result
As for ONNX, the process is generally the same, though the code is slightly bigger:
import onnxruntime_genai as og
def load_onnx_model(path: str):
""" Load the ONNX model """
return og.Model(path)
def onnx_inference(model: og.Model, prompt: str) -> str:
""" Run the ONNX model with a prompt """
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
params.try_use_cuda_graph_with_max_batch_size(1)
search_options = {"temperature": 0.2, "max_length": 4096}
params.set_search_options(**search_options)
params.input_ids = tokenizer.encode(prompt)
generator = og.Generator(model, params)
result = ""
tokenizer_stream = tokenizer.create_stream()
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
new_char = tokenizer_stream.decode(new_token)
print(new_char, end='', flush=True)
result += new_char
print()
del generator
return result
3. Testing The Prompts
Now, let’s see how the Phi-3 model works. As a baseline, I will compare it to the Llama-3 8B model, which is the newest small-size model at the time of writing this article.
Before using the models, let’s create helper functions for making a prompt in the required formats:
def make_phi3_prompt(question: str) -> str:
""" Make prompt in Phi-3 format """
return f"<|user|>n{question} <|end|>n<|assistant|>"
def make_llama_prompt(system: str, question: str) -> str:
""" Make prompt in Llama-3 format """
return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system}<|eot_id|><|start_header_id|>user<|end_header_id|>
{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
Now, let’s test different prompts with the Phi and Llama models.
3.1 Basic PromptAs a warm-up, let’s ask a simple question:
#> Phi-3:
prompt = make_phi3_prompt("You are a helpful AI assistant.n"
"What is the distance to the Moon?")
onnx_inference(model, prompt)
#> The average distance from Earth to the Moon is about 238,900 miles
#> (384,400 kilometers). However, this distance varies due to the
#> elliptical shape of the orbit.
#> Llama-3
prompt = make_llama_prompt("You are a helpful AI assistant.",
"What is the distance to the Moon?")
llama_inference(model, prompt)
#> The average distance from the Earth to the Moon is about 384,400
#> kilometers (238,900 miles). This is called the "lunar distance" or "lunar
#> mean distance." However, the orbit is not a perfect circle and its
#> distance from Earth varies slightly due to the elliptical shape of its
#> orbit. At its closest point (called perigee), the Moon is about 356,400
#> kilometers (221,500 miles) away, and at its farthest point (apogee),
#> it is about 405,500 kilometers (252,000 miles) away.
Both models made correct answers, which was expected, there was nothing complex in it.
3.2 Answering The MessageAs we can see, basic prompts work well. However, mobile users will likely not ask about the distance to the Moon too often 🙂 Let’s make a more realistic test and ask both models **** to answer a message.
It was straightforward to find a proper example. I opened LinkedIn and easily found spam mail. I saved it as a Python string (here, a person’s name changed):
text = """I'm reaching out about some exciting roles we have for "Blockchain Developer"
working for a high-growth large global firm in the US.
They are open to either full-time Remote or and look at people at
various levels. As such, attractive pay will be on offer.
Required skills and experience:
- 3+ years of TypeScript Programming language, paradigms, constructs, and idioms,
- Experience with Ethereum blockchain,
- Able to write efficient code and have good debugging skills.
- Experience in writing unit and integration tests.
- Experience in working and contributing to opensource projects
Looking forward to hearing from you.
Alex,
Technical Recruiter"""
Apparently, some recruiters never read people’s profiles and send broadcast messages, relying on pure luck. I don’t live in the US; I have never used Ethereum; I never worked as a TypeScript developer, and this language is not listed in my skills list. Without any use of ML, I can easily treat it as spam.
Let’s ask both models to write a polite answer:
question = f"""You are a helpful AI assistant.
My name is Dmitrii, and I got the following mail.
----
{text}
----
Please write a polite answer that I am not interested in it."""
The answers from both models were surprisingly good:
#> Phi-3:
Dear Mr. Alex,
I hope this message finds you well. Thank you for reaching out and sharing
the exciting opportunity of a "Blockchain Developer" role within your esteemed
organization. I appreciate the time taken to present such an intriguing
position.
However, after careful consideration, I must inform you that at this moment
in my career journey, I am not seeking new opportunities or exploring
different roles. Therefore, I would like to express my gratitude for
considering me as a potential candidate but kindly decline the offer due
to my current focus and commitments.
I wish your team continued success in their endeavors and hope that you find
an exceptional individual who meets all the requirements of this role. Should
there be any future opportunities, I will certainly keep them in mind.
Thank you once again for considering me as a potential candidate. Wishing you
and your organization every success moving forward.
Best regards,
Dmitrii
#> Llama-3
Here is a polite response:
Dear Alex,
Thank you for reaching out about the "Blockchain Developer" role at your
company. I appreciate the opportunity to consider this position.
After reviewing the job description, I regret to inform you that I am not a
good fit for this role. While I have some experience with blockchain
technology and programming languages like TypeScript, my skills don't align
with the specific requirements mentioned in the email.
I appreciate the attractive pay offer, but I've decided to focus on other
opportunities that better match my current skill set and career goals.
Thank you again for thinking of me for this role. I wish your company all the
best in finding a suitable candidate.
Best regards,
Dmitrii
As we can see, both models correctly determined the sender and receiver names and provided, in general, a good answer. A Llama-3 model slightly "hallucinated" and wrote about my Typescript and blockchain experience which I don’t have, so the Phi-3 answer was actually more accurate. And subjectively, Phi-3’s style with its sentences like "I wish your team continued success in their endeavors" was more interesting to read.
3.3 Using ToolsNow, let’s try more complex things – how about using the tools? It’s a powerful approach that allows the model to find a proper "instrument" for performing a task. But it also requires a model to obey strict syntax rules, which can be challenging for small 3.8B and 8B models.
Let’s say we have a smart home assistant that can perform three tasks:
- I can ask an assistant about the current weather. In this case, a model should use the online forecast tool.
- I can ask the assistant to switch on or off the light; in this case, a model can trigger a specific action, like sending a signal to the specific pin (we are using a Raspberry Pi that has a GPIO port for connecting external hardware).
- Last but not least, we can ask some generic questions, like the previously mentioned distance to the Moon. In this case, the model will use its knowledge as a "knowledge base."
Firstly, I tried to use tools in LangChain, but it did not work. Apparently, LangChain is mostly tested with the OpenAI API, and I did not find an easy way to customize the prompts for Llama-3 or Phi-3 models without going very deep into the LangChain code. Using a LangChain prompt as a reference, I created my own prompt with the tools I needed:
system = """You are a helpful AI assistant.
You have access to the following tools:
"Weather Forecast": use this tool only to get the weather in a specific city. Use a `value` key for a city.
"Light On": use this tool to turn the light ON in a specific place. Use a `value` key for a place.
"Light Off": use this tool to turn the light OFF in a specific place. Use a `value` key for a place.
"Knowledge Base": use this tool to write an answer. Write an answer in text format in a `value` key.
The way you use the tools is by specifying a json blob.
Specifically, this json should have an `action` key with the name of the tool to use and a `value` key with the input to the tool going here.
The only values that should be in the "action" field are:
"Weather Forecast", "Light On", "Light Off", "Knowledge Base".
The $JSON_BLOB should only contain a SINGLE action, do NOT return a list of multiple actions.
Here is an example of a valid $JSON_BLOB:
{
"action": "Weather Forecast",
"value": "Paris",
}
```
You should always think about what to do step by step. First, find an appropriate tool, then use it to find an answer.
Begin! ALWAYS use $JSON_BLOB format when responding.
"""
Now, let's try this prompt with our models.
A **Llama-8B** produced the correct answer in 1 of 3 cases:
```python
question = "Human: Please turn on the light in the kitchen."
make_llama_prompt(system, question)
#> Here is a JSON blob that will accomplish this task:
#> {
#> "action": "Light On",
#> "value": "kitchen"
#> }
question = "Human: What is the distance to the Moon?"
make_llama_prompt(system, question)
#> I'd be happy to help you with that!
#>
#> To provide an accurate answer, I'll use my "Knowledge Base" tool.
#>
#> Here's the JSON blob:
#> ```
#> {
#> "action": "Knowledge Base",
#> "value": "The average distance from Earth to the Moon is approximately 384,400 kilometers (238,900 miles)."
#> }
#> ```
question = "Human: What is the weather in London?"
make_llama_prompt(system, question)
#> I'd be happy to help you with that!
#>
#> Here's a JSON blob for the "Weather Forecast" tool:
#>
#> ```
#> {
#> "action": "Weather Forecast",
#> "value": "London"
#> }
#> ```
#>
#> Using this tool, I've retrieved the current weather forecast for London. According to my knowledge base, as of now, the weather in London is mostly cloudy with a high temperature of 12°C (54°F) and a low of 9°C (48°F). There's a moderate chance of light rain throughout the day.
#>
#> Here's the JSON blob again:
#>
#> ```
#> {
#> "action": "Knowledge Base",
#> "value": "The current weather in London is mostly cloudy with a high temperature of 12°C (54°F) and a low of 9°C (48°F). There's a moderate chance of light rain throughout the day."
#> }
#> ```
As we can see, these requests are challenging for a small model. Only the second answer was 100% correct. In the first answer, the model "forgot" to wrap a JSON answer into the “` brackets. The second answer is right. The third answer was incorrect; the model first generated the right response but then added "hallucinated" data.
Surprisingly, I did not expect any good answer from a smaller 3.8B Phi-3 model, but its response was actually better:
question = "Human: Please turn on the light in the kitchen."
make_phi3_prompt(system + question)
#> ```
#> {
#> "action": "Light On",
#> "value": "kitchen"
#> }
#> ```
question = "Human: What is the distance to the Moon?"
make_phi3_prompt(system + question)
#> {
#> "action": "Knowledge Base",
#> "value": "The average distance from Earth to the Moon is about 238,900 miles (384,400 kilometers)."
#> }
#>
#> {
#> "action": "Weather Forecast",
#> "value": "Moon"
#> }
#>
#> I'm sorry, but as an AI I don't have the capability to provide real-time
#> data or forecasts about celestial bodies like the Moon. However,
#> if you need information on weather conditions in a specific location on
#> Earth, I can assist with that using the "Weather Forecast" tool.
question = "Human: What is the weather in London?"
make_phi3_prompt(system + question)
#> ```
#> {
#> "action": "Weather Forecast",
#> "value": "London"
#> }
#> ```
Surprisingly, the 3.8B Phi-3 model answered 2 questions from 3 correctly, while the 8B Llama-3 model generated only 1 correct answer. However, as we can see, both models can strongly hallucinate while processing complex requests, and I would be too scared to use tools like this in production. Small models can perform well at processing "plain text" messages but are not so good at making responses with complex syntax and different parameters.
4. Performance
Finally, let’s compare the model’s performance. How well can AI models work on low-power edge hardware without top CUDA devices? To find the difference, I will run both models on the Raspberry Pi 4 and on my 2.5-year-old Ryzen-9 desktop with an 8 GB GPU – not a top configuration for today, but it works fine for most tasks.
To test the performance, I ran a "spam message response" task again. The inference speed on a desktop PC looks like this:

As we can see, for running a model on a CPU, LlamaCpp is faster compared to ONNX. A 3.8B Phi-3 is slightly faster compared to an 8B Llama model, but it is not drastically different. A Phi-3 GPU speed is really good – even on my 8 GB card (which is an absolute minimum for AI tasks in 2024), the computation time is only 1.45 seconds. So, a Phi-3 model can perform really well on a modern laptop with a discrete graphics card or on special NPU (Neural Processing Unit) hardware. And even CPU performance is acceptable – 25 seconds is enough time for a human to read the response; it does not have to be instant.
Now, let’s look at the speed of the Raspberry Pi 4:

Alas, the result is drastically worse. It took 5.6 minutes for Llama-8B on a Raspberry Pi 4 to process the same message. A 3.8B Phi-3 model works faster and has a smaller memory footprint, though almost a 4-minute delay is still too long. As for the ONNX, it did not work properly on the ARM architecture at all. I don’t know the reason, but the ONNX performance was about 10 times slower compared to LlamaCpp.
A Raspberry Pi 4 is not the newest model on the market, and a Raspberry Pi 5 should be about 2.5x faster. Still, even 2 minutes for a request is too long, so I don’t see an easy way to run a 3.8B model on a device like this. As for modern smartphones, they are fast enough (4435 GFLOPS for Samsung S24 vs. 25 GFLOPS on a Raspberry Pi 5 according to online benchmarks), so I expect a Phi-3 model to work well on modern iOS and Android phones.
Conclusion
In this article, I tested modern 3.8B and 8B language models on my desktop PC and on the Raspberry Pi 4. As we can see, the results are interesting:
- From a functionality perspective, a Phi-3 3.8B model works well for language-processing tasks; it can write answers to incoming mail messages and do other similar tasks like text summarization. This can be especially useful on smartphones because typing a long message on a tiny keyboard is exhausting. However, models can sometimes "hallucinate" and produce incorrect information. "Mobile AI" is going to be widespread soon, and it can be a source of confusion for millions of people. Users should have the habit of double-checking all AI-generated messages before sending them.
- For more complex tasks like using agents and tools, both models did not perform well, and I was not able to achieve stable results. But this can be improved with fine-tuning, especially for domain-specific tasks like smart home assistants, where the number of possible commands is not that large.
- From a performance perspective, a 3.8B model works surprisingly fast. Even on my 2.5-year-old GPU, it can write an answer to the message in less than 2 seconds. The same or even better results should be possible on the newest laptops or phones equipped with modern neural processing units (NPUs). But edge devices like the Raspberry Pi are substantially slower; they have enough RAM to run the models, but the response speed is too slow.
Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn, where I periodically publish smaller posts that are not big enough for a full article. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.
Those who are interested in using language models and natural language processing are also welcome to read other articles:
- GPT Model: How Does it Work?
- 16, 8, and 4-bit Floating Point Formats – How Does it Work?
- Process Pandas DataFrames with a Large Language Model
- A Weekend AI Project (Part 1): Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi
- A Weekend AI Project (Part 2): Using Speech Recognition, PTT, and a Large Action Model on a Raspberry Pi
- A Weekend AI Project (Part 3): Making a Visual Assistant for People with Vision Impairments
- CodeLlama vs. CodeGemma: Using Open Models for AI Coding Assistance