
At the beginning of 2024, almost all tech reviewers wrote about Rabbit R1 – the first portable "AI assistant" with a $199 cost. Which uses, according to authors, "neuro-symbolic programming" and a LAM ("Large Action Model") to perform different tasks. But how does it work? Well, the best way to know is to make the prototype on our own!
Those readers who have never heard about the Rabbit R1 before can find plenty of YouTube reviews like this:
This article was also inspired by the post of Nabil Alouani, who did an interesting analysis of how the Rabbit R1 could be made:
Rabbit’s New AI Device Can "Do Anything" for You by Using Apps – But How Exactly Does It Work?
I will implement similar ideas in Python code, and we will see how it works on real Raspberry Pi hardware and what kind of challenges need to be solved.
Before we begin, a small note: I have no affiliation with the Rabbit team or its sales.
Components
In this article, we will make an AI assistant containing several components:
- A microphone ** and a** Push-to-Talk (PTT) button.
- Automatic Speech Recognition (ASR), which can convert recorded audio data into text.
- A small language model that runs locally on the device. This model will parse actions from the text recognized by the ASR.
- If the action is unknown to a local model, a device will call the public API. Here, two options will be available: we will use an OpenAI API (for those who have a key) and the LLaMA model for those who want a free solution.
- The result (action for a local model or a text response from the "big" model) will be displayed on the device screen.
The code in this article is made for the Raspberry Pi, but it can also be tested on a regular PC as well. And now, let’s get started!
Hardware
For this project, I will use a Raspberry Pi 4, a single-board computer running Linux. The Raspberry Pi has plenty of GPIO (general-purpose input/output) pins, which allow us to connect different hardware. It’s portable and needs only 5V DC power. A will also connect a 128×64 OLED display and a button; the connection diagram looks like this:

At the moment of this writing, the Raspberry Pi costs about $80–120, depending on the model (the RPi 5 is faster but more expensive) and RAM size (at least 4GB is required to run a language model). A display, button, and set of wires can be bought on Amazon for an extra $10–15. For the sound recording, any USB microphone will do the job. The Raspberry Pi setup is straightforward; there are enough tutorials about that. It is only important to mention that 32- and 64-bit versions of Raspbian are available. We need a 64-bit version because most modern Python libraries are not available in 32-bit versions anymore.
Now, let’s talk about software parts.
Push-to-Talk (PTT)
Implementing push-to-talk mode on the Raspberry Pi is relatively straightforward. As we can see on the wiring diagram, the PTT button is connected to one of the pins (in our case, pin 21). To read its value, we first need to import a GPIO library and configure the pin:
try:
import RPi.GPIO as gpio
except (RuntimeError, ImportError):
gpio = None
button_pin = 21
gpio.setup(button_pin, gpio.IN, pull_up_down=gpio.PUD_UP)
Here, I set pin 21 as an "input" and enabled the pull-up resistor. A "pull-up" means that when the button is not pressed, the input is connected via the internal resistor to the "power," and its value equals "1." When the button is pressed, the input value equals "0" (so the values in the Python code will be reversed: "1" if the button is not pressed, "0" otherwise).
When the input pin is configured, we need only one line of code to read its value:
value = gpio.input(button_pin)
To make the coding easier, I created a GPIOButton
class, which allows me to remember the last button state. By comparing the state, I can easily detect if the button was pressed or released:
class GPIOButton:
def __init__(self, pin_number: int):
self.pin = pin_number
self.is_pressed = False
self.is_pressed_prev = False
if gpio is not None:
gpio.setup(self.pin, gpio.IN, pull_up_down=gpio.PUD_UP)
def update_state(self):
""" Update button state """
self.is_pressed_prev = self.is_pressed
self.is_pressed = self._pin_read(self.pin) == 0
def is_button_pressed(self) -> bool:
""" Button was pressed by user """
return self.is_pressed and not self.is_pressed_prev
def is_button_hold(self) -> bool:
""" Button still pressed by user """
return self.is_pressed and self.is_pressed_prev
def is_button_released(self) -> bool:
""" Button released by user """
return not self.is_pressed and self.is_pressed_prev
def reset_state(self):
""" Clear the button state """
self.is_pressed = False
self.is_pressed_prev = False
def _pin_read(self, pin: int) -> int:
""" Read pin value """
return gpio.input(pin) if gpio is not None else 0
This approach also allows us to create a "virtual button" for those who don’t have a Raspberry Pi. For example, this "button" is pressed the first 5 seconds after the application is started:
class VirtualButton(GPIOButton):
def __init__(self, delay_sec: int):
super().__init__(pin_number=-1)
self.start_time = time.monotonic()
self.delay_sec = delay_sec
def update_state(self):
""" Update button state: button is pressed first N seconds """
self.is_pressed_prev = self.is_pressed
self.is_pressed = time.monotonic() - self.start_time < self.delay_sec
With a "virtual button," this code can be easily tested on a Windows, Mac, or Linux PC.
Sound Recording and Speech Recognition
With the help of the PTT button, we can record the sound. To do this, I will be using a Python soundcard
library. I will record the audio using 0.5s chunks; this accuracy is good enough for our task:
import soundcard as sc
class SoundRecorder:
""" Sound recorder class """
SAMPLE_RATE = 16000
BUF_LEN_SEC = 60
CHUNK_SIZE_SEC = 0.5
CHUNK_SIZE = int(SAMPLE_RATE*CHUNK_SIZE_SEC)
def __init__(self):
self.data_buf: np.array = None
self.chunks_num = 0
def get_microphone(self):
""" Get adefault microphone """
mic = sc.default_microphone()
logging.debug(f"Recording device: {mic}")
return mic.recorder(samplerate=SoundRecorder.SAMPLE_RATE)
def record_chunk(self, mic: Any) -> np.array:
""" Record a new chunk of data """
return mic.record(numframes=SoundRecorder.CHUNK_SIZE)
def start_recording(self, chunk_data: np.array):
""" Start recording a new phrase """
self.chunks_num = 0
self.data_buf = np.zeros(SoundRecorder.SAMPLE_RATE * SoundRecorder.BUF_LEN_SEC, dtype=np.float32)
self._add_to_buffer(chunk_data)
def continue_recording(self, chunk_data: np.array):
""" Continue recording a phrase """
self.chunks_num += 1
self._add_to_buffer(chunk_data)
def get_audio_buffer(self) -> Optional[np.array]:
""" Get audio buffer """
if self.chunks_num > 0:
logging.debug(f"Audio length: {self.chunks_num*SoundRecorder.CHUNK_SIZE_SEC}s")
return self.data_buf[:self.chunks_num*SoundRecorder.CHUNK_SIZE]
return None
def _add_to_buffer(self, chunk_data: np.array):
""" Add new data to the buffer """
ind_start = self.chunks_num*SoundRecorder.CHUNK_SIZE
ind_end = (self.chunks_num + 1)*SoundRecorder.CHUNK_SIZE
self.data_buf[ind_start:ind_end] = chunk_data.reshape(-1)
With a PTT button and a sound recorder, we can implement the first part of our "smart assistant" pipeline:
ptt = GPIOButton(pin_number=button_pin)
recorder = SoundRecorder()
with recorder.get_microphone() as mic:
while True:
new_chunk = recorder.record_chunk(mic)
ptt.update_state()
if ptt.is_button_pressed():
# Recording started
recorder.start_recording(new_chunk)
elif ptt.is_button_hold():
recorder.continue_recording(new_chunk)
elif ptt.is_button_released():
buffer = recorder.get_audio_buffer()
if buffer is not None:
# Recording is finished
# ...
# Ready for a new phrase
ptt.reset_state()
The full code is represented at the end of the article, but this part is enough to get the idea. Here, we have an infinite "main" loop. The microphone is always active, but the recording starts only when the button is pressed. When the PTT button is released, the audio buffer can be used for speech recognition.
The **** ASR (Automatic Speech Recognition) was already described in my previous article:
A Weekend AI Project: Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi
To make this text shorter, I will not repeat the code again; readers are welcome to check the previous part on their own.
Display
In this project, I am using a small 1.4" 128×64 OLED display, which can be bought on Amazon for $3–5. The code was already presented in the previous article. I only did a small refactoring and put all methods in the OLEDDisplay
class:
class OLEDDisplay:
""" Display info on the I2C OLED screen """
def __init__(self):
self.pixels_size = (128, 64)
...
self.app_logo = Image.open("bunny.png").convert('1')
if adafruit_ssd1306 is not None and i2c is not None:
self.oled = adafruit_ssd1306.SSD1306_I2C(self.pixels_size[0],
self.pixels_size[1],
i2c)
else:
self.oled = None
def add_line(self, text: str):
""" Add new line with scrolling """
def add_tokens(self, text: str):
""" Add new tokens with or without extra line break """
def draw_record_screen(self, text: str):
""" Draw logo and text """
logging.debug(f"Draw_record_screen: 33[0;31m{text}33[0m")
if self.oled is None:
return
image = Image.new("1", self.pixels_size)
img_pos = (self.pixels_size[0] - self.image_logo.size[0])//2
image.paste(self.image_logo, (img_pos, 0))
draw = ImageDraw.Draw(image)
text_size = self._get_text_size(text)
txt_pos = (self.pixels_size[0]//2 - text_size[0]//2,
self.pixels_size[1] - text_size[1])
draw.text(txt_pos, text, font=self.font, fill=255, align="center")
self._draw_image(image)
def _get_text_size(self, text):
""" Get size of the text """
_, descent = self.font.getmetrics()
text_width = self.font.getmask(text).getbbox()[2]
text_height = self.font.getmask(text).getbbox()[3] + descent
return (text_width, text_height)
def _draw_image(self, image: Image):
""" Draw image on display """
I also added a draw_record_screen
method, which shows the "rabbit" logo and text, indicating whether or not the PTT button is pressed. The text is also useful for other status messages. The display, connected to the Raspberry Pi, looks like this:

The flickering is an artifact of the video recording; it’s not visible to the human eye. And I am not a visual artist; sorry for my drawing skills 😉
As in the previous article, this code can be tested on a regular PC without a Raspberry Pi. In that case, the oled
variable is None and only the standard logging.debug
output will be used.
Large Action Model
Now we are approaching the fun part – let’s play with LLMs. The logic of our AI assistant will be simple:
- We get the phrase from the microphone.
- We parse this phrase with a small language model, which runs locally on the device.
- If the phrase corresponds to a specific action, the assistant is performing it (for example, it can turn on the lights by sending a command to a smart LED lamp). Only if the action is unknown, the assistant asks a "big" model to help.
Running the model locally on a device is crucial for a straightforward reason: the cloud API is not free. For example, at the moment of this writing, an original Rabbit R1’s price is $199, and as they promise on the website, "no subscription is required." To make it doable, it is important to run as many actions locally as possible. For our smart assistant, we will use the same approach.
As a toy example, let’s assume we have only one action, and our smart assistant can only switch lights on and off. A possible LLM prompt to detect this action can look like this:
You are the user assistant.
If the user wants to switch on the light, write only LIGHT ON.
If the user wants to switch off the light, write only LIGHT OFF.
In any other case write only I DON'T KNOW.
Example.
User: Switch on the light. Assistant: LIGHT ON.
User: Switch off the light. Assistant: LIGHT OFF.
User: Open the door. Assistant: I DON'T KNOW.
Now read the following user text. Write only a short answer.
USER: {question}
ASSISTANT:
In the real use case, there may be many actions, and a small RAG database will get a prompt best suited for the user’s request, but for our test, it’s good enough.
To use the language model on a Raspberry Pi, let’s create the LLM
class. I also created the LLMAction
class with our 3 possible actions:
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain_community.llms import LlamaCpp
class LLMAction:
""" Available actions """
UNKNOWN = 0
LIGHTS_ON = 1
LIGHTS_OFF = 2
@staticmethod
def get_action(response: str) -> int:
""" Get action from the text response """
actions = [(LLMAction.LIGHTS_ON, "LIGHT ON"),
(LLMAction.LIGHTS_OFF, "LIGHT OFF")]
for action, action_text in actions:
if action_text.lower() in response.lower():
return action
return LLMAction.UNKNOWN
class LLM:
""" LLM interaction """
def __init__(self):
self.model_file = "..."
self.llm = LlamaCpp(
model_path=self.model_file,
temperature=0.1,
max_tokens=8,
n_gpu_layers=0,
n_batch=256,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True,
)
def get_action_code(self, question: str) -> int:
""" Ask LLM a question and return the action code """
res_str = self._inference(question)
return LLMAction.get_action(res_str)
def _inference(self, question: str) -> str:
""" Ask LLM a question """
template = self._get_prompt_template()
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = prompt | self.llm | StrOutputParser()
resp = chain.invoke({"question": question}, config={}).strip()
return resp
def _get_prompt_template(self) -> str:
""" Get prompt for different models """
if "tinyllama" in self.model_file:
return """<|system|>
You are the user assistant. If the user wants to switch on the light, write only LIGHT ON.
If the user wants to switch off the light, write only LIGHT OFF. In any other case write only I DON'T KNOW.
Example.
User: Switch on the light. Assistant: "LIGHT ON".
User: Switch off the light. Assistant: "LIGHT OFF".
User: Open the door. Assistant: "I DON'T KNOW".
Now answer this user question. Please write only LIGHT ON, LIGHT OFF or I DON'T KNOW.
</s>
<|user|>
{question}</s>
<|assistant|>"""
...
Here, I loaded a language model using [LlamaCpp](https://python.langchain.com/docs/integrations/llms/llamacpp)
and created the _inference
method to get the response. It’s important to mention that different models have different prompt syntax, so depending on the model name, I select different prompts. A LlamaCpp library is great for our task because it is written in plain C/C++ and can work without a CUDA GPU (which a Raspberry Pi does not have).
Which model can we use? The answer is not so easy because the Raspberry Pi has pretty limited computational resources. There is no GPU, and the CPU inference is just slow. Practically, when running a model on a Raspberry Pi, we are limited to 1–2B models; otherwise, the inference takes too much time. A "small large language model" sounds like a paradox, but in our case, the choice is pretty limited. On HuggingFace, I’ve found 3 models that are more or less suitable for the Raspberry Pi: 1B Tiny Vicuna, 1.1B Tiny Llama, and 2.7B Phi-2.
To find the best tiny model, let’s make a tiny benchmark. To test our 3 actions, I created 12 phrases, 4 for each type:
qa_pairs = [("Switch on the light", LLMAction.LIGHTS_ON),
("Switch on the light please", LLMAction.LIGHTS_ON),
("Turn on the light", LLMAction.LIGHTS_ON),
("Turn on the light please", LLMAction.LIGHTS_ON),
("Switch off the light", LLMAction.LIGHTS_OFF),
("Switch off the light please", LLMAction.LIGHTS_OFF),
("Turn off the light", LLMAction.LIGHTS_OFF),
("Turn off the light please", LLMAction.LIGHTS_OFF),
("Buy me the ticket", LLMAction.UNKNOWN),
("Where is the nearest library?", LLMAction.UNKNOWN),
("What is the weather today?", LLMAction.UNKNOWN),
("Give me a receipt of an apple pie", LLMAction.UNKNOWN)]
Before running the code, we need to download the models with the huggingface-cli
tool:
huggingface-cli download afrideva/Tiny-Vicuna-1B-GGUF tiny-vicuna-1b.q4_k_m.gguf --local-dir . --local-dir-use-symlinks False
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
huggingface-cli download TheBloke/phi-2-GGUF phi-2.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
Here, I also downloaded the 7B Llama-2 model as a "reference" to see if it works better.
The test results on a Raspberry Pi 4 look like this:
Tiny Vicuna 1B: accuracy: 66%, avg.time per response: 4.2s
Tiny Llama 1.1B: accuracy: 0%, avg.time per response: 4.9s
Phi-2 2.7B: accuracy: 75%, avg.time per response: 24.6s
Llama-2 7B: accuracy: 83%, avg.time per response: 19.3s
The results are interesting. First, I was surprised by the 0% accuracy of the Tiny Llama model. It turned out that it could still provide an answer, but it was just inaccurate. For example, Tiny Llama can respond to the phrase "Turn on the light" with an answer of "User: LIGHT ON," which is "kind of" correct, and the key phrase can be easily found. Second, it was interesting to see that a 7B Llama-2 works faster than a 2.7B Phi-2. Third, the most "difficult" for models was the last group of questions; almost all the models tried to make an answer on their own instead of writing "I don’t know.". And funny enough, not only a "tiny" model, but Google Bard also made this mistake:

Anyway, we can use 1–2B models on the Raspberry Pi, but there are two challenges. First, as we can see, the Raspberry Pi 4 is not fast enough. The model works, but 4 seconds for inference is a bit slow. The Raspberry Pi 5 should be almost 2x faster, but even 2 seconds is a significant delay. Second, the tiny LLMs are often not accurate enough. It’s not a problem for our prototype, but for a production device, fine-tuning will be required.
After getting the test results, it is also interesting to think about how Rabbit R1 developers implemented their LLM. In the R1 description, I saw a "500 ms" response time. As we can see from the results, this time limit is indeed challenging. Naturally, I don’t know the real answer, though I can make some educated guesses.
- They made their own small 0.1–0.5B language model, focused only on user actions, and trained it on a synthetic dataset.
- They used a separate chip to make processing faster. Devices like "Intel Neural Compute Stick" have been known for years, and maybe there are modern coprocessors capable of running LLM calculations (this approach is not new; people of my age can recall Intel 8087 math coprocessors as well;).
- They use more than just LLM. Text parsing can also be done using "classic" Python tools like regular expressions, coded rules, and so on; it may be profitable to combine all methods.
Last but not least, according to specs, the Rabbit R1 has a pretty good MediaTek MT6765 Octa-core processor, which is significantly (4–8x in some tests) faster than my Raspberry Pi 4. So, maybe even the 1B model works fast enough on this CPU.
Cloud Model
As was discussed before, if the local model does not know the answer, it returns the "I don’t know" response, and in that case, we will forward the question to its "big brother." Let’s do it!
Using a cloud model on a Raspberry Pi is straightforward. I will create an OpenAILLM
class, which, as its name suggests, uses an OpenAI API:
from langchain_openai import OpenAI
OPENAI_BASE_URL = "..."
OPENAI_API_KEY = "KEY_HERE"
class OpenAILLM:
""" OpenAI API Handler """
def __init__(self):
self.llm = OpenAI(openai_api_key=OPENAI_API_KEY,
base_url=OPENAI_BASE_URL)
self.template = """You are a helpful assistant.
Make a short answer to the question.
Question: {question}
Answer:"""
def inference(self, question: str, callback: Callable):
""" Ask OpenAI model a question """
prompt = PromptTemplate(template=self.template,
input_variables=["question"])
chain = prompt | self.llm | StrOutputParser()
for token in chain.stream({"question": question}):
callback(token)
Here, I use streaming mode and a callback
handler to update the OLED display, so the answer will appear immediately, token by token.
Readers can see an OPENAI_API_KEY
variable in the code. What can we do if we don’t have an OpenAI subscription? Well, LlamaCPP is a great library, and it also provides a solution for that. It can mimic the OpenAI API with a local instance, which we can run on another PC. For example, I can run a 7B Llama-2 model on my desktop by using this command:
python3 -m llama_cpp.server --model llama-2-7b-chat.Q4_K_M.gguf --n_ctx 16192 --host 0.0.0.0 --port 8000
After that, I only need to adjust the OPENAI_BASE_URL
in the code to something like "http://192.168.1.10:8000/v1". Interestingly, the OpenAI library still needs a key (there is an internal check that the key cannot be empty), but the universal number "42" is good enough 😉
By the way, in our prototype, we use a local model first to reduce the cloud costs, but those readers who have an OpenAI key and don’t care about the price can use API calls to parse local actions as well. It will not be free, but it will be faster and more accurate. In that case, the LLM
class can be slightly modified to send action prompts to OpenAI instead of the LlamaCPP model.
Results
Finally, it’s time to combine all the parts! The final code looks like this:
if __name__ == "__main__":
display = OLEDDisplay()
display.add_line("Init automatic speech recogntion...")
asr = ASR()
display.add_line("Init GPT model...")
llm = LLM()
ptt = GPIOButton(pin_number=button_pin)
if gpio is None:
ptt = VirtualButton(delay_sec=5)
def on_recording_finished(audio_buffer: np.array):
""" Recording is finished, process the audio data """
question = asr.transcribe(audio_buffer)
display.add_line(f"> {question}n")
# Process
action = llm.get_action_code(question)
if action != LLMAction.UNKNOWN:
process_action(action)
else:
process_unknown(question)
display.add_line("nReturn back in 5s...n")
time.sleep(5)
def process_action(action: int):
""" Process dummy action """
...
def process_unknown(question: str):
""" Ask OpenAI the question """
display.add_line("n")
# Stream an answer
openai_llm = OpenAILLM()
openai_llm.inference(question=question,
callback=display.add_tokens)
# Main loop
recorder = SoundRecorder()
with recorder.get_microphone() as mic:
ptt.reset_state()
display.draw_record_screen("PTT READY")
while True:
new_chunk = recorder.record_chunk(mic)
ptt.update_state()
if ptt.is_button_pressed():
display.draw_record_screen("PTT ON")
recorder.start_recording(new_chunk)
elif ptt.is_button_hold():
recorder.continue_recording(new_chunk)
elif ptt.is_button_released():
display.draw_record_screen("PLEASE WAIT...")
buffer = recorder.get_audio_buffer()
on_recording_finished(buffer)
# Ready again
ptt.reset_state()
display.draw_record_screen("PTT READY")
Practically, it works like this.
Executing the local action:

Here, I’m using a dummy action; connecting to a smart lamp or using a relay shield on the Raspberry Pi will be far beyond the scope of this article.
Getting the answer from the remote LLM:

Here, I used an LLaMA-2 model running on my desktop as an OpenAI replacement. We can also see that the response is much faster compared to a model running locally on a Raspberry Pi.
Conclusion
In this "weekend" project, we made a smart assistant, based on a Raspberry Pi, that can perform different functions:
- It allows us to use the push-to-talk button and speech recognition.
- It can run a local language model that is capable of detecting different user actions.
- It can "ask" for a more powerful remote model if the user’s request is unknown to the local LLM. In our case, the OpenAI API or a compatible local server can be used.
As we can see, there are many challenges in making a project like this. Large language models need a lot of computation power, which is challenging for a portable device. Getting good results requires not only a fine-tuned model but also hardware that is powerful enough to run that model fast (nobody will be interested in the "smart assistant" response if it is received after the 20-second delay). A cloud API is fast but also not free, and finding the balance between hardware costs, computation speed, sale price, and cloud costs can be tricky.
If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.
Those who are interested in using language models and natural language processing are also welcome to read other articles:
- A Weekend AI Project (Part 1): Running Speech Recognition and a LLaMA-2 GPT on a Raspberry Pi
- LLMs for Everyone: Running LangChain and a MistralAI 7B Model in Google Colab
- LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab
- LLMs for Everyone: Running the HuggingFace Text Generation Inference in Google Colab
- Natural Language Processing For Absolute Beginners
- 16, 8, and 4-bit Floating Point Formats – How Does it Work?
Thanks for reading.