The world’s leading publication for data science, AI, and ML professionals.

Rabbit’s New AI Device Can “Do Anything” for You by Using Apps – But How Exactly Does It Work?

Let's reverse-engineer "r1" and its Large Action Model

All images by the author via Midjourney and DALL-E.
All images by the author via Midjourney and DALL-E.

All you have to do is push a button and speak your mind. Half a second later, r1 turns your instructions into actions. If you’re not into voice commands, you can shake r1 and a keyboard will appear on the screen.

r1 is a new AI-powered device introduced by a company called Rabbit.

It’s like a phone, but it’s not a phone. It’s more like a Pokédex with an AI assistant that "gets things done." r1 can book you a trip, update a spreadsheet, and suggest recipes based on what you have in the fridge.

"The R1 is surprisingly light and feels much nicer than it looks in pictures," David Pierce wrote for The Verge. "Its buttons are clicky and satisfying, which is no surprise from Teenage Engineering, and the whole thing fits nicely in my grip."

Okay cool, but how much?

$199. No subscription. No hidden fees.

Not bad for a device that can "do anything" based on English instructions – but how does this work? Thanks to LAM, the model powering r1. LAM stands for Large Action Model, and it’s the main star of Rabbit’s AI show.


What the heck is a Large Action Model?

LAM turns human intentions into actions through imitation. You show LAM a given workflow, and it learns how to replicate it. No need for repetition; one shot is all it takes.

"We have developed a system that can infer and model human actions on computer applications, perform the actions reliably and quickly, and is well-suited for deployment in various AI assistants and operating systems […]

Enabled by recent advances in neuro-symbolic programming, the LAM allows for the direct modeling of the structure of various applications and user actions performed on them without a transitory representation, such as text. […]

We hope that our efforts could help shape the next generation of natural-language-driven consumer experiences."

No typing in search bars. No clicking. No switching back and forth between tabs. LAM does it on your behalf.

"Oh this is just an app connected to multiple APIs," you may think and you’d be wrong. LAM uses UIs just like we humans do. "Okay but, why is it a big deal?" you may ask.

Because if an AI model can adapt to any user interface, it can do anything.

Imagine you’re five years old. It’s bedtime, and your mother sits next to you. She suggests storytime. You have two options.

  1. You can read a book: in this case, the book is a UI that allows you to access a given story.
  2. Your mother can tell you a story: in this case, your mother is an API that connects you to the content of a given story.

If you opt for the API, you can access extra features, including a comforting voice, Q&A sessions, and special sound effects like "SPLASH," and "BOOM."

These extra features aren’t free though. You have to abide by certain rules set by the API. For instance, your mother can demand you eat your veggies to unlock storytime. She can also limit the number of "replays."

Different API? Different rules. Your grandfather may require you to clean your room, and your older sister can ask for a slice of your dessert.

If you pick the UI (the book) instead, you have fewer features, but more flexibility. You can set the pace and read the story on repeat. Plus, the "rules" are straightforward and barely change across books.

In short, each API offers more options but requires a unique set of rules. UIs have fewer features but offer more freedom and stability -they look similar across apps.

AI assistants like ChatGPT use APIs to access additional functions, like image generation with DALL-E and web browsing with Bing. If the plug-ins change their API rules, however, you lose the extra features.

LAM does things differently. It doesn’t use a mediator; it goes straight to the app.

And yes, UIs also evolve, but only on rare occasions, because apps don’t want to disturb their users. Besides, you can quickly guess UI updates through trial and error. In contrast, API updates require technical adaptations.

Wait, how do you enable LAM to use an app?

You use a web portal called the Rabbit Hole.

The Rabbit Hole allows you to select which apps you want LAM to access. No worries, LAM doesn’t store your login data; your ID and password remain on the app side. Meanwhile, LAM acts on your behalf and asks for confirmation for every important step.

Rabbit has already trained their model on the most popular applications, but you can also teach LAM new actions.

For instance, during the r1 demo, Rabbit CEO Jessie Lyu showed LAM how to generate a picture of a cute dog using Midjourney on Discord. Jesse then asked LAM to mimic the same workflow, but instead of a dog, he asked for a picture of a bunny. LAM did just that.

The question is: How is all this possible on a $200 device?


Is LAM inside the r1 device?

Rabbit didn’t spill out their secret sauce, but they gave out hints scattered in three places -their demo, their website, and a tech blog.

Here are the most significant clues so far:

  • A one-shot payment of $200 and no subscription;
  • Fast response time of 500 milliseconds (half a second);
  • Focus on neuro-symbolic Programming (a fancy way to say algorithms);
  • Computer vision;
  • Speech-to-text and Text-to-speech;
  • Natural Language interface (like LLMs).

The first two hints misled thousands of people. They thought the model runs locally on the r1 device because there’s no subscription.

This means the tiny orange Pokédex would host a powerful model -except powerful models require a significant amount of storage and computing power. In other words, it’s hard to squeeze "do anything" capabilities into a device half the size of a smartphone.

It’s easier to host LAM – or at least part of it – on the cloud, a hypothesis Rabbit themselves confirmed on their website.

"We believe that intelligence in the hands of the end-user is achievable without heavy client-side computing power.

By carefully and securely offloading the majority of computation to data centers, we open up opportunities for ample performance and costoptimizations, making cutting-edge interactive AI experiences extremely affordable.

While the neuro-symbolic LAM runs on the cloud, the hardware device interfacing with it does not require expensive and bulky processors, is extremely environmentally friendly, and consumes little power.

As the workloads related to LAM continue to consolidate, we envision a path towards purposefully built server-side and edge chips."

Rabbit’s website [Emphasis by the author].

Another clue is a quote from David Pierce mentioned earlier.

"I spent a few minutes with the R1 after Rabbit’s launch event, and it’s an impressive piece of hardware," David said. "Only one device (Lyu’s) was actually functional, and even that one couldn’t do much because of spotty hotel Wi-Fi." [Emphasis by the author].

We can deduce that most, if not all, of the heavy lifting happens in a cloud-based model. Meaning r1 has access to LAM through API.

You could say r1 uses one API to replace them all.

Still, there are gray areas around how r1 works. To explain further, let’s break down the r1 stack into components and explore what each one does.


r1 = local UI + (LAM via API)

The messy equation means r1 is a device that has a user interface connected to LAM through the internet.

This is why people argue that r1 could be "just an app" instead of a device. Theoretically, you can build an app that does everything r1 can do.

In this case, you don’t need a dedicated device and you don’t need the Rabbit OS either. Rabbit OS is a customized operating system built specifically to connect r1 to LAM.

<iframe src="https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&key=a19fcc184b9711e1b4764040d3dc5c07&schema=twitter&url=https%3A//twitter.com/WillHobick/status/1745569486055367138&image=" title="This is a fake version of r1 that is not capable of carrying out actions on your other apps. It’s just an illustration of the idea of "r1 as an app."" height="281" width="500">

If r1 is just a "wrapper app" hosted on a cute gadget, it’d work much like ChatGPT. Every time you ask a question, you run an inference on a powerful model. Inference is a fancy way to say "make your model answer a question."

Each answer generated by your model requires computational resources and thus, money. You could argue inferences cost mere pennies nowadays, but with thousands of users, the expenses can spiral companies down to bankruptcy.

This is for instance why OpenAI limits daily usage for each user, even for those who pay a subscription.

Similarly, every instruction you ask from r1 generates a small cost for Rabbit. Assuming the same pricing as GPT-4, $200 could cover one to three years of r1 usage.

The calculation supposes every day, there are 10 instances of usage, each consisting of 3 interactions. While the hypothesis is far from accurate, it provides a general idea of the financial tradeoff.

If you also include the hardware cost, you’re left with less than two years of usage before each user becomes a running cost for Rabbit.

Rabbit may consider this a positive trade because they can use the money from initial sales to improve their product and sell new versions. They can also hope the cost of inference will go down over time.

The story doesn’t end there, however. Another technological trick can make the financial trade even better for Rabbit.


r1 = local*(small LLM + text-speech + UI) + (LAM via API)

There could be small models hosted on the r1 device – a small fine-tuned LLM, a text-to-speech model (TTS), and a speech-to-text model (STT).

These models formulate quick intermediary responses to the user.

When you ask LAM to order missing ingredients for a recipe, the model takes half a second to say "Let me take a look." That’s the local LLM and speech-conversion models formulating an immediate response.

Simultaneously, r1 sends a query to LAM via the online API. From there, LAM takes over, analyzes the situation, and orders the missing groceries via a delivery app.

Similarly, if you say, "Tell me the stock price of Apple," r1’s local models respond: "Searching for the stock price." At the same time, r1 runs a query through LAM to search the web for an answer.

The final response is then displayed on the screen and spelled out loud via the local text-to-speech model.

This would explain how r1 l responds to the prompts within 500 milliseconds. The local models can also optimize the queries sent to the online LAM model to reduce costs through prompt engineering.

Another advantage of having a small local LLM running on r1 is to turn simple conversations into a "free feature."

If you ask r1 for jokes, generic definitions, or spelling checks, the local LLM can be enough. From Rabbit’s point of view, they’re delivering useful services without spending extra money.

Then whenever you ask complex questions, LAM takes over.

Speaking of which, we still haven’t explained LAM’s main capability.


How does LAM perform actions?

The answer could be a technological solution called Robotic Process Automation or RPA.

RPA is like recording a Macro. You open an app like Microsoft Excel, press a "record" button, and perform a set of actions. The program learns to mimic the same actions in the same order – and once you "save" a macro, you can "play" it whenever you want.

Note that RPA can mimic tasks across different apps.

For instance, my team uses RPA to extract information from a series of apps. Instead of doing individual search queries, we use an RPA bot that navigates different UIs to export the requested data in elegant tables.

RPA works like a charm with boring repetitive tasks, but it has next-to-zero flexibility. If one of the UIs changes its layout, the robot will fail: It’ll click on the wrong button or submit data in the wrong field.

Rabbit may have invested in neuro-symbolic programming to overcome the flexibility issue. Put differently, they wrote code to design a flexible version of RPA.

See, with classic RPA, you can’t book a trip on different dates. This is because RPA strictly replicates the task exactly as you record it – the dates are hardcoded into the task.

With LAM, you can change the dates, location, and the number of visitors. You can also tick the breakfast box and write a message to the host.

For simplicity let’s call this technology "flexible RPA."

Flexible RPA is the main reason Rabbit can afford to build an AI assistant that doesn’t cost a fortune. Most of the intelligence is algorithmic – and algorithms are cheap to execute.


There’s an LLM in "LAM"

Whether or not there’s a small LLM inside the r1 device is still up for discussion. But you can be 100% sure the cloud-hosted LAM has a capable Large Language Model attached to it. Think something like GPT-3.5.

The LLM inside LAM takes care of text generation and splits tasks across the rest of LAM’s components. Picture it as a manager who does all the talking and delegates work to different teams.

Even a not-very-capable model can do the trick because the "generative aspect" is not LAM’s top priority. In terms of language, LAM’s key feature is information retrieval.

That’s why LAM has to be equipped with a good RAG.

RAG is short for Retrieval Augmented Generation. It’s a framework that allows LLMs (and now LAMs), to search for relevant information to produce better answers. RAG increases accuracy and context awareness.

With RAG you can unlock features like "finding answers through a web search" and "extracting quotes from a research paper."

The RAG framework is particularly useful for LAM because:

  1. It allows the integration of external information.
  2. It helps with calculations by retrieving key figures and parameters from natural language instructions.
  3. It grounds generated answers in a relevant context.
  4. It enhances "memory" by using past conversations as a data source.
  5. It increases accuracy by crosschecking inputs with reliable sources.

Let’s not forget about Computer Vision

LAM can analyze images through a Computer Vision (CV) module. It’s still not clear whether it acts on static pictures or a continuous video feed.

Either way, vision is a heavy calculation that should be computed on the cloud. Therefore, it’s safe to assume you can only perform "image/video queries" when r1 is online.

Much like the stock price example, the local LLM responds within half a second, saying: "Let me take a look." Meanwhile, the cloud-hosted LAM analyzes the images and responds accordingly.


LAM in short

LAM = LLM (with good RAG) + Computer Vision + flexible RPA (to mimic human users).

All of the Deep Learning models (LLMs, Text-Speech models, and Computer Vision) are merely interfaces. They translate natural language into parameters that get fed into the RPA-like software to perform actions.

The heavy computational lifting is done by "regular software" that’s much less expensive than Deep Learning models.


Why did Rabbit get it right?

They had the everyday user in mind. They didn’t build r1 to impress business owners and research labs. They built both the model and the device to serve pedestrians – people who don’t care about AI benchmarks and the number of parameters of your model.

"Our mission is to create the simplest computer," Rabbit CEO Jesse Lyu said. "Something so intuitive that you don’t need to learn how to use it."

In addition, Rabbit values both your time and privacy. The model doesn’t store your personal data. And unlike smartphones, r1 wants you to spend less time scrolling mindlessly.

Put differently, Rabbit isn’t after your attention.

Sure, the upcoming Pokédex won’t replace your smartphone, but it’s already pushing Big Tech to innovate. Expect Google, Apple, and Amazon to react with newer versions of their smart speakers and AI assistants.

Whatever happens next, users stand to win because the game is changing from time-wasting to time-saving.

Quick lessons to steal from Rabbit

Whether you work in data, software, or web design, you’re serving users – and all users seek the same things.

Simplicity, efficiency, and elegance.

Products are not only technical solutions. They’re also experiences that enhance and resonate with your users’ daily lives.

  • Build to impress your customers, not your colleagues and competitors.
  • Combine existing technologies to create new ones.
  • Minimize computing costs; maximize usefulness.
  • Challenge the status quo even if you know you’ll lose. That’s how you drive innovation.
  • Look for partnerships that align with your values.
  • Make it useful.
  • Make it cute.
  • Make it fun.

Keep in touch?

You can subscribe to get email notifications. Smiling also works.

Want my exclusive Medium articles in your inbox? Subscribe in one click below 👇 _Want my exclusive Medium articles in your inbox? Subscribe in one click below 👇 I publish weekly stories on AI / LLM_s…nabil-alouani.medium.com

I’m also active on Linkedin and X and reply to every single message.

For Prompt Engineering inquiries, write me at: [email protected].


Related Articles