Speech and Natural Language Input for Your Mobile App Using LLMs

How to leverage OpenAI GPT-4 Functions to navigate your GUI

Hans van Dam

Published in

Towards Data Science

14 min readJul 25, 2023

Introduction

A Large Language Model (LLM) is a machine learning system that can effectively process natural language. The most advanced LLM available at the moment is GPT-4, which powers the paid version of ChatGPT. In this article, you will learn how to give your app highly flexible speech interpretation using GPT-4 function calling, in full synergy with your app’s Graphical User Interface (GUI). It is intended for product owners, UX designers, and mobile developers.

OpenAI GPT-4 Functions to Navigate your Mobile App’s GUI

Background

Digital assistants on mobile phones (Android and iOS) have failed to catch on for several reasons, among which they are faulty, limited, and often tedious to use. LLMs, and now especially OpenAI GPT-4, hold the potential to make a difference here, with their ability to more deeply grasp the user’s intention instead of trying to coarsely pattern match a spoken expression.

Android has Google Assistant’s ‘app actions’, and iOS has SiriKit intents. These provide simple templates to register speech requests that your app can handle. Google Assistant and Siri have already improved quite a bit over the past few years — even more than you probably realize. Their coverage is greatly determined, however, by which apps implement support for them. Nevertheless, you can, for instance, play your favorite song on Spotify using speech. The natural language interpretation of these OS-provided services, however, predates the huge advances in this field that LLMs have brought about — so it is time for the next step: to harness the power of LLMs to make speech input more reliable and flexible.

Although we can expect that the operating system services (like Siri and Google Assistant) will adapt their strategies soon to take advantage of LLMs, we can already enable our apps to use speech without being limited by these services. Once you have adopted the concepts in this article, your app will also be ready to tap into new assistants, once they become available.

The choice of your LLM (GPT, PaLM, LLama2, MPT, Falcon etc.) does have an impact on reliability, but the core principles you will learn here can be applied to any of them. We will let the user access the entirety of the app’s functionality by saying what they want in a single expression. The LLM maps a natural language expression into a function call on the navigation structure and functionality of our app. And it need not be a sentence spoken like a robot. LLM’s interpretation powers allow users to speak like a human, using their own words or language; hesitate, make mistakes, and correct mistakes. Where users have rejected voice assistants because they often fail to understand what they mean, the flexibility of an LLM can make the interaction feel much more natural and reliable, leading to higher user adoption.

Why speech input in your app, and why now?

Pros:

Navigate to a screen and provide all parameters in one speech expression
Shallow Learning Curve: No need for the user to find where in your app the data is or how to operate the GUI
Hands-free
Complementary and not unconnected (like in a voice user interface or VUI): speech and GUI work in harmony.
Accessibility for visually impaired
Now: because interpretation of natural language has risen to a new level through LLMs, responses are much more reliable

Cons:

Privacy when speaking
accuracy/misinterpretations
still relatively slow
Knowledge in the head vs. in the world (What can I say?): the user does not know what spoken expressions the system understands and has answers to

Examples of apps that can benefit from speech input include those used for car or bicycle driving assistance. In general, users may not want to engage in the precision of navigating an app by touch when they cannot easily use their hands, for instance, when they are on the move, wearing gloves, or busy working with their hands.

Shopping apps may also benefit from this feature, as users can verbalize their desires in their own words rather than navigate through shopping screens and set filters.

When applying this approach to increase accessibility for visually impaired individuals, you might consider adding natural language output and text-to-speech features to the mix.

Your app

The following figure shows the navigation structure of a typical app, exemplified by a train trip planner you may be familiar with. At the top, you see the default navigation structure for touch navigation. This structure is governed by the Navigation Component. All navigation clicks are delegated to the Navigation Component, which then executes the navigation action. The bottom depicts how we can tap into this structure using speech input.

The users say what they want; then a speech recognizer transforms the speech into text. The system constructs a prompt containing this text and sends it to the LLM. The LLM responds to the app with data, telling it which screen to activate with which parameters. This data object is turned into a deep link and given to the navigation component. The navigation component activates the right screen with the right parameters: in this example, the ‘Outings’ screen with ‘Amsterdam’ as a parameter. Please note that this is a simplification. We will elaborate on the details below.

Many modern apps have a centralized navigation component under the hood. Android has Jetpack Navigation, Flutter has the Router, and iOS has NavigationStack. Centralized navigation components allow deep linking, which is a technique that allows users to navigate directly to a specific screen within a mobile application rather than going through the app’s main screen or menu. For the concepts in this article to work, a navigation component and centralized deep linking are not necessary, but they make implementing the concepts easier.

Deep linking involves creating a unique (URI) path that points to a specific piece of content or a specific section within an app. Moreover, this path can contain parameters that control the states of GUI elements on the screen that the deep link points to.

Function calling for your app

We tell the LLM to map a natural language expression to a navigation function call through prompt engineering techniques. The prompt reads something like: ‘Given the following function templates with parameters, map the following natural language question onto one of these function templates and return it’.

Most LLMs are capable of this. LangChain has leveraged it effectively through Zero Shot ReAct Agents, and the functions to be called are called Tools. OpenAI has fine-tuned their GPT-3.5 and GPT-4 models with special versions (currently gpt-3.5-turbo-0613 and gpt-4–0613) that are very good at this, and they have made specific API entries for this purpose. In this article, we will take the OpenAI notation, but the concepts can be applied to any LLM, e.g. using the ReAct mechanism mentioned. Moreover, LangChain has a specific agent type (AgentType.OPENAI_FUNCTIONS) that translates Tools into OpenAI function templates under the hood. For LLama2 you will be able to use llama-api with the same syntax as OpenAI.

Function calling for LLMs works as follows:

You insert a JSON schema of function templates into your prompt along with the user’s natural language expression as a user message.
The LLM attempts to map the user’s natural language expression onto one of these templates.
The LLM returns the resulting JSON object so your code can make a function call.

In this article, the function definitions are direct mappings of the graphical user interface (GUI) of a (mobile) app, where each function corresponds to a screen and each parameter to a GUI element on that screen. A natural language expression sent to the LLM returns a JSON object containing a function name and its parameters that you can use to navigate to the right screen and trigger the right function in your view model, such that the right data is fetched. The values of the relevant GUI elements on that screen are set according to the parameters.

This is illustrated in the following figure:

mapping LLM functions onto your mobile app’s GUI

It shows a stripped version of the function templates as added to the prompt for the LLM. To see the full-length prompt for the user message: ‘What things can I do in Amsterdam?’, click here (Github Gist). It contains a full curl request that you can use from the command line or import into Postman. You need to put your own OpenAI-key in the placeholder to run it.

Screens without parameters

Some screens in your app don’t have any parameters, or at least not the ones that the LLM needs to be aware of. To reduce token usage and clutter, we can combine a number of these screen triggers in a single function with one parameter: the screen to open.

{
    "name": "show_screen",
    "description": "Determine which screen the user wants to see",
    "parameters": {
        "type": "object",
        "properties": {
            "screen_to_show": {
                "description": "type of screen to show. Either 
                    'account': 'all personal data of the user', 
                    'settings': 'if the user wants to change the settings of 
                                the app'",
                "enum": [
                    "account",
                    "settings"
                ],
                "type": "string"
            }
        },
        "required": [
            "screen_to_show"
        ]
    }
},

The Criterion as to whether a triggering function needs parameters is whether the user has a choice: there is some form of search or navigation going on on the screen, i.e. are there any search (like) fields or tabs to choose from?

If not, then the LLM does not need to know about it, and screen triggering may be added to the generic screen triggering function of your app. It is mostly a matter of experimentation with the descriptions of the screen purpose. If you need a longer description, consider giving it its own function definition to put more separate emphasis on its description than the enum of the generic parameter does.

Prompt instruction guidance and repair:

In the system message of your prompt, you give generic steering information. In our example, it can be important for the LLM to know what date and time it is now, for instance, if you want to plan a trip for tomorrow. Another important thing is to steer its presumptiveness. Often, we would rather have the LLM be overconfident than bother the user with its uncertainty. A good system message for our example app is:

"messages": [
        {
            "role": "system",
            "content": "The current date and time is 2023-07-13T08:21:16+02:00.
                       Be very presumptive when guessing the values of 
                       function parameters."
        },

Function parameter descriptions can require quite a bit of tuning. An example is the trip_date_time when planning a train trip. A reasonable parameter description is:

"trip_date_time": {
      "description": "Requested DateTime for the departure or arrival of the 
                      trip in 'YYYY-MM-DDTHH:MM:SS+02:00' format.
                      The user will use a time in a 12 hour system, make an 
                      intelligent guess about what the user is most likely to 
                      mean in terms of a 24 hour system, e.g. not planning 
                      for the past.",
                      "type": "string"
                  },

So if it is now 15:00 and users say they want to leave at 8, they mean 20:00 unless they specifically mention the time of the day. The above instruction works reasonably well for GPT-4. But in some edge cases, it still fails. We can then e.g. add extra parameters to the function template that we can use to make further repairs in our own code. For instance, we can add:

"explicit_day_part_reference": {
          "description": "Always prefer None! None if the request refers to 
                        the current day, otherwise the part of the day the 
                        request refers to."
          "enum": ["none", "morning", "afternoon", "evening", "night"], 
                           }

In your app, you will likely find parameters that require post-processing to enhance their success ratio.

System requests for clarification

Sometimes, the user’s request lacks information to proceed. There may not be a function suitable to handle the user’s request. In that case, the LLM will respond in natural language that you can show to the user, e.g. by means of a Toast.

It may also be the case that the LLM does recognize a potential function to call, but information is lacking to fill all required function parameters. In that case, consider making parameters optional. But if that is not possible, the LLM may send a request in natural language for the missing parameters in the language of the user. You should show this text to the users, e.g. through a Toast or text-to-speech, so they can give the missing information (in speech). For instance, when the user says ‘I want to go to Amsterdam’ (and your app has not provided a default or current location through the system message) the LLM might respond with ‘I understand you want to make a train trip, from where do you want to depart?’.

This brings up the issue of conversational history. I recommend you always include the last 4 messages from the user in the prompt so a request for information can be spread over multiple turns. To simplify things, omit the system’s responses from the history because, in this use case, they tend to do more harm than good.

Speech recognition

Speech recognition is a crucial part of the transformation from speech to a parametrized navigation action in the app. When the quality of interpretation is high, bad speech recognition may very well be the weakest link. Mobile phones have onboard speech recognition with reasonable quality, but LLM-based speech recognition like Whisper, Google Chirp/USM, Meta MMS, or DeepGram tends to lead to better results, especially when you can tune them for your use case.

Architecture

It is probably best to store the function definitions on the server, but they can also be managed by the app and sent with every request. Both have their pros and cons. Having them sent with every request is more flexible, and the alignment of functions and screens may be easier to maintain. However, the function templates not only contain the function name and parameters but also their descriptions that we might want to update quicker than the update flow in the app stores. These descriptions are more or less LLM-dependent and crafted for what works. It is not unlikely that you want to swap out the LLM for a better or cheaper one or even swap dynamically at some point. Having the function templates on the server may also have the advantage of maintaining them in one place if your app is native on iOS and Android. If you use OpenAI services for both speech recognition and natural language processing, the technical big picture of the flow looks as follows:

The users speak their request; it is recorded into an m4a buffer/file (or mp3 if you like), which is sent to your server, which relays it to Whisper. Whisper responds with the transcription, and your server combines it with your system message and function templates into a prompt for the LLM. Your server receives back the raw function call JSON, which it then processes into a function call JSON object for your app.

From function call to deep link

To illustrate how a function call translates into a deep link, we take the function call response from the initial example:

"function_call": {
                    "name": "outings",
                    "arguments": "{\n  \"area\": \"Amsterdam\"\n}"
                }

On different platforms, this is handled quite differently, and over time, many different navigation mechanisms have been used and are often still in use. It is beyond the scope of this article to go into implementation details, but roughly speaking, the platforms in their most recent incarnation can employ deep linking as follows:

On Android:

navController.navigate("outings/?area=Amsterdam")

On Flutter:

Navigator.pushNamed(
      context,
      '/outings',
      arguments: ScreenArguments(
        area: 'Amsterdam',
      ),
    );

On iOS, things are a little less standardized, but using NavigationStack:

NavigationStack(path: $router.path) {
            ...
}

And then issuing:

router.path.append("outing?area=Amsterdam")

More on deep linking can be found here: for Android, for Flutter, for iOS

Free text field for apps

There are two modes of free text input: voice and typing. We’ve mainly talked about speech, but a text field for typing input is also an option. Natural language is usually quite lengthy, so it may be difficult to compete with GUI interaction. However, GPT-4 tends to be quite good at guessing parameters from abbreviations, so even very short abbreviated typing can often be interpreted correctly.

The use of functions with parameters in the prompt often dramatically narrows the interpretation context for an LLM. Therefore, it needs very little, and even less if you instruct it to be presumptive. This is a new phenomenon that holds promise for mobile interaction. In the case of the train station to train station planner, the LLM made the following interpretations when used with the exemplary prompt structure in this article. You can try it out for yourself using the prompt gist mentioned above.

Examples:

‘ams utr’: show me a list of train itineraries from Amsterdam Central Station to Utrecht Central Station departing from now

‘utr ams arr 9’: (Given that it is 13:00 at the moment). Show me a list of train itineraries from Utrecht Central Station to Amsterdam Central Station, arriving before 21:00

Follow up interaction

Just like in ChatGPT, you can refine your query if you send a short piece of the interaction history along:

Using the history feature, the following also works very well (presume it is 9:00 in the morning now):

Type: ‘ams utr’ and get the answer as above. Then type ‘arr 7’ in the next turn. And yes, it can actually translate that into a trip being planned from Amsterdam Central to Utrecht Central, arriving before 19:00.
I made an example web app about this that you can find a video about here. The link to the actual app is in the description.

Update: a successor to this article incorporating text input can be found here and a demo video here.

Future

You can expect this deep link structure to handle functions within your app to become an integral part of your phone’s OS (Android or iOS). A global assistant on the phone will handle speech requests, and apps can expose their functions to the OS so that they can be triggered in a deep-linking fashion. This parallels how plugins are made available for ChatGPT. Now a coarse form of this is already available through the intents in the AndroidManifest and App Actions on Android and on iOS through SiriKit intents. The amount of control you have over these is limited, and the user has to speak like a robot to activate them reliably. Undoubtedly, this will improve over time when LLM-powered assistants take over.

VR and AR (XR) offer great opportunities for speech recognition because the user's hands are often engaged in other activities.

It will probably not take long before anyone can run their own high-quality LLM. The cost will decrease, and speed will increase rapidly over the next year. Soon LoRA LLMs will become available on smartphones, so inference can take place on your phone, reducing cost and speed. Also, more and more competition will come, both open source like Llama2, and closed source like PaLM.

Finally, the synergy of modalities can be driven further than providing random access to the GUI of your entire app. It is the power of LLMs to combine multiple sources that hold the promise for better assistance to emerge. Some interesting articles: multimodal dialog, google blog on GUIs and LLMs, interpreting GUI interaction as language, LLM Powered Assistants.

Conclusion

In this article, you learned how to apply function calling to speech enable your app. Using the provided Gist as a point of departure, you can experiment in Postman or from the command line to get an idea of how powerful function calling is. If you want to run a POC on speech enabling your app, I would recommend putting the server bit from the architecture section directly into your app. It all boils down to 2 HTTP calls, some prompt construction, and implementing microphone recording. Depending on your skill and codebase, you will have your POC up and running in several days.

Happy coding!

Follow me on LinkedIn or UXX.AI

All images in this article, unless otherwise noted, are by the author.