Synergy of LLM and GUI, Beyond the Chatbot

Use OpenAI GPT function calling to drive your mobile app

Hans van Dam

Published in

Towards Data Science

10 min readOct 20, 2023

Synergy of LLM and GUI — Image created using Midjourney

Introduction

We introduce a radical UX approach to optimally blend Conversational AI and Graphical User Interface (GUI) interaction in the form of a Natural Language Bar. It sits at the bottom of every screen, allowing users to interact with your entire app from a single entry point. Users always have the choice between language and direct manipulation. They do not have to search where and how to accomplish tasks and can express their intentions in their own language, while the GUI's speed, compactness, and affordance are fully preserved. Definitions of the screens of a GUI are sent along with the user’s request to the Large Language Model (LLM), letting the LLM navigate the GUI toward the user’s intention. We built upon a concept introduced in a previous article. We further optimized the concept and implemented a Flutter sample app available here for you to try. The full Flutter code is available on GitHub, so you can explore the concept in your own context. A short video explaining the functionality is available here. This article is intended for product owners, UX designers, and mobile developers.

The Natural Language Bar, connecting natural language and GUI

Background

Natural language interfaces and Graphical User Interfaces (GUIs) connect the human user to the abilities of the computer system. Natural language allows humans to communicate with each other about things outside of immediacy while pointing allows communication about concrete items in the world. Pointing requires less cognitive effort for one's communicative counterpart than producing and processing natural language. It also leaves less room for confusion. Natural language, however, can convey information about the entire world: concrete, abstract, past, present, future, and the meta-world, offering random access to everything.

With the rise of ChatGPT the interpreting quality of NLP has reached a high level, and using ‘function calling’ it is now feasible to make complete natural language interfaces to computer systems that make little misinterpretations. The current trend in the LLM community focuses on chat interfaces as the main conversational user interface. This approach stems from chat being the primary form of written human-to-human interaction, preserving conversational history in a scrolling window. Many sorts of information are suitable for graphical representation. A common approach is to weave GUI elements into the chat conversation. The cost of this, however, is that the chat history becomes bulky, and the state management of GUI elements in a chat history is non-trivial. Also, by fully adopting the chat paradigm, we lose the option of offering menu-driven interaction paths to the users, so they are left more in the dark with respect to the abilities of the app.

The approach taken here can be applied to a whole range of apps, such as banking-, shopping- and travel apps. Mobile apps have their most important feature on the front screen, but features on other tabs or screens buried in menus may be challenging for users to find. When users can express their requests in their own language, they can naturally be taken to the screen that is most likely to satisfy their needs. When the most important feature is on the front screen, the number of options available for this core feature may be overwhelming to all present in the form of GUI elements. Natural language approaches this from the other end: the users have the initiative and express precisely what they want. Combining the two leads to an optimum, where both approaches complement each other, and users can pick the best option to suit their task or subtask.

The Natural Language Bar

The Natural Language Bar (NLB) allows users to type or say what they want from the app. Along with their request, the definitions of all screens of the app are sent to the LLM using a technique coined ‘function calling’ by OpenAI. In our concept, we see a GUI screen as a function that can be called in our app, where the widgets for user input on the screen are regarded as parameters of that function.

We will take a banking app as an example to illustrate the concept. When the user issues a request in natural language, the LLM responds by telling the navigation component in our app which screen to open and which values to set. This is illustrated in the following figure:

Navigating a mobile app using natural language via an LLM and deep linking

Some interaction examples are given in the following images:

Example of navigating in a Mobile App’s GUI using natural language

The following image shows a derived conclusion by the LLM. It concludes that the best available way to help the user is by showing the banking offices near you:

The following example shows that even significantly shortened expressions may lead to the desired result for the user:

So free typing can also be a very fast interaction mode. The correct interpretation of such shorthands depends on the non-ambiguity of the intention behind it. In this case, the app has no other screen than transfers that this could be meant for so the LLM could make a non-ambiguous decision.

Another bonus feature is that the interaction has a history, so users can continue to type to correct the previous intent:

So the LLM can combine several messages, one correcting or enhancing the other, to produce the desired function call. This can be very convenient for a trip-planning app where users initially just mention the origin and destination and, in subsequent messages, refine it with extra requirements, like the date, the time, only direct connections, only first-class, etc.

Click here to try the sample app for yourself. Speech input does work in a Chrome browser and on Android and iOS native. The provided speech recognition of the platform is used, so there’s room for improvement if the quality is insufficient for your purpose.

How it works

When the user asks a question in the Natural Language Bar, a JSON schema is added to the prompt to the LLM. The JSON schema defines the structure and purposes of all screens and their input elements. The LLM attempts to map the user’s natural language expression onto one of these screen definitions. It returns a JSON object so your code can make a ‘function call’ to activate the applicable screen.

The correspondence between functions and screens is illustrated in the following figure:

OpenAI function calling for the GUI screens of a mobile app

A full function specification is available for your inspection here.

The Flutter implementation of the Natural Language Bar is based on LangChain Dart, the Dart version of the LangChain ecosystem. All prompt engineering happens on the client side. It makes more sense to keep screens, navigation logic, and function templates together. The function templates are knit into the navigation structure since there is a one-to-one relationship. The following shows the code for activating and navigating to the credit card screen:

DocumentedGoRoute(
   name: 'creditcard',
   description: 'Show your credit card and maybe perform an action on it',
   parameters: [
     UIParameter(
       name: 'limit',
       description: 'New limit for the card',
       type: 'integer',
     ),
     UIParameter(
       name: 'action',
       description: 'Action to perform on the card',
       enumeration: ['replace', 'cancel'],
     ),
   ],
   pageBuilder: (context, state) {
     return MaterialPage(
         fullscreenDialog: true,
         child: LangBarWrapper(
             body: CreditCardScreen(
                 label: 'Credit Card',
                 action: ActionOnCard.fromString(
                     state.uri.queryParameters['action']),
                 limit:
                     int.tryParse(state.uri.queryParameters['limit'] ?? ''))));
   }),

At the top, we see that this is a route: a destination in the routing system of the app that can be activated through a hyperlink. The description is the portion the LLM will use to match the screen to the user’s intent. The parameters below it (credit card limit and action to take) define the fields of the screen in natural language so the LLM can extract them from the user’s question. Then, the pageBuilder-item determines how the screen should be activated using the query parameters of the deep link. One can recognize these in https://langbar-1d3b9.web.app/home: type: ‘credit card limit to 10000’ in the NLB, and the address bar of the browser will read: https://langbar-1d3b9.web.app/creditcard?limit=10000.

A LangChain agent was used, which makes this approach independent of GPT, so it can also be applied using other LLMs like Llama, Gemini, Falcon, etc. Moreover, it makes it easy to add LLM-based assistance.

History Panel

The Natural Language Bar offers a collapsible interaction history panel so the user can easily repeat previous statements. This way, the interaction history is preserved, similarly to chat interfaces, but in a compacted, collapsible form, saving screen real estate and preventing clutter. Previous language statements by the user are shown using the language the user has used. System responses are incorporated as a hyperlink on that user statement so they can be clicked on to reactivate the corresponding screen again:

When the LLM cannot fully determine the screen to activate, system responses are shown explicitly, in which case the history panel expands automatically. This can happen when the user has provided too little information, when the user’s request is outside of the scope of the app, or when an error occurs:

Future

The history panel is an excellent place to offer customer support and context-sensitive help in chatbot form. At the time of writing, there is a lively discussion and evolution of RAG (Retrieval Augmented Generation) techniques that let chatbots answer user questions based on a large body of text content provided by your organization. Besides that, the Natural Language Bar is a good starting point to imagine what more power and ease one can give to applications using natural language.

Customer Support

The history panel of interactions is a good place to embed customer-support conversations. Such conversations occupy more vertical space than most examples in this text. In a customer support conversation, your organization’s answers are linguistic expressions, whether produced by a chatbot or a human service operator. They need to be fully displayed rather than embedded in a hyperlink. But that is fine since otherwise, this space would be consumed elsewhere. Your organization probably already has a chatbot on the website or the app. It is logical to unify it with the history panel of the natural language bar.

Context-sensitive Help

In the context described above, we maintain a history of linguistic interaction with our app. In the future, we may (invisible) add a trace of direct user interaction with the GUI to this history sequence. Context-sensitive help could be given by combining the history trace of user interaction with RAG on the help documentation of the app. User questions will then be answered more in the context of the current state of the app.

Beyond static assistance for Mobile Apps

The current proposal is an MVP. It offers a static template for interpreting a user’s linguistic requests in the context of an app. This technique opens a broad spectrum of future improvements:

When users pose a question when they are on a specific screen, we may be able to dynamically add more specific interpretation templates (functions) to the prompt that depend on the state of that screen, like ‘Why is the submit button greyed out/disabled?’.
Function calling using a Natural Language Bar can be used as an assistant for creative applications, e.g. to execute procedures on selections like ‘make the same size’, or ‘turn into a reusable component’. Microsoft Copilot 365 is already using similar features. The approach taken in this article can also enable your organization to take advantage of such functions.

Natural language interaction with every aspect of your system will rapidly become a major component of every UI. When using ‘function calling,’ you must include your system abilities in the prompt, but soon, more economical and powerful methods will hit the market. For instance, OpenAI has recently opened up model finetuning with function calling, allowing you to create an LLM version with the abilities of your system baked in. Even when those abilities are very extensive, the load on the prompt remains limited.

Conclusion

LLMs can be an excellent glue for interacting with GUI-based apps in natural language through ‘function calling’. A Natural Language Bar was introduced that enables users to type or speak their intentions. The system will respond by navigating to the right screen and prefilling the correct values. The sample app allows you to actually feel what that is like, and the available source code makes it possible to quickly apply this to your own app if you use Flutter. The Natural Language Bar is not for Flutter or mobile apps only but can be applied to any application with a GUI. Its greatest strength is that it opens up the entirety of the functionality of the app for the user from a single access point, without the user having to know how to do things, where to find them, or even having to know the jargon of the app. From an app development perspective, you can offer all this to the user by simply documenting the purpose of your screens and the input widgets on them.

Please share your ideas in the comments. I’m really curious.

Follow me on LinkedIn or UXX.AI

Special thanks to David Miguel Lozano for helping me with LangChain Dart

Some interesting articles: multimodal dialog, google blog on GUIs and LLMs, interpreting GUI interaction as language, LLM Powered Assistants, Language and GUI, Chatbot and GUI

All images in this article, unless otherwise noted, are by the author