Poster: Enabling Agent-centric Interaction on Smartphones with LLM-based UI Reassembling

In this poster, we introduce a novel dynamic user interface (UI) specifically designed for mobile devices powered by large language models (LLMs) agents. The advent of LLMs has led to a surge in deploying LLM-based agents on personal and Internet of Things (IoT) devices, with the aim of facilitating various daily tasks through device manipulation. However, this integration poses a significant challenge: how to intelligently and flexibly select and present information both during and after the execution of tasks, ensuring users are well-informed about the operations and can access the desired results conveniently. To address this challenge, we propose a UI reassembling method. This method allows for analyzing and strategically combining different mobile applications and their UI components, enabling the dynamic construction and adjustment of UIs tailored to user needs. Our prototype exhibits promising performance, with the UI selection module achieving an F1 score of 0.74. This innovative approach opens up exciting possibilities of new user-device interaction paradigm, leveraging the capabilities of LLMs to enhance the user experience in handling mobile and IoT devices.


INTRODUCTION
The rise of LLMs has led to an increasing deployment of LLMbased agents on personal and IoT devices.These agents can perform various daily tasks by understanding users' intentions, gathering information, making decisions, and taking autonomous actions [1].On mobile devices like smartphones, VR devices, electric vehicles, Figure 1: An example of UI reassembling method.Our system executes the user's task, selecting and displaying key UI elements from SMS Messenger and Google Maps.
these agents could autonomously operate mobile apps [3], call different APIs [2], and use various sensors [4] to perform tasks.However, this integration poses a challenge in effectively displaying information during task execution, ensuring users are informed about the operations and the outcomes of the tasks they desire.
Traditional static UIs struggle to display comprehensive information about the tasks completed by agents.For instance, when a user asks for a product comparison across two online shopping apps, an LLM-based agent should access these apps, search for the product, and show the relevant details.However, showing the product information from both apps in one screen could be difficult for traditional UI.A straightforward approach is to summarize the comparison in natural language within a single dialogue box, but this method might exclude crucial details, such as the product's image.
To address this, we develop a UI reassembling technique that combines all necessary UI components from different UI pages or apps, offering a more flexible way to display information.Our system is able to strategically select UI elements that meet the user's current needs and generate a clear UI from these elements, as shown in

DESIGN
Our system consists of a task planning agent, a UI navigating agent, and a UI reassembling agent.Given the high level task that may involve several apps, the task planner agent divides it into subtasks.Then the UI navigating agent can complete each subtask on one app, and record the important UI elements for display.Finally, the UI reassemble agent constructs a user-friendly UI layout based on these recorded UI elements.The workflow of our system is shown in Figure 2.
Task Planning Agent is responsible for making plans to solve high-level user task.Specifically, given a task  and a set of apps { 1 ,  2 , ...,   }, it decomposes  into a series of sub-tasks { 1 ,  2 , ...,   } on the mobile device and assigns each sub-task   to the corresponding app   .We utilize the reasoning and planning abilities of LLMs to complete the task planning process.
UI Navigating Agent.For each subtask   , the UI navigating agent completes it through a series of UI actions on the corresponding app   .The available options for the UI action include: 1. Touch <Button ID>, 2. Input <EditBox ID> with <text>, 3. Swipe <Scroller ID> <direction>.During each UI navigation, the UI navigating agent selects the appropriate UI element  ( )  and generates the UI action   based on the current UI state   and the action history { 1 ,  2 , ...,   −1 }.
The UI navigating agent also identifies and records key UI elements involved in the task, represented as Given the subjective nature of UI element selection process, we establish basic standards by categorizing subtasks into two types: 1) Information display tasks, where the agent accesses UI pages, searches for items, checks states or results, etc. 2) Action and Execution Tasks, where the agent performs specific actions or tasks, such as managing app-specific settings, creating contacts or notes, organizing digital files.LLMs are first prompted to classify one subtask.For each category, LLMs are guided to focus differently: for information display tasks, the emphasis is on identifying the relevant UI pages; for action and execution Tasks, the focus shifts to outlining the steps necessary for task execution.
UI Reassembling Agent.The UI reassembling agent is responsible for designing an engaging and clear UI layout based on the selected UI elements { (1) ,  (2) , ...,  () } provided by the UI aavigating agent.Given the complexity of UI layout XML, which often exceeds the LLMs' output token limits, we use LLMs to create a JSON dictionary detailing the UI elements' properties.This dictionary is then converted into functional code for UI rendering.The resulting UI is interactive, linking the displayed UI elements back to their original app counterparts, allowing users to navigate and interact directly with the elements in the original app.

PRELIMINARY RESULTS
We tested UI selection method using ChatGPT-3.5 and the DroidTask dataset [3], which contains 158 tasks from 13 well-known apps.In DroidTask, each task is straightforward and linked to a single app, so we considered them as subtasks.To evaluate the task planning agent, we combined every 2 DroidTask subtasks randomly to form complex tasks, then used ChatGPT-4.0 to detail these tasks, with volunteers verifying their accuracy.We randomly chose 50 of these tasks to evaluate the planning agent.These complex tasks simulate user-generated tasks, and our agent aims to break them down into the original subtasks and the associated apps.We asked volunteers to categorize each task and identify necessary UI elements for display tasks and completion.If a task was an information display type, they selected UI elements for the final display and the UI elements needed to complete the task.Otherwise, only elements essential for task completion were chosen.We focused on displaying UI element selection results, ignoring errors by the task completion agent for clarity.The results, presented in Table 1, reveal that the UI selection method achieved an F1 score of 0.74.

Figure 1 .
It leverages LLMs to select task-relevant UI elements and use LLM-based agents to reassemble the UI by generating structured layout description.To the best of our knowledge, this approach represents the first exploration into dynamic UIs designed to display information from LLM-based agents.706 MOBISYS '24, June 3-7, 2024, Minato-ku, Tokyo, Japan © 2024 Copyright is held by the owner/author(s).This work is licensed under a Creative Commons Attribution International 4.0 License.

Table 1 :
Performance of UI elements selection process.