MUD: Towards a Large-Scale and Noise-Filtered UI Dataset for Modern Style UI Modeling

The importance of computational modeling of mobile user interfaces (UIs) is undeniable. However, these require a high-quality UI dataset. Existing datasets are often outdated, collected years ago, and are frequently noisy with mismatches in their visual representation. This presents challenges in modeling UI understanding in the wild. This paper introduces a novel approach to automatically mine UI data from Android apps, leveraging Large Language Models (LLMs) to mimic human-like exploration. To ensure dataset quality, we employ the best practices in UI noise filtering and incorporate human annotation as a final validation step. Our results demonstrate the effectiveness of LLMs-enhanced app exploration in mining more meaningful UIs, resulting in a large dataset MUD of 18k human-annotated UIs from 3.3k apps. We highlight the usefulness of MUD in two common UI modeling tasks: element detection and UI retrieval, showcasing its potential to establish a foundation for future research into high-quality, modern UIs.


INTRODUCTION
As mobile apps increasingly become integral to daily life, research interest in developing various applications based on mobile UI screens has surged.These applications include UI element detection [17], screen embedding [47], widget captioning [49], icon labeling [35], and screen summarization [70], all of which enhance the interactive capabilities and accessibility of mobile phones.These tasks typically depend on mobile UI data, specifically UI screenshot images, and the view hierarchy, which represent the elements contained on the screen, their attributes (e.g., position, dimensions), and the structural relationships between them.
Towards that goal, several mobile UI datasets have been collected, including ERICA [25], Gallery DC [17], Guigle [13], VINS [15], AMP [75], and Swire [42].However, these existing datasets are either inaccessible or insufficient for data-driven modeling purposes.Rico [24] is the largest publicly available mobile UI dataset, comprising 66k UIs from 9.7k Android apps.It has been a primary data source for much UI understanding research and has been expanded to support numerous downstream applications.However, Rico's dataset, collected in early 2017, has not been updated since.We conducted a small empirical study comparing the UIs in the Rico dataset with the latest UIs.The latter is designed in a modern style, boasting appealing visual appearances, user-friendly visual hierarchies, simplified user interactions, and readable typography.As a result, models trained on the outdated UIs in the Rico dataset may exhibit degraded performance on newer apps with updated design aesthetics.
On the other hand, many studies have identified that view hierarchies are often noisy [48].A small-scale pilot study on 500 UIs in the Rico dataset from our study in Section 3.2 reveals three major issues with these view hierarchies related to the UI screens.First, UIs are typically captured at runtime, but UI rendering may take time, leading to the capture of partially rendered UIs.Second, due to the nature of UI framing, developers may use views to cover previous elements for simplicity, resulting in overlaid view hierarchies.Third, there remains a significant percentage of duplicate UIs in the dataset.In fact, these problematic issues are not beneficial as either input signals or output labels and might even negatively impact UI modeling performance.While many studies [15,48] propose noise removal in Rico, this leaves the dataset on a smaller scale for less generality of data-driven UI modeling.
In this paper, we present MUD, a large-scale, high-quality mobile UI dataset collected from the most recent apps.MUD is constructed by mining Android apps at runtime using a novel automated app exploration approach.Drawing inspiration from the success of Large Language Models (LLMs) in conversational chatting as a professional expert, we frame app exploration as a question-and-answer (Q&A) task, i.e., asking the LLMs to play a role as an app expert to interact with the target app.Specifically, we provide the LLMs with the context of the current UI information via the view hierarchy and prompt potential interactions to operate on the screen, thereby automatically exploring the apps.During this automated app exploration, we collect a large number of UIs, many of which could be noisy.To address this, we adopt mature techniques with best practices to automatically remove the noisy data in advance.We then enlist human annotators as the final line of defense to audit the UIs and view hierarchies, ensuring the quality of our dataset.
Our MUD dataset comprises 18,132 unique UIs from 3.3k apps, spanning 33 app categories.Results indicate that our proposed LLMenhanced approach can boost 17% in exploring apps, compared to three state-of-the-art tools.Additionally, we qualitatively investigate the capabilities of LLMs in app exploration and reveal three key findings in our discussion: semantic text input, compound action, and language insensitivity.To evaluate MUD's utility, we apply it to the two most common tasks in literature: UI element detection and UI retrieval.Preliminary results demonstrate its value in UI modeling, potentially paving the way for further improvements toward UI intelligence.
To summarize, our paper makes the following contributions: • We carry out an empirical study to examine the data issues present in the widely-used Rico dataset.This investigation motivates us to collect a high-quality dataset in a more modern style for data-driven UI modeling.• We introduce a novel approach that elicits the capabilities of LLMs to automatically mine UIs from apps in a manner that mimics human exploration.• We collect, annotate, and open-source the MUD dataset 1 , which comprises 18k unique UI screens, each accompanied by a high-quality view hierarchy.We demonstrate the usefulness of the MUD dataset through two applications drawn from the literature: element detection and UI retrieval.These applications highlight MUD's potential as a valuable data source for extensive UI modeling research.

RELATED WORK
We conduct a review of research in three primary areas related to MUD, including automated app exploration, datasets for UI modeling, and applications of UI datasets.

Automated App Exploration
Previous studies primarily relied on human exploration to mine UIs, a process that can be time-consuming and prohibitively expensive [15,24,65].In the same vein, many software testing researchers have developed tools to automatically explore apps to detect bugs and generate test scripts.While the target differs, these tools can facilitate our task, reducing human effort in UI collection.One of the earliest initiatives is Monkey [10], Google's official automated app exploration tool, designed to generate random user actions such as clicks, touches, or gestures, as well as several system-level actions on the UI.Subsequent efforts have focused on improving randomness strategies [55,56].However, the random-based testing strategy often fails to formulate a reasonable exploration path based on the characteristics of the app, resulting in low exploration coverage and excessive time consumption.
Recent tools [12,50,54] have leveraged dynamic and static analysis to reverse engineer a stochastic model from UI for more robust automated app exploration.Gu et al. [41] introduced a UI eventrefinement model that uses UI runtime information to evolve an initial model, generating precise actions.Su et al. [67] proposed Stoat, which assigns different probabilities to UI runtime elements for selection, achieving effective app exploration.Degott et al. [23] adopted reinforcement learning to identify valid interactions for a UI element (e.g., a button can be clicked but not dragged) to guide exploration.Although model-based automated tools can improve exploration coverage, the coverage remains low as these tools do not take human behavior into account.
Researchers have further proposed human-like strategies and designed learning-based automated app exploration tools.Zhou et al. [76] introduced a deep learning model that predicts the next element a user may click, based on the user's click history, the structural information of the UI screen, and the current context, such as the time of day.However, this user information can be challenging to obtain in practice.Li et al. [51] introduced Humanoid, which uses a sequence of UIs captured from actions to learn a model that predicts human-like interactions on the app.Nevertheless, it still struggles to fully understand the semantic information of the UI and plan actions according to the dynamic situation of the app.
Following the success of LLMs in many natural language processing tasks, researchers have started exploring their potential for semantic understanding of UIs.Wang et al. [69] examined the use of LLMs in enabling more natural and intuitive conversations between users and UIs, such as fact-based question-and-answer (i.e., "what is the app version number on the UI").Feng et al. [30] introduced innovative prompting techniques to adapt LLMs for replicating bug steps from bug reports on mobile UIs.Our study, however, employs the knowledge nested in LLMs to deduce potential human-like actions on mobile UIs for app exploration.These models are trained on an ultra-large-scale corpus, which includes tutorials or reports on the web with natural-language descriptions of how specific actions can lead to certain outcomes in software.By eliciting in-context learning from these models, it potentially aids in exploring the apps more effectively.More recently, Liu et al. [53] proposed QTypist, a fine-tuned LLMs designed to generate text inputs for exploring more UI states in apps.Unlike exclusive focus on text input generation, our research provides a nuanced, systematic understanding of the capabilities offered by LLMs, including complex compound actions and multilingual comprehension, as discussed in Section 4.4.1.It makes this work more generalized for mobile app exploration in a variety of contexts.

Datasets for UI Modeling
There have been several datasets gathered to support web UI modeling.For instance, Webzeitgeist [44] utilized an automated crawler to mine 103k designs of web pages and associated web elements with extracted properties such as HTML tag, size, font, and color.WebUI [71] crawled 400k web pages with semantic information to  1 summarizes the key differences between these existing datasets.Shirazi et al. [65] presented the first mobile UI dataset, manually mining 29k UIs from 400 Android apps.The ERICA dataset [25] provided a user-friendly web interface that allows users to interact with apps installed on Android devices, collecting 18.6k UIs from 2.4k Android apps.The Gallery D.C. dataset [31], created for UI element detection, was compiled from 5k app introduction UI screenshots from 68k Android apps in the Google Play Store.The Guigle dataset [13] comprised a large corpus of 5k apps with 12k UI screenshots to facilitate design search.The AMP dataset [75] collected 77k UI screens from 4k iOS apps.However, these datasets are either not publicly available or insufficient for data-driven UI modeling.
The Rico dataset [24] is recognized as the largest publicly available mobile UI dataset, containing 66k UI screens from 9.7k Android apps.It has been the primary data source for numerous UI modeling research studies [38,47,48,70].Nevertheless, Rico has several shortcomings.First, various studies have identified errors and noise in the Rico dataset, such as instances where nodes in the view hierarchy do not align with the screenshot.To this end, efforts have been made to repair and filter these examples.The Enrico dataset [45] initially randomly sampled 10k examples from Rico, then cleaned and provided additional annotations for 1,460 of them.The VINS dataset [15] manually labeled a highly accurate view hierarchy for 4k UIs from Rico.The Clay dataset [46] aimed to denoise Rico through a pipeline of automated machine-learning models and human annotators to provide accurate element labels.However, after cleaning, these datasets may not be suitable for data-driven modeling due to their small size.Second, Rico was collected in early 2017 and has not been updated since.Many of the latest popular apps adhere to newer design guidelines, like Material Design for Android [6].Modeling based on an outdated UI dataset may result in decreased performance on new designs, thus limiting the effectiveness of modeling UI understanding in the current apps.
In our study, we construct MUD, a large-scale, high-quality mobile UI dataset, collected from the most recent apps.We design a novel tool that automatically navigates through these apps, mining a large number of UIs in the process.Recent extensive studies [36,39] reveal that even the most recent apps can still harbor noise and errors.To ensure the dataset's quality, we utilize mature techniques to carefully filter out noise and errors in advance.We further enlist human effort to validate the dataset's quality.As a result, MUD contains 18k unique UIs from 3k recent popular Android apps.We publicly release MUD to encourage further research on modeling UIs in a modern design style.

Applications of UI Datasets
Originally, many studies relied on pixel-based or heuristic matching [26,74].The advent of UI datasets, like those mentioned earlier, has opened up avenues for developing more robust computational models, particularly those derived from visual data.In this paper, our focus is on two prevalent uses of UI datasets, namely: (i) element detection at the element level, and (ii) UI retrieval at the screen level.
Element detection is a process that pinpoints the location and type of UI elements within a screenshot.This process forms the basis for many subsequent tasks, such as the repair of accessibility metadata [19,32,75], software testing [20,34,37,72,73], and code generation [18,57].Zhang et al. [75] suggested an on-device method for element detection on the screen to support accessibility features by training a Faster R-CNN model.However, the robustness and generality of these models are significantly impacted by the noise present in the UI datasets [15].We discovered that our large-scale clean UI dataset, MUD, could potentially enhance performance as discussed in Section 5.1.
Other research involves modeling UI screen retrieval to aid designers in data-driven searches, creating design examples by stimulating inspiration, generating new ideas, and examining design decisions from UI repositories.Numerous attempts have been made to enhance retrieval algorithms to obtain more relevant UI designs, utilizing deep learning [16,33,35,42,58], reverse engineering [17,28,29,57], and app information [47].However, the significance of UI repositories has often been neglected.Retrieving UI designs from outdated UI repositories, such as the Rico dataset in 2017, may not inspire modern designs.Our MUD dataset is driven by the desire to mine the most recent popular UI designs, thereby offering designers a chance to draw inspiration from new design trends.

EMPIRICAL STUDY
In this section, we conducted a small empirical study to understand new UI designs and identify potential noise in existing datasets, thereby highlighting the need for appropriate dataset support.We chose the Rico dataset [24] as the subject of our study, given its widespread use as a UI dataset.We randomly gathered 500 UIs from 27 app categories to form our experimental dataset.To gain insights into this dataset, we recruited six students as annotators.These students were recruited via the university's internal slack channel and were compensated at a rate of $12 USD per hour.To ensure accurate annotations, we began with an initial training that included reading instructions related to UI understanding, learning the rules of labeling, and passing an assessment test before the participants could begin the various UI annotation tasks.

Do the apps feature updated UI designs?
To determine whether the UIs in the Rico dataset have been updated, we compared the UIs to the corresponding latest versions of the apps.To do this, we first developed a web crawler to scrape the most recent versions of the apps in our experimental dataset from the Google Play Store, in total, we gathered 132 apps.We then asked the annotators to manually explore the apps to identify the corresponding UI designs.We opted for human exploration because 1) The previously recorded interaction trace to the UIs in Rico may differ in the new app due to new functionalities and usage scenarios; 2) Automated app exploration tools might not uncover the corresponding UIs.As such, we instructed the annotators to familiarize themselves with the UIs and the previous interaction traces in the apps.This contextual information, such as previous interaction traces, could potentially help annotators navigate the apps more easily to find the corresponding UIs.After an average of 4 hours of human exploration, we obtained 82 pairs of UI designs from both the Rico dataset and the new app version.
To better comprehend the evolution of design styles from old to new, we instructed the annotators to independently categorize the UI pairs using existing UI/UX design knowledge, as documented in resources like The Design of Everyday Things [61] and Mobile Design Pattern Gallery [60].Following the initial categorization, the annotators convened to discuss discrepancies and establish a set of new categories, continuing until a consensus was reached.We identified four primary types of UI design updates from Rico, as illustrated in Fig. 1.
First, the visual appearance of mobile UI is evolving, encompassing aspects such as color, contrast, and element design.For instance, as depicted in Fig. 1(a), an appealing visual appearance can guide users, optimize readability and accessibility, and ensure a superior user experience.Second, improvements in the visual hierarchy are being made to better accommodate human perception within the UI.Principles such as the Gestalt theory [40], derived from recent research in psychology and biological vision, provide guidelines for visual UI design cues like connectivity, similarity, proximity, and continuity.These principles demonstrated in Fig. 1(b), are widely used to enhance user experience and usability.Third, interactions are being optimized to simplify the user experience, enabling users to complete tasks more quickly and efficiently, thereby boosting user engagement.For instance, the interaction for selecting a date is simplified in the new design shown in Fig. 1(c).Lastly, the importance of typography, previously underestimated in the early years, is increasingly recognized.A well-designed typeface and style can effectively highlight words and capture users' attention.For example, in Fig. 1(d), the text in the UI is updated to the "Footlight" font, which is easy to read on small screens due to its well-spaced letters and clear strokes.

What are the prevalent noises in the UI?
To examine the extent of noises and errors in the UI datasets, we asked the annotators to label the noisy data in a sample of 500 experimental UIs.During this manual review process, we observed various types of UI noise.Categorizing these noises could help elucidate the issues and streamline our UI collection process.Based on previous studies [15,46], we asked the annotators to independently identify categories and discuss any discrepancies.Following the Card Sorting method [66], we identified three prevalent types of noise in the UI datasets: 1) Partially Rendered UIs (16%).UI rendering is the act of generating a frame and displaying it on the screen, including transitioning from the previous page, loading resources from the internet, and drawing UI objects (like buttons) into pixels.The duration of this process can vary based on the quality of the app code, device performance, and internet bandwidth.However, 16% of the UIs in our experimental dataset are partially rendered, as shown in Fig. 2(a).This could be due to a short waiting delay during screenshot capturing.
2) Overlaid View Hierarchies (21%).Apps often incorporate numerous functionalities, which can lead to overlapped designs in the user interface.Some developers might oversimplify app development by overlaying new views on top of the existing UI.While this doesn't result in any visual differences, as the previous elements are entirely covered by the new view, it does lead to a misalignment in the view hierarchy, as demonstrated in Fig. 2(b).
3) Duplicated UIs (2%).We discovered that 2% of the UIs are duplicates.This could be attributed to two main factors.First, the process of exploring apps to collect UIs often involves navigating back and forth, which can result in duplicated states.Second, certain interactions may trigger a toast [9], a brief message displayed on the screen for users, which disappears automatically after a short period of time.Consequently, the UIs collected before and after the appearance of the toast are identical, leading to duplication in the dataset.
Summary: Upon analyzing 500 UIs from Rico, we found that they are now designed in a more contemporary style.Despite the considerable size of the Rico set, 39% of its UIs are affected by noise, including partially rendered UIs, overlaid view hierarchies, and duplicated UIs.These old-fashioned UI designs and the presence of noise in Rico could significantly impact data-driven modeling in current UI tasks.This underscores the need for a high-quality dataset featuring UIs designed in the new fashion.

THE MUD DATASET
Given that manually exploring apps can be both time-consuming and labor-intensive, we propose a novel automated app exploration method.This method encourages Large Language Models (LLMs) to act as app experts and facilitates interactions within the app.To address the potential noises outlined in Section 3.2, we employ mature techniques and best practices from prior works.Lastly, as a final safeguard, we implement manual validation to ensure the quality of the dataset.The overview of our dataset collection is shown in Fig. 3.

App Collection
The quality of apps can directly influence the quality of the UIs.The more popular the apps are, the higher the quality of the UIs, and the more beneficial they are to data-driven UI modeling.To achieve this, we develop a parallelizable, cloud-based web crawler to collect the most popular apps from the Google Play Store.Specifically, our crawler is composed of (i) a crawling coordinator server that monitors visited and queued URLs, (ii) a pool of crawler workers that scrape URLs using a headless browser, and (iii) a database service that stores artifacts uploaded by the workers.The coordinator server distributes seeds to the workers based on the Breadth-First search strategy [59], which involves collecting a queue of URLs from a seed list and using the queue as the seed in the second stage of iteration.The crawler worker is implemented using a headless framework [3] to interface with the Chrome browser.We also introduce simple heuristics to automatically dismiss certain types of pop-ups (e.g., GDPR cookie warnings) to facilitate access to page content.

Automated App Exploration
Large Language Models (LLMs) [14,22,52,62] trained on ultralarge-scale corpora have shown promising performance in areas such as natural language understanding, logical reasoning, and question answering.The success of ChatGPT [5] exemplifies how an LLM can comprehend human knowledge and interact with humans as an informed expert.Inspired by ChatGPT, we have reimagined app exploration as a Question & Answering (Q&A) task, where we ask the LLMs to act as an app expert and interact with the screen to explore the app.An overview of our approach is depicted in Fig. 4. Specifically, we first prompt the LLMs to play the role of app expert by: "You are an app expert tasked with exploring apps for maximum coverage."Then, we dump the view hierarchy that represents the current UI screen.With the view hierarchy serving as the context for available interactive elements (i.e., those flagged with a "clickable" attribute) in the UI, we then prompt the LLMs with the following questions: "How would you interact with the following UI <UI>?"To further effectively translate the LLMs' responses into actionable operation scripts for executing the app, we also equip the LLMs with a set of interaction primitives, like: "Please answer the interactions in the following format: Our approach is implemented as a fully automated app exploration tool.For the LLMs, we employ the gpt-3.5-turbomodel [5] from OpenAI.Since LLMs may generate verbose output (like repeated questions or reasoning), we use "[]" to deduce the specific feedback for operations, such as action, target component resource_id, and input value.Given that LLMs may generate a sequence list of potential actions, i.e., complex compound actions, to facilitate exploration as depicted in Fig. 4, we employ regular expressions to extract the numerical index, serving as the sequence of operations.We use Android UIAutomator [2] to dump the UI view hierarchy and Android Debug Bridge (ADB) [1] to execute the operations on the device.Additionally, we explore automatic login policies by activating the Google account authentication [7], which allows a successful login into the Google account to unlock apps and enable the exploration of features hidden behind login screens.To lessen the noise from partially rendered UIs, as detailed in Section 3.2, we establish a relatively lengthy waiting period between each operation to allow the UI to fully render.According to a small-scale pilot experiment, we have set this waiting time to 2 seconds.

Validation of Noise
During such automated app exploration, the tool automatically collects pairs of runtime UI screenshots and their corresponding runtime UI view hierarchies.Although we have implemented some best practices in the tool to minimize UI noise, such as allowing a long waiting time for partially rendered UIs, some residual noise remains, as discussed in Section 3.2, such as overlaid view hierarchies and duplicated UIs.To address this, we utilize mature techniques from previous studies to mitigate these noises in our UI datasets.
First, we utilize a heuristic method from [17] to discern duplicated UIs by encoding the underlying element hierarchy of a UI.Specifically, we assign a unique type_index to each type of UI element -for instance, Button is designated as 0, TextView as 1, RatingBar as 2, and so on.We then obtain the node_index (starting at 0) of each UI element in the UI and concatenate an element's node_index with its type_index.For example, if the first UI element is a TextView, it can be represented as "0_1".Consequently, we generate a representation string for a UI by concatenating the representation strings of all its UI elements.We then hash this representation string using the MD5 algorithm [64] to produce a fixed-length unique identifier for the UI.If two UIs share the same identifier, they are considered identical, and the duplicate is removed.
Second, we utilize a deep learning model proposed in [46] to automatically identify overlaid view hierarchies.Specifically, we frame the detection of overlaid view hierarchies as a binary classification task -determining whether the elements in the view hierarchy are overlaid or misaligned.We train a binary classification model using ResNet, with the input being a four-channel matrix.The first three channels correspond to the original pixels of the UI image, while the fourth, or mask channel, indicates the bounding box of the element under examination.With the mask channel, the model is able to recognize the element's location and concentrate on the element pixels to make a prediction.We then filter the overlaid view hierarchy based on the model's prediction of how likely it is that the element is overlaid.
In addition to automated noise removal, we also employ manual validation as the last line of defense.We engage the six students mentioned in Section 3 to meticulously validate the dataset.We follow a similar strategy in [46] to ensure accurate annotations, which creates a web interface to enable participants to annotate the UI dataset efficiently.The interface (Fig. 10 in Appendix) displays a screenshot of the mobile UI, along with the bounding boxes

Data Analysis
The collection process continued from February 27, 2023, to August 19, 2023, with a collection of 18k unique UIs from 3.3k Android apps.In this section, we present a comprehensive analysis of our dataset MUD, including exploration coverage and dataset statistics.

Exploration Coverage.
To measure the coverage benefits of our LLMs-enhanced automated exploration approach, we compare it with three commonly used and state-of-the-art baselines.There is one random-based tool Monkey [10], one model-based tool Droidbot [50], and one learning-based tool Humanoid [51].To further demonstrate the advantage of our approach, we set up two ablation studies.Given that we propose a role instantiation prompt to define the task objective of app exploration, we consider a variant of our approach without the tailored prompt Ours w/o role to compare the performance of our approach with and without role instantiation.We further investigate the contribution of action primitives to the prompt, namely Ours w/o primitives, to see the impact of output formatting.Specifically, we follow the previous work [48] to extract specific action operations from the natural language output.
We use the default configuration settings for each tool and record activity by running the apps for 15 minutes.We use activity coverage as the evaluation metric, i.e., collecting all the activities defined in each app from AndroidManifest.xml following previous studies [21,54], and measuring the percentage of the explored activities during runtime.However, extracting all the activities from the apps can be time-consuming, especially for closed-source and highly confidential apps.Therefore, we conduct experiments with 30 apps randomly selected from the dataset, each of which has a rating higher than 4 stars (out of 5) and has been downloaded more than a million times.
Fig. 5 displays the activity coverage of our approach in comparison to the baselines and ablations.Our approach achieves a median activity coverage of 0.60 across 30 mobile apps.In contrast, Monkey, which relies on random user actions to explore the app,  can only achieve an average activity coverage of 0.32.Our approach even surpasses the best baseline (Humanoid) in activity coverage by 17% (0.43 vs. 0.60).This demonstrates the effectiveness of our approach in covering the majority of activities within the apps.We can also see that applying the prompt of action primitives can significantly improve the performance of our approach, leading to an improvement of 11% in activity coverage.This is due to the fact that the feedback generated by LLMs tends to be verbose (i.e., repeated questions and reasoning) and ambiguous (i.e., "create an account name of test" where "create" refers to the "input" action), resulting in incorrect action extraction.In addition, augmenting the prompt with role instantiation leads to a 5% increase in activity coverage, suggesting that LLMs can focus more effectively on app navigation to trigger additional activities.
To fully understand the capability of our approach, we carry out a qualitative study, examining cases where our LLMs-enhanced tool outperforms the baselines.We identify three key capabilities.First, our tool can automatically populate valid text content into the input element, which is crucial for navigating through the page.This is primarily due to the LLMs' training on a large-scale corpus, which captures semantically correlated text for input, for example, the email input "example@gmail.com"and valid 9-digit mobile number as seen in Fig. 6(a).Second, our tool can generate complex compound operations.While the baseline focuses on a single action on the UI, sometimes a series of actions in the right order is required to reach the next state.This is facilitated by the LLMs' ability to understand the causal relationships between the elements in the UI.As illustrated in Fig. 6(b), to get through the state, it first clicks on the checkbox to accept the policy, and then clicks on the agree button.Third, our approach is not sensitive to multilingual apps.As seen in Fig. 6(c), allows for the exploration of international apps in various languages.More than that, the app requires that all multilingual questions be answered correctly before it proceeds to the next stage, which our approach has the capability to achieve.This is because the LLMs are trained on a diverse set of texts in various languages, e.g., at least 95 natural languages for the GPT model [4].The capability of our LLM-enhanced approach results in a performance boost of 28%, 23%, and 23% in activity coverage compared to the baseline tools -Monkey, Droidbot, and Humanoid, across 10 (33%) mobile apps supporting multiple languages included in our experimental dataset.

Dataset Statistics.
Our dataset is composed of 3.3k apps spread over 33 different categories.Fig. 7(a) illustrates the diverse distribution of app categories, covering education, finance, food & drink, etc. Fig. 7(b) presents the ratings of these apps, while Fig. 7(c) displays the download statistics of these apps.The apps in our dataset have an average download count of 71 million and hold an average rating of 4.17 out of 5.0 stars, emphasizing the popularity and high quality of the apps in our collection.
Our automated app exploration has amassed a collection of 46k UI screenshots, each paired with its corresponding view hierarchy.We automatically filter out duplicates and overlaid view hierarchies as discussed in Section 4.3, which results in the removal of 22k UIs (automatically eliminating 85.7% noises).To further ensure the quality of our dataset, we also carry out manual validation, leaving 18k UIs (39.1%) in the MUD dataset.

DOWNSTREAM TASKS FROM MUD
We have showcased the effectiveness of our approach in collecting a high-quality UI dataset, MUD.In this section, we further highlight the potential value of MUD across two common UI modeling tasks, namely element detection and UI retrieval.

Element Detection
UI element detection is a specialized task within object detection, which involves identifying the locations and types of UI elements from a screenshot.It can be useful as a standalone task or as an initial step for more advanced UI modeling.Experimental Setup.We conduct experiments using Faster-RCNN [63], one of the highest-performing object detection models evaluated on public datasets.The model employs a convolutional neural network (CNN) to extract image features from the input UI screenshot.It then uses a region proposal network (RPN) to generate region proposals that likely contain an object of interest, as opposed to just background.These region proposals are referred to as regions of interest.Finally, it utilizes an object detection network that predicts object classification scores for the region proposals and determines the object bounding box.
Regarding our training/testing data split, a simple random split would not effectively evaluate the model's generalizability, as screens within the same app may exhibit very similar visual appearances.
To avoid this data leakage issue [43], we split the screens in the dataset by apps, ensuring that representations of app categories are similar in both the training and testing datasets.The resulting split includes 12,471 (70%) UIs in the training dataset, 2,771 (15%) in the validation dataset, and 2,890 (15%) in the testing dataset.We adopt the configuration from previous work [20] to train the model, employing early stopping on the validation metric to reduce the risk of overfitting.We also train a baseline model following the same configuration but use the entire Rico dataset for training and our testing dataset for evaluation.We assess our model's performance using Average Precision (AP), a standard evaluation metric for object detection.We select a threshold of > 0.5 IoU (Intersection over Union), a common benchmark used in object detection challenges [27], to pair a detection with a ground truth UI element.
Results.Table 2 presents the Average Precision (AP) at an Intersection over Union (IoU) of 0.5 for each of the 9 classes trained in the Rico and MUD datasets.We can see that MUD achieves a higher AP across all 9 classes and boosts 10.5% on the average AP over the performance of Rico.Fig. 8 provides examples of detection results from our MUD.This superior performance could be attributed to two potential factors.First, the elements in the Rico dataset may be outdated, thereby training on old-fashioned UIs hinders its capacity to detect more modern elements and poses challenges to modeling contemporary UIs.Second, the noisy view hierarchy in the Rico dataset may significantly impact the performance of element detection.In contrast, our MUD dataset features a clean view hierarchy, which enhances the detector's ability to recognize elements of interest, resulting in improved performance.

UI Retrieval
Searching for relevant UI design examples can assist designers in drawing inspiration and comparing design alternatives [15].This task emphasizes the similarity between UI screens, specifically, identifying the top-N most similar screens in the dataset.
Experimental Setup.We conduct experiments using the autoencoder model proposed in previous work [24], which supports unsupervised learning of lower-dimensional representations of the UI.The model first extracts the layout of a UI using the bounding boxes of all the UI elements in the view hierarchy, distinguishing between text and non-text elements using different colors.Then, the model employs a typical encoder-decoder architecture, where the encoder maps the layout image to a lower-dimensional vector, while the decoder is optimized to map this lower-dimensional vector back to the original image.In detail, the encoder has an input dimension of 11,200, followed by two hidden layers of sizes 2,048 and 256, with an output dimension of size 64.The decoder has the reverse architecture, i.e., from 64 to 256, 2,048, and then 11,200.We use the mean squared error (MSE) between the input of the encoder and the output of the decoder to train the model.
We recruit three participants with backgrounds in UI/UX design and art practice to evaluate the performance of UI design retrieval.At the beginning of the experiment, we provide them with an introduction to our study and ask them to review previous studies [15,42] to better understand the principles of UI retrieval.Subsequently, we randomly select 5 query UIs in our dataset and for each query, we use our model to retrieve the top-5 most similar UIs.Similarly, we retrieve the top-5 most similar UIs from the Rico dataset as the baseline.The participants then individually rate the results on a five-point Likert scale (1: not related at all, 5: highly related) to assess their relevance to the query.Note that they do not know which result is from Rico or MUD dataset.
Results.Fig. 9 shows several example query UIs and their retrieved designs from Rico and MUD datasets.All participants note that the retrieved results are relevant to the input queries and would serve as valuable design examples.As a result, participants assign our dataset an average relevance score of 4.1 out of 5.0, compared to a score of 3.2 for the Rico dataset.This indicates the potential usefulness of our dataset in retrieving similar UI designs and inspiring creativity.One participant underscores the key contribution of our dataset, i.e., the inclusion of modern UI designs, stating: "As a designer, I frequently need to keep up with the latest trends in UI design.However, the UI designs on the existing design-sharing platforms like Dribbble are often more artistic than practical, which doesn't help in gaining 'practical' inspiration.On the other hand, the existing UI design repositories like Rico are outdated and often lack the updated features and functionalities required for contemporary user interface design.I appreciate this new UI design repository.And hope it could keep updating."

DISCUSSION
We analyze the effectiveness of the proposed automated app exploration approach in Section 4.4 and explore the usefulness of the collected UI dataset MUD in Section 5.In this section, we delve deeper into the implications of our research, the limitations of our approach, and how future studies can expand upon our work.

Improving Automated App Exploration
The advantages of LLMs in performing semantically-driven app explorations are discussed in Section 4.4.1, including valid text input, complex compound actions, and multilingual content understanding.These advantages might due to the knowledge from extensive instructional resources like WikiHow [11] and PixelHelp [8], which provide step-by-step app instructions.While we use the gpt-3.5turboas our model in the study, we believe that other LLMs trained on similar resources, such as the PaLM [22] and the open-sourced LLama model [68], could also deliver comparable or even better performance.In our future work, we aim to conduct a comparative analysis of these LLMs' performance to gain insights into each model's unique strengths and limitations.To prompt potential exploration guidance, We utilize the LLMs by providing UI view hierarchy information as the context.However, mobile apps contain multiple sources of information, including app descriptions and functionality introductions.Such information could potentially assist our tool in prioritizing the exploration of the most crucial UIs in the apps, thereby enhancing the efficiency of UI collection.Another limitation of our investigation is our exclusive use of view hierarchy information as the context, while other information remains unused.For instance, a historical context of previously explored UI states could help LLMs reduce redundant exploration.We also hypothesize that providing the context of the app name may improve our approach's performance, as LLMs exposure to tutorials containing step-by-step instructions or descriptions on how to trigger certain features in the training corpus.Future research could incorporate these information sources as context for the LLMs, potentially leading to more meaningful and efficient exploration.
Apart from the improvement of LLMs, the automated app exploration may limit to the login activity.While we propose a novel login approach by Google account authentication that allows 100% successful login into the Google account to unlock apps, a few apps may only provide customized registration.The LLMs might simulate account details such as username and password.While these details appear reasonable, apps often require account activation through third-party verification like email, messaging, or CAPTCHA.This results in registration failure, hindering the app exploration.In the future work, we plan to systematicallty investiagte these failure and develop a complete registration/login system, i.e., monitoring the incoming activation message, activating with proposed steps, sending feedback to login the app.
Besides enhancing LLMs, the automated app exploration may be restricted to the login activity.We propose a novel automatic login approach using Google account authentication, which ensures 100% successful login into Google accounts to unlock apps.However, some apps might only offer customized registration.The LLMs could simulate account details like usernames and passwords.While these details seem reasonable, apps often require account activation through third-party verification such as email, messaging, or CAPTCHA.This leads to registration failure, thereby obstructing the app exploration.In the future, we aim to systematically investigate these failures and design a comprehensive registration/login system.This would involve monitoring the incoming activation message, activating with proposed steps, and providing feedback to log into the app.

Increasing the Dataset
As data-driven computational methods for modeling UIs continue to rise, datasets have become a pivotal resource for understanding UI.In line with this, we have been collecting and annotating the MUD dataset, a large mobile UI dataset comprising modern UI screens and high-quality view hierarchies.One limitation of our dataset is that it only captures screenshots and view hierarchies.Other artifacts, such as interaction traces and UI animations, could also prove valuable for UI understanding.Although we captured these artifacts during our automated app exploration, they can appear noisy due to our process of UI removal, rather than their intrinsic characteristics.For instance, the corresponding UIs in the interaction trace could be significantly impacted by the state duplication.We plan to investigate this issue and propose an automated algorithm to refactor the trace, thereby broadening UI modeling opportunities for future research.Aside from multi-modal artifacts, our dataset only collects UIs from Android apps.Constructing a dataset from different platforms, such as iOS and the web, could yield similar benefits.We anticipate that our automated app exploration approach, which relies on view hierarchy for guidance, could be extended to these platforms, given their similar view hierarchy data or equivalence.In the future, we aim to broaden our approach to collecting UI datasets across various platforms to enhance the research on UI understanding in a more general context.

Improving UI Understanding
Our experiments have initially showcased the utility of MUD in two common downstream tasks as described in Section 5. A logical progression would be to evaluate the performance of our dataset in other UI modeling tasks, such as screen summarization [70], screen question-answering [69], etc.We expect that these models can attain relative performance enhancements consistent with those findings in Section 5.The implications of these improvements may also shed light on the progression of UIs over the years, potentially serving as a proactive mechanism to remind researchers of advancements in UI understanding.Beyond that, a long-term objective of our data collection and high-quality human annotation efforts is to achieve a more generalized and modern understanding of UIs.This would involve developing an advanced model capable of comprehending the latest UI designs and their semantics, ultimately enabling an intelligent UI agent to tackle various UI problems.

CONCLUSION
In this paper, we present MUD, a dataset of 18k UIs paired with visual and hierarchy information to aid UI modeling.Unlike most existing datasets for UI research, which often lack sufficient data or contain noisy and erroneous data, MUD is collected using a novel approach that incorporates best practices.In detail, we employ the Large Language Models (LLMs) to simulate an app expert for automatically exploring apps, thereby collecting an order of magnitude more UIs.Subsequently, we utilize the best practices of UI noise filtering techniques to eliminate noise and further implement human annotation to ensure the final quality of the dataset.We demonstrate the effectiveness of LLMs-enhanced automated app exploration in generating human-like actions, facilitating more thorough and efficient app exploration compared to three stateof-the-art tools.Moreover, we highlight the utility of our dataset MUD by modeling two common UI tasks: element detection and UI retrieval.The dataset MUD will be released to the public to encourage further research and lay the groundwork for UI modeling based on high-quality and modern UI designs.

Figure 1 :
Figure 1: Examples of four types of updates in the new UI design, including visual appearance, visual hierarchy, user interaction, and typography.

Figure 2 :
Figure 2: Examples of noises in the Rico dataset.The orange bounding box presents the elements in the view hierarchy.

Figure 3 :
Figure 3: The overview of our dataset collection process.

Figure 4 :
Figure 4: Illustration of our LLMs.We prompt the model to suggest potential interactions, based on the current UI view hierarchy, in order to achieve maximum coverage.

Figure 5 :
Figure 5: The performance of our LLMs-enhanced exploration method compared to three automated exploration tools, including Monkey, Droidbot, and Humanoid, and two ablation studies, including the prompt without role instantiation and action primitives.
(a) Valid text input, where the email is entered as "example@gmail.com",and the mobile number is a valid 9-digit number.(b)Compound actions, involve first ticking the checkbox to accept the policy, followed by clicking on the agree button.(c)Multilingual understanding, as it will not be able to proceed to the next UI until the multilingual questions are answered correctly.

Figure 6 :
Figure 6: Examples of the capability of our approach.

Figure 7 :
Figure 7: Summary of the app statistics of the MUD dataset: (a) category distribution, (b) average rating distribution, and (c) installation distribution.

Figure 8 :
Figure 8: Examples of the element detection in MUD dataset.

Figure 9 :
Figure 9: Examples of query results from Rico and MUD datasets.

Figure 10 :
Figure 10: In the interface, we highlight the UI screenshot along with the bounding boxes extracted from the view hierarchy.Annotators can flag the UI as invalid and select the reasons, such as partially rendered UI, overlaid view hierarchy, duplicate UI, or explain the other reason.

Table 1 :
A comparison of MUD with other open-source UI datasets.