CrashTranslator: Automatically Reproducing Mobile Application Crashes Directly from Stack Trace

Crash reports are vital for software maintenance since they allow the developers to be informed of the problems encountered in the mobile application. Before fixing, developers need to reproduce the crash, which is an extremely time-consuming and tedious task. Existing studies conducted the automatic crash reproduction with the natural language described reproducing steps. Yet we find a non-neglectable portion of crash reports only contain the stack trace when the crash occurs. Such stack-trace-only crashes merely reveal the last GUI page when the crash occurs, and lack step-by-step guidance. Developers tend to spend more effort in understanding the problem and reproducing the crash, and existing techniques cannot work on this, thus calling for a greater need for automatic support. This paper proposes an approach named CrashTranslator to automatically reproduce mobile application crashes directly from the stack trace. It accomplishes this by leveraging a pre-trained Large Language Model to predict the exploration steps for triggering the crash, and designing a reinforcement learning based technique to mitigate the inaccurate prediction and guide the search holistically. We evaluate CrashTranslator on 75 crash reports involving 58 popular Android apps, and it successfully reproduces 61.3% of the crashes, outperforming the state-of-the-art baselines by 109% to 206%. Besides, the average reproducing time is 68.7 seconds, outperforming the baselines by 302% to 1611%. We also evaluate the usefulness of CrashTranslator with promising results.


INTRODUCTION
New mobile applications are being developed and released continuously via app stores since the market for mobile devices is both growing and diversifying.Recent statistics show that more than 5 million mobile apps are available in popular marketplaces like Apple App Store and Google Play Store, for over 140 billion downloads in 2022 and 12 million mobile developers are maintaining them [55].As developers add more features and capabilities to their apps to make them more competitive, the corresponding increase in app complexity has made testing and maintenance activities more challenging.The competitive app marketplace has also made these activities quite important for an app's success.As shown in a survey, 88% of app users would abandon an app if they repeatedly encountered a functionality issue [2].This motivates developers to identify and resolve issues rapidly, or risk losing users otherwise.
An important mechanism for ensuring app quality is the online bug reporting systems, e.g., GitHub Issue Tracker [21], Bugzilla [29], Google Code Issue Tracker [9].These systems enable users to create bug reports in which they can describe their observed failure; developers can then use this information to help debug their apps.These bug reports are becoming a non-neglectable source of information for improving app quality and user satisfaction.
Once developers receive a bug report, one of the first steps to debugging the reported issue is to reproduce the issue following the reproducing steps.There are several existing studies focusing on crash 1 reproduction from the natural language described reproducing steps [72][73][74].They typically apply natural language processing techniques to match the reproducing steps with the app's GUI events (i.e., operations on GUI widgets, e.g., clicking the search button of an app), and employ guided exploration strategies with the matched information for bug reproduction.However, not all crash submitters would strictly follow the report template to provide the reproducing steps when reporting the crash.
Our motivational study (Section 2) reveals that a non-neglectable portion (20.2%) of crash reports contain only the stack trace.Such stack-trace-only reports can be submitted by crash reporting tools such as Crashlytics [10], which automatically collects crash logs and upload them, and this is also the commonly-used practice in large commercial software.These reports can also be submitted by the app users who accidentally trigger the crash yet fail to figure out the reproducing steps.Due to the insufficient information provided in these reports, developers tend to spend extra effort in understanding and reproducing the issues, which brings in a longer fixing duration, i.e., the average fixing duration of these stack-trace-only reports is 26% larger than the crash reports with reproducing steps.Besides, the aforementioned existing approaches would fail to work on stack-trace-only reports.This further implies the necessity for the automatic crash reproduction approach directly from stack traces.
Existing studies on stack trace analysis relate with stack trace similarity [30,49,60], fault localization from stack trace [23,26,42,65,67], test code generation from stack trace [7,50,[52][53][54]68], duplicate crash reports detection with stack trace [11,48], etc.Although these approaches can facilitate the understanding and analysis of the stack trace, none of them can tackle the problem of automatic crash reproduction from the stack trace.There are two challenges in the automatic reproduction of these crashes.
First, as mentioned above, existing studies generally utilize the textual-described reproducing steps for crash reproduction, yet stack-trace-only crashes lack step-by-step guidance.In other words, the stack trace does not record the exploration sequence from the entry page to the crash-occurring page, while the most useful information might be the last GUI page when the crash occurs.However, there can be 1 to 8 exploration steps for reaching the last crashoccurring GUI page (based on our experimental data), which can be quite inefficient if explored randomly.Although the static or dynamic analysis techniques [19,56,70] can infer the transition between activities and plan the exploration path, yet they can be quite incomplete or inaccurate [69].Second, even when the last crash-occurring page is reached, it may still need certain interactions with the app to finally trigger the crash, e.g., clicking a certain button.Nevertheless, there can be an average of 6.6 interactive widgets on a GUI page (based on our experimental data) and the stack trace does not implicitly provide which widget to interact with for crash triggering, which further complicates the automated crash reproduction problem.
Nevertheless, we also find two clues for facilitating the reproduction of stack-trace-only crashes.The first is the last crash-occurring GUI page before triggering the crash, which offers the target for the planned exploration.The second is the involved APIs in the programming code when conducting the operations in the crashoccurring page, which can help find the widget with which to interact so the crash can finally be triggered.
Motivated by these clues, we propose an approach named Crash-Translator to automatically reproduce mobile application crashes directly from the stack trace.It accomplishes this by leveraging a pre-trained Large Language Model (LLM) to predict the exploration steps for triggering the crash, and designing a reinforcement learning based technique to mitigate the inaccurate prediction and guide the search holistically.
In detail, we first extract crash-related information from the stack trace, i.e., the crash-occurring GUI page and crash-involved APIs.Second, we design three scorers to assign the exploration priority for each GUI widget on the current page: 1) Page reaching scorer, which leverages the LLM to choose the GUI widget that may lead to the crash-occurring page; 2) Widget hitting scorer, which utilizes the heuristic method to find the crash-triggering widget by matching the crash-involved APIs; 3) Exploration optimization scorer, which assigns scores based on previous interaction records of reinforcement learning technique, in order to bridge the gap by inaccurate prediction of the first two scorers and plan the exploration holistically.CrashTranslator selects the GUI widget based on the three scorers and continues the process iteratively until the target crash is triggered.It finally generates the replay script (for direct replay) and the textual-described reproducing steps with the step-by-step image instructions (for facilitating understanding).
To evaluate the effectiveness and efficiency of our approach, we run CrashTranslator on 75 crash reports collected from 58 popular Android apps involving three datasets.CrashTranslator successfully reproduces 61.3% (46/75) of the crashes, which outperforms the state-of-the-art baselines by 109% to 206%.Besides, the average reproducing time is 68.7 seconds, outperforming the baselines by 302% to 1611%.Furthermore, the results also show that both the designed page reaching scorer and widget hitting scorer greatly contribute to the reproduction performance.The usefulness evaluation shows that CrashTranslator's generated reproducing steps can make the crashes easily reproduced (215% faster).
The contributions of this paper are as follows: • Dimension.The first work of the automatic crash reproduction of mobile applications directly from the stack trace.

MOTIVATIONAL STUDY
We conduct a motivational study to investigate whether it is common for stack-trace-only crash reports, their characteristics, and the challenges of reproducing these reports.
In detail, we choose GitHub as the data source since it contains a large number of publicly available valid bug reports.We use the web crawler provided by Wendland et al. [63] to automatically crawl the bug reports from Android projects and focus on the reports created from Jan. 2015 to May. 2022, resulting in 96,451 bug reports.We then filter the bug reports involving application crashes with the keywords such as crash, exception following existing studies [63,74].As a result, we acquire 10,843 Android crash reports for our motivational study.

Is it Common for Stack-trace-only Crash?
A well-formulated crash report usually contains the crash overview, textual-described reproducing steps, stack trace when the crash occurs, and visual recordings (screenshots/GIFs) about how the crash occurs.However, not all issue submitters would strictly follow the report template and provide all crash-related information when reporting the crash.
We utilize keywords and heuristic pattern matching 2 to automatically examine whether the crash report contains reproducing steps and stack traces, following existing studies [63,74].Results reveal that 20.2% (2,187 / 10,843) of crash reports contain only stack traces, i.e., stack-trace-only crash reports.Such reports can be submitted by crash reporting tools such as Crashtics [10], which automatically collect crash logs and upload them to the issue server when the app crashes, as shown in Figure 2 (a).Furthermore, commercial software can also have such auto-generated stack-trace-only crash reports [14,27] , which further indicates the universality of such reports.Besides, these reports can also be submitted by the app users or developers who accidentally trigger the crash yet fail to figure out the reproducing steps, as shown in Figure 2 (b).This also implies the necessity for the automatic crash reproduction approach.

Is it Difficult to Handle Such Crash?
We go a step further to investigate whether it is difficult to handle (e.g., reproduce, fix) such crashes.Since it can hardly obtain the time or effort for crash reproducing from GitHub, we turn to the commonly-used issue fixing duration, i.e., the duration between the issue creation time and the closing time [12], to indicate the difficulty of handling such crashes.
2 Details are in our website: https://github.com/wuchiuwong/CrashTranslatorFor crash reports with reproducing steps, the average issue fixing duration is 57 days, while for stack-trace-only crash reports, such duration is increased to 72 days (26% higher).We assume that due to the insufficient information provided in the stack-trace-only crash report, developers need to spend extra effort in understanding the problem and reproducing the issue, and thus have a low willingness and take more time to fix the issue.Therefore it would be highly expected to automate the reproduction of stack-trace-only crash reports to save the effort and facilitate follow-up issue fixing.

Why is it Difficult?
Challenge 1: Lack of step-by-step guidance.Existing approaches would take the textual-described reproducing steps as input and conduct the bug reproduction guided by the steps [72][73][74].However, for the stack-trace-only crash report, we cannot fetch the reproducing steps and thus lacks the step-by-step guidance to trigger the crash.By comparison, we can possibly derive the crash-occurring activity or fragment, i.e., the last interactive GUI page when the crash occurs, e.g., InstalledSearchEnginesSettingsFragment as demonstrated in Figure 1.But the automatic approach still needs to speculate the exploration steps for navigating to the crash-occurring page.In our experimental dataset (shown in Section 4.1), there can be 1 to 8 exploration steps for reaching the crash-occurring GUI page, which can be quite inefficient if explored randomly.
Challenge 2: Need specific interactions even reaching the last GUI page.The second challenge is that even if the crashoccurring GUI page is reached, it may still need certain interactions with the app to finally trigger the crash.As shown in Figure 1, to trigger the crash, after reaching the GUI page, one still needs to click Search engine in the crash-occurring GUI page at step 5.In our experimental dataset (shown in Section 4.1), there can be an average of 6.6 interactive widgets, which further complicates the reproduction of stack-trace-only crashes.

Are There any Clues for Reproduction?
While facing the above challenges, stack traces do provide clues for automated reproduction.Specifically, as mentioned in challenge 1, we can derive the crash-occurring GUI page, which offers the target for the planned exploration.Besides, when conducting the operations in the crash-occurring GUI page, there can be corresponding invocations of the programming code, and the stack trace would output the involved APIs, e.g., refetchSearchEngines and onResume in Figure 1.
To summarize, the stack trace offers two clues, i.e., the crashoccurring GUI page and the involved APIs when interacting with the app at the crash-occurring page.We can utilize the first clue for predicting the exploration steps for navigating to the crashoccurring GUI page; and then use the second clue to find the widget by interacting with which the crash can finally be triggered.
Step 1: Click "SKIP" Step 2: Click "More Options" Step 3: Click "Settings" Step 4: Click "Search" Step 5: Click "Search engine" Intro page Main page Menu of main page Settings page Settings page  Summary: Our analysis on 10,843 crash reports of Android apps from GitHub shows that 20.2% of crash reports only contain the stack trace, and these stack-trace-only reports consume 26% more time for issue fixing.This can be because they lack step-by-step guidance for planning the reproduction.Our findings confirm the necessity and challenges of the crash reproduction directly from the stack trace.We also observe two clues to motivate our approach development for automated crash reproduction.

APPROACH
Motivated by the above findings, we propose an automated approach named CrashTranslator to reproduce crash reports directly from the stack trace of mobile apps.It accomplishes this by leveraging a pre-trained Large Language Model (LLM) to predict the exploration steps for triggering the crash, and designing a reinforcement learning based technique to mitigate the inaccurate prediction and guide the search holistically.As demonstrated in Figure 3, given a stack-trace-only crash report, CrashTranslator first derives the crash-occurring GUI page and crash-involved APIs for triggering the crash.It designs three scorers to assign the exploration priority for each interactable GUI widget on the current page, i.e., 1) Page reaching scorer leverages the LLM to choose the next GUI page for reaching the crashoccurring GUI page, and the widget for transferring to the next page (Section 3.2); 2) Widget hitting scorer utilizes a heuristic method to find the crash-triggering widget by matching the crashinvolved APIs (Section 3.3); and 3) Exploration optimization scorer assigns scores based on the previous interaction records of reinforcement learning technique, which aims at bridging the gap by inaccurate prediction of the first two scorers and plans the exploration holistically (Section 3.4).CrashTranslator selects the GUI widget based on the three scorers and repeats the process iteratively until the target crash is triggered.It finally generates a replay script (for direct replay) and textual-described reproducing steps with step-by-step image instructions (for facilitating understanding).

Preprocessing
We first conduct preprocessing for the stack trace and the target app, to prepare the information for reproducing the crash.Specifically, three types of information will be used in the crash reproduction.
App's package name and the names of all activities in the app.We first decompile the target app and get its configuration file (AndroidManifest.xml)which records the package name of the app and the names of all activities it contains.In this paper, we consider the activity name as the GUI page name.
Crash-involved APIs.We extract the lines that contain the app's package name from the stack trace, which indicate the crashinvolved APIs in the app, e.g., org.mozilla.focus.search.SearchEngine-ListPreference.refetchSearchEngines in blue color in Figure 1.
Crash-occurring page.From the crash-involved APIs extracted in the second step, we then check whether the terms coincide with the app's activity name (extracted in the first step).If so, we treat the activity as the crash-occurring page; otherwise, the crash may occur in a fragment, and we simply extract the name with the keywords Fragment from the stack trace, e.g., InstalledSearchEngi-nesSettingsFragment in red color in Figure 1.
For better understanding and reducing noise, we further tokenize the extracted page names (activity names), crash-involved APIs, and the crash-occurring page by the underscore and Camel Case, and remove the stop words, for follow-up usage.

Page Reaching Scorer
To reach the crash-occurring GUI page, intuitively, we can interact with the GUI widgets that share similar names with the crashoccurring page, e.g., click Search in step 4 can reach the crashoccurring installed search engines settings fragment as shown in Figure 1.However, there can be a long sequence of exploration steps for reaching the crash-occurring page, and the widgets in the prior and middle part of the sequence tend to be irrelevant to the crash-occurring page, e.g.,More options in step 2 looks totally different from the crash-occurring page as shown in Figure 1.To tackle this, we leverage the LLM to predict the exploration steps for reaching the crash-occurring page iteratively.Specifically, as shown in Figure 4, we first ask the LLM to predict the next page for reaching the crash-occurring page; then ask the LLM to choose the widget for transferring to the predicted next page; and iterate the process.This step would assign scores for the widgets, indicating their probabilities of reaching the crash-occurring GUI page, i.e., priorities of being chosen.We provide the LLM with all app's pages and the current GUI page, and ask the LLM to speculate the next page to reach the crash-occurring page, by which the whole exploration sequence is generated iteratively.
Input.There are three inputs, i.e., I1-I3 as shown in Table 1. 1) PageNames, i.e., the name of all activities, are extracted from the AndroidManifest.xmlas described in Section 3.1; 2) CurrentPage is the activity name of the current GUI page during the iterative process.To distinguish between pages with the same activity name but with different widgets on them (e.g., the pages in steps 2 and 3 in Figure 1), we divide the pages into three types: menu, dialog, and general pages.For menu or dialog page, we name the CurrentPage by "page_type of activity_name" (e.g., menu of main in step 3 in Figure 1).3) CrashPage, i.e., the name of the crash-occurring page, is extracted from the crash stack trace as described in Section 3.1; Prompt generation.To design the prompt, we follow the regular prompt template [5,8,25], and each of the three authors is asked to write the prompt sentence for this task with 10 trial apps, and conduct a discussion to derive the final prompt pattern, i.e., P1 as shown in Table 1.Take the last step in Figure 4 as an example, we first tell the LLM what pages are contained in the app (There are 8 pages in the app, named: intro, main, setting, ...), then tell the LLM what we want to do with the information about the current page and crash-occurring page (I want to go from the menu of main page to the installed search engines settings page), and finally ask what the next page should be (What is the next page?).The LLM will predict the name of the next page (setting).
Fine-tuning.To achieve better performance, we build a finetuning dataset and fine-tune the LLM for learning the transition relations between app pages.Specifically, we collect 1,000 apps of different categories from F-Droid [17] and extract their activity transition graph (ATG) with Gator [19], one of the state-of-the-art static analysis tools.We then utilize these transition relations as fine-tuning data for model fine-tuning.Although the static analysis tool like Gator cannot obtain the full ATG, which is why we do not directly use it for path planning, yet through inputting the incomplete ATG from different apps, the LLM has the potential to combine the diversified viewpoints together and speculate the desired transitions.Note that the fine-tuning process is a one-time requirement and does not necessitate individual fine-tuning for each app.

Predicting Transfer Widget for Reaching Next Page.
After knowing the next page, we need to know how to reach there, e.g., which button to click.We provide the LLM with all interactable GUI widgets on the current page, then ask the LLM to choose the widget for transferring to the next page.
Input.Besides the three inputs used in the previous section, this step needs the fourth input, i.e., InteractableWidgets (I4 as shown in Table 1).It is the list of names for all widgets which can be interacted with on the current GUI page, and we obtain it from the view hierarchy file of the current page.Specifically, we first filter the interactive widgets from all the widgets based on whether the clickable or long-clickable property is true.Then, we leverage heuristic rules to extract a representative name for each widget following existing studies [38,39,74].In detail, for text-like widgets (e.g., Buttons, TextView, etc.), we obtain their textual attributes (e.g., text, content-description, resource-id) and utilize the first non-empty one.For icon-like widgets (e.g., ImageButton, ImageView, etc.), we extract their name from their contextual text information (e.g., nearby text, sibling text, child text) and use the first non-empty one.Finally, we tokenize the extracted widget name by the underscore and Camel Case, and group widgets according to their container widgets to express the current page's layout, e.g., What's New, Help, Settings in a list in step 3 of Figure 4.
Prompt generation.Following the same procedure as the previous section, we come out with the prompt pattern, i.e., P2 in Table 1.Take the last step in Figure 4 as an example.Like the prompt in the previous section, we first tell the LLM what pages are contained in the app and our ultimate goal; we then tell it the predicted next page (The next page may be the setting page), provide the names of all widgets on the current page (Here are widgets I can click: what's new, help, settings in a list, ...), and finally ask LLM to choose the optimal transferring widget (What should I click?).The LLM will output a widget which it thinks could lead to the target page (settings).
Since the LLM might not always output the correct transfer widget, we would let it provide a ranked list of candidate widgets (i.e., top 5), and CrashTranslator would consider these ranked widgets during the exploration optimization in Section 3.4.To realize this, we repeat the widget prediction by utilizing a new prompt in which we remove the interactable widgets that have already been predicted, and let the LLM provide a new answer.This process is repeated five times, and we obtain five distinct widgets in order.We then assign numerical scores to each widget in the manner of 1/( + 2) (e.g., Top 1 scores 0.33, Top 2 scores 0.25).
Fine-tuning.Similar to the prior section, we fine-tune the LLM for better performance.We use the commonly-used RICO dataset [13], which contains plenty of the targeted GUI page and transfer widget for reaching it.We randomly sample 1,000 such data pairs (involving 629 apps) as the fine-tuning dataset.Note that the finetuning process here also only needs to be conducted once.

Widget Hitting Scorer
Some crashes can be triggered as soon as arriving at the crashoccurring page, yet in most cases, specific events or event combinations are needed to perform on the crash-occurring page to finally trigger the crash, e.g., although we reach the installed search engine settings page at step 4 in Figure 1, the crash has not yet been triggered until we click the Search engine at step 5.We utilize the crash-involved APIs to infer the crash-triggering widgets.Take Figure 1 as an example, the stack trace involves API of refetch-SearchEngines, from which we can infer that clicking the Search engine on the crash-occurring page would likely trigger the crash.
To automatically find these crash-triggering widgets, we propose a lightweight matching method between widgets and crashinvolved APIs extracted from the stack trace.Specifically, we first tokenize the name of widgets and crash-involved APIs by the underscore and Camel Case.If there is an overlap at the token level between the name of a widget and an API, i.e. at least one token is the same after stemming or an abbreviation of the other, we assume the widget is a candidate crash-triggering widget and assign a score based on the percentage of overlapping tokens.

Exploration Optimization Scorer
Ideally, the crash can be reproduced with the predicted transfer widget for arriving at the crash-occurring page (in Section 3.2) and the crash-triggering widget after reaching the crash-occurring page (in Section 3.3).Yet, in practice, the prediction can be inaccurate, and there might be complex transitions in the app which the prediction model could not capture.Therefore we design the exploration optimization scorer to help plan the exploration holistically and bridge the gap by inaccurate prediction of the first two scorers.
We leverage Q-learning [62], a reinforcement learning method, to help conduct the exploration.The basic idea is to maintain a Q-table, which stores all widgets' value, to record the crash-related and exploration information and memorize the valuable widgets during trial-and-error exploration.When we first reach a new page, the value of each widget on the page is initialized to 0 and updated by reward or penalty after the widget's interaction (details below).

Formulating Exploration as MDP.
In our approach, we define an instance of the Markov decision process (MDP) to describe the exploration process and adopt Q-learning to optimize the exploration.The MDP can be defined with a 4-tuple, ⟨S,A,P,R⟩, where S refers to the set of states, A refers to the set of actions, P refers to the transition function, and R refers to the reward function.In the context of crash reproduction, each page in the app represents an individual state.We extract interactable GUI widgets on each page, and interactions (click, long click or type text) with widgets constitute the action set of the state.When we perform an action   (an interaction on a widget), the app's state will be changed from state   to state   +1 .We first record the transition ⟨  ,   ,   +1 ⟩ to the transition function P, and then assign a reward   based on the reward function R. The reward function R generates a value indicating the quality of a performed action, e.g., whether interacting with a certain widget relates to the crash.Our reward function evaluates an action (i.e., the corresponding widget) as a sum of the following three aspects.
Crash triggering reward.When an action involves the crashrelated elements, i.e., reaching the crash-occurring page or triggering crash-involved APIs, the corresponding widget will receive a large positive reward since the exploration in these widgets has a larger possibility of triggering the crash.Next time when the approach reaches the page, it might choose the widget again, and the combination containing the widget might trigger the crash.
New state reward.When an action explores a new GUI page, the corresponding widget will receive a small positive reward to encourage exploiting new states, especially at the beginning of the exploration.With the exploration going on, the new state reward of a widget can be balanced out by the duplicate state penalty.
Duplicate or failure state penalty.When a widget transfers to a known page, it will receive a small penalty since it is less likely to trigger the crash.Besides, when an action transfers to a page out of the app (e.g., opening a browser) or triggers a non-target crash, the corresponding widget will receive a large penalty.It will not be chosen again since it is impossible to trigger the crash.
Next, we will update the value of the widget interaction  (  ,   ) recorded in the Q-table with the Bellman function: where  refers to the learning rate and is set to 0.1,  refers to the discount factor and is set to 0.9,  * (  +1 ,   +1 ) refers to the maximum value of all actions in state   +1 .

Widget selector.
To efficiently focus on the potentially correct planning and also break out the local optimum, we leverage the -greedy policy [59] to select the widget to be performed following existing studies [46,72].Specifically, for each widget, we sum the assigned score from the three scorers (i.e., page reaching scorer, widget hitting scorer, exploration optimization scorer); then choose the widget with the highest sum score with a high probability 1 − , or randomly select other widgets with a low probability .In practice,  is initially set as a small number close to 0 to enable CrashTranslator to focus on the predicted optimal widget (i.e., the one with the highest sum score).During exploration, the optimal widget may be wrongly predicted and lead the exploration stuck in a repetitive exploration between several pages; at this point,  will be changed to a big number close to 1, leading CrashTranslator to select the non-optimal widget to break the local optimum.

Step Replayer
During crash reproducing, CrashTranslator uses an interaction cache to record all interactions with widgets when the app is launched (and the cache will be cleared after the app restart).After the target crash is triggered, the interaction recorded in the cache is crash-reproducing operations from app launch to crash.However, the interactions recorded in the cache may not be the most straightforward way to trigger the crash and might contain some redundant interactions.To make the generated reproducing steps more concise, we automatically eliminate the redundant interactions that lead to repeated or looped transitions.Finally, we convert the concise interaction history into an auto-replay script and human-readable reproducing steps, i.e., "image + text" reproducing steps shown in Figure 1.For each interaction, we highlight the widget to be interacted with on the page's screenshot by a red box and generate the textual reproducing step in the form of "event type + widget name".

Implementation
We implement our approach in Python and extend functionalities from the following libraries: Appium [1] to interact with Android apps and obtain the view hierarchy of the current page; NLTK [43] to stem word, which is used in the widget hitting scorer (Section 3.3); Ella [16] to check whether crash-involved APIs are triggered, which is used in the exploration optimization scorer (Section 3.4); OpenCV [45] to mark widgets that need to be interacted with on screenshots, which is used in generating reproducing steps (Section 3.5).We run CrashTranslator and perform experiments on a physical x86 Ubuntu 20.04 machine with Android emulators (Android 4.4-7.0).
For the LLM leveraged in the page reaching scorer (Section 3.2), we adopt the pre-trained GPT-3 [4] model from OpenAI 3 .We choose the Curie model as the base model and fine-tune the model through official APIs as described in Section 3.2.Developers only need to set up their OpenAI account, complete the fine-tuning process according to the instructions provided on our website 2 , and subsequently utilize our tool to automatically.On average, reproducing one crash requires sending approximately 42.6 prompts (4939.7 tokens), with an estimated cost of around 0.01 USD.

EXPERIMENT DESIGN
To evaluate CrashTranslator, we consider the following three research questions: RQ1: How effective and efficient is CrashTranslator in reproducing crashes from stack trace?
RQ2: What is the contribution of the designed scorers in Crash-Translator for reproduction?
RQ3: Can the reproducing steps generated by CrashTranslator help developers to reproduce crashes?

Experimental Dataset
In this work, We collect 75 crash reports involving 58 apps from three sources for evaluation, i.e., ReCDroid's dataset [74], AndroR2 dataset [63], and GitHub.ReCDroid is an approach for crash replay based on the textual-described reproducing steps, and we utilize all 33 crash reports in its replicate package.AndroR2 is a dataset of manually-reproduced bug reports for Android apps, and we use all its 22 crash reports.Other reports, e.g., display bug reports, are out of the scope of this study.Note that the above 55 (33+22) reports do not necessarily contain stack traces.For those reports without a stack trace, we manually reproduce the crash following the reproducing steps and then extract the stack trace from the log.
We also collect a third dataset from GitHub to further prove the effectiveness of CrashTranslator.In detail, we first crawl and filter 3,566 crash reports with the stack trace (may also contain reproducing steps) from GitHub as described in Section 2, and randomly sample 300 crash reports for manual checking to retrieve the reproducible ones.It is performed independently by three graduate students with 2-4 years of Android development experience, and each report is manually reproduced by two of them.We exclude those that cannot be reproduced (e.g., lack of apks, failed-to-compile apps, environment issues) or require special conditions (e.g., account, hardware).This results in 20 crash reports involving 15 apps, and we refer to this dataset as CrashTranslator's dataset.
Note that the crash reports used in the experiments may contain reproducing steps or screenshots, but CrashTranslator will not use such information and only uses the stack trace to reproduce crashes.Due to space limitations, the details of the dataset can be viewed on our website 2 .

Baselines
To the best of our knowledge, CrashTranslator is the first work to reproduce crashes directly from the stack trace.Existing studies which reproduce crashes from the natural language described reproducing steps, e.g., RecDroid [74] and ReproBot [72], can not work for our task.Nevertheless, we forcefully apply ReCDroid as a baseline for our task and donate as ReCDroid  .Specifically, We provide the crash stack trace as its input rather than the reproducing steps, irrespective of whether ReCDroid comprehends the stack trace or not.
In addition, there are automated GUI testing approaches [15,24,34,36,37,40,46,57,61] which also explore the app and try to reveal the crashes; hence we utilize these approaches as the baselines to better prove our effectiveness.We choose the following four state-of-the-art approaches from different categories, i.e., Monkey, Humanoid, APE, and Q-testing: Monkey [40] is a widely-used random-based GUI testing tool that tests the target app with purely random sequences of GUI events or system events.The advantages of Monkey are its ability to perform lots of GUI events quickly and its good compatibility.
Humanoid [34] is a novel deep learning-based GUI testing tool.It trains a deep neural network model from a large-scale crowdsourced human interactions dataset to predict which GUI widgets on the current page are more likely to be interacted with by testers.
Ape [24] is one of the state-of-the-art model-based GUI testing tools.It models the app's behavior by building a finite state machine dynamically.The advantage of Ape is its ability to balance the size and precision of the modelling by using dynamic GUI abstraction.
Q-testing [46] uses reinforcement learning to guide testing toward new pages to find crashes.It rewards GUI events that reach a new page and penalizes events that transfer to an explored page.

Experimental Setup and Evaluation Metrics
For RQ1, we verify the effectiveness and efficiency of CrashTranslator in two aspects: (1) the percentage of reports that can be successfully reproduced in a given time (denoted as success rate).We set one hour following the existing study [46].(2) The time required for successful reproduction (denoted as reproducing time).To mitigate the bias from randomness, we run our approach and the baselines three times and record the average reproducing time.
For RQ2, to investigate the contribution of the designed scorers, we would remove each of them and evaluate the performance for crash reproduction.Note that since the exploration optimization scorer is responsible for the exploration, if removing it, the exploration might be stuck in a local dilemma and could not finish the reproduction.Therefore, we only evaluate the contribution of the other two scorers.In detail, we create two variants of CrashTranslator, i.e., 1) CT  , the variant without the page reaching scorer described in Section 3.2.2) CT  , the variant without the widget hitting scorer described in Section 3.3.The relative contribution of each scorer is measured by comparing each variant with the original approach in terms of success rate and reproducing time, which is also based on the average of three runs.
For RQ3, we verify the usefulness of reproducing steps generated by CrashTranslator.We invite 14 postgraduate students to participate in this experiment.All of them have experience in mobile application testing, 8 are Android developers with at least 3 years of development experience, and 7 work in the crowdtesting platform.For the 46 crash reports that can be reproduced by CrashTranslator, we ask participants to reproduce the crash manually based on the stack trace or reproducing steps (generated by CrashTranslator).Specifically, each participant is assigned 20 different reports, 10 of which only contain the stack trace, while the other 10 only contain the reproducing steps, thus ensuring that each report is reproduced by 3 participants based on the stack trace and 3 others based on steps.If a participant can reproduce the crash within 30 minutes following the existing study [74], we record the reproducing time; otherwise, we mark it as failing to reproduce.Finally, we compare the success rate and reproducing time based on the stack trace and CrashTranslator-generated reproducing steps.

RESULTS AND ANALYSIS 5.1 RQ1: Effectiveness and Efficiency
Table 2 shows the success rate of reproducing crash reports from three datasets.Overall, CrashTranslator can reproduce 61.3% of them (46 out of 75), outperforming the baselines by a large margin, i.e., 171% (61.3% vs. 22.6%) higher than ReCDroid  , 206% (61.3% vs. 20%) higher than Monkey, 142% (61.3% vs. 25.3%)higher than Humanoid, 109% (61.3% vs. 29.3%)higher than Ape, 206% (61.3% vs. 20%) higher than Q-Testing.This shows that our tool can effectively reproduce crashes based on the corresponding stack trace.Specifically, CrashTranslator successfully reproduces 28 (84.8%)reports on  the ReCDroid's dataset, which outperforms ReCDroid  and other automated GUI testing baselines (30.3%-51.5%).While on the An-droR2 dataset and the CrashTranslator's dataset, CrashTranslator can achieve a success rate of 40.9%-45%.In contrast, ReCDroid  and other automated GUI testing baselines can only successfully reproduce a small portion of reports, with a success rate of 5% to 13.6%.The difference in the success rate of the three datasets is might because that crash reports in the ReCDroid's dataset tend to involve fewer exploration steps for reproduction, while the other two datasets require a longer exploration sequence for reproducing the crash.
For the 29 reports that CrashTranslator fails to reproduce, there are three main reasons for hindering the reproduction: 1) Some apps (3 from the ReCDroid's dataset, 1 from the AndroR2 dataset) could not run in our environment, e.g., the server is down, incompatibility with our emulator.2) CrashTranslator does not cover all the interactive GUI actions.It already supports such GUI actions as tapping, long pressing and typing, and rotating the screen like existing studies [73,74], yet some crashes require other types of actions to trigger (e.g.C-4 Alarmio-47 requires scrolling up and down the screen).3) Other unsolved technical challenges, which are further discussed in the discussion (Section 6.1).
Table 3 and Table 4 show the reproducing time of CrashTranslator and four baselines on the successfully reproduced reports.For the 46 crash reports that CrashTranslator can reproduce, the average reproducing time of CrashTranslator is 68.7 seconds, which indicates that CrashTranslator can automatically reproduce crashes based on the corresponding stack trace within an acceptable time cost.Compared with the baselines, CrashTranslator performs better than ReCDroid  and the automated GUI testing techniques in most of the cases.Since different tools can succeed in different crashes, for the average reproducing time, we compare CrashTranslator with each baseline on the crashes that both of them can reproduce.The results show that CrashTranslator is 1110% faster than ReCDroid  (81.6s vs. 988s), 1611% faster than Monkey (39.1s vs. 669s), 620% faster than Humaniod (73.9s vs. 532s), 705% faster than Ape (66s vs. 531s), and 302% faster than Q-testing (89.6s vs. 360s).

RQ2: Contribution of Different Scorers
Columns CT  and CT  in Table 4 show the reproduction results for the variants of CrashTranslator without the page reaching scorer and without the widget hitting scorer, respectively.Overall, CT  can only successfully reproduce 42 crash reports, 4 fewer than CrashTranslator.Meanwhile, the average reproducing time of CT  is 127.1 seconds, which is 82% slower than CrashTranslator (69.7s vs. 127.1s)considering the reports both can reproduce.For CT  , it can successfully reproduce 44 crashes (2 fewer than Crash-Translator), and the average reproducing time is 113.2s(63% slower than CrashTranslator).The inferior performance of CT  /CT  indicates that the two scorers can significantly improve the effectiveness and efficiency of crash reproducing.
We further examine the detailed difference between the results of CT  /CT  and CrashTranslator for a thorough understanding.For the reports which involve shorter exploration sequences for reaching the crash-occurring page, e.g., R-10 FastAdapter-394, we find that excluding the page reaching scorer or widget hitting scorer would not largely influence the reproduction.This might be because, in these cases, the widget transferring to the crash-occurring page can be found by one of the scorers or even by traversal.By comparison, for the reports which involve longer exploration sequences for reproduction (e.g., A-5 andOTP-500), or the entry page has dozens of candidate widgets (e.g., R-5 AnyMemo-18), it is difficult to find the reproducing steps solely by the traversal or the widget hitting scorer.In this case, the page reaching scorer which is accomplished by LLM contributes significantly to the crash reproduction by providing step-by-step guidance to the crash-occurring page.Besides, for the reports which have many candidate widgets on the crash-occurring page (e.g., R-14 SMSsync-464), the efficiency can be largely improved by the widget hitting scorer that predicts which widgets should be interacted with to trigger the crash.
Still, we need to admit that on some reports like R-11 LibreNews-22, removing the scorer would improve the crash reproduction efficiency.This is mainly because the scorers can occasionally fail to conduct an accurate prediction, and with the wrong guidance, the exploration might go in the wrong direction and waste time.

RQ3: Usefulness of CrashTranslator
Columns P  and P  of Table 4 show the results of participants' manual reproduction based on stack traces and reproducing steps (generated by CrashTranslator), respectively.Following the reproducing steps generated by CrashTranslator, 100% of reports are successfully reproduced by at least two participants.As a comparison, when we only provide the stack trace, this percentage drops to 63% (29/46), and there are 6 reports that none of the participants can reproduce.Besides, the average reproducing time with CrashTranslator is 66.7 seconds, which is 215% faster than reproducing from stack traces (57.3s vs. 180.5s)considering the reports that both have at least one participant can reproduce.This indicates the usefulness of our proposed CrashTranslator, whose generated reproducing steps can supplement the stack-trace-only crash reports and make the crashes easily to be reproduced.
Furthermore, on 91.3% (42/46) reports, the reproduction is faster for the automatic approach CrashTranslator, when compared with the average time of human reproduction directly from the stack Table 4: Reproduction details on three datasets.CT, R, M, H, A, Q in the table header refers to CrashTranslator, ReCDroid  , Monkey, Humanoid, Ape and Q-testing respectively.CT  and CT  refer to variants of CrashTranslator without the page reaching scorer and without the widget hitting scorer, respectively.If the above approach successfully reproduces the crash, we record the reproduction time (in seconds) in the table, and if it fails, we record ×. Columns P  and P  show the average time for participants to manually reproduce the crash based on the stack trace or reproducing steps generated by CrashTranslator, respectively, and the number in parentheses is the number of participants who reproduce crash successfully (out of 3).Note that in order to save space, reports that cannot be reproduced by either approach are omitted.RQ  trace.This further implies the usefulness of CrashTranslator which provides an automatic solution and can be faster than humans.After practitioners have manually reproduced the bug reports, we also conduct an unstructured interview about the challenges they encountered in reproducing crashes based on stack traces only.Most participants complain that crash-related information in stack traces is too limited to infer how to reproduce crashes.They usually need to spend a long time trying to reach crash-occurring pages and finding crash-triggering widgets, and after many attempts, they may lose interest and assume that crashes are not reproducible.By comparison, they agree that CrashTranslator can automatically conduct the tedious process of exploring the app and finding paths to crash-occurring pages and crash-triggering widgets.

DISCUSSION 6.1 Limitations
Except for the two engineering limitations discussed in Section 5.1, there are three other technical limitations of CrashTranslator which hinder it from reproducing all bug reports, This also indicates the challenges in crash reproduction from stack traces and calls for further research.
First, CrashTranslator may fail to reproduce crashes that do not contain crash-occurring pages or crash-involved APIs in stack traces.In some special cases, crashes do not occur in a specific activity or fragment, e.g., faults related to network communication.For these cases, crash-occurring pages and crash-involved APIs are not available from the stack trace, CrashTranslator may degenerate into an automated GUI testing tool for aimless exploration due to the lack of guidance from the stack trace.
Second, CrashTranslator cannot reproduce crashes requiring valid input contents, e.g., performing a crash-triggering log-in requires a valid username and password.Such crashes require human intervention and cannot be fully automated since input contents are not present in the stack trace.Nevertheless, by asking the users to provide the information in advance, CrashTranslator can still conduct the automatic reproduction for these cases.
Third, CrashTranslator cannot capture special preconditions that trigger crashes.For example, a display GUI page works fine with default settings, but if switching to a dark theme, a crash would occur when reaching the page.In this case, CrashTranslator can correctly understand that the display page is a crash-occurring page, but it could not successfully reproduce the crash even if the correct page is reached.This is because CrashTranslator can hardly capture the triggering precondition of switching to the dark theme from the stack trace.In the future, we will investigate incorporating the crash-triggering preconditions into CrashTranslator with the help of static code analysis and other techniques.

Threats to Validity
The first threat relates to the randomness of CrashTranslator.Our widget selector (Section 3.4.2) will select a non-optimal widget with a certain probability.To reduce this threat, we run CrashTranslator and its variants (CT  and CT  ) three times and record the average reproducing time in experiment results.
The second threat relates to the choice of parameter settings of CrashTranslator that may affect the effectiveness and efficiency of crash reproduction.In order to mitigate the threat, we conduct small-scale experiments on several "crash" reports that are excluded from our experimental data to determine suitable settings before the evaluation.These reports come from the data collection (Section 2) when we find some "crash" reports with stack traces, but they can not be triggered by ourselves due to incompatibility with our environment.
The third threat relates to the confounding effects of participants.Following the existing approach [74], we assume that students with Android programming experience can be substituted for testers, and their reproducing time and success rate are representative.

RELATED WORK
Mobile Bug Reports Analysis and Reproducing.There are several studies which utilize natural language processing techniques to extract critical information from mobile bug reports, such as summarizing and classifying bug reports [20,47], facilitating dynamic analysis [28,66], augmenting bug reports for mobile apps [35,41] and generating test cases [6,41].
Several studies focus on crash reproduction of mobile bug reports with step-by-step guidance, i.e., textual described reproducing steps [72][73][74] and visual recordings [18].Specifically, ReCDroid [74] and ReCDroid+ [73] leveraged the natural language described reproducing steps to perform reproduction.It designed a set of predefined grammar patterns to extract events and objects from textual reproducing steps and then adopted a greedy-based dynamic exploration to synthesize event sequences.ReproBot [72] went a step further by analyzing the reproducing steps more accurately and designing a new exploration strategy to find the best match between steps and GUI actions.GIFdroid [18] leveraged the visual screen record to perform reproduction.It adopted image-processing techniques to map the keyframes in recording to GUI states and generated reproducing traces based on the transition graph.Compared to the aforementioned crash reproduction tools, CrashTranslator achieves the following advances: 1) Our work enables the reproduction of crashes from stack traces without relying on step-by-step guidance.This is a more challenging task, and existing tools 2) We propose a novel approach which leverages LLM and reinforcement learning to predict and guide the exploration steps for trigger the crash, which is more effective and efficient than previous techniques.
There are also studies which record and replay bugs in mobile and web apps by using running information [22,33,44,64,71] and textual description [3,31].Among these studies, CrashDroid [64] generated reproducing steps by translating the call stack, which contains all method calls from app launch to app crash.It requires the call stack collected by the specific mechanism throughout the app's run, while our approaches only need the automatically generated stack trace when the crash occurs.
Stack Trace Analysis.Stack traces offer exception-related information about an app.Schroter et al. [51] empirically indicated that the stack trace information was very helpful to developers when debugging.Subsequently, several automatic approaches were proposed to recover the links between the crashes and their cause functions and assist developers in locating crashing faults [23,26,42,65,67].Going a step further, researchers started to explore how to utilize the located faulty functions to help developers fix bugs, e.g., generating test cases for fault functions [7,50,[52][53][54]68], and finding the best developer to fix the bug [32,58].Besides, there were some studies focused on calculating the similarity between the stack trace, which could be used to distinguish duplicate crash reports [11,30,48,49,60].This study opens a new direction, i.e., the crash reproduction directly from the stack trace.

CONCLUSION
Crash reports from open-source platforms are vital for ensuring mobile application quality.Still, the crash-related information they record is not always complete, e.g., they may only contain the crash stack trace but lack the reproducing steps, which hinders developers from fixing issues.This paper proposes a novel reproduction approach named CrashTranslator which automatically reproduces crashes from stack traces of mobile bug reports.It adopts three scorers, i.e., page reaching scorer, widget hitting scorer, and exploration optimization scorer, to select crash-related widgets iteratively until the target crash is triggered.We evaluate the CrashTranslator on 75 bug reports, and it can successfully reproduce 46 (61.3%) of the crashes within an acceptable time cost, largely outperforming automated GUI testing baselines.

Figure 1 :
Figure 1: Examples of crash reproducing

Figure 2 :
Figure 2: Examples of stack-trace-only crash reports

Figure 3 :Figure 4 :
Figure 3: Overview of CrashTranslator.Page reaching (Sec 3.2) means arriving at the last GUI page when the crash occurs, and widget hitting (Sec 3.3) means clicking the correct widget in the last GUI page for triggering the crash.Table 1: The example of prompt generation rules Input ID Attribute Description Examples I1 PageNames List of names of all activities in the app, extracted from Android-Manifest.xml file PageNames = ["intro", "main", "setting", . . .] I2 CurrentPage The activity name of the current page CurrentPage = menu of main I3 CrashPage The name of the crash-occurring page, obtained from stack trace CrashPage = installed search engines settings I4 Interactable-Widgets List of names for all interactable GUI widgets on the current page, obtained from parsing the view hierarchy of the current page InteractableWidgets = ["what's new", "help", "settings", . . .]

Table 2 :
Reproduction success rate (Effectiveness).CT, R, M, H, A, Q in the table header refers to CrashTranslator, ReCDroid  , Monkey, Humanoid, Ape and Q-testing respectively.

Table 3 :
Reproducing time on successfully reproduced reports (Efficiency)