Semantic GUI Scene Learning and Video Alignment for Detecting Duplicate Video-based Bug Reports

Video-based bug reports are increasingly being used to document bugs for programs centered around a graphical user interface (GUI). However, developing automated techniques to manage video-based reports is challenging as it requires identifying and understanding often nuanced visual patterns that capture key information about a reported bug. In this paper, we aim to overcome these challenges by advancing the bug report management task of duplicate detection for video-based reports. To this end, we introduce a new approach, called JANUS, that adapts the scene-learning capabilities of vision transformers to capture subtle visual and textual patterns that manifest on app UI screens - which is key to differentiating between similar screens for accurate duplicate report detection. JANUS also makes use of a video alignment technique capable of adaptive weighting of video frames to account for typical bug manifestation patterns. In a comprehensive evaluation on a benchmark containing 7,290 duplicate detection tasks derived from 270 video-based bug reports from 90 Android app bugs, the best configuration of our approach achieves an overall mRR/mAP of 89.8%/84.7%, and for the large majority of duplicate detection tasks, outperforms prior work by around 9% to a statistically significant degree. Finally, we qualitatively illustrate how the scene-learning capabilities provided by Janus benefits its performance.


INTRODUCTION
Video-based bug reports are becoming increasingly popular for mobile applications [23,34,35,52].As mobile app bugs typically manifest visually via the graphical user interface (GUI), recording videos depicting bugs is more natural compared to textual bug reports [23,35].App users can easily record app bugs via the recording features of mobile operating systems (e.g., Android [5]) or via third-party recording apps [4].Additionally, popular issue trackers, such as GitHub [8], offer easy-to-use features for users to submit these videos to app developers.The rapidly increasing use of videos in mobile app issue trackers has been documented by recent studies [34,52].Feng et al. studied open source apps hosted on FDroid [6], and reported the usage of over 13k video recordings in issue trackers between 2012-2020, with a significant increase in usage during 2018-2020 (i.e., a 15% -35% increase).Kuramoto et al. [52] reported a 13% increase in issues containing videos in 2017-2021 for 289k popular GitHub projects.
While video-based bug reporting offers various advantages (ease of recording and submission, and visual details about app bugs [23,26,34,35,52]), it also presents several challenges for developers during bug report management tasks, particularly in scenarios where a high volume of bug reports is encountered [23,34,35,52].
One of the most challenging tasks for developers is determining whether video-based bug reports depict the same app bug.This situation arises when multiple users independently report identical problems with the application (e.g., during crowd-sourced app testing [26,31,40]).In such scenarios, developers face the challenge of watching, understanding, and assessing incoming and previously submitted video-based bug reports.This task can be extremely challenging since these recordings typically show numerous steps executed rapidly, making it difficult to recognize the bug reproduction scenario from the videos [23,26,35].Additionally, the depicted buggy app behavior may not be apparent in the videos for the various types of bugs that apps can show in their GUI [33].Developers often need to pause and replay the videos multiple times in order to fully understand the reported problems [23,35].The task of duplicate (video-based) bug report detection is crucial during the bug triage process, as it helps developers avoid excessive redundant effort in investigating and resolving identical issues [26,31,40,71].
This challenge is particularly prominent in crowd-sourced testing of mobile apps [31,40], wherein software vendors engage a large distributed user base to test applications across diverse operational environments, for example, encompassing various devices, locations, and mobile networks.Crowd-sourced app testing often leads to multiple users encountering and reporting the same apprelated issues.In fact, previous research has found that a substantial proportion (80%+) of bug reports submitted by users during crowdsourced app testing are duplicates [71].Consequently, developers often spend considerable effort on duplicate detection, which can impede the overall bug resolution process [26,31,40,71].
In this paper, we propose Janus, a novel automated approach designed to assist developers in identifying duplicate video-based bug reports.Janus combines visual representation learning, information retrieval, and sequence-based algorithms, to analyze the visual, textual, and sequential information present in video-based bug reports.Through these methods, Janus establishes the degree of similarity between videos in reporting the same bug, thus enabling the automated detection of duplicate reports.
To model the visual information within videos, Janus leverages the Vision Transformer (ViT) architecture [30] and the selfsupervised training scheme DINO [18], which extract rich hierarchical features that explicitly capture scene layout information related to GUI screens.In addition, Janus analyzes the textual content of videos by leveraging the Efficient and Accurate Scene Text Detector (EAST) [81] and a Transformer-based Optical Character Recognition (TrOCR) model [53], which accurately localize and extract text from video frames.By encoding this textual content via an adapted vector space model (VSM) [37], Janus assesses the textual similarity between two videos.Finally, to encode the sequential aspect of videos, Janus incorporates an adapted version of the classical longest common substring algorithm, giving higher weight to subsequent video frames that show the buggy app behaviors even if the videos show distinct bug reproduction scenarios.
We evaluate Janus using a comprehensive benchmark of 7,290 duplicate detection tasks, constructed from 270 video-based bug reports representing 90 unique bugs found in nine Android apps.We created this benchmark by extending an existing dataset that relied mostly on synthetic bugs [26].Specifically, we extended it by incorporating 90 video-based bug reports pertaining to 30 real bugs of different kinds (e.g., crashes, incorrect app output, and cosmetic issues) from three additional apps, resulting in a more comprehensive, realistic, and diverse benchmark.
Through multiple ablation experiments, we systematically assess the performance of the individual components of Janus as well as various combinations of these components.Our evaluation demonstrates that the most optimal configuration of Janus (when visual, textual, and sequential video information is combined) achieves an overall mRR/mAP of 89.8%/84.7%,surpassing the performance of an existing duplicate detector by ≈9% (with statistical significance).These results suggest that Janus can significantly reduce the effort required to identify duplicate video-based bug reports, as developers would only need to review fewer video reports to assess whether an incoming report depicts a known bug.Furthermore, we conducted a qualitative analysis to understand the reasons behind Janus' performance compared to prior work.Notably, Janus exhibits an interpretable representation of video frames, effectively capturing nuanced patterns related to GUI component style, composition, and layout, which are crucial in accurately distinguishing duplicate video-based bug reports.
In summary, this paper makes the following main contributions: • A new approach (Janus) that leverages visual representation learning for the graphical/lexical analysis of video-based bug reports.It also leverages sequential information within these reports for more accurate duplicate detection;

THE JANUS DUPLICATE DETECTOR
This section describes the architecture and design details behind Janus, our approach for duplicate video-based bug report detection.

Problem Formulation and Challenges
We formulate the problem of duplicate detection as an information retrieval (IR) problem, as is typically done for textual bug reports [47,54,78].A newly-submitted video-based bug report (the query) is compared against the set of previously submitted video reports in the issue tracker (the corpus) via a retrieval engine (e.g., Janus), which retrieves and ranks the corpus reports by their similarity with the query.The higher a video-based report is ranked, the more likely it is to depict the same bug as the query.A developer then would watch the ranked videos in a top-down manner, marking the new video as duplicate if they find a video depicting the same bug.Video-based bug reports depict the incorrect behavior of an application (e.g., GUI screens showing a crash, layout problems, or functional misbehavior), and the actions performed by the user on the GUI screens that lead to such misbehavior (i.e., the steps to reproduce the bug).Duplicate video-based bug reports are pairs of two reports, e.g., the query report and one corpus report, that depict the same buggy behavior, possibly showing different GUI steps, as multiple sequences of steps can lead to the manifestation of the same buggy behavior.An advantage of an IR formulation over other methods (e.g., binary classification as (non-)duplicates reports [19]) is the fact that a ranked list gives higher flexibility to developers, because multiple bug reports are recommended as possibly showing the same bug.While the primary goal of a duplicate detector is to identify whether two distinct videos depict the same incorrect app behavior, there are multiple challenges that make this task particularly difficult.For instance, duplicate videos may vary in length and display different reproduction steps, stemming from diverse reproduction  scenarios executed by the users or the omission of certain steps during recording.Even if the reproduction steps appear the same or highly similar across videos, users may execute them at varying speeds.Distinguishing between different videos displaying distinct yet similar unexpected app behavior and reproduction steps can pose challenges for detectors.Furthermore, certain applications may exhibit dynamic content.For example, a mobile web browser allows users to navigate websites with varying layouts/content.

Janus Overview
An overview of Janus is shown in Fig. 1.Janus receives as input two video-based bug reports and outputs a similarity score that indicates how similar they are at depicting the same app bug.Janus can be used to compute scores between a new video-based bug report and a corpus of videos representing previously-submitted bug reports.The scores allow for ranking the corpus videos as a list of potential duplicate candidates.The goal of Janus is to rank higher in this list, the actual duplicates for the new video.Internally, Janus begins sampling a number of frames from the two videos at a given rate (every sixth frame following the findings of past work [26]) to reduce overhead, given that successive frames tend to be exact or near duplicates of each other.Next, Janus computes a vector representation of the videos, by processing the visual and/or textual content of the frames.Janus's visual component, Janus  , vectorizes each video into a visual TF-IDF representation by discretizing the frames into a Bag of Visual Words (BoVW) [45], using a feature extractor based on a Vision Transformer (ViT) model [30] and the DINO self-supervised training scheme [18].Janus's textual component, Janus  , vectorizes each video into a textual TF-IDF representation by extracting video frame text (via the EAST [81] and TrOCR [53] models) and constructing a document of the concatenated text, represented as a Bag of Words (BoW) [65].Each pair of visual or textual TF-IDF representations is then compared via cosine similarity.The visual and textual similarities can be used individually to rank duplicate candidates, or they can be combined into a single similarity score to account for both modalities of information, ideally leading to more effective duplicate detection.
To account for the sequential nature of video-based bug reports, which typically show the reproduction steps first and the incorrect app behavior afterward, Janus can compute an alternative similarity score, based on a customized version of the longest common substring (LCS) algorithm, which matches the vector representation of video frames via cosine similarity and produces an overall similarity score that weights more heavily the later frames in the video than the earlier ones.This similarity is computed by Janus's sequential component, Janus  , which operates on the visual (Janus − ) and textual (Janus − ) vector representations of the frames.

Janus 𝑣𝑖𝑠 : Visual Representation of Videos
Janus  obtains a visual representation of a video in two steps.First, the sampled video frames are resized to 224 × 224 (pixels) and encoded via visual representation learning [49].Second, these frame embeddings are further processed into a Bag of Visual Words (BoVW) [50], which is used to represent a video as a TF-IDF vector [65].The goal is to learn useful visual information from app GUI components and layouts shown in the videos to distinguish potential duplicates from non-duplicates.

Visual Representation of Video
Frames.Visual representation learning aims to obtain high-quality visual representations that are helpful for downstream tasks such as image classification [51], object detection [80], or image captioning [43].This task is typically carried out in an unsupervised, self-supervised, or supervised manner [25,60].Most recently, there has been a focus on contrastive [25,60] and distillation learning methods [18].A promising technique, known as the Vision Transformer (ViT) [30] has recently been proposed to better learn visual representations.The performance of this architecture has been demonstrated to surpass or, at the very least, match previous models relying on Convolutional Neural Networks (CNNs) for image classification.However, the most significant advantage of ViT lies in its ability to excel beyond CNNs in capturing explicit information concerning the semantic segmentation of an image (i.e., layouts and object boundaries) [30].
We posit that learning object segmentation within an image is particularly useful for app GUI screens, given their structured, component-based nature.Hence, we adopted the ViT architecture for designing Janus  .The ViT architecture is comprised of a standard Transformer encoder model [29] but instead of lexical tokens, "patches" from images are fed into the network.These patches are treated the same way that tokens are in lexical transformers: they are linearly transformed and have added positional embeddings.
Given that image-level supervision requires labor-intensive annotations and limits the information that can be learned during training to a single concept with a few categories of objects (as is the case of app GUI screens, which contain components and layouts of well-defined kinds), we need to train our ViT model in a self-supervised manner.Janus  trains its ViT using the selfsupervised training methodology DINO [18], which leverages a student-teacher knowledge distillation training scheme [42].In this scheme, the student network is trained to match the distribution of the teacher network by minimizing the standard cross-entropy loss.Usually, the teacher network is larger than the student network in terms of the number of model parameters.However, the teacher network in DINO is built from the past iterations of the student network with an exponential moving average strategy, whose parameters are frozen over an epoch by applying a stop-gradient operator, given that direct replication of the student weights fails to converge.The outputs of both networks are normalized using a temperature softmax.To adapt the knowledge distillation architecture to self-supervised learning, two global views and several local views are constructed on the basis of data augmentations [38] and the multi-crop strategy [17], with local views passed through the student while only the global views are passed through the teacher network, to encourage local-to-global correspondence.By combining DINO with ViT, we aim to further improve the ability to capture global GUI layouts.
Through this self-supervised training process, the model learns a rich representation of images that emphasize scene layouts and object boundaries.To further refine the DINO model's capabilities to our domain of app GUI screens, we fine-tuned Janus's ViT model, which was pre-trained on ImageNet [28], on a collection of 66k mobile app screenshots from the popular RICO dataset [27].We directly use the projected output of the [CLS] token, a special token that marks the aggregation of all image patch embeddings, from the last block of the ViT model as the representation of video frames.

Visual Representation of Videos.
To represent a video, Janus  implements a BoVW + TF-IDF approach since it has been shown to be more useful for video retrieval compared to other approaches [50] (e.g., using directly the frame representations for similarity computation or aggregating them into a single vector).
Janus discretizes the frame representations by leveraging a Codebook of visual words [50].The Codebook represents a catalog of visual words, which are representative vectors found in a corpus of images (in our case, images of app GUIs).The Codebook is constructed via a trained -Means model that clusters the corpus of image representations into  clusters, the centroids being the visual words.Janus then assigns each video frame representation to its closest cluster centroid (i.e., a visual word) via Euclidean distance.The Codebook is trained by randomly sampling 15k mobile app screenshots from the RICO dataset [27], vectorizing them via our fine-tuned ViT model, and running the -Means algorithm on the vectors, with  = 1 recommended by prior work [50].We take a sample rather than using the full RICO dataset due to computational constraints of the -Means algorithm.The Codebook is trained only once before the TF-IDF representation approach is applied.
Once each frame representation is discretized to its corresponding visual word, Janus computes a TF-IDF vector representation of a video, as similarly done for text retrieval [65].The term frequency (TF) is the count of each visual word in the video.The inverse document frequency (IDF) is the count of BoVW representations of existing videos where a visual word appears.Since a corpus of existing videos for a particular app may be small and may lack diversity, we consider the set of RICO images as the corpus of existing videos.By considering the diversity of apps in the RICO dataset, we aim to improve the generalization of the TF-IDF video representations.
Janus  compares the TF-IDF representation of two videos via cosine similarity to establish the likelihood of the videos showing the same app bug.This method is applied to the existing corpus of TF-IDF visual representations for an app to generate a ranked list of candidate duplicate videos for a new video-based bug report.
To address potential biases due to random sampling when creating the Codebook, we adapted Janus  to use four Codebooks (each trained on 15k RICO images, 60k in total).Specifically, Janus  uses each Codebook to produce similarity scores for a set of videos.These similarity scores are averaged to produce a final set of similarities and video ranking.More details are given in section 3.3.2.

Janus 𝑡𝑥𝑡 : Textual Representation of Videos
Janus  creates a textual representation of a video in two steps: (1) it localizes and extracts the text present in video frames via neural text localization and Optical Character Recognition (OCR); and (2), it encodes the extracted text using a standard TF-IDF representation [65].The goal is to leverage the text from labels, messages, and other sources shown in the frames to compute video similarity.
For the first step, Janus  has two components: (1) a text localization component that proposes image regions where text is rendered, and (2) a text recognition component that takes those regions and extracts any text present in them.The text localization component implements the Efficient and Accurate Scene Text Detector (EAST) model [81], which has been trained to directly derive region proposals.The text recognition component leverages the TrOCR Transformer model [53], which takes region proposals from EAST and directly predicts the text represented in the proposals.The combination of EAST and TrOCR was adopted over the popular TesseractOCR [1] approach because: (1) such a combination simplifies the overall OCR pipeline since it relies on neural models only, without needing heuristic-based approaches to filter out poor text region candidates (as TesseractOCR does); and (2) such a combination has shown strong performance improvements in detecting scene text as well as handwritten/printed text, which means it is less sensitive to noise in the images.Each video frame is put through this 2-stage pipeline to extract its text.
For the second step, Janus  concatenates the text from all video frames and pre-processes it via tokenization, lemmatization, and removal of special characters, such as non-ASCII characters, punctuation, or stop words.This resulting text is used to build a Bag of Words (BoW) representation of the video, which is then encoded as a standard textual TF-IDF representation using the popular Lucene library [37], which implements the standard information retrieval Boolean model and the Vector Space Model (VSM) [65].We use this textual representation approach over neural text encoding models because it is based on exact text matching, which could lead to more accurate similarity computation of duplicate videos (as they are likely to show the same text on the buggy app screens).
Finally, Janus  compares the TF-IDF representation of two videos using Lucene's similarity scoring function (based on cosine similarity and document length normalization) [11].Similarity computation can be applied to a corpus of video-based bug reports to generate a ranked list of potential duplicate videos to the new video.

Janus 𝑠𝑒𝑞 : Sequential Similarity of Videos
Janus  and Janus  ignore the sequential order of the videos, as these components are based on Bags Of (Visual) Words.However, the buggy app behavior is typically shown toward the end of a video-based bug report, after the bug reproduction steps have been rendered.To account for the sequential order of the videos, Janus employs a modified version of the longest common substring (LCS) algorithm to compute an alternative similarity score between videos.This approach is coined as Janus  and operates on both visual (Janus − ) and textual representations of the videos (Janus − ).
Janus  treats a video as a sequence of visual/textual words, based on the vector representation of the video frames, and applies an LCS-based approach for similarity computation.Intuitively, the longer the LCS between videos is, the higher their similarity is.The textual representation of a video frame is the TF-IDF vector of the text extracted from the frame, using the approach described in section 2.4.In the standard word-based LCS algorithm, words are compared using exact text matching.To account for similar, yet different video words (which might be common for textual video representations), we relaxed this matching scheme and instead used cosine similarity between video frame representations.Additionally, the similarity-based matching should weigh more heavily the frames that appear later in the videos as they are more likely to show the buggy app behavior and should give a normalized similarity score between zero and one.
Given these requirements, we defined the following similarity computation for Janus  :   = w- max w- , where the numerator, w-, represents the amount of overlap between two videos, given by our modified LCS algorithm, which uses the cosine similarity between frames (rather than exact matching) and a weighting scheme that favors later frames in the videos.The weighting scheme is   ×   , where  is the ℎ frame of a first video, with  being its # of frames, and  is the ℎ frame of a second video, with  being its # of frames.The denominator, max w-, represents the maximum possible overlap if the videos were identical.Since the videos could be of different lengths, we align the end of the shorter video (with length ), to the end of the longer video (with length ), and calculate the maximum overlap as:  =1   ×  −  .

Combining Janus's Components
To design Janus, we explore different combinations of its components.The similarity scores from Janus  and Janus  can be linearly combined as (1 − ) ×   +  ×   , with  ∈ [0, 1]-the higher  is the more weight it gives to textual information from the videos.We also explore various combinations that replace this similarity calculation with those given by Janus  (Janus − & Janus − ), which consider the sequential video information.

EVALUATION METHODOLOGY
We investigate the performance of Janus's components (Janus  , Janus  , and Janus  ), as well as the performance of various combinations of these components, and compare these to a baseline duplicate detection technique proposed in prior work [26].Additionally, we aim to understand why we observe various trends in the overall performance of Janus, and qualitatively examine cases where Janus is able to outperform the baseline technique.
To that end, we formulate the following research questions (RQs):

Duplicate Detection Dataset
We constructed a comprehensive evaluation dataset by extending a prior dataset that mostly relied on synthetic app bugs [26].The previous dataset collected 60 distinct bugs (35 crashes and 25 noncrashes) across six Android apps of different sizes and domains (e.g., podcast, finance, and weight management apps).The dataset contains ten confirmed real bugs and 50 bugs injected by the mutation testing tool MutAPK [33], which generates code mutations based on diverse mutant operators that affect various app features.The dataset includes three duplicate videos per bug, for a total of 180 video-based bug reports, and a set of 810 duplicate detection tasks per app, for a total of 4,860 tasks, created from the videos.We refer to this dataset as the original dataset.We next describe how we extended this dataset and detail the creation of video-based bug reports and duplicate detection tasks to evaluate Janus.
3.1.1Extended Real Bug Dataset.We extended the prior dataset by constructing an evaluation dataset containing only real bugs.Wendland et al. [74] released the AndroR2 dataset containing 90 manually reproduced bug reports for Android apps.This dataset was later extended through the addition of another 90 reproduced bug reports in the AndroR2+ dataset [46], for a total of 180 real, reproducible reports.For each bug report, AndroR2+ provides a link to the original bug report in the issue tracker, an apk of the buggy app version, a reproduction script, and metadata for bug reproduction (device, OS version, etc.).
To construct our new dataset of real bugs, we chose the three apps with the largest number of bugs from AndroR2+, while also ensuring the diversity of app categories.We selected: Firefox Focus (FCS) [7], a web browser; PDF Converter (ITP) [10], an image-to-PDF converter; and GPSTest (GPS) [9], a GPS testing app.FCS is the only app that renders dynamic content on the screen.For these apps in Andror2+, we found ten bug reports for FCS, nine reports for GPS, and eight reports for ITP.We further manually checked each app's issue tracker and collected one more bug for GPS and two more bugs for ITP to have the same number of bugs per app.To find the apk files of the correct buggy version of the apps for these three bugs, we chose the app version closest to the date the issue was created and confirmed that the apk allowed for the successful reproduction of the bug.Based on the AndroR2+ metadata and the three bug reports we collected, there are seven different OS versions used to reproduce the bugs, namely, Android version 4.4.4,6.0.1, 7, 7.1, 8, 8.1, & 9.
3.1.2Duplicate Video Recording.The paper authors and external participants recorded videos replicating the collected 30 real bugs from the three AndroR2+ apps, following prior work [26].
We rewrote the descriptions of steps to reproduce (S2R), expected behaviors, and observed behaviors for these bugs to ensure they are clear and easy for participants to reproduce from an end-user perspective.Although the AndroR2+ bugs were reproducible on a Pixel 2 emulator, we chose Nexus 5X to maintain the same device configuration as the previous dataset [26], since the bugs were also reproducible on the Nexus 5X.This ensures a consistent resolution of the videos across the benchmarks.Additionally, we minimized the different OS versions to three (6, 8.1, and 9) to reduce participants' effort by finding the closest OS versions to their original ones while ensuring the bugs were still reproducible.Also, having these additional OSes in our video reproductions of these bugs has the added benefit of being more realistic-the prior dataset only used Android 7.0.While AndroR2+ provides automated bug reproduction scripts, we avoided using them for two reasons: (i) we found that certain scripts led to errors that did not properly reproduce the bug, and (ii) we wanted to capture video-based reports depicting real human actions, to ensure the most realistic setting possible.
Video-based reports were created by the paper authors for all 30 bugs according to the S2R.To maintain three duplicate videos per bug, in line with the previous dataset, two authors (who previously did not record any videos) along with two Ph.D. students were asked to record the additional 60 videos, each responsible for reproducing 15 distinct bugs with only the descriptions of expected and observed behaviors, to ensure diversity of reproduction steps.Unlike the prior dataset, the recorded videos do not show the Android touch indicator when the user taps on the screen.
In total, our new dataset consists of 90 video-based bug reports corresponding with three duplicates of 30 real bugs from three apps.It contains two crashes and 28 non-crashes, comprising 270 reproduction steps in total (249 taps, six gestures, and 15 input entry actions) and ≈35-second videos, on average.There are six videos for Android 6, nine for Android 9, and 75 for Android 8.1.

Duplicate Detection Tasks.
In line with the prior dataset, we construct duplicate detection tasks for each app to be as realistic as possible.We define a duplicate detection task as having: (1) a query video that represents a newly reported video-based bug report, and (2) a corpus of 13 existing video-based reports.The query must be compared against the corpus in order to determine whether the incoming report is a duplicate of an existing report.Each task contains videos of the 10 bugs for an app.The corpus contains two duplicate videos of the query (i.e., they show the same bug).The remaining eleven videos are non-duplicates: three of them are duplicates of each other but not of the query (i.e., they show a bug different from the query bug), and eight videos show distinct bugs.Each task simulates a situation that is similar to crowd-sourced app testing, where duplicates of the query, of other bugs, and unique video-based reports exist together on the issue tracker for an app.
Using different combinations of bugs and videos, we created a total of 810 tasks per app or 2,430 tasks across all apps.Combining both the prior and new datasets, there are 7,920 tasks in our extended evaluation benchmark to evaluate Janus.

Baseline Duplicate Detector
We compare Janus against the Tango duplicate detector introduced by Cooper et al. [26].Tango also leverages multi-modal information to detect duplicate video-based reports, using less-sophisticated methods as compared to Janus.It extracts visual features from video frames using a contrastive learning method called SimCLR, which uses a ResNet-50 CNN to learn local features of app GUIs [25].It also analyzes text displayed on GUI screens using an approach that combines LSTM-based language models and heuristics, relying on TesseracOCR to extract video frame text [1].Finally, Tango performs limited alignment of video frames: only for its extracted visual SimCLR features.Tango's evaluation found the best performing configuration is when the visual and textual components are combined, hence we compare Janus against this configuration while also performing ablation comparisons between their individual components. is the number of duplicates in the top- candidates.

Metrics and Experimental Settings
All metrics give a normalized score in [0, 1]-the higher the score, the higher the duplicate detection performance.We executed different configurations of Janus and the baseline on the 7,920 tasks and computed/compared the metrics between these approaches.
To account for potential biases from random image selection when constructing the Janus  's Codebook, we used four distinct Codebooks, each trained on 15k distinct RICO images (60K images in total).With each Codebook, Janus  generates similarity scores for a set of videos.These similarities are averaged across the four Codebooks to produce final scores used for ranking.To perform a fair comparison with Tango's visual component, we implemented the same Codebook generation strategy on Tango, using its publicly released implementation [26].The recomputed Tango results on the prior dataset are slightly higher than those reported in the original paper (76.2 vs 75.3 mRR and 69.8 vs 67.8 mAP).
We compared Janus  against Tango's textual component by experimenting with different configurations for the EAST and TrOCR models.For EAST, we used three different resolution thresholds to filter out small text regions: 5×5 (EAST-5), 40×20 (EAST-40), and 80 × 40 (EAST-80).The 5 × 5 threshold is used by default in EAST.We did not test larger resolutions than 80 × 40 to ensure that each textual document created for the video has at least one valid detection.40 × 20 was included as a middle ground to understand the impact of the threshold size on the video similarity calculation.For TrOCR, we used its large version with BEiT Large [14] as the encoder and RoBERTa Large [55] as the decoder.Two fine-tuned TrOCR-Large models are used, namely TrOCR-p (fine-tuned on the printed text dataset SROIE [44]) and TrOCR-s (finetuned on the synthetic scene text datasets such as ICDAR15 [48] and SVT [72]).

Model Training.
All visual models were fine-tuned on the 66k mobile app screenshots from the RICO dataset [27] for 100 epochs using model checkpoints trained on ImageNet [28], except for DINO (ViT-B/16), to fairly compare it with the Tango  's SimCLR model.After examining preliminary results showing the advantages of DINO with ViT, we decided to train DINO (ViT-B/16) for 400 epochs [18].Fine-tuning was carried out on three NVIDIA T4 Tesla GPUs with 16GB of memory each.Because DINO does not use contrastive learning, we were able to use a much smaller batch size as compared to the SimCLR model used in Tango: 96 vs 1,792 for ViT-S/16 and ResNet-50.For the ViT-B/16 and ViT-S/8 models, we used a batch size of 64 and 16 due to memory constraints.Table 1 shows the network configurations and three fine-tuning hyperparameters, where dim is the representation dimension of the output, # params is the total number of model parameters."temp" and "wtemp" represent the teacher temperature and the warm-up teacher temperature respectively, and the numbers in parentheses are the # epochs used for warm-up.Model training was not required for Janus  as we directly use pre-trained EAST and TrCOR models for GUI text localization and recognition [53,81].

EVALUATION RESULTS AND DISCUSSION
Table 2 shows Janus's duplicate detection performance compared to the baseline Tango, for their individual components: visual, textual, and sequential.Table 3 shows the performance of different combinations of Janus components, compared to the baseline.
Cells shaded green in these tables indicate a statistically significant (via Wilcoxon's paired test at the  < 0.05 level) higher effectiveness when comparing a given Janus configuration/component to a given Tango configuration/component. Yellow shaded cells indicate a higher performance, but without statistical significance.We present the results for each app of the original (mostly synthetic bugs) and extended (real bugs) datasets and the overall results accounting for all the apps in both sets, separately and combined.
While we computed the performance of four Janus  DINO models (i.e., DINO with ResNet, ViT-S/16, ViT-S/8, and ViT-B/16), we present (in tables 2 and 3) the best-performing model for Janus  : DINO with ViT-B.Likewise, we report here the results of the best performing model configuration for Janus  , namely EAST-80 (EAST that filters out region proposals smaller than 8040) combined with TrOCR-s (TrOCR fine-tuned on real-world scenes, e.g., street scenes, instead of text found in printed and handwritten documents).The results for all the DINO, EAST, and TrOCR configurations can be found in our replication package [75].
Tables 2 and 3 show a consistent trend: the performance achieved by any duplicate detector (i.e., any configuration) is lower for the original dataset than for the extended dataset.After investigating the minimal set of ground-truth reproduction steps of the bugs used in the datasets, we found this trend is explained by the number of overlapping steps across distinct bugs in an app.We observed that distinct bugs for a given app in the original dataset have a larger step overlap than distinct bugs in the extended dataset.It is more challenging for a duplicate detector to distinguish between duplicate and non-duplicate videos if there is a larger step overlap across bugs (hence, across videos).Recall that in a duplicate detection task, the videos in the corpus are for distinct bugs; if there is a larger overlap among them, particularly between duplicates and non-duplicates, a detector would struggle to discern the differences.

RQ1: Janus 𝑣𝑖𝑠 's Performance
Table 2 shows the duplicate detection effectiveness of Janus  (DINO with ViT-B) compared to visual Tango (SimCLR).
Before discussing the table results, we briefly discuss the results of comparing the training schemes (distillation via DINO vs contrastive via SimCLR, both using the same pre-trained ResNet weights).We found that SimCLR outperforms DINO for six of nine apps by a relatively small margin (by 3.5% mRR and 4.2% mAP, on average), but DINO outperforms SimCLR for the remaining three apps (APOD, GNU, DINO) by a larger margin (7.5% mRR and 6% mAP, on avg.).Overall, across all the apps, we found a similar performance between these two approaches (less than 1.1% mRR/mAP improvement), which indicates the training scheme does not have a large impact on duplicate detection performance.
Furthermore, both ViT-S/16 and ViT-S/8 used by Janus  's DINO exhibit superior performance compared to ResNet-50 used by visual Tango's SimCLR.Specifically, although ViT-S/16 and ViT-S/8 have a similar model size to RestNet-50, they outperform ResNet-50 by 2.91% and 2.92% respectively, in terms of mRR on average, with statistical significance.This highlights the effectiveness of ViT over ResNet for duplicate video-based bug report detection.compared to the baseline and individual components.We experimented with different weights (from 0 to 1 in 0.1 increments) using all duplicate detection tasks and selected the weights that lead to the highest mRR/mAP performance.
As mentioned earlier, the best Tango configuration is when its visual and textual components are combined (with a weight of 0.8 and 0.2, respectively), as reported in the original paper [26].Janus's visual and textual components (i.e., Janus  and Janus  ) are combined using 0.9 and 0.1 as weights.This combination is denoted as "Visual + Textual" in Table 3.The table also shows the combination of Janus's visual/textual components and the sequential one: "Vis + Seq" denotes the average of the similarity scores produced by Janus  and Janus − , while "Txt + Seq" denotes the average of the similarity scores produced by Janus  and Janus − .An average combination means a weight of 0.5.Finally, we combine the similarities produced by the last two combinations using a weighted linear combination as follows: Sim(Vis + Seq) × 0.6 + Sim(Txt + Seq) × 0.4.This combination incorporates every information source from the videos and is denoted as "Vis + Txt + Seq".
Table 3 shows that the best performing Janus combinations are "Visual + Textual" and "Vis + Txt + Seq", both outperforming the baseline by 4.6%/4.6%mRR/mAP and 8.7%/9.2%mRR/mAP overall respectively (with statistical significance).The other two Janus combinations lead to mixed results: "Vis + Seq" leads to overall performance gains while "Txt + Seq" does not produce overall gains, due to its lower performance on the extended dataset.
When using "Visual + Textual", Janus significantly outperforms Tango on seven of nine apps and is only worse than Tango on FCS, considering both mRR and mAP.As previously mentioned, Janus's lower performance for FCS, compared to Tango, stems from the nature of the app itself.FCS is a web browser and the bugs used for this app were not dependent on a particular web page.When reproducing the bugs, the users navigated to different web pages, each one having different layouts and appearances.This means that the duplicate video-based bug reports appeared to be substantially different.Since Janus focuses more heavily on global GUI layout information, via its DINO+ViT model, Janus struggles to differentiate duplicates from non-duplicates.The local features learned by Tango seem to be useful for duplicate detection even when the duplicate videos show different layouts.The lower Janus mAP value on GNU is explained by the lower mAP values of Janus  and Janus  on that app (by 0.4% and 1.2%-see Table 2).
Janus's configuration "Vis + Txt + Seq" consistently shows mRR/mAP improvement in all nine apps except GNU, when compared to the baseline Tango.Across these apps, we observe improvements ranging from 6.8%/6.2% to 23.4%/26% mRR/MAP in the original dataset, and from 0.9%/1.8% to 7.5%/8.1% mRR/MAP in the extended dataset.This is interesting because the performance of the individual components of this configuration is substantially different across the apps.For instance, for TOK, the sequential aspect of the videos, individually combined with Janus  or Janus  , is less effective than Tango, but when Janus  and Janus  are combined together with Janus  , Janus leads to substantial improvement (by 6.8%/6.2%mRR/mAP).Another example is the FCS app, which seems to benefit from the visual and sequential information, as Janus  +Janus − seems to contribute most to the overall performance of the " Vis + Txt + Seq" configuration.This suggests that the incorporation of sequential information enhances the Janus's ability to handle dynamic content, resulting in improved performance in comparison to its "Visual + Textual" configuration and the baseline Tango.
Best Janus configuration: The best performing Janus configuration is when combining visual (Janus  ), textual (Janus  ), and sequential information (Janus  ) from video-based bug reports.This configuration consistently outperforms the baseline duplicate detector for 8/9 mobile apps.It achieves an overall performance of 89.8%/84.7 mRR/mAP, outperforming the baseline by 8.7%/9.2%mRR/mAP.This means that Janus can reduce the effort that developers spend determining if a new video-based bug report shows a known bug (by (1.60 − 1.38)/1.38= 16%, based on avg.rank), since they would need to inspect only 1.38 videos on average (i.e., 1.38 avg.rank across all tasks) for finding the first duplicate video in the candidate duplicates suggested by Janus.

Qualitative Analysis
We discuss two qualitative examples that illustrate the validity of our hypothesis that the richer representations learned by Janus's transformer-based visual representation and OCR models improve duplicate detection for video-based bug reports.

Example 1:
Transformer-based Representations Capture Subtle GUI patterns.To illustrate why we observed improvement in visual Janus as compared to visual Tango, we use interpretability techniques that generate saliency maps that help visualize the learned visual features.To visualize patterns learned by CNNs, we use a technique called AGF [39].Although AGF can visualize self-supervised models such as SimCLR (used by the baseline), this requires training a supervised linear classifier after each layer and a dedicated algorithm to extract the segmentation information from their weights.Therefore, to simplify our comparison, instead of visualizing SimCLR directly, we visualize its main component, the ResNet-50 CNN using AGF under supervision.We follow past work and use the pre-trained ResNet-50 (on ImageNet [28]: the training dataset for ResNet) to generate the saliency map based on the class IDs with the highest probabilities for a given target GUI screen [39].We further visualized the ViT-S/16 model (used by Janus  's DINO) by directly displaying the self-attention maps.Textual Tango, which uses Tesseract OCR is unable to distinguish between similar video reports for a number of bugs, including three bugs from the GNUCash (GNU) app [3].Therefore, we visualize the detection bounding boxes of text for three keyframes of these three videos in Fig. 3 for both Tesseract (first row) and EAST [81] (used by Janus).The first report for the GNU-CC6 bug has a main trace that goes to the balance sheet screen and checks the sub-account: we show one keyframe for this report in (Fig. 3-(a)), while the second video report for the GNU-CC9 bug navigates to the General Preferences screen, as shown in keyframe in (Fig. 3-(b)), and finally, the report for GNU-CC7 changes the password under the General Preferences menu, as shown in (Fig. 3-(c)).While these bugs are different, they include many similar screens where keywords are important for differentiation.
As observed in Fig. 3, EAST is more accurate than Tesserac-tOCR for GUI component and text detection.In Fig. 3-(a), Tesseract OCR fails to localize the text on some buttons (e.g., sheet) and the text in brighter colors (e.g., Asset).Also, for the keyframe of GNU-CC9 (Fig. 3-(b)), Tesseract misses the text General Preferences, making it difficult to distinguish between report GNU-CC9 and GNU-CC7, as they both access various parts of the settings menu.In addition, Tesseract fails to detect the text when it is in low brightness and low contrast regions, including the text on the dialing circles (Fig. 3(c)), which also helps with differentiating between GNU-CC9 and GNU-CC7, since GNU-CC7 enters a passcode, but GNU-CC9 only accesses the passcode settings.Thus, the more accurate text extraction of EAST clearly aids in the accurate extraction of key text that can help to differentiate between similar GUI screens.

THREATS TO VALIDITY 5.1 Internal and Construct Validity
Beyond the evaluation dataset, the implementation of Janus's models and experimental settings represent key validity threats.We controlled as many factors as possible for a fair comparison with the baseline.For instance, we implemented the 4-Codebook approach on both Janus and the baseline, used the same duplicate detection tasks, and measured their performance using well-known metrics in duplicate detection studies.

External Validity
To improve generalization, we created a new dataset to include ≈3k more duplicate detection tasks, for real bugs of different kinds, reported on mobile app issue trackers.These bugs were video recorded by multiple users on various mobile OS versions and did not include touch indicators.We ensured the recorded videos contained different reproduction scenarios for the same bugs.The decisions were made to make our dataset more comprehensive, realistic, and diverse.Our dataset could be improved by considering different app languages or other mobile platforms such as iOS.

RELATED WORK 6.1 Duplicate Textual Bug Report Detection
Many approaches have been proposed to detect duplicate textual bug reports to help developers avoid redundant effort during bug management.Most of the approaches leverage text retrieval techniques to obtain a ranked list of candidate duplicates for a query report [13,64,67,69].Some approaches leverage extra information (fields [63,68,76], contexts [12,41], execution traces [73], etc.) and/or more effective similarity techniques (BM25F [68,70], topic-modeling [58], word-embedding [76], etc.) to improve the detection.Wang et al. [71] proposed SETU, which combines textual bug descriptions with screenshots to detect duplicates, rather than focusing on video reports (as we do).

Automated GUI Understanding for SE
Various GUI understanding approaches have been proposed to help software engineering tasks related to mobile apps, such as GUI reverse engineering [15,21,57,79], software testing [16,56,77], and GUI search [20,22].Most of them detect GUI elements first to understand GUI information.Chen et al. [24] show that deep learning-based object detection models (FasterRCNN [62], YOLOv3 [61], and centerNet [32]) and scene text detector EAST [81] outperform old-fashioned detection models [59] and OCR tool Tesseract [66] respectively.However, these models, based on supervised learning, leverage GUI information limited to a few GUI element categories, and the relationships between different elements are not considered, thus lacking an understanding of the entire screen.Fu et al. [36] therefore attempt to understand the whole screen by considering these relationships, based on the Transformer architecture to detect GUI elements more accurately.
The most closely related work to our own is Cooper et al. 's [26], which proposed the Tango duplicate detector and a dataset to evaluate it.While Janus leverages the same information from videobased bug reports as Tango does, there are key differences that set Janus apart.First, Janus learns visual features from app GUIs (via distillation and Vision Transformers) which capture GUI layouts more effectively for duplicate detection than Tango, which focuses on learning local GUI features (via contrastive learning and CNNs).Second, Janus learns textual representations of videos that are more useful for duplicate detection, by recognizing and extracting frame text more accurately (via fully neural models rather than heuris-tic+neural based approaches adopted by Tango).Third, Janus's sequential similarity computation, which attempts to align video frames, can be applied to both visual and textual representations, rather than to only visual representations as Tango does.Fourth, the best configuration of Janus combines all three modalities of video information (visual, textual, and sequential), and significantly outperforms the best Tango configuration, on duplicate detection tasks that include both injected and real bugs for a diverse set of mobile apps.Notably, our evaluation dataset is more comprehensive, realistic, and diverse than the one used to evaluate Tango.

CONCLUSIONS
To assist developers in identifying video-based bug reports that show identical mobile app bugs, we propose Janus, a new approach for duplicate video-based bug report detection.Janus leverages visual, textual, and sequential information from videos via the combination of representation learning, information retrieval, and framealignment approaches.
We evaluated Janus and found that it significantly outperforms an existing duplicate detector.The evaluation considered a new benchmark of 7,290 duplicate detection tasks based on 270 videobased bug reports, drastically extending a prior dataset (with real bugs as opposed to injected bugs from prior work).We conducted ablation experiments and an in-depth qualitative analysis visually showing that Janus learns a more interpretable hierarchical visual representation and localizes text regions more accurately.

Figure 1 :
Figure 1: Overview of the Janus duplicate detector.

Figure 3 :
Figure 3: Bounding boxes localized by EAST and the Tesseract OCR library on keyframes of video-based bug reports such as the difference between two similar pop-ups, or the difference between background and foreground element when menus are displayed.
Janus  's duplicate detection performance?RQ 2 : What is Janus  's duplicate detection performance?RQ 3 : What is Janus  's duplicate detection performance?RQ 4 : What is the performance of Janus's component combinations?
[26,47,54,78]trics.We use standard metrics used in prior work on duplicate bug report detection evaluations[26,47,54,78]: • Mean Reciprocal Rank (mRR): it gives a measure of the average ranking of the first duplicate video found in the candidate list of videos given by a duplicate detector.It is calculated as:  (  ), where  is the set of duplicate videos for task ,   is the rank of the duplicate video , and   () = 3.3.1

Table 1 :
The network configurations and fine-tuning hyperparameters for Janus  compared with SimCLR used by Tango