On Improving Management of Duplicate Video-Based Bug Reports

Video-based bug reports have become a promising alternative to text-based reports for programs centered around a graphical user interface (GUI), as they allow for seamless documentation of software faults by visually capturing buggy behavior on app screens. However, developing automated techniques to manage video-based reports is challenging as it requires identifying and understanding often nuanced visual patterns that capture key information about a reported bug. Therefore, my research endeavors to overcome these challenges by advancing the bug report management task of duplicate detection for video-based reports. The objectives of my research are fourfold: (i) investigate the benefits of tailoring recent advancements in the computer vision domain for learning both visual and textual patterns from video frames depicting GUI screens to detect duplicate reports; (ii) adapt the scene-learning capabilities of vision transformers to capture subtle visual and textual patterns that manifest on app UI screens; (iii) construct a more comprehensive and realistic benchmark which contains video-based bug reports derived from real bugs; (iv) conduct an empirical evaluation to potentially demonstrate state-of-the-art improvements achieved by the proposed approach.


INTRODUCTION
Due to the graphical nature of mobile applications, video-based bug reports are a natural fit for capturing buggy behavior as they depict how a given fault manifests through the graphical user interface (GUI).Additionally, they are simple to create since recording the screen information is now a part of the Android operating system, no longer requiring a third-party application1 .This coupled with popular issue tracking software such as GitHub Issues supporting the attachment of videos2 , video-based reports are quickly becoming a major modality of the information in the mobile application domain.In fact, a recent study [15] performed on open source applications listed on FDroid [1] spanning the years of 2012-2020 illustrated that over 13k visual recordings were present in issue trackers of mobile apps, with the vast majority of these being uploaded between 2018-2020, indicating a growing popularity.
The increasing prevalence of video-based reports brings with it a growing set of challenges related to their management.Specifically, because of the rich, pixel-based data captured in videos, automatically capturing the nuanced patterns depicted is a challenging proposition.This difficulty in the automated analysis of video contents drastically complicates the creation of techniques for automated bug report management, such as triaging, duplicate detection, and fault localization.However, given the size and complexity overhead associated with video files as compared to text files, duplicate detection may be one of the more important tasks for managing video-based bug reports.For instance, detecting duplicate videos can allow for new reports to be archived or deleted to save space in software repositories and can reduce the amount of valuable developer time spent (re)watching/comparing videos of bugs to identify, group, and manage duplicates [12].
However, only recently have approaches been explored for detecting duplicate video-based bug reports [12] -due in part to the scarcity of data for training and evaluating such approaches.My research aims to build a new approach to improve the state of the art for duplicate video-based bug report detection [36].Specifically, I will leverage recent advancements in the computer vision domain, namely, the introduction of the Vision Transformer (ViT) and new training schemes [5,13], for scaling visual deep learning models.Our hypothesis is that rich hierarchical features of self-supervised ViT models contain explicit scene layout information that helps to distinguish subtle visual patterns in video frames depicting GUI screens.Additionally, the proposed approach will leverage similar advancements in image representation learning and Optical Character Recognition (OCR) by adapting various models, such as the Efficient and Accurate Scene Text Detector (EAST) [40], the CRAFT [2], and the TrOCR [25], for improved localization and recognition of text in video frames, compared to the prior technique that only combine learning-based methods and heuristics [12].To evaluate the proposed approach, a new benchmark will be created for duplicate detection of video-based bug reports that contains duplicate detection tasks constructed from more real bugs, in order to extend the evaluation dataset used in prior work that relied mostly on synthetic bugs [12].An empirical evaluation will be conducted on this comprehensive benchmark to potentially demonstrate stateof-the-art performance achieved by the proposed approach when compared to the prior baseline [12].

RELATED WORK
GUI Comprehension.GUI understanding can help many software engineering tasks related to mobile applications, such as GUI reverse engineering [3,7,27,38], software testing [4,26,29,37], and GUI search [6,8].Most GUI understanding techniques need to detect GUI elements first to understand the information provided by the GUI.Chen et al. [9] show that deep learning-based object detection models [14,30,31] and scene text detector EAST [40] outperform old-fashioned detection models [28] and OCR tool Tesseract [32] respectively.Fu et al. [16] utilize the Transformer architecture for GUI element detection but only based on limited pixel words.The most closely related work to our own [12] uses self-supervised approach SimCLR [10] based on ResNet [18] to understand the visual GUI and use OCR to obtain the textual information in order to detect duplicate video-based bug reports.
Duplicate Video Retrieval.To retrieve similar videos, the traditional techniques in the computer vision domain first extract global and/or local features of video frames, then aggregate extracted features to represent a whole video, and finally calculate similarity scores between videos.The visual features are extracted either by handcrafted image processing methods, such as Local Binary Patterns (LBP) [19,35], Scale Invariant Feature Transform (SIFT) [34,39], or by the Convolutional Neural Networks (CNNs) [18,33].The features can then be aggregated based on global vectors [34], bag-of-words [11,22], or deep metric learning [23].Kordopatis-Zilos et al. [21] conducted a comprehensive experimental study comparing feature extraction methods, CNN architectures and aggregation schemes, showing that CNN+BoVW is the best performing combination, which is the reason why the most relevant work [12] chose this strategy to obtain video representations.

PROPOSED APPROACH
In this section, a brief overview of the proposed approach is provided.The approach will build upon the success of past techniques [12] and adapts a framework that combines visual and textual information modalities for duplicate video-based bug report detection.Specifically, the approach receives as input two video-based bug reports and outputs a similarity score that indicates how similar they are in depicting the same app bug.Therefore, it can be used to compute scores between a new video-based bug report and a corpus of previously-submitted bug reports.The scores allow for ranking the corpus videos as a list of potential duplicate candidates.
Internally, the proposed approach will begin by taking the two videos and subsampling a number of video frames.Next, it vectorizes each video by discretizing the frames using a ViT-based feature extractor (e.g., [5]) into a Bag of Visual Words (BoVW) representation, for the visual component of the approach, and by extracting the text of each frame [2,25,40] and constructing a video document of the concatenated text, encoded as a Bag of Words (BoW), for the textual component of the approach.The sequential information can be further added to both visual and textual TF-IDF representations.Each pair of visual or textual representations is then compared via cosine similarity.The visual and textual similarities can be used individually to rank duplicate candidates, or they can be combined into a single similarity score to account for both information modalities to enhance the effectiveness [36].In addition to this, I will further leverage multimodal Transformers (eg.[17,24]) to fuse visual and textual information by mutually-supervised objectives during training and inference for more accurate video representations.

RESEARCH DESIGN
My research aims to investigate the performance of two (visual and textual) components of the proposed approach, as well as the performance of these two components combined together when compared to the baseline technique [12].To evaluate the performance of different models for duplicate video-based bug report detection, previous work [12] collected 60 distinct bugs across six Android apps.App users further recorded 180 videos to construct 4,860 tasks for evaluation.Given that most of the bugs collected in [12] are injected, synthetic faults, as opposed to real-world faults, we will extend this benchmark by constructing an evaluation dataset containing only real bugs.Recently, the AndroR2+ dataset [20] was released which contains 180 manually reproduced bug reports for Android apps.For each bug report, AndroR2+ provides a link to the original bug report, an apk binary of the buggy version of the app, and a reproduction script.Therefore, we would be able to collect more real bugs from the AndroR2+ dataset and record more videos following the pipeline provided by [12] in order to create additional tasks for a comprehensive and realistic benchmark.Additionally, since the duplicate bug report detection is modelled as an information retrieval task, standard information retrieval metrics will be used for the evaluation of the proposed approach's performance, including mean reciprocal rank, mean average precision, etc..

ANTICIPATED CONTRIBUTION
My research is intended to facilitate bug report management, particularly focusing on duplicate video-based bug report detection.The proposed approach will be able to analyze a new video-based bug reports and a corpus of previously-submitted ones, and generate a ranked list of potential duplicates in the corpus videos.By watching the videos in the ranked list rather than randomly, the developers can save their time in identifying duplicate bug reports [12] during bug triaging.The anticipated contributions include a novel approach which includes various components that will identify the visual, textual, and sequential information existing in the videobased bug reports.The visual and textual information is potentially to be mutually-supervised for more accurate video representations.To precisely evaluate the proposed approach, I plan to construct a more comprehensive and realistic benchmark which includes more video-based bug reports derived from real bugs.This dataset will be open to the community to foster future work that aims to advance duplicate video-based bug report detection.Ultimately, my research goal is to implement this proposed approach as an usable tool and release it to the community as open-source software.