Aiding Developer Understanding of Software Changes via Symbolic Execution-based Semantic Differencing

According to a recent observational study, developers spend an average of 48% of their development time on debugging tasks. Approaches such as equivalence checking and fault localization support developers during debugging tasks by providing information that enables developers to more quickly identify and deal with unintended changes in program behavior. The accuracy and runtime performance of these approaches have seen continuous improvements throughout the years. However, the outputs of existing tools are often difficult to understand for developers due to a lack of context information and result explanations. Our goal is to improve upon this issue by developing a new equivalence checking approach that (i) is at least as accurate as existing approaches but (ii) provides more detailed descriptions of identified behavioral / semantic differences and (iii) presents these results in a way that is useful for developers, thus aiding developer understanding of equivalence checking results and corresponding software changes.


MOTIVATION
Developers spend a lot of time debugging.In fact, developers selfreport that 20-60% of their time is spent on debugging tasks [6].A recent observational study of developers showed similar results, finding that debugging constitutes an average of 48% of total development time [3].Furthermore, debugging not only takes place during dedicated bug fixing sessions but also takes up 40% of developer time during programming sessions dedicated to the implementation of new features [3].These findings suggest that reducing the time required to perform debugging tasks promises significant cost savings and would free up developer resources for other purposes.
Various approaches have been developed throughout the years that aim to support developers during debugging by helping them more quickly understand and deal with source code changes that cause unexpected -and perhaps unintended -changes in program behavior.For example, equivalence checking informs developers whether the input-output behavior of a program has changed across versions [23], fault localization reports lines of code that might be the cause of unintended behaviors [27], and automated program repair aims to automatically fix programs that are crashing or otherwise failing tests after a modification has taken place [18,20].
While the effectiveness and runtime performance of such approaches have seen continuous improvements throughout the years, developers criticize that existing tools do not provide enough context information to be useful to them [16,22].For example, equivalence checking tools generally only output whether two programs are non-/equivalent [4,14], fault localization tools only provide line numbers of potentially fault-inducing lines of code [27], and program repair tools only output patches that fix identified crashes or test failures [19,28].However, neither approaches provide any further evidence or explanations to support their results, which commonly causes developers to disregard the provided outputs because they feel like they cannot understand or trust them [21,22,26].

PROBLEM STATEMENT
Based on the findings presented above, we hypothesize that developers' understanding of program changes and trust of corresponding tool outputs can be improved if these outputs are enriched with additional context information and result explanations.More specifically, our goal is to develop a new equivalence checking approach that (i) is at least as accurate as existing approaches but (ii) provides more detailed descriptions of identified behavioral / semantic differences and (iii) presents these results in a way that is useful for developers, thus aiding developers' change understanding.Toward this goal, we aim to answer the following research questions: RQ1 How can we exploit symbolic execution to create an equivalence checking approach that is at least as accurate as existing approaches while providing more detailed context information and result explanations?RQ2 How should the collected information about software changes (i.e., equivalence checking results, context information, result explanations) be presented to developers in order to be most useful for them?RQ3 To which degree does the provided information affect the speed, reliability, etc. with which developers are able to reason about software changes?
In the remaining sections, we describe our research approach including preliminary results and finish with expected contributions.

RESEARCH APPROACH
To answer the stated research questions, we split the corresponding research work into five work packages (WP1-WP5): two for RQ1, two for RQ2, and one for RQ3.These work packages are as follows.

Developing and Evaluating the Semantic
Differencing Approach (RQ1) Semantic Differencing (WP1): The goal of this work package was to develop an approach for the computation of equivalence checking results and any additional information that is required to explain these results to developers.The approach that we have developed for this primarily relies on symbolic execution [15] which has proven to be applicable to a wide range of software analysis tasks [5].Desirable properties of symbolic execution include its ability to provide summaries of program behavior which can serve as the basis of result explanations, and its underapproximation of program behavior which avoids false-positive results that are often a source of developer frustration [22,23].The prototype implementation of our approach is publicly available on GitHub 1 .Benchmarking (WP2): In this work package, we evaluated the effectiveness and efficiency of the equivalence checking / semantic differencing approach developed in WP1.To ensure an unbiased evaluation of our approach, we applied it to the ARDiff [4] benchmark to compare its runtime performance and the correctness of the reported equivalence checking results to three state-of-the-art equivalence checking approaches: ARDiff [4], DSE [23], and PRV [7].The results of this evaluation, which are currently undergoing peer review, demonstrate that our approach correctly classifies more cases in the ARDiff benchmark than the three existing approaches, albeit at the cost of moderate runtime increases.A preprint of our results is available on arXiv [11] and a replication package that contains all evaluation scripts as well as the raw and processed evaluation data is available on Zenodo [12].

Developing Appropriate Visualizations of the Semantic Differencing Data (RQ2)
Data Visualization (WP3): Existing studies have shown that source code analysis results are most useful to developers when presented inside of developers' IDEs [2,17].Consequently, the goal of this work package is to develop an IDE plugin for a commonly used IDE such as IntelliJ2 or VS Code3 that presents the data computed by our approach in a way that is appropriate for developers.Initial representations of the data will be based on the findings of existing user studies in related domains such as general program understanding [13,24] as well as our own experiences and intuitions as developers.The representations will later be refined using developer feedback from the pilot study conducted in WP4.

Pilot Study (WP4):
The goal of WP4 is to conduct a pilot study with around 5 developers to collect initial feedback for the IDE plugin developed in WP3.For each study participant, we will provide a demonstration of our plugin's features, have them complete a series of change understanding tasks with our plugin while thinking aloud [25], and then gather more in-depth feedback through a semi-structured interview [1].Observations from the change understanding tasks as well as suggestions and criticism from the interview sessions will be used to improve our first IDE plugin prototype, thus making it more useful for a larger range of developers.Task completion data, interview guides, etc. will be made publicly available to the research community.

Evaluating the Usefulness of the Semantic
Differencing Data (RQ3) User Study (WP5): To evaluate the usefulness of the improved prototype created in WP4, we plan to conduct a user study with 20-30 developers.User study sessions will follow a similar structure as in the pilot study, consisting of a tool demonstration followed by the completion of change understanding tasks and a concluding interview.However, for the change understanding tasks, we will ask developers to not only use our own tool, but to also complete some tasks with an existing tool for source code differencing such as gitdiff 4 , IJM [9], GumTree [8], or ChangeDistiller [10].By following this approach, we will be able to compare our tool to the state-ofthe-art both quantitatively (via task completion times, accuracy, etc.) as well as qualitatively (via interview feedback).

EXPECTED CONTRIBUTIONS
In our research, we will adhere to open science principles.As mentioned throughout the previous sections, we therefore plan to make all created research prototypes, interview guides, raw and processed data, etc. publicly available.In particular, we expect to make the following contributions throughout our research work: C1 an approach for equivalence checking of software programs that provides more information about behavioral / semantic differences than existing approaches (WP1), C2 an open source prototype that implements the computation of the raw semantic differencing data as a standalone Java application (WP1).C3 an open source prototype that implements the processing and visualization of the semantic differencing data as an IDE plugin (WP3), C4 benchmarking results that compare the runtime requirements and accuracy of our approach to state-of-the-art equivalence checking approaches (WP2), C5 experimental results that compare how quickly and accurately developers are able to complete change understanding tasks when using our IDE prototype vs. (prototypes of) stateof-the-art tools (WP4 + WP5), C6 developer feedback that compares the perceived usefulness of our IDE prototype to (prototypes of) state-of-the-art tools (WP4 + WP5).
Through these contributions, we expect to provide reproducible evidence for (or against) our hypothesis that developers' understanding of program changes as well as trust in equivalence checking outputs can be improved if this information is enriched with additional context information and result explanations.