GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions

Bug-fix benchmarks are fundamental in advancing various sub-fields of software engineering such as automatic program repair (APR) and fault localization (FL). A good benchmark must include recent examples that accurately reflect technologies a nd development practices of today. To be executable in the long term, a benchmark must feature test suites that do not degrade over-time due to, for example, dependencies that are no longer available. Existing benchmarks fail in meeting both criteria. For instance, Defects4J, one of the foremost Java benchmarks, last received an update in 2020. Moreover, full-reproducibility has been neglected by the majority of existing benchmarks. In this paper, we present GitBug-Actions: a novel tool for building bug-fix benchmarks with modern and fully-reproducible bug-fixes. GITBuG-ACTIONS relies on the most popular CI platform, GitHub Actions, to detect bug-fixes and smartly locally execute the CI pipeline in a controlled and reproducible environment. To the best of our knowledge, we are the first torely on G it Hub Actions to collect bug-fixes. To demonstrate our toolchain, we deploy GITBuG-ACTIONS to build a proof-of-concept Go bug-fix benchmark containing executable, fully-reproducible bug-fixes from different repositories. A video demonstrating GITBuG-AcTIONS is available at: https://youtu.be/aBWwa1sJYBs.


INTRODUCTION
Bug-fix benchmarks play a pivotal role in advancing the field of software engineering by providing essential resources for evaluating methodologies in various sub-fields, such as automatic program repair (APR) and fault localization (FL) [1,2].A bug-fix is a software modification that fixes an existing defect, aligning the program's behavior with the intended specification.It is represented by a pair of commits: a buggy commit and a subsequent fixing commit.For example, Defects4J [3] is a widely adopted bug-fix benchmark that has greatly served software engineering research in the past decade.Good benchmarks must be representative and rigorous.
First, benchmarks must ensure that the research community studies relevant problems of today in modern software.Rather than relying on outdated examples, benchmarks must include bugfixes that are reflective of modern development practices and that use modern programming languages and build tools.Moreover, benchmarks of recent bugs help reduce the risk of data leakage, that is evaluating large language models (LLMs) techniques with data seen at training time [4,5].
Second, reproducible benchmarks ensure that such studies can be validated by third-parties today but also in the future.While reproducibility is fundamental in the scientific method, bug-fix benchmarks have failed to retain it over time.For example, Zhu and Rubio-González [6] show that reproducibility in bug-fix benchmarks varies between 26.6% and 96.9%, with none achieving fullreproducibility.
Continuous Integration (CI) systems have served as a valuable source of bug instances [7,8].By automating the build and testing processes, CI systems precisely capture developer-specified environments from which bug-fix samples can be collected and reproduced.GitHub Actions is the most popular CI system [9].It offers GitHub users an integrated platform for defining CI workflows to build and execute test suites.Our key insight is that GitHub Actions is a valuable resource for creating high-quality bug-fix benchmarks.
In this paper, we propose GitBug-Actions a novel methodology for building bug-fix benchmarks based on GitHub Actions.GitBug-Actions relies on GitHub Actions to identify and run bug-fixes in the same environment as the one defined by developers.By using GitHub Actions, GitBug-Actions collects recent bug-fixes, that reflect the variety of real-world cases due to the CI system's widespread adoption.
Moreover, GitBug-Actions preserves the collected bug-fixes in fully-reproducible formats.This is achieved by locally executing the bug-fixes in the environment specified by the developers in the GitHub Actions workflow, and then storing the necessary files, in particular all software dependencies, to re-execute the bug-fixes offline in the same environment.In this way, GitBug-Actions builds bug-fix benchmarks that uphold scientific standards w.r.t.reproducibility.The bug-fixes in GitBug-Actions are designed to be executable for eternity.
To validate the concept GitBug-Actions, we build a proof-ofconcept benchmark of Go bug-fixes from January 2023.In total, GitBug-Actions successfuly collects bug-fixes that are 1) executable, 2) fully-reproducible and 3) come from different repositories.
To summarize, our contributions are: • An original workflow for building bug-fix benchmarks using GitHub Actions, called GitBug-Actions.To the best of our knowledge, we are the first to use GitHub Actions to build bug-fix benchmarks.• GitBug-Actions's implementation, made publicly available for researchers to build benchmarks in their programming language and stack of choice: https://github.com/gitbugactions/gitbugactions.• A proof-of-concept benchmark of Go bug-fixes from January 2023, collected by GitBug-Actions.The benchmark contains 22 fully-reproducible bug-fix commits from 14 different repositories.

THE GITBUG-ACTIONS WORKFLOW
In this section, we present the design of GitBug-Actions, a novel tool for collecting fully-reproducible bug benchmarks.

Overall Workflow
GitBug-Actions builds on top of the ability to locally execute GitHub Actions (Section 2.2) for multiple programming languages and build systems (Section 2.3).Its pipeline, shown in Figure 1, is composed of three stages: (1) Collect Repositories (Section 2.4) (2) Collect Bug-Fix Commits (Section 2.5) (3) Build and Run Offline Reproducible Environment (Section 2.6) The final produce of GitBug-Actions is a benchmark containing high-quality reproducible bug-fix commits.

Locally Execute GitHub Actions
GitHub Actions is a CI service provided by GitHub, the most popular according to a 2022 survey [9] by JetBrains.GitHub Actions' builds are declared by workflows.Workflows are configurable YAML documents that define: (1) the events that trigger its run (e.g., a git push event); (2) the jobs that will run; (3) the environment in which the jobs will run, typically a VM in Azure; and (4) the list of steps to be run in each job, which can either run shell commands or a reusable third-party action.
Executing bug-fixes is a fundamental aspect for a bug-fix benchmark, since one needs to execute test suites to verify the correctness or incorrectness of a program.To build a fully-reproducible benchmark, execution needs to be local in order to not depend on third-party services.To locally execute GitHub Actions, GitBug-Actions relies on the well-established open-source tool Act1 .Act uses docker images that imitate GitHub's proprietary execution environments.For each build, it initializes containers based on these images and the environment setup defined in the workflow to run.
GitBug-Actions builds on top of Act.First, GitBug-Actions identifies the workflows which contain test execution commands (e.g.maven test) by static analysis of the YAML build file.Then, the identified workflows are modified as follows: the operating system is set to ubuntu-latest to ensure compatibility with Act; matrix configurations are simplified to the first existing configuration; only jobs containing test commands and their dependencies (i.e. as stated by the needs operator) are kept; test commands are instrumented to ensure the generation of test reports.After executing Act on the instrumented workflows, GitBug-Actions parses the test reports and returns the test execution result.Test execution results are useful in verifying wether the program respects the expected behavior defined by the test suite.

Tailoring Data Collection per Programming Language and Build System
GitBug-Actions is designed to be programming language and build system agnostic.It builds an unified abstraction layer that standartizes the interaction with test execution workflows.Such layer overcomes the variability associated with such diversity.For each build tool, the necessary information to support a given programming language and build system is: (1) how to distinguis source-code files from test files, (2) how to identify a test execution command, and (3) how to retrieve test execution results.GitBug-Actions already supports Java (Maven and Gradle), Python (pytest and unittest) and Go.

Collect Repositories
GitBug-Actions employs a systematic approach to identify locally reproducible open-source repositories on GitHub.
First, it selects repositories that meet specific criteria per a GitHub search query 2 , for example the number of stars a repository has received from users.Second, it attempts to execute the repositories' GitHub Actions as described in Section 2.2.Repositories are retained if and only if GitBug-Actions is able to retrieve test execution results from the executed workflows.Successful execution and retrieval of test reports indicates that the executed actions are executing tests, rendering the repository as a potential source of bug instances for further analysis.We recall that test execution results are fundamental for building a bug-fix benchmark since they are used to verify the correctness or incorrectness of a program.

Collect Bug-Fix Commits
GitBug-Actions searches the commit history of each of the locally executable repositories for bug-fix commits.A bug-fix is a pair of commits (  −1 ,   ) such that   −1 corresponds to the buggy version and   corresponds to the fixed version.GitBug-Actions collects behavioral bug-fixes, so buggy programs must have at least one failing test case, and fixed programs must have a passing test suite.
To identify bug-fix commits, GitBug-Actions first splits each candidate commit's patch into three patches: (1) Source Patch: Contains changes in the source code under test, (2) Test Patch: Contains changes in test cases, and (3) Non-Code Patch: Contains changes in non-source files such as documentation files.This split is important for two reasons.First, developers often couple non-source code changes in commits that fix behavioral bugs.As a result, bug-fix commits can become polluted with changes that are not relevant to the program's behavior.By isolating these changes, GitBug-Actions collects higher-quality bug-fix patches.Second, when fixing a bug, developers often introduce test changes to validate them.Isolating test changes is thus crucial in building a buggy version that has failing test cases.In this scenario, the human-developer introduces bug-fix changes with   , as well as test changes to validate them.We look for pairs of subsequent commits (  −1 ,   ) which have passing CI builds.  must introduce both source code and test code changes, which must not be removal only.  −1 's build must fail only when the test changes from   are applied.
(2) Failing Commit + Passing Commit with Only Source Changes: In this scenario, the human-developer introduces bug-fix changes such that the program adheres to the entire pre-existing specification.We look for pairs of commits (  −1 ,   ) with the following CI status:   −1 has a failing CI build and   has a passing CI build.  must exclusively introduce source code changes.

Build Offline Reproduction Environments
A big challenge in collecting bug benchmarks lies in ensuring all bugs remain reproducible in the future [6].Due to the prevalence of third-party code in complex applications, typically retrieved during build time, reproducibility often depends on outside actors (e.g.package mantainers and repositories) which may become unpredictabily unavailable.Another issue lies in flaky tests, which introduce non-determinism in experimental reproductions.
GitBug-Actions builds offline reproduction environments to safe-guard reproducibility as a key component of the collected benchmarks.This is achieved in two steps.First, GitBug-Actions stores the docker container's state after locally executing the tests workflow.The stored container's state includes all installed software packages required to run the same tests workflow again without access to the internet.Being able to reproduce test execution offline is essential for reproducibility, since required software packages might go unavailable in the future.Second, GitBug-Actions executes each collected bug-fix commits  times.Only those bugs which yield the exact same test results across all  bugs are kept in the benchmark, effectively removing cases with flaky tests that introduce non-determinism in the benchmark.

GITBUG-ACTIONS ON THE GO
We instantiate GitBug-Actions to create a benchmark containing bug-fix commits from January 2023 written in the programming language Go.Due to GitBug-Actions's workflow abstraction, the required logic to handle Go workflows and extract test execution results is implemented in a single file 3 .The Go benchmark is available on Zenodo: https://zenodo.org/records/10034612.
We run the entire GitBug-Actions's pipeline.In the Collect Repositories step, the following criteria are used to select repositories from GitHub: (1) Programming Language: Main programming language must be Go per the Github metadata.(2) Popularity: At least 50 stars.The star count serves as an indicator of popularity and community engagement, ensuring that selected repositories have a certain level of relevance and activity.(3) Size: Repository is less than 200MB.This criterion serves for storage space efficiency.Collect Bug-Fix Commits is configured to only consider commits from January 2023.GitBug-Actions is deployed with 32 parallel workers on a machine with an AMD EPYC 7742 64-Core Processor and 512GB of memory.The pipeline's run starts on October 10th 2023 and takes approximately 56 hours to complete.
Collect Repositories.GitBug-Actions finds 21,891 GitHub repositories in Go that follow the aforementioned criteria.For these, if there exists GitHub Actions that execute tests on the latest commit of the default branch, GitBug-Actions attempts to locally execute the Go test suite.GitBug-Actions locally executes the test workflow of 3,465 repositories which have a single test GitHub Action.In total, GitBug-Actions is able to execute and obtain test execution results of 1,626/3,465 (46.9%) repositories.
Collect Bug-Fix Commits.For each of the 1,626 repositories, GitBug-Actions finds 9,567 commits from January 2023.Recall that, for each commit, GitBug-Actions locally executes the associated test workflow.A pair of subsequent commits is considered a bug-fix if they match one of the bug-fix patterns explained in Section 2.5.In total, Collect Bug-Fix Commits identifies 39 bug-fix commits from 31 different repositories per the considered patterns.
Build Offline Reproduction Environments.This step builds and runs an offline reproducible environment for each of the 39 identified bug-fix commits.Recall that, for a bug-fix to be included in the benchmark, it must be fully-reproducible, meaning it must run in offline isolation and not have flaky tests.This rules out 17 commits.We build the fully-reproducible images that contain all dependencies.In total, Build Offline Reproduction Envrionments finds 22 fully-reproducible bug-fixes from 14 different repositories.Researchers could run the system over a timespan of several years to the amount of bug-fixes they need.This proof-of-concept benchmark demonstrates that GitBug-Actions can be configured for any programming language with little effort.

RELATED WORK
Several bug-fix benchmarks suitable for different purposes and with a diverse range of properties have been proposed in the literature.QuixBugs [10], Codeflaws [11], Code4Bench [12], RunBugRun [13], and EvalGPTFix [5] are benchmarks constructed from coding competition websites.Such problems, while real, are not representative of those that developers face in complex software systems.Others, like HumanEval-Java [14], rely on artifically injected bugs which are, by nature, not real.
Benchmarks such as FixJS [15] and Minecraft [16] do not include test suites for each bug-fix instance, rendering them unsuitable for studies reliant on execution.
Defects4J [3], Bugs.jar [17], Bears [8] and BugSwarm [7] contain executable bugs from real-world repositories.However, Bugs.jar and Bears face serious execution and reproducibility challenges due to missing dependencies and incomplete configuration environments [6].Also, Durieux and Abreu state that 96.4% of the BugSwarm benchmark is not suitable for APR and FL, for reasons that include duplicate samples, lack of failing tests, and changes to non-source code files [18].
Finally, Defects4J is a milestone of benchmark research, containing bugs that are reproducible and adequate for APR and FL.Yet, Defects4J mostly contain old bugs, at the time of writing, it was last updated in 2020.Indeed, Silva et al. [19] find that the majority of Defects4J bugs require Java 6 or earlier bytecode while Java 6 is no longer supported by Oracle as of December 2018.Moreover, given that the cutoff date of most datasets used for LLMs is beyond 2020, there exists a significant risk examples are included in them, thus threathening the validity of evaluations of LLM-based techniques on Defects4J.
To the our knowledge, GitBug-Actions is the for collecting bug-fix benchmarks that are sourced from fully-reproducible, appropriate for building new benchmarks containing recent bug-fixes.
ActionsRemaker [20] is a tool for reproducing GitHub Actions builds.We favor using Act instead of ActionsRemaker because Act is a popular and mature open-source tool, with significantly higher reliability compared to academic prototyping.To the best of our knowledge, we are the first to leverage GitHub Actions to collect bug-fix commits.

CONCLUSION
We present GitBug-Actions, a tool for building bug-fix benchmarks based on GitHub Actions.GitBug-Actions builds fullyreproducible bug-fix benchmarks by using GitHub Actions to both collect recent examples and create fully-reproducible environments for the collected bug-fixes.To the best of our knowledge, we are the first to leverage GitHub Actions to build bug-fix benchmarks.

Figure 1 :
Figure 1: Overview of GitBug-Actions, a novel methodology to collect bug-fixes based on Github Actions.