Can GPT-4 Replicate Empirical Software Engineering Research?

Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research. In this paper, we examine GPT-4's abilities to perform replications of empirical software engineering research on new data. We study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.


INTRODUCTION
Empirical software engineering research on production systems has introduced a better understanding of the software engineering process for practitioners and researchers.Empirical studies on Microsoft code bases have studied how bugs get fixed [25] and the effects of distributed teams on code quality [9], while studies from Bloomberg [39], Meta [16], and Google [55] have revealed insights on implementing automated program repair systems and static analysis tools in practice.
Yet, only a small subset of production systems are rigorously studied.This limits the impact of empirical software engineering research, as software engineering practitioners are rarely able to reap the benefits from running analyses on their own data, since they often lack the expertise and time to replicate empirical research.Yet, software engineers report an interest in obtaining answers to questions related to software development, spanning topics such as bug measurements, development practices, and testing practices [8,32].For software engineers to obtain such insights, data scientists have played an increasingly important role in software teams [37,38] by running analyses to help teams understand their productivity and code quality.While data scientists do not replicate empirical software engineering papers, these papers contain methodological knowledge to generate insights on software development.Thus, replicating empirical software engineering research could be a potential avenue for software teams to gain insights from their own software artifacts and data.
However, performing replications poses its own set of challenges, as it requires a deep understanding of research methodologies and subtle nuances in software engineering data [37].While prior work has begun addressing this issue by creating domain-specific languages [34,35] or programming environments [15] to help automate statistical analyses, these approaches do not directly address study replication, especially for software engineering contexts.Given that large language models (LLMs), such as , show promise in tackling software engineering- [31,71] and science-related [62] tasks, these models may help replicate software engineering research studies.Studying an LLM's ability to replicate empirical software engineering research has the potential to broaden the impact of this research and democratize data science expertise for teams that do not have the resources for a dedicated data scientist.This could allow developers to learn insights about their code bases and work habits, potentially helping to increase developer productivity.
In this paper, we examine GPT-4's abilities to perform replications on empirical software engineering research papers, specifically those involving quantitative analyses.This is because LLMs have shown promise in generating code [12], which could be used to help replicate analyses.According to the SIGSOFT Empirical Standards [2], replication is applying the same research methodologies from a given research paper on a different set of data.
These standards state an essential attribute for replications is identifying and reporting the context variables (i.e., assumptions) of the original study [2].This is because if assumptions of an empirical study are not met, the validity of the results can be compromised [11,52].For instance, a study could assume the number of pull requests is an accurate measure of developer productivity, but this may not apply to all repositories, such as one with a single contributor.Questionable assumptions are a major barrier to adopting research in practice [47].Therefore, our first research question is: RQ1 Can GPT-4 identify assumptions from research methodology?Further, replications of quantitative analyses require creating an analysis pipeline with code that can be run to replicate the research methodology.Thus, our second research question is: RQ2 Can GPT-4 generate an analysis pipeline to replicate research methodology?
To answer these questions, we evaluated GPT-4's ability to generate assumptions, analysis plans (i.e, a list of module specifications), and code on seven empirical software engineering papers.

60:3
We ran a user study with 14 software engineering researchers, who evaluated GPT-4-generated assumptions and analysis plans.We then performed a manual analysis of the GPT-4-generated code.We find that GPT-4 surfaces mostly correct assumptions, but struggles to generate ones that apply common but implicit knowledge about software engineering data (e.g., pull requests showing the lines of code changed).We also observe that GPT-4 can generate analysis plans that correctly outline the modules for replication, but is limited by the quality and detail of the methodology as written in the original research paper.Finally, we find that the GPT-4-generated replication code contains the correct high-level logic, but has many small implementation-level errors (e.g., using incorrect tables in a database).Our findings have implications for leveraging LLMs for software engineering research, such as teaching GPT-4 software engineering domain knowledge.

RELATED WORK
We discuss prior research on LLMs for software engineering (Section 2.1) and science (Section 2.2).Since the field develops quickly, our discussion offers a snapshot of the field as of September 2023.

Language Models for Software Engineering
Language models have been applied to many software engineering tasks.The most prominent task is code generation; models like Codex [12] have strong performance in providing developers with code suggestions [68].With the emergence of publicly accessible LLMs such as GPT-4 [50], LLMs have been applied to a wide variety of software engineering tasks.Zheng et al. [71] surveyed 123 papers and identified seven software engineering tasks LLMs have been applied to: code generation, code summarization (i.e., generating comments for code), code translation (i.e., converting code from one programming language to another), vulnerability detection (i.e., identifying and fixing defects in programs), code evaluation and testing, code management (i.e., code maintenance activities such as version control), and Q&A interaction (i.e., using Q&A platforms such as StackOverflow).In a literature review of 229 research papers on language models for software engineering, Hou et al. [31] found that LLMs were used on a variety of software engineering datasets, including data on source code, bugs, patches, code changes, test suites, Stack Overflow, API documentation, code comments, and project issues.The papers spanned themes such as software development (e.g., API recommendation [65]), software maintenance (e.g., merge conflict repair [70]), software quality assurance (e.g., flaky test prediction [18]), requirements engineering (e.g., requirements classification [30]), and software design (e.g., software specification synthesis [48]).OpenAI recently released Code Interpreter [1] for ChatGPT to write and execute Python code.This could be used to generate analyses.Our work extends our understanding for tools that generate code like Code Interpreter by observing how LLMs generate code for analysis plans.
Based on this literature, LLMs can handle a wide variety of software engineering tasks and data.Yet, these approaches require fine-tuning models or specialized approaches.Our study extends from this literature by examining whether pre-trained LLMs like GPT-4 reflect this software engineering domain expertise off-the-shelf without additional training.

Language Models for Science
Other work has studied using language models for science.Language models such as MultiVerS [62] can validate claims against scientific literature in the domains of COVID-19, public health, and climate change.However, Auer et al. [4] found that ChatGPT struggled to answer challenging questions from research papers across topics like computer science, engineering, chemistry, geology, immunology, economics, and urban studies.
Similar studies have been performed in computer science.In natural language processing (NLP), Gao et al. [23] investigated whether LLMs could generate a survey of knowledge for NLP concepts, Table 1.A summary of the papers selected for GPT-4 to generate assumptions, analysis plans, and code.We report each paper's venue, number of citations in the ACM Digital Library, and a brief description of the paper's analysis.We also report number of assumptions and modules generated by GPT-4.such as A* search.In an evaluation with NLP experts, the authors found GPT-4 generated reasonable explanations of these concepts, but sometimes generated factually incorrect knowledge.Researchers also have applied LLMs for research in human-computer interaction.Wu et al. [66] replicated seminal crowdsourcing papers using LLMs, while other work found that GPT-3 could generate synthetic data for both open- [28] and closed-ended [58] questions in interviews and surveys.Lastly, Xiao et al. [67] found that GPT-3 could perform deductive qualitative coding on datasets.Our work builds upon the literature by studying whether LLMs can analyze methodology and write code pipelines to repeat analyses, rather than returning factual knowledge or generating research data.Compared to prior work, our study specifically focuses on quantitative empirical research methods rather than qualitative ones.Finally, we extend our understanding of LLMs' scientific knowledge by studying its performance in software engineering research.

METHODOLOGY
To answer the research questions, we selected seven empirical software engineering papers (Section 3.1).We used LLMs to automate the tasks that data scientists would perform in today's practice to replicate an empirical study in their own context: analyze the assumptions of the methodology, plan the analysis pipeline, and implement the code.We then prompted GPT-4 to generate assumptions, analysis plans, and code for each paper (Section 3.2).To evaluate the assumptions and analysis plans generated by the model, we performed a user study (Section 3.3) with 14 participants with software engineering research expertise.We then performed a manual evaluation on the generated code (Section 3.4).Finally, we performed quantitative and qualitative analysis on the collected data (Section 3.5).Materials used in this study, such as the protocols, GPT-4 generated data, and exact prompts, are available in the supplemental materials [43].

Paper Selection
We describe the process to select empirical software engineering papers for GPT-4-generated assumptions, analysis plans, and code.This process yielded seven research papers (see Table 1).
First, we developed selection criteria for the research papers.For consistency, each of the seven papers met the selection criteria.The selection criteria were: • Has a quantitative empirical analysis on software engineering data.We focused on analyses that could be replicated through code rather than through manual means and could be derived from Git and GitHub data via a database.• Is after 2010.GitHub was created in 2010.Since the generated code relies on a Git and GitHub database, we identified research papers that utilized a similar type of data.• Has an approximately 1-page methodology section.We ensured the methodology was short in length for two to be to read through and evaluated in a 1-hour long user study.
To obtain a diverse set of empirical software engineering papers to evaluate on GPT-4, the first author applied the selection criteria to three different sets of papers.As much as possible, we used the ACM Digital Library as a tool to retrieve papers as its search function allowed for keyword searches across specific metadata, allowing for a repeatable selection process.Further, the ACM Digital Library allowed access to work from premier empirical software engineering venues (e.g., IEEE/ACM ICSE, ACM ESEC/FSE, MSR).The three sets of papers were: (1) Empirical papers in software engineering venues.Software engineering conferences contain numerous empirical studies on software engineering data.Thus, we searched on the ACM Digital Library for papers with "empirical" in the title, had the term "software engineering" in the publication venue, and whose content type was "Research Article".We then sorted by citation and selected the first 20 results to limit the search results.To answer RQ1, we used Prompt 1 to generate assumptions for each paper and evaluated them in a user study.To answer RQ2, we used Prompt 2 to create an analysis plan and evaluated it in a user study.We also used Prompt 3 to create the code modules of the analysis plan and evaluated it in a code review workshop.The prompts in the figure are only summaries of the actual ones; for the complete prompts, see the supplemental materials [43].

Prompting GPT-4
We present an overview of the prompting strategy applied to GPT-4 in Figure 1.To answer the research questions, we prompted GPT-4 to generate assumptions (RQ1).We also prompted GPT-4 to generate analysis plans (i.e., a list of module specifications) as well as code to implement the analysis plan (RQ2).We separated the analysis pipeline into two steps-analysis plans and code-to distinguish the higher-level abstraction of creating modules from the lower-level details of writing implementations.This is because code implementation is also influenced by higher levels software design, like modules [42]; thus, studying both abstractions and implementations could reveal a more holistic understanding of GPT-4's ability to generate analysis pipelines.We ran the outputs on GPT-4 on the OpenAI Python API in September 2023 with the default model parameters, except for temperature, which was set to 0 to have deterministic outputs.

Prompt Design.
We designed the prompt to provide the task structure in the beginning, relevant information in the middle, and instructions at the end.This structure leveraged LLMs' primacy and recency biases in input contexts [46] to emphasize the task structure and instructions.
The prompts started with one to two example input-output pairs for few-shot learning of the task and output format, following Brown et al. [10].Next, we provided text-based instructions for GPT-4 (see Figure 1).The instructions included an explanation of GPT-4's role as a software engineering researcher; skills it has, such as in software engineering replication or in MySQL; and additional information about the given task, such as a description of the inputs.Finally, the prompt included a request to extract, gather, or derive specific data, following the format provided in the example input-output pairs.The exact prompts and the generated assumptions, analysis plans, and code are available in the supplemental materials [43].

Name: Confounding Factors Description:
The assumption that the number of lines added/deleted, whether the file is a test, and the number of commenters are the main confounding factors influencing the number of comments a file receives.There may be other unconsidered factors, such as the complexity of the changes, the expertise of the reviewers, or the clarity of the code. 1 2 Fig. 2. Example GPT-4 generated assumption, given the methodology from Fregnan et al. [22].Assumptions contain a name 1 and a description 2 .
Name: Data Extraction Description: This module is responsible for extracting the required data from the dataset provided by Gousios.The data includes commit comments from 90 of the top-starred software projects for the top programming languages on Github.

Methodology text:
We extracted our analysis data from the dataset provided by Gousios [2], this dataset includes 90 of the 10-top starred software projects for the top programming languages on Github.We analyzed a total 60425 commit comments.Fig. 3. Example GPT-4 generated module, given the methodology from Guzman et al. [27].Modules contain a name 1 , description 2 , inputs 3 , a description of outputs 4 , and corresponding methodology text 5 .

Data Output
Format.Below, we further elaborate on each of the generation types.We report the number of each data type generated by GPT-4 using the multiplication symbol (×).
Assumptions (111×).The GPT-4 generated assumptions contain two pieces of information: a name and a description (see Figure 2).We prompted GPT-4 to generate assumptions that underlie the given research methodology.Next, we prompted GPT-4 to generate assumptions about applying the methodology to a different dataset.
Analysis plan (7×).Given a research paper methodology, GPT-4 generates an analysis plan that is represented as a list of code modules.It generates a code module has a title, input ( which may include one or more outputs from other modules in the analysis plan), output, description, and corresponding methodology text (see Figure 3).We prompted GPT-4 to divide the methodology text into a set of code modules and generate specifications following the above metadata.
Code (23×).Given a module specification, GPT-4 generates a piece of Python code that implements the specification (see Figure 4).We instructed GPT-4 to generate code that outputted data as a JSON object into a file, so that it may be reused by other modules by reading in the file.However, the code may also query an existing database filled with Git and GitHub data with a predefined schema; the schema is available in the supplemental materials [43].To assist with code generation of downstream modules, we also instruct GPT-4 to return an example JSON object and a natural language description of the JSON object with the generated code.The example object and description is provided in the prompt for downstream modules in case they depend on it.This way modules that rely on earlier modules in the pipeline are able to update the input in their specifications to match this data representation.

User Study
We performed a user study with software engineering experts to validate the GPT-4-generated assumptions (RQ1) and analysis plans (RQ2) for each paper.The user study was reviewed and approved by our institution's Institutional Review Board.The survey and interview instruments used in the user study protocol (Section 3.3.2) are included in the supplemental material [43].

Participants.
Since our user study targeted a niche population, we used snowball sampling to recruit participants with software engineering research experience at Microsoft (see Table 2).Our inclusion criteria were people who obtained a Ph.D. studying software engineering-related topics and people who were data scientists for software engineering teams.
Recruitment.We compiled a list of potential participants in the coauthors' network who met the inclusion criteria.We then sent them invitations to participate in the study.Potential participants suggested other individuals who met the inclusion criteria, who were also sent study invitations.In total, 24 invitations were sent, with 14 participants participating in the user study.We note that having 14 participants allowed each paper to have three to four participants review the GPT-4 generated assumptions and analysis plans, reflecting common practices in academic peer-review.Additionally, while performing open coding on the interview data (Section 3.5), no new codes were added after 8 interviews, indicating that 14 participants was sufficient to achieve saturation.
Demographics.Participants (7 women, 7 men) were mostly located in the United States and Canada ( = 12), with a few participants also located in India ( = 2).Participants were experienced in software engineering-related research, with between 2 to 14 years of software engineering research experience ( = 7.1 years) and 5 to 20 years of programming experience ( = 8.4 years).Participants reported publishing between 0 to 50 publications in top-tier software engineering venues ( = 8.2 publications) and serving as a reviewer 0 to 20 times at these venues ( = 8.2 times).
Participants also reported being familiar with LLMs, with 86% of participants using these models at least on a weekly basis.Participants also reported an ability to analyze and evaluate AI applications based on a validated AI literacy instrument [63].All participants reported being able to choose a proper solution when presented with multiple solutions from AI agents.Additionally, 93% of participants reported being able to evaluate the capabilities and limitations of AI applications after using them.Finally, 71% reported being able to select an appropriate AI for a particular task.
Table 2.An overview of the participants in the user study.We report each participant's number of publications in top-tier software engineering venues, number of times served as a reviewer in software engineering venues, and years of experience doing software engineering research as well as programming.We also report the frequency of large language model usage, location, gender, and job position.Survey.We designed a 10-minute Microsoft Forms survey to collect participants' demographics, programming background, research background, and familiarity with AI.Example topics included: years of experience in software engineering research and programming, the number of software engineering venues participants had reviewed for, and how often they used LLMs.
We collected information about participants' gender following best practices from the HCI Guidelines for Gender Equity and Inclusivity [56].To collect information on participants' AI literacy, we used the validated instrument from Wang et al. [63].Since participants would evaluate GPT-4 outputs in the interview, we used questions related to the instrument's evaluation construct to understand the degree to which they could evaluate the strengths and weaknesses of AI models.
Interview.The first author conducted semi-structured interviews with participants.Participants assessed the GPT-4-generated assumptions and analysis plans for two research papers.The interview was 60 minutes long: 10 minutes for consent and instructions and 25 minutes for each paper.For each paper, the participant spent 5 minutes on reading the paper abstract and methodology, 10 minutes on assumptions, and 10 minutes on analysis plans.
To reduce participant fatigue, the interview did not include an evaluation of the generated code; instead, the authors performed a manual analysis (see Section 3.4).Interviews were recorded and transcribed.Recordings were deleted after transcription.Also, the papers were paired by length so the interview would stay under the allotted time.
Participants were provided with a document containing instructions with all relevant materials for them to complete the study.In the document, participants were instructed to act as a software engineering research consultant applying the assigned research papers' methodology to company data.Their task was to evaluate outputs from an LLM tool that would assist them in the replication.Next, the instructions included grading rubrics to standardize the evaluation of the generated Table 3.An overview of the grading rubrics to score the GPT-4-generated assumptions, analysis plans, and code modules to replicate empirical software engineering papers.Correctness is graded on a scale of 1 to 3. Other constructs are graded on scale of 1 to 5. The full rubrics are included in the supplemental materials [43].

Construct
Rating & Description Assumptions correctness 1 Does not make the following assumption. 2 Partially makes the following assumption.3 Does make the following assumption.3 Can be used as-is to repeat the study on a new set of data.Module correctness 1 Describes a set of actions that should not be performed for the analysis pipeline.2 Describes a set of actions that can partially be performed for the analysis pipeline.
3 Describes a set of actions that should be performed for the analysis pipeline.descriptiveness 1 Is unintelligible or too vague for someone to run the analysis on their own.5 Clearly describes the exact steps to follow for someone to run the analysis on their own.assumptions and analysis plans.Finally, the instructions included space for participants to grade the assumptions and analysis plans according to the aforementioned rubrics.
During the interview, participants were presented with the instructions document and were debriefed on the task instructions.To evaluate the papers, participants read the abstract and methodology of the paper.After being provided the rubrics for the assumptions, they graded the assumptions by recording their scores in the instructions document and could refer to the methodology to reduce memory biases.Next, the participant discussed their impressions on the assumptions and on how to improve the output.The same process was repeated for the analysis plan.After the participant evaluated the first paper, the protocol was repeated for the second paper.
To keep the interview within the allotted time, data collection for that paper ended when the time limit was exceeded.Collection was resumed at the end if time allowed.This ensured two papers were evaluated while minimizing participant fatigue.To ensure all assumptions were evaluated in the user study, we shuffled the order of the assumptions for each participant.
Data.Participants rated assumptions and analysis plans on a 3-point Likert scale for correctness (i.e., not correct, partially correct, correct) with all scale points having defined criteria.For all other constructs, we use a 5-point scale, with only the extremes having defined criteria.An overview of the scoring criteria is in Table 3.
Assumptions were evaluated based on correctness, relevance (i.e., whether the assumption was necessary to consider in the replication), and insightfulness (i.e., whether the assumption reflected a deep understanding of software engineering research methodology or data).
Meanwhile, analysis plans were graded both in their entirety as well as at the individual module level.Analysis plans in their entirety were evaluated based on correctness.At the module-level, the plans were evaluated based on correctness and descriptiveness (i.e., whether the instructions provided were descriptive enough for another person to replicate the paper).
Piloting.Following best practices in human subjects experiments in software engineering [41], we piloted both the survey and interview with two software engineering researchers.This validated the overall interview structure, time frame, rubric design, as well as help clarify wording of interview questions.Between each pilot, the survey and interview were updated based on the participants' feedback.Pilot participants' data were not included in the final analysis.

Manual Code Review
Three authors also performed a manual review of the 23 GPT-4-generated code snippets (RQ2).The three authors convened in a series of code review workshops to review each of the GPT-4generated code modules for all the papers.For each code module, the authors reviewed the module specification corresponding to the generated code.The authors then read through the generated code.Together, the authors identified any instances of incorrect and correct code logic.Each incorrect and correct logic instance was noted down only upon consensus following discussion.This process generated a dataset of examples of correct and incorrect logic from the code generated by GPT-4.In total, the analysis with the authors took 7 hours (i.e., 21 person-hours).
To understand whether generated code was executable, the first author manually ran each program.For each file, the missing dependencies were installed with pip, if they existed.The code was run with the Python interpreter and the result of the execution (i.e., pass, fail) was recorded.

Analysis
To analyze the data, we used both quantitative and qualitative analysis techniques.
Quantitative analysis.For the quantitative analysis of survey data, we followed the best practices outlined by Kitchenham and Pfleeger [40] by reporting how often participants agreed or strongly agreed with statements, as well as rated outputs as partially correct or fully correct, relevant or very relevant, insightful or very insightful, and descriptive or very descriptive.For the quantitative analysis of the participant-graded data, we calculated descriptive statistics across all the ratings from all seven papers for the assumptions, analysis plans, and module grading constructs.Finally, we report the percentage of files that were executed as-is without any errors by Python.
Qualitative analysis.While performing qualitative analysis, we followed best practices from Hammer and Berland [29] by interpreting codes as tabulated claims about the data to be investigated in future research.To qualitatively analyze the interview data, the first author performed open coding.Only the first author performed open coding since she also conducted the interviews, and therefore had the best understanding of participants' statements within the study team.Further, the qualitative data was focused on obtaining participants' impressions on the GPT-4 output.Since the data was narrowly constrained, it was feasible for one researcher to perform open coding.
To open code the interview data, the first author first read through the interview transcripts.She identified statements that described strengths and limitations of the GPT-4-generated assumptions and analysis plans and inductively generated codes.These statements were assigned with one or more codes, where each code had a name and a description.After open coding, the first author performed axial coding on the resulting set of codes to extract out broader themes about GPT-4's performance on replicating empirical software engineering papers.The same process was followed for the dataset of examples of incorrect and correct logic from the code generated by GPT-4.
After the first author generated codes from the interview data, the last author independently validated the codes.A random subset of three interview transcripts were selected.The transcripts' original codes were removed, but the highlighted spans of text corresponding to the codes remained intact.The last author then applied the generated codes to each highlighted span of text.Based on this procedure, the inter-rater reliability was 80.6%.This inter-rater reliability score is comparable to other studies in empirical software engineering research [21,33,45].

RESULTS
In this section, we report our results to the two research questions.Overall, we find that GPT-4 has some understanding of software engineering research methodology and data.However, we also observe that GPT-4 exhibits many limitations to their knowledge, such as lacking basic knowledge of software engineering data.We elaborate further on our findings below.

Can GPT-4 identify assumptions from research methodology? (RQ1)
We report our findings on participants' ratings and impressions of the assumptions generated by GPT-4 on the seven research papers.
Quantitative Results. Figure 5 shows the distributions of participants' ratings on the assumptions by correctness, relevance, and insightfulness across all the papers.Overall, we observe that participants rated the assumptions ( = 367 scores) as being high in correctness ( = 2.5 out of 3), as 86% of assumptions were graded as partially correct or fully correct.The assumptions were also rated moderately high in terms of relevance ( = 3.2 out of 5), as 46% of assumptions were rated being relevant to the replication.Finally, we observe that participants rated the assumptions lowly in terms of insightfulness ( = 2.8 out of 5), as only 31% of assumptions were rated as insightful.
Qualitative Results.We elicited 12 codes related to GPT-4's capabilities and limitations for generating assumptions, where three main themes emerged: reasons for positive ratings ( ), reasons for negative ratings ( ), and ways of improving outputs ( ).We report the number of participants who mentioned this code with the multiplication symbol (×).

Correct (13×).
A majority of participants found the assumptions to be correct, as they "seemed to match the assumptions that were swimming in my head" (P14).Additionally, "[assumptions] that weren't [fully correct]...were usually partial correctness" (P2).As a result, some participants were surprised at GPT-4's capability: "if an LLM generated these, I'm very impressed" (P1).
"Some of the insights were similar to researchers reading the paper could generate."(P6) Comprehensive (12×).Many participants found the assumptions to be comprehensive by the breadth of topics covered.When asked to generate any assumptions that GPT-4 had missed, 7 participants were not able to suggest any assumptions.Even while taking notes of assumptions while reading the paper, participants still described the set of assumptions to be comprehensive: "Based on the notes I took, the assumptions are pretty comprehensive."(P4) Insightful (8×).Less frequently, participants also mentioned the assumptions to be insightful, as GPT-4 "[brought] up some great points about the paper and the assumptions made" (P1).Others were impressed that the assumptions were generated "not specifically mentioned in the paper" (P12).Relevant (6×).Some participants felt the generated assumptions were relevant to consider for replication, and thus could be helpful to "someone [who] is not a researcher" (P6).
"And relevance-wise, it is able to extract all the relevant assumptions for this."(P5) Lack of software engineering knowledge (9×).The most frequent reason for negative ratings was because GPT-4 lacked knowledge of software engineering data or technologies that participants felt were obvious, as it made assumptions that were "not representative of the real world" (P1).This often lowered the relevance and insightfulness scores.Participants noted that GPT-4 surfaced assumptions about the availability of certain software engineering artifacts, such as "logs" (P12) and "code commits" (P3).P13 noted, "What kind of pull request wouldn't show you...[the lines of code changed]?...That's like saying they're assuming a computer is an electronic with binary values." Participants also noted GPT-4's lack of knowledge in technologies: "The assumption that the GitHub project is compatible with...V8 Spider Monkey.I mean, these are super hugely popular JavaScript engines...so it's almost like some domain knowledge that's not explicit in the paper that the model is missing here."(P4) Not correct (7×).Participants often noted assumptions generated by GPT-4 were "incorrect" (P5), such as not being an assumption made in the paper.Some assumptions were not assumptions "because [the authors] tested their data and they were like, 'This is why we're applying [a methodology], because we observed this'" (P2).Others were "facts" (P8) rather than assumptions.Participants also noted instances where GPT-4 would "couch its answers...It always gives a second opinion of 'Well, I'm going to tell you something, but I'm also going to tell you it might not be that thing.'" (P4) "The ones that are off were where it wasn't an assumption.It was like something they actually said they were doing."(P13) Missing assumptions (4×).Some participants identified valid assumptions from the paper methodology that GPT-4 did not generate.Identifying additional assumptions usually required additional effort from the participant by taking physical notes on assumptions while reading the paper methodology.Participants mentioned that the missed assumptions were usually minor: "One of them I don't remember seeing is the time window that commits in a 30-minute window are merged and treated as one commit...I don't think that's a super important one, but I did write it down."(P7) Not relevant (3×).Participants also noted some of the assumptions were not very relevant for the replication, limiting how "useful" (P1) the assumption was.In one case, one participant noted it was because GPT-4 did not reflect a proper understanding of what a mining challenge was: "I think [considering data completeness] is irrelevant.The data set is...the same for everyone, so for all intents and purposes, for the challenge, it is considered the complete data set." (P14) "I don't think all of them...are very relevant to the work itself."(P1)

Not insightful (2×).
A few participants noted some assumptions generated by GPT-4 were not insightful, as they were "generic or...simple" (P6) or were expected, given the domain: "They were just par for the course of any...sentiment analysis that you do.So they were less useful and more obvious."(P7) Reducing repetition (9×).Participants often noted the set of assumptions were repetitive.Some participants identified word-for-word duplicated assumptions generated by GPT-4: "There's a couple weird duplicates, which I don't know if it's a data copying error or if the model actually repeated itself" (P4).Assumptions were also repetitive based on topic that "that syntactically might be slightly different, but semantically if you look at them, it's...the same." (P12) "Security-related labeling came up in several different forms.This is something that clued me in...it...was [LLM-generated] because that's something that LLMs tend to do." (P3) Explaining how to work around assumptions (8×).Participants noted that GPT-4 often generated text about when an assumption may not hold "that were counter to what was proposed in the paper" (P9), after presenting the main assumption.However, participants noted that no suggestions were provided on how to address the assumption if it did not hold: "It would be helpful if there was a suggestion to...handle that assumption."(P8) Providing sources (6×).Participants also mentioned that they would like a link to the source of the assumption, such as by "[giving] line numbers" (P13).This could explain "why these assumptions are important or have been made" (P5) or build confidence in GPT-4's responses: "If there's an AI, it should have a confidence interval telling me, 'I'm 99% confident this assumption is correct...or is explicitly there.'" (P11) ○ Key findings: 86% of assumptions ratings rated the assumptions as partially or fully correct, 46% as relevant, and 31% as insightful.Participants also noted that the assumptions were correct and comprehensive, but had a lack of software engineering knowledge.

Can GPT-4 generate an analysis pipeline to replicate research methodology? (RQ2)
We report our findings on participants' ratings and impressions of the analysis plans generated by GPT-4 on the seven research papers (Section 4.2.1).We also report the results from the manual evaluation performed by three authors on the GPT-4-generated code (Section 4.2.2).

Analysis
Plan.We report our findings on participants' ratings and impressions of the analysis plans and modules generated by GPT-4 on the seven research papers.
Quantitative Results. Figure 6 shows the distributions of participants' ratings on the analysis plans by correctness, as well as individual modules by correctness and descriptiveness across all the papers.We observe that participants rated the analysis plans ( = 27 scores) as being moderate in terms of correctness ( = 2.1 out of 3), as 89% of analysis plans were graded as partially or fully correct.Meanwhile, the individual modules ( = 87 scores) performed comparatively in terms of correctness ( = 2.4 out of 3), with 72% of modules rated as being partially or fully correct.However, the distribution for the analysis plans is more skewed towards being partially correct compared to the distribution of the modules.Finally, we observe that participants rated the assumptions lowly in terms of descriptiveness ( = 3.2 out of 5), as only 38% of assumptions were rated as descriptive.
Qualitative Results.We identified 6 codes related to GPT-4's capabilities and limitations for generating analysis plans.Similar to the assumptions, three themes emerged from these codes: reasons for positive ratings ( ), reasons for negative ratings ( ), and ways of improving outputs ( ).
High-level structure (11×).Participants noted that the analysis plans were correct in their "high level steps" (P1), which was the main reason for positive correctness scores: "[The modules] broke up how I thought it would." (P3) Participants mentioned this could be a "useful [starting point] for somebody else if you want to replicate the analysis" (P10)."I thought it was able to chunk [the methodology] well into the different pieces, which were outlined in the paper."(P13) Not descriptive (13×).Participants frequently noted that the description and methodology text corresponding to the analysis plan was not descriptive.Some participants felt the "detail was superfluous...which was a distraction to actually what's being done" (P9).Further, the descriptions were written largely for software engineering experts: "I would know what I would do in general...but I'm from the area" (P11).This is because smaller details were often not "super clear in the methodology" (P4) and were not elaborated upon: "Everything is really close to being almost exactly as I'd want it.But everything is missing, just like a little bit crisper description" (P4).
"[The analysis plan] contains methodology text...but I would have wanted it to be a little bit more explicit."(P13) Not correct (9×).Participants noted some of the analysis plans contained details that were incorrect.One participant noted incorrect module ordering: "The last step makes no sense in this order." (P11) Another participant noted that some of the modules created were useless, as they did not consider the replication context: "I would completely throw away the data set loading [module] because I'm going to load a different data set [for replication]." (P3) Other participants noted that the inputs of the module specification could also be incorrect: "[The module] says for the refactoring revision identification, the input is two program versions.But that's not necessarily true because we need the entire change history and then we look at a pair of revisions [and] see if there was a refactoring revision."(P5) Providing additional context (6×).Participants noted one way to improve the analysis plan and module outputs was by providing context on related research artifacts, such as a "Git repository of the replication package" (P2) and "paper references" (P12) or context about the replication context, such as information about the "new dataset" (P3).
"So [let's say] I'm not even using GitHub.I'm using SVN.Can I tell the model I got this project and guide me to the next step in a...way that adapts to the user's scenario?"(P6) Improving modularization (3×).A few participants said that the modularization could also be improved.In particular, participants noted that the responsibilities divided between modules were often uneven and could be divided into additional modules: "I think the Automatic Identification module is very big.I would have chunked it up into two smaller steps." (P1) "It's funny because all the work is actually in one of the modules."(P7) Providing sources (3×).Some participants wanted a link to the source of the modules' scope, such as through specific "reference[s] to [the] methodology section that it references" (P9).
"Maybe giving some references would be better.Yeah, to the methodology text."(P10) 4.2.2Code.We report our findings from the manual analysis of the GPT-4-generated code from the module specifications.We found that 7 of the 23 generated code modules (30%) were executable without any modifications with the Python interpreter.
Below, we describe the 10 codes based on correct behaviors and incorrect behaviors within the code generated by the model.Across the codes, we identified two themes: data () and logic ( ).We report the number of Python files that contained this code using the multiplication symbol (×).
Correct behaviors.We elicited 3 types of correct behaviors within the code generated by GPT-4.
High-level structure (15×).The overall logic of the code modules often followed the correct sequence of steps that was described in the methodology text and description.This aligns with previous work [7,44], which has noted that code generation models can help developers scaffold solutions to open-ended problems.For example, the second code module in the Pletea et al. [51] analysis plan described steps to instantiate a list of keywords, Porter stem the keywords, and filter GitHub comments based on the Porter stemmed keywords.GPT-4 generated code corresponding to each one of those steps that contained generally correct logic to accomplish the step.
 Correct data source (2×).We noticed GPT-4 selecting the correct data source (i.e., a SQL table or loading in from a pre-defined JSON file) infrequently.For instance, the first module in the Pletea et al. [51] analysis plan provided instructions to "[extract] the relevant tables containing comments on commits and pull requests", which GPT-4 successfully did.
Incorrect behaviors.We identified 7 codes for the incorrect behaviors in the generated code.
 Incorrect data source (15×).GPT-4 also frequently generated code which did not identify the correct data sources (i.e., a SQL table or a pre-defined JSON file) to load from.This aligns with prior work, which has found that selecting from complex database schemas is challenging for language models [69].In the second module of the Selakovic and Pradel [57] paper, the generated code incorrectly tried to identify tests in the issue body.Additionally, there sometimes were slight errors in how the data was handled or queried-in the first module of the Fregnan et al. [22] analysis plan, only open pull requests are considered, despite it not being specified in the text.We also noticed that more complex SQL queries tended to contain incorrect logic.
Missing methodology steps (13×).GPT-4 missed code for steps in detailed or lengthy methodology text.For instance, the first module of the Tian et al. [59] analysis plan said to filter non-English commits, which was not implemented by GPT-4.Other times, a comment was left for a human programmer to complete.For example, the second module in the Selakovic and Pradel [57] analysis plan detailed instructions to "measure the execution times [of performance fixes] and keep only issues where the optimization leads to a statistically significant performance improvement."In this case, GPT-4 generated a code comment: # Check if the issue leads to a statistically significant performance improvement.
Incorrect logic (12×).We also observed errors in how GPT-4 generated code to implement the methodology.Errors included small logic errors, such as instantiating global variables inside loops or using the wrong type of data structure.Other times, there were errors in the general approach to implement a part of the methodology.For instance, in the second module of the Eyolfson et al. in the original paper, which left certain methodological details (e.g., the procedure used to identify tests on pull requests) ambiguous.One way GPT-4's generated code could be improved is if the paper methodology itself were written in a more detailed and systematic way by the human authors.
Our exploration in replicating empirical software engineering papers with GPT-4 sheds light on one factor of the replication crisis in empirical studies in computer science [14]-the lack of detail in the methodologies.A majority of participants noted that these sections were not descriptive enough for individuals unfamiliar with software engineering research, such as software engineers: "[If] I hand [the methodology] off to a [software engineering] intern...and be like 'Hey write this [in code]!',I would have wanted to be a little bit more explicit."(P13) 5.1.3Design Implications for GPT-4-Powered Tools in Software Research.Our findings also produce design implications on how apply GPT-4 in tools for software engineering research.Overall, humans still have a vital role in GPT-4-aided replications of software engineering research papers.GPT-4 could be useful to brainstorm assumptions or provide starting points for replication code pipelines.However, humans should provide oversight to GPT-4 to validate its outputs.
Rely on GPT-4 to scaffold analysis code pipelines.Given that GPT-4 produces correct high-level structure of the analysis plan and code, GPT-4 could assist with writing analysis pipelines for replication.However, given GPT-4's tendency to propagate errors, such as hallucinating data and hallucinating APIs, they currently are better suited for scaffolding out analyses in code rather than writing full implementations autonomously.Tools could provide gaps for users to write implementations for the parts of the generated code the model has low confidence on.Verifying code is the most time-consuming activity for users of code generation tools [49]; thus, developers could be more productive by filling in their own implementations.
Rely on GPT-4 to brainstorm assumptions.GPT-4 could help brainstorm assumptions, as the generated assumptions were often comprehensive and correct.However, these assumptions could lack relevance, making them less useful.This could be addressed by explaining how to work around assumptions, as suggested by a majority of our participants.
Rely on humans to validate and correct GPT-4 output.Since GPT-4 is prone to error in generating assumptions and code analysis pipelines, human oversight is necessary at all stages to validate the correctness of GPT-4 output for software engineering research.This is especially important for code generation, as only 30% the code was executable and contained many errors.
Build trust between GPT-4 and users.Given the vital role of humans to validate GPT-4 outputs, it is important for GPT-4-powered tools to build trust with users for AI-generated analyses [24].One way is by tying each assumption, analysis plan, and line of code to methodology text, as participants wanted GPT-4 to provide sources for more confidence in the outputs.

Takeaways
Below, we describe takeaways of our research to software engineering researchers and practitioners.

Software Engineering Researchers.
To overcome the limitations of using GPT-4 in software engineering research, future work is needed to investigate techniques to improve GPT-4's performance in this domain, such as methods for teaching LLMs like GPT-4 software engineering expertise.Theories of developer expertise has noted the importance of domain expertise [6,45] in software development.While GPT-4 demonstrated some knowledge of software engineering, it was unable to apply software engineering domain expertise in the generated code (e.g., what database table to query).Domain expertise could be taught by exposing LLMs to software engineering knowledge via fine-tuning code LLMs like Code LLaMA [54], pre-training specialized language models [26], or providing additional context about a user's replication context (e.g., the data available) or the study's materials (e.g., datasets, scripts, or external references) while prompting.
To improve GPT-4's ability to generate code for replicating research methodologies, future work could also develop datasets of research code, such as by extracting scripts in replication packages and pairing it with corresponding research methodologies.Popular code generation benchmarks, such as HumanEval [12], represent beginner-level coding problems.However, replicating research methodologies is distinct from basic coding problems, as the former requires following elaborate set of steps in natural language.While GPT-4 performs decently on these current benchmarks [54], this performance may not generalize.Thus, having specialized datasets for generating research code could improve language models' performance on this task.
Finally, since GPT-4-generated code was constrained by the quality of the methodology, future work could investigate using GPT-4 to signal when methodology is sufficiently detailed, as the degree of correctness of the generated code could be used as a sign of how well-written the methodology is.If the methodology is too vague, GPT-4 could also be used to generate fine-grained, step-by-step implementation details to increase the accuracy of the generated code.5.2.2 Software Practitioners.For software practitioners, our results suggest ways of using GPT-4 to replicate empirical software engineering papers.Practitioners could provide GPT-4 with the schema of their data and use the generated assumptions to assess whether they can replicate the study.GPT-4 could help provide the high-level structure of the code analysis pipeline.This is because the assumptions and analysis plans generated by GPT-4 were found to be generally correct.GPT-4-generated assumptions could also help practitioners evaluate methodologies for replication.Participants reported that reading the assumptions allowed them to critically think about the methodology: "I mean, these assumptions are more leading me to kind of critique the paper." (P1) Practitioners should exercise caution using GPT-4 to write code for the analysis pipeline, as we found a vast majority of the generated code to be unexecutable.However, research papers with extremely detailed research methodology could be better candidates for generating research code.

THREATS TO VALIDITY
External validity.Empirical studies have difficulty generalizing [20].In our study, the assumptions, analysis plans, and code are generated by  during August 2023, rather than other LLMs such as PaLM [13], PaLM 2 [3], LLaMA [60], and LLaMA 2 [61].GPT-4 has been shown to outperform these other models on diverse tasks [61] which suggests that these models (at least "off-the-shelf" with no fine-tuning) would perform no better.Nonetheless, it is unclear the extent to which our results may generalize to other LLMs or future versions of GPT models that may behave differently.
The paper selection strategy could introduce biases and result in a small sample that is incomplete and not representative of all research papers.Certain methodologies, research topics, or writing styles may not be included in our sample due to the limited selection criteria and the sets of papers examined.Thus, the results may not generalize to all studies in empirical software engineering.
For the user study, using snowball sampling could introduce sampling bias.Our smaller sample size may also not be representative of all empirical software engineering researchers or data scientists on software engineering teams, limiting the generalizability of the results.We reduced this threat by ensuring saturation of the qualitative codes by the time data collection ended.
Internal validity.Only having a single author perform qualitative coding could introduce biases.We minimized this threat by having a second author independently validate the codes and achieved an inter-rater reliability of 80.6%.How a prompt is crafted can affect the model's performance on a task [5,10,53,64].While we eliminated the non-determinism of model outputs by setting the temperature to 0, the results from this study could be influenced by how prompts were constructed, as a more effective prompt could elicit different outputs and thus alter the ratings and reactions from study participants.Some user study participants could skip evaluating GPT-4-generated data due to time constraints, potentially causing some collected data to be incomplete.We reduced its effect by randomizing the order of the assumptions for each participant.Additionally, user study participants could misunderstand the wording of the survey and interview questions.To reduce this threat, we piloted the survey and interview with two software engineering researchers.
Construct validity.User study participants graded the assumptions and analysis plans along subjective criteria, such as insightfulness, relevance, and descriptiveness.Thus, participants' ratings may be inconsistent or may not accurately reflect these constructs.To reduce this threat, we developed grading rubrics and piloted them with two software engineering researchers.

CONCLUSION
In this study, we investigated GPT-4's ability to replicate empirical software engineering papers.We prompted GPT-4 to generate assumptions, analysis plans (i.e., a sequence of module specifications), and code, given a research paper's methodology for seven empirical software engineering papers.
To evaluate the assumptions and analysis plans, we ran a user study with 14 software engineering researchers and data scientists.We also manually reviewed the code generated by GPT-4 with three coauthors.We find that GPT-4 is able to generate assumptions that are correct, but lack domain knowledge in software engineering.We also find that the code generated by GPT-4 is correct in its high-level structure, but can contain errors in its lower-level implementation.This is also reflected in the analysis plans generated by GPT-4, which are correct in its high-level structure, but lack detail in its description.Our findings have implications on leveraging LLMs for software engineering research, such as teaching LLMs like GPT-4 more domain knowledge in software engineering.We have made the study's prompts; GPT-4-generated assumptions, analysis plans, and code; qualitative analysis codebooks; and survey and interview protocols available in the supplemental material [43].

DATA AVAILABILITY
Our supplemental materials are available on Figshare [43].We include the list of assumptions, analysis plans, and code generated by GPT-4 on the seven papers that were evaluated in this study; participants' ratings of the assumptions and analysis plans; example prompts used to generate the assumptions, analysis plans, and code; the schema of the predefined SQL database; the qualitative analysis codebooks; and the survey and interview protocols from the user study.

Fig. 1 .
Fig.1.An overview of the prompts used in the study.To answer RQ1, we used Prompt 1 to generate assumptions for each paper and evaluated them in a user study.To answer RQ2, we used Prompt 2 to create an analysis plan and evaluated it in a user study.We also used Prompt 3 to create the code modules of the analysis plan and evaluated it in a code review workshop.The prompts in the figure are only summaries of the actual ones; for the complete prompts, see the supplemental materials[43].

relevance 1 1
Does not need to be considered at all to successfully repeat the analysis on a new set of data.5 Must be considered to successfully repeat the analysis on a new set of data.insightfulness 1 Does not reflect an understanding of software engineering research methods or data.5 Reflects a deep understanding of software engineering research methods or data.Analysis plan correctness Cannot be used to repeat the study at all on a new set of data.2Can be partially used to repeat parts of the study on a new set of data.

60 : 13 "
The coolest [assumption] by far...was the Confounding Factors one...That was not explicitly caught out in the paper.I even had to go back and look."(P3)

Fig. 6 .
Fig. 6.The distribution of participants' scoring of the GPT-4-generated analysis plans by correctness (left), as well as individual modules by correctness (middle) and descriptiveness (right), for all papers.
[51]4.Example GPT-4 generated code, given the module specification from the analysis plan for Pletea et al.[51].Generated code reads data from JSON file or database 1 , runs additional logic, then outputs the result as a JSON object 2 .