Do Automatic Test Generation Tools Generate Flaky Tests?

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, existing flakiness suppression mechanisms implemented in EvoSuite are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintainers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.


INTRODUCTION
A flaky test is a test case that produces inconsistent results, meaning that the same test can pass or fail for no apparent reason, even when the system being tested has not changed [51].They are a major problem for software developers because they limit the efficiency of testing, complicate continuous integration, and reduce productivity [19,39,48].The negative effects of flaky tests are ubiquitous, experienced by large companies such as Google, Microsoft, and Facebook, as well as the developers of smaller open-source projects [19,36,45,47].Indeed, recent surveys found that a majority of developers observe flaky tests on at least a monthly basis [29,52].As well as being a burden on developers, flaky tests are also a persistent problem in research, limiting the deployment of several state-ofthe-art techniques for test selection and prioritization [45,54,68].
Increasing research interest in the area of flaky tests has produced a range of empirical studies regarding the causes, origins, and impacts of developer-written flaky tests [21,37,44,64].However, far less attention has been paid to flaky tests produced by automatic test generation tools [53,60].This research gap is problematic for several reasons.Firstly, there is minimal guidance for developers regarding the sorts of flaky tests they might expect to receive from test generation tools and more crucially how to avoid them.This threatens to detract from the positive benefits of such tools on the software development lifecycle as previously established [61].Similarly, the maintainers of test generation tools have only limited information on the prevalence of automatically generated flaky tests and what causes them.These insights are crucial for maintainers to prevent their tools from producing flaky tests.Furthermore, researchers in the field of flaky tests would benefit from an analysis of how the root causes of developer-written flaky tests compare to those that are automatically generated.Such an investigation would inform researchers on whether generated flaky tests are representative of their developer-written counterparts.This would be useful for augmenting existing datasets of developer-written flaky tests with generated flaky tests [35].
In this study, we used the popular search-based test generation tools EvoSuite [25] and Pynguin [42] to generate test suites for 1 902 Java projects and 4 454 Python projects respectively.We repeatedly executed both the developer-written and automatically generated test suites of all the projects, consisting of nearly a million individual test cases, 200 times each to detect flaky tests.We compared the prevalence of flakiness between both types of test suites and went one step further by comparing root causes, following our manual analysis on a random sample of 481 non-order-dependent flaky tests.Furthermore, we performed the first scientific evaluation on the effectiveness of EvoSuite's built-in flaky test suppression feature.Chiefly among our findings, we found that flaky tests are at least as common in generated tests as they are in developerwritten tests, that EvoSuite's flaky test suppression feature can reduce the number of generated flaky tests by 71.7 %, and that the distribution of the root causes of generated flaky tests differs to that of developer-written tests.
The main contributions of this study are as follows: Contribution 1: Empirical study: Our empirical study involving 6 356 open-source projects is the largest among all previous studies on both flaky tests and search-based test generation.Our study is also the first to analyze the root causes of automatically generated flaky tests.See Section 3 for more information.Contribution 2: Recommendations: The results of our study have important implications for software developers, maintainers of test generation tools, and researchers in the area of flaky tests.From these, we are able to offer insights and recommendations that are actionable by these stakeholders.See Sections 4 and 5 for more information.Contribution 3: Dataset: The dataset we collected for this study is the first publicly available dataset of flaky tests that contains automatically generated flaky tests and features a large manually annotated sample.See our replication package for more information [11].

BACKGROUND 2.1 Flaky Tests
Luo et al. [44] performed one of the earliest empirical studies of test flakiness.They categorized the cause of the flaky tests repaired by developers in 201 commits across 51 open-source projects using the following ten categories: 1. Async.Wait.Test makes an asynchronous call but does not properly wait for it to finish, leading to intermittent failures.2. Concurrency.Test spawns multiple threads that behave in an unsafe or unanticipated manner, such as a race condition.3. Floating Point.Test uses floating points and is flaky due to unexpected results such as non-associative addition.4. Input/Output (I/O).Test uses the filesystem and is flaky due to intermittent issues such as storage space limitations.5. Network.Test depends on the availability of a network and is flaky when the network is unavailable or busy.6. Order Dependency.Test depends on a shared value or resource that is modified by other test cases as a side effect.7. Randomness.Test involves random number generators and is flaky due to not setting seeds, for example.8. Resource Leak.Test does not release acquired resources (e.g.database connection) inducing flaky failures for itself or for other tests that require the same resources.While studying flakiness in Python tests, Gruber et al. [32] identified another root cause: 15.Infrastructure.Test fails intermittently due to issues outside the project code, but inside the execution environment (the container or the local host), for example, permission errors or lack of disk space.

Automatic Test Generation
EvoSuite is an automatic test generation tool for Java projects [25,26].It uses a search-based approach to generate test suites that cover as much of the code under test as possible.The tool is based on evolutionary algorithms, which means that it uses principles from biological evolution to search for test cases [46].The search is guided by a fitness function that can be configured to optimize test suite generation for high line, branch, or mutation coverage.The tool applies techniques from mutation testing to minimize the number of assertions in the generated tests [33].This ensures that they are more easily understandable to developers and not overly brittle.A large-scale empirical study demonstrated that Evo-Suite can achieve an average of 71% branch coverage per class [27].EvoSuite also provides ways to suppress "unstable" (flaky) tests, addressing this issue both during the evolutionary search process and after its completion.EvoSuite detects unstable tests by controlling the environmental dependencies using bytecode instrumentation, resetting the state of static variables before executing each test, using mocks to replace non-deterministic calls, and compiling and executing the generated tests, removing failing tests [14].
Pynguin is an automatic test generation tool for Python projects [41][42][43].Target languages aside, Pynguin and EvoSuite share several similarities.Both tools use a search-based approach to generate test cases and both apply mutation testing to generate assertions.Pynguin faces the additional challenge that Python is a dynamically typed language, meaning that generating test inputs of the appropriate type is much harder.Therefore, Pynguin relies on existing type hints in function and class definitions in the code under test.From these, Pynguin can apply type inference to attempt to determine the types of variables without hints.A previous empirical evaluation of Pynguin found that it was able to achieve a mean branch coverage of 71.6 % on 163 Python modules from 20 open-source projects [43].

METHODOLOGY
With our study, we aim to answer the following research questions: RQ1 (Prevalence): How prevalent is flakiness in tests that were generated without flakiness suppression mechanisms?RQ2 (Flakiness Suppression): How many flaky tests can Evo-Suite's flakiness suppression mechanism prevent?RQ3 (Root Causes): How do the root causes of generated flaky tests differ from those of developer-written tests?
Fig. 1 depicts an overview of our study setup.

Project Sampling
To collect subjects for our empirical study, we randomly sampled open-source projects written in Java and Python.These are two of the most popular programming languages, which have also been the main targets for both test flakiness and test generation research.
3.1.1Java.To collect Java projects, we used the index of the Maven Central Repository [4], one of the official software repositories for Java.The index is updated weekly for newly added projects or patches.It consists of roughly 520 000 unique packages (as of 2022-10-26).We iterated over the entire index and each project's Project Object Model (POM) to fetch the URL to the project repository.We limited our project sampling to only include projects whose source code is available on GitHub and which use Maven as a build tool.Since the project's POM in the Maven Central Repository does not include details of the build automation tool it is using-which means that the index includes projects built from Gradle or Ant-we crawled through the GitHub URLs to filter for projects that include a pom.xml file in the root of their repository.This is to confirm that the project is using Maven as its build automation tool.In total, we found 38 841 Maven projects that include a link to the project repository on GitHub.While some projects may lack developerwritten tests, we did not exclude them during this crawling process (but we did so later, during the test outcome analysis).We decided against sampling projects from existing flakiness databases, such as IDoFT [35], because we did not want to limit our study to projects that already contain developer-written flaky tests.

Python.
To collect Python projects, we used the dataset from Gruber et al. [32], who studied flakiness in Python.It consists of 22 352 projects that were randomly sampled from the Python Package Index (PyPI) [10], the official third-party software repository of Python.Each project contains at least one test that could be executed using pytest [8] and its source code is available on GitHub.
Unlike IDoFT [35], these projects do not all contain flaky tests: The original study found 7 571 flaky tests among 1 006 projects.

Test Generation
To generate tests for the sampled projects, we use state-of-the-art test generation tools for the respective language.For Java, we use EvoSuite [25], an automated test generation tool that utilizes metaheuristic techniques to generate JUnit test suites.For Python, we use Pynguin [41][42][43], a test generation tool that produces unit tests  suppression.Instead, it applies a re-execute-once strategy to "filter out trivially flaky assertions, e.g., strings that include memory locations" [7]: After generating a test that contains passing assertions, the test is executed again.Any assertion that does not hold in this execution is excluded, as it was made on an apparently flaky value.This behavior is inspired by EvoSuite's JUnit Check, however, since it is not optional and Pynguin has no further flakiness-relevant parameter, we execute Pynguin using only one configuration: the default settings.Both EvoSuite and Pynguin generate tests non-deterministically, meaning that they generate different test suites every time the tool is used.Most of the previous studies investigating automatic test generation generated more than one test suite per project to take the random nature of the evolutionary search [42,60] into account.Unlike these, we generate only one test suite per class/module, since our study does not draw any conclusions by comparing individual projects or components.Instead, we accommodate for the randomness in the test generation by sampling a large corpus of projects, which also contributes to the generalizability of our findings.

Test Execution
To detect flaky tests, we execute all generated tests and all developerwritten tests 100 times in the same order and 100 times in random orders.This procedure follows other studies [32,38] and allows us to distinguish order-dependent (OD) from non-order-dependent (NOD) flaky tests.The test executions are conducted either directly through or inspired by FlaPy [31], a tool that allows researchers to mine flaky tests from a given set of projects by repeatedly executing their test suites.FlaPy ensures a fresh and isolated environment for each test execution and handles dependency installation.Furthermore, it splits the runs into iterations, where each iteration is executed in a separate Docker container, which helps to avoid timeouts and detect environment issues, such as infrastructure flakiness [32].In our case, we split the 200 runs into at least five iterations per project and test type.Generated and developer-written tests are not executed together, but in separate iterations to avoid side-effects.
To install third-party dependencies of the project under evaluation, we use language-specific pipelines: For Java this can easily be accomplished by using 'mvn dependency:copy-dependencies' to make a copy of all the dependencies from the repository on our local machine.We then update the environment variables to include all the dependencies that the project needs when executing the tests.
For Python this process is more complicated since the general landscape of build systems is more heterogeneous.As we cannot rely on a standardized solution, we use FlaPy's built-in dependency installation heuristic which searches for requirements.txt(or similarly named) files and runs them against pip.To execute the projects' test suites we use the JUnit Runner [3] for Java and pytest [8] for Python.When conducting test executions in random orders, we shuffle the tests on class-level, which randomly sorts first the classes and then the tests within each class.For Python, this can easily be accomplished using pytest's random-order plugin [9].For Java, we had to create a custom test runner since Maven's Surefire plugin [5] currently does not support this form of shuffling.

Test Outcome Analysis
Table 2 depicts the number of projects for which we were able to successfully execute the developer-written tests, and successfully generate tests using EvoSuite or Pynguin.We consider the execution of the developer-written test suite to be successful if at least one test case was executed without producing an Error or Skip outcome.We consider the test generation to be successful if at least one executable test was generated.Since we use the same generic setup processes for all projects of the same language, we were not able to successfully execute the developer-written test suite and the test generation for each sampled project.Reasons for erroring test executions or test generation include project-specific requirements that go beyond standard third-party dependencies, such as setting global variables or installing system software.
To assert if we still derived a sufficiently large and diverse set of projects, we inspect two quantitative metrics: the lines of source code (SLOC) of the projects measured via CLOC [16], and the number of developer-written tests they possess (Fig. 2).To assess if we applied the test generation tools properly, we look at the coverage  that the generated tests achieved and compare it with the coverage reported by previous studies applying these tools.
3.4.1 Java.On average (mean) the Java projects possess 4 948 lines of source code (median 1 395).Only 1.5 % of all projects contain less than 100 SLOC, whereas the largest project (cosmos-sdk-java1 ) has more than 500 000 lines of Java code.The total number of SLOC is around 9.4 million for all projects combined.Fig. 2a shows the histogram of the SLOC distribution for the Java projects.Fig. 2b shows the number of developer-written test cases per Java project.Multiple parametrizations of the same test case are treated as separate test cases, following a previous study [32].In total, the projects possess 163 305 developer-written tests and each project contains between 1 and 6 315 test cases.The median number of tests per project is 18, and the mean is 85.9.As these figures about SLOC and test cases show, we have indeed derived a large and diverse sample of Java projects.Both EvoSuite FSOn and EvoSuite FSOff generated test suites with high branch-(over 81 % mean) and line-(over 84 % mean) coverage (Fig. 3).This is similar to the reported code coverage of previous studies on EvoSuite [27,28].
3.4.2Python.Fig. 2c depicts the size of the 4 454 Python projects in terms of SLOC.The average (mean) project has 1 755 SLOC (median 549.5).The largest project (kuber2 ) has more than 500 000 lines of source code and only 5.2 % of projects contain less than 100 SLOC.Combined, the projects feature 7.8 million lines of Python code.Fig. 2d shows the size of the Python projects in terms of the number of developer-written tests they contain.Like for the Java projects, the mean number of test cases per project (68.2) is substantially greater than the median (14), which is caused by a small number of very large projects.In total, the Python projects contain 303 711 developer-written test cases.After inspecting the projects both in terms of number of SLOC and their number of test cases, we find no obvious bias towards overly small or large projects and conclude that we have derived a large and diverse sample of Python projects.The Python tests generated by Pynguin yielded a mean branch coverage of 66.0 % (Fig. 3) with a standard deviation of 31.7.This performance is very similar to the one reached by Lukasczyk et al. [43], the creators of Pynguin, who achieved a mean branch coverage of 71.6 % with a standard deviation of 30.5.

RQ1 (Prevalence).
To study the prevalence of flakiness in generated tests, we compare the number of flaky tests that were created by the test generation tools-without using flakiness suppressionto the number of flaky tests found in the developer-written tests of the respective language.We regard a test as flaky if it yielded at least one passed and one failed or errored outcome [17,32].Tests that switch between failing and erroring verdicts are therefore not considered as flaky, since both lead to a build failure and the test therefore does not contribute to the typical developer experience caused by flakiness (sporadically failing builds).Furthermore, we look at the ratio between order-dependent flaky tests (which only show flaky behavior when run in random orders) and non-orderdependent flaky tests (which also show flaky behavior when run in the same order).Lastly, we also compare the projects containing at least one developer-written or generated flaky test to investigate if generated and developer-written flakiness tends to appear in the same projects.

RQ2 (Flakiness Suppression).
To assess the effectiveness of EvoSuite's flakiness suppression, we compare the number of flaky tests generated by EvoSuite FSOn to those generated by EvoSuite FSOff , and to developer-written tests.For each Java project and each of the three test types, we compute the ratio of flaky to non-flaky tests and use a Wilcoxon signed-rank test [67]-which is a commonly used, non-parametric paired difference test-to check for statistically relevant differences.We refrain from using a parametric test, as we found our data to be not normally distributed according to a Shapiro-Wilk test [62].Since Pynguin does not offer any optional flakiness suppression mechanisms, we limit our comparison to EvoSuite.

Root Cause Analysis
In our last research question, we investigate the similarity of developer-written and generated flaky tests regarding their root causes, which is seen as a core property of flakiness [44,51].Like other studies [32,44], we categorize the flaky tests' root causes by labeling them manually along established categories.Namely, we use the amalgamation of root causes collected by Parry et al. in 2021 [51] as our initial set of pre-defined categories (items 1. to 14. in Section 2.1).Since we can automatically detect order-dependency (OD) through test executions in random orders, we only consider non-orderdependent (NOD) flaky tests for this step.
3.5.1 Sampling.As we found a total of 1 740 NOD flaky tests, we have to take a representative sample to keep the labeling feasible.To avoid creating a bias towards projects with only a few flaky tests (e.g., by randomly selecting projects), or tests from only a few large projects (e.g., by randomly selecting flaky tests), we combine two sampling strategies: First, we randomly select one NOD flaky test from each affected project, regardless of the test type (generated or developer-written), resulting in the breadth sample.Second, we randomly choose 21 Java and 9 Python projects and sample all their flaky tests (depth sample).These projects are evenly distributed regarding the type-or combination-of flaky tests they contain (Java developer-written, EvoSuite FSOn , EvoSuite FSOff , Python developerwritten, Pynguin).Using this technique, we sampled a total of 481 flaky tests (340 Java, 141 Python): 329 from the breadth sample, 122 from the depth sample, and 30 selected by both strategies.

Labeling.
The manual labeling itself is carried out by four of the authors using the project code, the test code, as well as the test failures (stack trace and error message).To create a common understanding about what constitutes a certain root cause and to assess if existing root cause categories are applicable to generated flaky tests, we precede the actual labeling with an alignment step: We randomly choose 50 flaky tests from our sample, which are then classified by all four researchers.According to Fleiss' Kappa [23], the four labeling authors have reached an inter-rater reliability of 0.41, which is considered 'good' according to Regier et al. [55].For cases in which the authors disagree, discussions are held, which have resulted in the following adjustments to the set of root causes: • Broaden the category unordered collection to also include unspecified behavior in general.• Broaden the category resource leak to also include resource unavailability.• Add category performance, which describes tests that fail intermittently due to varying durations of (sequential) processes (example: Fig. 5).• Add category non-idempotent-outcome (NIO), which covers self-polluting and self-state-setting tests.This was described by Wei et al. [66] shortly after the literature survey on which we based our root cause categories [51].
After finishing the alignment, the remaining flaky tests in the sample are labeled each by one of the four researchers.

RQ3 (Root Causes
).We use the labeled root causes to make three main comparisons: First, the root causes we found in developerwritten tests against those found by previous studies.Second, the root causes of flaky tests generated without flakiness suppression (EvoSuite FSOff , Pynguin) against those of developer-written tests of the respective language.Third, the root causes of tests generated with and without flakiness suppression (EvoSuite FSOff vs. EvoSuite FSOn ).

Threats to Validity
3.6.1 External Validity.To sample projects for our study, we relied on the Maven Central Repository [4] and PyPI [10], which are the largest official software repositories of Java and Python.However, we had to make certain assumptions to keep our setup feasible: For Java, we only considered projects using Maven and we excluded projects using JUnit 3 due to compatibility issues with more recent versions.While these design decisions might potentially influence our results, we tried to mitigate this threat by assuming the usage of the predominantly used build automation and testing technologies (Maven and JUnit), which also other studies on test flakiness rely on [13,38,63].For Python, we used an existing dataset of Python projects [32], which was also used by other researchers to evaluate flakiness detection and debugging techniques [12,30,57,66].Nevertheless, we inherit any potentially existing bias in this dataset.

Construct Validity.
To detect test flakiness, we executed each test 100 times in a fixed order and 100 times in shuffled orders.However, some flaky tests have very low failure rates, which might have caused us to underestimate the number of flaky tests in our dataset.Another potential threat to the construct validity of our study is the search budget used for test generation.We gave two minutes per Java class for EvoSuite and ten minutes per Python module for Pynguin.However, allowing more time might have yielded different results.Choosing a meaningful search budget is a non-trivial issue, especially when setting up experiments that include thousands of projects with various different sizes [15].To achieve a balance between feasibility, resembling a practical use case, and giving sufficient resources to the tools, we chose our search budgets according to the most commonly used configurations in tool competitions [58,65] or evaluations by the maintainer [43].We also measured the coverage of the generated tests and found that they yielded a high branch coverage, which indicates that our search budgets were sufficient.

Internal Validity.
As we found almost 1 800 NOD flaky tests, we had to take a sample before manually labeling their root cause, which might pose a potential threat to the validity of our findings.To avoid favoring overly large or small projects, we applied a twofold sampling strategy (see Section 3.5.1).Each flaky test was then manually labeled by one of four authors.This might pose a potential threat, as the authors have different backgrounds and experiences when it comes to root-causing flaky tests.To mitigate this issue, we created an alignment sample of 50 flaky tests that were labeled by all four researchers, and we held discussions about cases in which we disagreed.In our alignment sample, we reached a 'good' inter-rater reliability, meaning that we were aligned in most of the verdicts given even before starting the alignment.Furthermore, the root causes we found for developer-written flaky tests match previous studies [21,44], which increases our confidence in the validity of our other findings.

RQ1: Prevalence
Table 3 depicts the number of flaky tests we discovered for each language and configuration.Overall we executed almost 1.2 million tests and discovered 9 568 flaky tests, roughly two-thirds of them generated ones.For developer-written tests, we found roughly 0.5 % to 1 % of all tests to be flaky, which is similar to previous studies on flakiness in Java [38] and Python [32].Like them, we also found the ratio between OD and NOD flaky tests to be almost even for Java projects, whereas it strongly tilts towards order-dependency for Python projects.
Looking at the flaky tests generated by EvoSuite FSOff (3 832) and Pynguin (1 013), we see that for both languages/tools, flakiness is more prevalent in generated than in developer-written tests, relative to the total number of generated/developer-written tests.In the case of Java, we even see an increase of 54 % (0.94 % to 1.45 %).When distinguishing between order-and non-order-dependent flakiness, we see a strong tendency towards OD flaky tests (91 % of flaky Pynguin tests are OD, 84 % of flaky EvoSuite FSOff tests are OD).
To check if generated flaky tests tend to appear more frequently in projects that already contain developer-written flaky tests, we looked at the sets of projects containing at least one flaky test.We found 224 Python projects containing generated flaky tests and 341 projects having developer-written ones.For Java, EvoSuite FSOff produced flaky tests for 228 projects, while 161 projects contained developer-written flaky tests.Fig. 4 depicts the overlap between these sets, which is notably small: Only 17.1 % of Java and 19.6 % of Python projects that contain generated flaky tests also contain developer-written flaky tests.
Summary (RQ1: Prevalence) For both Java and Python projects, flakiness is at least as common in generated tests as in developerwritten tests.However, it does not appear in the same projects.Similar to developer-written tests in Python (but unlike Java tests), the ratio between order-dependent and non-order-dependent flaky tests is leaning strongly towards order-dependency for generated tests.

RQ2: Flakiness Suppression
The second row in Table 3 (Java, EvoSuite FSOn ) depicts the amount of flakiness we found among tests generated while using flakiness suppression.We observe a significant (-value of Wilcoxon test < 0.001) reduction in flakiness of 71.7 % (1.45 % to 0.41 %) compared to EvoSuite FSOff and 56.4 % compared to the developerwritten tests.Among the remaining flaky tests, order-dependency  is again far more common than non-order-dependent causes (86.4 % of EvoSuite FSOn flaky tests are OD).Looking at projects containing flaky tests (Fig. 4a), we see only a minor overlap between EvoSuite FSOn (133 projects) and the developer-written tests (161 projects), as we found for EvoSuite FSOff (228 projects).When comparing the two EvoSuite configurations, we are surprised to also find only a moderate overlap.To investigate this observation more deeply, we look at the flaky tests' root causes in Section 4.3.Summary (RQ2: Flakiness Suppression) EvoSuite's flakiness suppression mechanism is effective: It reduced the number of flaky tests by 71.7 %, which is considerably lower than the relative number of developer-written flaky tests (56.4 % fewer flaky tests).The ratio of NOD and OD flaky tests remains strongly leaning towards OD.

RQ3: Root Causes
Table 4 depicts the distribution of flakiness root causes that we identified in our sample via manual labeling (Section 3.5).
For Java projects, we found asynchronous waiting to be a major cause (21.2 %) for flakiness in developer-written tests, which corroborates previous studies [21,44].However, we also found many flaky tests to be caused by brittle assumptions about the Performance (i.e., duration) of sequential processes (19.4 %), a root cause that has not previously been described.Fig. 5 shows an example of such a test.The assertion on line 113 is flaky as it assumes that the execution time (line 106) is within a certain range, which is not guaranteed.For Python projects, the main causes for developer-written flakiness are networking (30.1 %) and randomness (17.2 %), which was also found by the study from which we sampled our projects [32].Verifying Expected Exceptions describes issues happening when a test case expects a certain exception to be thrown and makes assertions about where (i.e., by which class) the exception was thrown.In other words, the test case asserts that the top of the stack trace of an expected exception has a certain value.Such tests can be flaky since a stack trace can change intermittently, even for the same exception.This is caused by optimizations, namely the just-in-time (JIT) compilation, that might decide at any point during the program execution to compile a frequently executed area in the class to native code, which causes it to no longer appear in the stack trace [34,50].Fig. 10a shows an example of such a case: The test is expecting an IndexOutOfBoundsException thrown by java.nio.Buffer.Sometimes, however, this exception is instead thrown by java.nio.HeapByteBuffer.This test is flaky due to the default way that the JVM decides to optimize the compilation, where the JIT compilation will compile certain parts of the java.nio.Buffer class to native code, causing it to no longer appear on top of the stack trace (as shown in Fig. 10b).Such optimization-based flakiness does not happen in tests generated by EvoSuite FSOff as we updated the 'No Runtime Dependency' parameter to true, which prevents EvoSuite from generating tests 1 @Test ( timeout = 4000) 2 public void test17 () throws Throwable {   Secondly, EvoSuite FSOn generates flaky tests that produce intermittent StackOverflowErrors.This was also discovered by a previous study [22].The errors occur consistently when the flakiness suppression is turned off.EvoSuite FSOn includes an internal resource threshold-limiting the stack size-to prevent a test case from a StackOverflowError, however, the resource checking is nondeterministic and some errors manage to slip through.Such issues do not occur in EvoSuite FSOff because we have disabled the generation of test scaffolding files (first row Table 1), which include a check to prevent infinite loops in recursive methods.However, not generating scaffolding files for test classes makes the generated tests more susceptible to traditional causes of flaky tests [24].
Summary (RQ3: Root Causes) Generated tests are flaky for the same reasons as developer-written ones, however, the distribution among those reasons differs: While developer-written flaky tests are often caused by concurrency and networking operations, generated flaky tests tend to be the result of randomness and unspecified behavior.When using flakiness suppression, the picture changes vastly, as the majority of remaining flaky tests do not fit any previously described category of flakiness.Instead, they are caused by runtime optimizations and EvoSuite-internal resource threshold.Notably, both only take effect when certain flakiness suppression mechanisms are activated!

RECOMMENDATIONS 5.1 Maintainers of Test Generation Tools
We found EvoSuite's flakiness suppression mechanisms to be very effective and can recommend them to other tools, such as Pynguin, whose rate of flaky tests is currently still higher than the rate of flaky tests in developer-written tests.Nevertheless, EvoSuite still produces flaky tests, which are mainly caused by (1) Verifying Expected Exceptions related to the 'No Runtime Dependency' option, and (2) StackOverflowErrors caused by scaffolding.Most notably, both mechanisms are meant-and also accomplish-to prevent traditional causes of flakiness.We therefore recommend revisiting EvoSuite's implementation of the 'No Runtime Dependency' option and the scaffolding mechanisms to eradicate the flakiness these tend to introduce.Furthermore, we recommend studying and addressing order-dependency in generated tests, as we found high numbers of OD flaky tests for both EvoSuite and Pynguin.
Our dataset [11] should provide the maintainers of both EvoSuite and Pynguin with a large number of real-world examples that can help to reproduce flakiness in generated tests and serve as an evaluation sample for improved versions.

Developers Using Test Generation
For developers using EvoSuite, we can highly recommend its flakiness suppression options, which we found to be very effective.The remaining flaky tests are mostly caused by Verifying Expected Exceptions and StackOverflowErrors.The former can be mitigated by disabling the tiered compilation (-XX:-TieredCompilation) flag when executing the tests.This will prevent the JVM from compiling frequently executed parts of the bytecode into native code (JIT compilation), which however will adversely affect the performance and execution time of other tests.The flaky StackOverflowErrors can be mitigated by removing the NonFunctionalRequirementRule in the test scaffolding file, which however causes the test to fail consistently.For developers using Pynguin, we recommend setting seeds for random number generators.This should eliminate randomnessrelated flakiness, which we found to be the most common individual root cause for flaky Pynguin tests.However, there is known criticism about using seeding to avoid flakiness [20].Developers should also consider order-dependencies between generated tests as a possibility.In our evaluation, about 4 % to 6 % of projects were affected by generated order-dependent flaky tests.

Researchers Studying Flaky Tests
Since generating tests automatically can be done quickly and efficiently, there is a potential for using tools such as EvoSuite and Pynguin to help in the research on flaky tests.For example, test generation tools could be used to create training data for machine learning models or to systematically expose non-determinisms in individual target projects.While we found generated flaky tests to have similar root causes compared to developer-written flaky tests as long as flakiness-suppression mechanisms are turned off (Table 1), we also found several differences: First, the root cause distribution differs, as flakiness in generated tests is less likely the cause of concurrency or networking issues, and more often the result of randomness and unspecified behavior (Table 4).Second, we found that projects containing developer-written flaky tests are not particularly likely to also produce generated flaky tests and vice versa (Fig. 4).

RELATED WORK
Shamshiri et al. [60] studied the effectiveness of automatically generated test suites in Java projects.They applied three automatic test generation tools, Randoop, EvoSuite, and AgitarOne, to a dataset of over 300 faults across five open-source projects, assessing how many bugs the automatically generated tests could detect.Through this process, they also identified the number of flaky tests that were generated by each of the three tools.Of the tests generated by Randoop, which uses feedback-directed random test generation, an average of 21% exhibited non-determinism in their outcomes.EvoSuite produced flaky tests at an average rate of 3%.Only 1% of the tests generated by the commercial, proprietary tool Agit-arOne were flaky.While our study and theirs both demonstrate that automatic test generation tools are capable of producing flaky tests, there major are differences.The main objective of our study was to investigate the prevalence and root causes of flaky tests generated by automatic test generation tools.However, the main objective of the study performed by Shamshiri et al. was to assess the bug-finding capability of automatically generated tests.As such, we analyzed the prevalence of generated flaky tests in much more detail (see Section 4.1) and went on to categorize their root causes (see Section 4.3).Furthermore, our subject set of 1 902 Java projects and 4 454 Python projects is significantly larger than the five Java projects used by Shamshiri et al. in their empirical evaluation.
Paydar et al. [53] examined the prevalence of flaky tests, and other types of problematic tests, generated specifically by Randoop.They took between 11 and 20 versions of five open-source Java projects and used Randoop to generate regression test suites, which were the main objects of analysis.Overall, they found that 5% of the automatically generated test classes were flaky, and on average, 54% of the test cases within each of these were flaky.As before, since Paydar et al. were not solely investigating automatically generated flaky tests, they did not examine them to the same level of detail as in our study (for example, they did not consider root causes).Furthermore, while they are all automatic test generation tools, Randoop is significantly different from EvoSuite and Pynguin in that Randoop is based entirely on random search.
Li et al. [40] applied automatic test generation to the repair of order-dependent flaky tests.Their work builds upon the iFixFlakies tool introduced by Shi et al. [63], which uses the statements of existing "cleaner" tests to remove the state-pollution left behind by "polluter" tests that induce order-dependency in the "victim" tests that the tool aims to repair.A weakness of iFixFlakies is that if no such cleaner test exists in the test suite, the tool will be unable to repair the victim.The technique introduced by Li et al. aims to address this weakness by applying automatic test generation to generate cleaners such that the victim may be repaired.Beyond the intersection of flaky tests and automatic test generation, there are no significant similarities between their study and ours.
There have been several previous studies, in which the authors have manually classified the root causes of flaky tests.Luo et al. [44] categorized the causes of the flaky tests repaired by developers in

Figure 3 :
Figure 3: Branch coverage of generated tests

Figure 10 :
Figure 10: Flakiness due to Verifying Expected Exceptions (project named-data-jndn) 9. Time.Test relies on measurements of date and/or time.Flakiness is caused by, for instance, discrepancies in precision and representation of time across libraries and platforms.10.Unordered Collection.Test assumes a deterministic iteration order for an unordered collection-type object, such as a set, leading to intermittent failures.Eck et al. [21] asked Mozilla developers to categorize the causes of 200 flaky tests they had previously repaired.The developers used the categories introduced by Luo et al. but with the option to create new categories if needed.Following this, Eck et al. identified the following additional four categories: 11.Too Restrictive Range.Test includes a range-based assertion that excludes a portion of the valid output values.12. Test Case Timeout.Test intermittently exceeds a pre-defined upper limit on its execution time.13.Platform Dependency.Test outcome varies across the project's target platforms.14.Test Suite Timeout.Test intermittently exceeds a pre-defined upper limit on the execution time of the test suite.

Table 1 :
[58,65] parameters to deactivate flakiness suppression mechanisms (EvoSuite FSOff ) For each of the Java projects, both EvoSuite FSOn and EvoSuite FSOff generate tests for every testable class in the system under test (SUT).A class is considered testable if it has at least one public method.EvoSuite aims to generate a test suite that covers all public methods.We set the search budget for generating tests to two minutes per class for both EvoSuite FSOn and EvoSuite FSOff , which is what previous tool competitions have used[58,65].
[43]24,26] programs.The tests are generated from scratch, not relying on any existing developer-written tests as input[56].3.2.1 Java.We use the latest release of EvoSuite at the time (v1.2.0) to generate tests for each Java project.Since EvoSuite applies multiple techniques to avoid creating flaky tests, and we want to measure the impact of these flakiness suppression mechanisms, we apply EvoSuite under two configurations: with and without Flakiness Suppression, which we refer to as EvoSuite FSOn and EvoSuite FSOff .The flakiness suppression parameters are turned on by default, which means that for EvoSuite FSOn , we do not change any of the EvoSuite parameters.To generate tests without the flakiness suppression mechanisms (EvoSuite FSOff ), we update several parameters of Evo-Suite that we extracted from previous studies[14,24,26], and which we confirmed via in-depth discussions with one of the EvoSuite maintainers on our team.Table1shows the parameters that we changed to deactivate EvoSuite's flakiness suppression.These parameters consist of actions carried out during the test generation process to mitigate non-determinism.They address factors related to managing environmental dependencies, establishing a virtual file system, mocking non-deterministic output such as Random [2] and Calendar [1] classes, and resetting the state of static and final fields to avoid creating dependencies on other tests.3.2.2 Python.To generate tests for the Python projects, we use version 0.27.0 of Pynguin, the latest release at the time(2022-11-14).Pynguin operates on module-level, and we apply it to each module contained in each of our sample projects.Like Lukasczyk et al.[43], we use a maximum search budget of ten minutes.Unlike EvoSuite, Pynguin does not offer optional parameters for flakiness A. Project Sampling B. Test Generation C. Test Execution D. Test Outcome Analysis (RQ1, RQ2) E. Root Cause Analysis (RQ3) (i) Alignment (ii) Manual Classification

Table 2 :
Size of our dataset

Table 3 :
Number of flaky tests found in developer-written and automatically generated tests

Table 4 :
Root causes for NOD flaky tests.Cells: number of tests (number of projects)