Artificial Intelligence Applied to Software Testing: A Tertiary Study

Context: Artificial intelligence (AI) methods and models have extensively been applied to support different phases of the software development lifecycle, including software testing (ST). Several secondary studies investigated the interplay between AI and ST but restricted the scope of the research to specific domains or sub-domains within either area. Objective: This research aims to explore the overall contribution of AI to ST, while identifying the most popular applications and potential paths for future research directions. Method: We executed a tertiary study following well-established guidelines for conducting systematic literature mappings in software engineering and for answering nine research questions. Results: We identified and analyzed 20 relevant secondary studies. The analysis was performed by drawing from well-recognized AI and ST taxonomies and mapping the selected studies according to them. The resulting mapping and discussions provide extensive and detailed information on the interplay between AI and ST. Conclusion: The application of AI to support ST is a well-consolidated and growing interest research topic. The mapping resulting from our study can be used by researchers to identify opportunities for future research, and by practitioners looking for evidence-based information on which AI-supported technology to possibly adopt in their testing processes.


INTRODUCTION
Software testing (ST) and artificial intelligence (AI) are two research areas with a long and ripe history in computing.AI methodologies and techniques have been around for more than 50 years [38] and, in the current century, with the advances in computational resources and the abundance of data, their potential has vastly increased.As a consequence, AI has been applied to fields as diverse as healthcare [39], project management [54], finance [66], law [93], and many more.Both the academic research community and the industry have injected AI paradigms to provide solutions to traditional engineering problems.Similarly, AI has evidently been useful to software engineering (SE) [7,13,47].ST has always been an intrinsic part of the software development lifecycle [85].Yet, as software has become more and more pervasive, it has also grown in size and complexity [16], bringing new challenges to software testing practices [64].Therefore, with AI poised to enhance knowledge work, there is interest in analyzing how it has been used to improve testing practices.Several studies have explored the interplay between AI and ST [51].Yet, given the breadth and depth of each of these disciplines, high-quality review studies tend to focus their scope on orthogonal selections in each of these areas.For instance, the use of evolutionary algorithms for regression test case prioritization has been investigated in [76] while the application of natural language processing technique in ST has been analyzed in [35].Alternatively, unstructured review papers or position papers have proposed how these two fields would merge.
The goal of this work is to uncover evidence on how AI has been applied to support ST, to reveal established hinge points between the two research areas and future trends.In particular, in this study, we focus on dynamic testing that, according to the ISO 29119 standard [34], comprises the activities that are performed to assess whether a software product works as expected when it is executed.To achieve this goal, we conducted a tertiary systematic mapping study.In this work, we adhere to the definitions by Kitchenham et al. [52].A primary study is an empirical investigation into specific research questions, while a secondary study is a review of primary studies related to specific research questions with the aim of synthesising evidence.Finally, a tertiary study, of which this research is an example, is a review of secondary studies related to the same research questions with the aim of uncovering mappings and/or trends.The need for a tertiary study, particularly a systematic mapping study, on the interplay of AI and ST is motivated by the following considerations: • although there are already several secondary studies investigating the application of AI to ST, to manage the vastness of the two research areas, most of these studies limit their scope with an orthogonal division of one or both areas; • there is a wealth of primary studies that makes it unfeasible to approach our research goal with a secondary study, if not by limiting the scope of the research, as the identified secondary studies have done; • a systematic process makes the work reproducible and provides internal consistency to the results and focuses the discussion on available evidence in existing secondary studies; • a systematic mapping study is suited to structure a research area [77], and as such is more suitable than a systematic literature review in our research context because of the size and scope of the bodies of knowledge (AI and ST).Furthermore, after an initial investigation, we noted that secondary studies that applied systematic literature reviews as their research method are able to do so by limiting the scope to a sub-domain (for instance, search base techniques for ST).Therefore, to observe the whole possible interplay between AI and ST, a systematic mapping is the suitable research method for this tertiary study.
As a main contribution, this article provides a broad view of the AI applications in support of ST.An additional novel contribution of this work is a fine-grained mapping showing how specific 58:3 testing fields have been supported by specific AI sub-domains and methodologies.This mapping can be used by researchers to identify open topics for future research on new applications of AI for ST and by practitioners to make decisions on the most suitable AI-supported technologies that can be introduced in their testing processes.To the best of our knowledge, this is the first tertiary study that attempts to depict a comprehensive picture of how AI is used to support ST, and how the two research domains are woven.The remainder of the article is organized as follows.Section 2 introduces the key concepts and terminology related to the areas of interest of our study.Section 3 describes the protocol we designed to support the process of selecting secondary studies of interest and for extracting evidence from them.Section 4 provides insights about the process execution.Section 5 analyzes extracted data and answers our research questions.Section 6 presents overall considerations on the results of our study and provides a focus on testing activities whose automation has been supported by different AI techniques.Section 7 discusses threats to the validity of our study.Finally, Section 8 concludes the article and provides final remarks.

BACKGROUND
AI and ST are two large and complex research areas for which there are no universally agreed upon taxonomies nor bodies of knowledge.As a way to define the language and vocabulary that has been used throughout the article, we built two taxonomies, one for each research area.The taxonomy shown in Figure 6 reports the AI key concepts that have been used to support ST, whereas the one in Figure 7 refers to the ST key concepts that have been supported by AI.In the following sections, we provide a short description of the two research areas, the domains, and sub-domains of each taxonomy along with a short definition of related key concepts that are relevant to our study.For each key concept, we also provide a proper literature reference from which it is possible to access more detailed and complete definitions.

Artificial Intelligence
Although there exist many definitions of AI, for the aims of this study, we mention the one given in the European Commission JCR report on AI [59]: "AI is a generic term that refers to any machine or algorithm that is capable of observing its environment, learning, and based on the knowledge and experience gained, taking intelligent action or proposing decisions.There are many technologies that fall under this broad AI definition.At the moment, ML techniques are the most widely used." This definition was adopted by the AI Watch1 in [88] as the starting point for the specification of an operational definition and a taxonomy of AI aimed at supporting the mapping of the AI landscape and at detecting AI applications in a wide range of technological contexts.The taxonomy provided by the AI Watch report includes five core scientific domains, namely, Reasoning, Planning, Learning, Communication, and Perception, and three transversal domains, namely Integration and Interaction, Services, and Ethics and Philosophy.The overall taxonomy is depicted in Figure 6, where: (i) white boxes represent domains and key concepts drawn from the AI Watch report [88], while (ii) gray boxes are additional key concepts extracted, during the mapping process, from the analyzed secondary studies.

Reasoning.
The AI domain studying methodologies to transform data into knowledge and infer facts from them.This domain includes three sub-domains: knowledge representation, automated reasoning, and common sense reasoning.Knowledge representation is the area of AI addressing the problem of representing, maintaining, and manipulating knowledge [56].Automated reasoning is concerned with the study of using algorithms that allow machines to reason automatically [14].Finally, as described in [27], common sense reasoning is the field of science studying the human-like ability to make presumptions about the type and essence of ordinary situations.Key concepts related to our study and belonging to this domain are: (i) fuzzy logic, a form of logic in which the truth value of variables may be any real number (between 0 and 1) [72], (ii) knowledge representation and reasoning, the use of symbolic rules to represent and infer knowledge [56], (iii) ontologies, forms of knowledge representation facilitating knowledge sharing and reuse [31], and (iv) semantic web, an extension of the World Wide Web through standards set by the World Wide Web Consortium, 2 which " . . .enables people to create data stores on the Web, build vocabularies, and write rules for handling data"3 [75].

Planning.
The AI domain whose main purpose concerns the design and execution of strategies to carry out an activity, typically performed by intelligent agents, autonomous robots, and unmanned vehicles.In this domain, strategies are identified by complex solutions that must be discovered and optimized in a multidimensional space.This domain includes three highly related sub-domains dealing with the problem of optimizing the search for solutions to planning and scheduling problems, namely, planning and scheduling, searching, and optimization.Key concepts related to our study and belonging to this domain are: (i) constraint satisfaction, the process of finding a solution to a set of constraints on a set of variables [99], (ii) evolutionary algorithms, a subset of metaheuristic optimization algorithms based on mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection [6], (iii) genetic algorithms, a branch of evolutionary algorithms inspired by the process of natural selection relying on biologically inspired operators such as mutation, crossover and selection [69], (iv) graph plan algorithms, a family of planning algorithms based on the expansion of compact structures known as planning graphs [17], (v) hyper-heuristics, the field dealing with the problem of automating the design of heuristic methods to solve hard computational search problems [21], and (vi) metaheuristic optimization, the research field dealing with optimization problems using metaheuristic algorithms [19].

Learning.
The AI domain dealing with the ability of systems to automatically learn, decide, predict, adapt and react to changes and improve from experience, without being explicitly programmed.The corresponding branch of the resulting taxonomy is mainly constructed with machine learning (ML)-related concepts.Key concepts related to our study and belonging to this domain are: (i) artificial neural networks, a family of supervised algorithms inspired by the biological neural networks that constitute animal brains [41], the training of a neural network consists in observing the data regarding the inputs and the expected output, and in forming probabilityweighted associations between the two, which are stored within the data structure of the network itself -designed as a sequence of layers of connected perceptrons [86], (ii) boosting, is an ensemble meta-algorithm for the reduction of bias and variance error's components [20], (iii) classification, a supervised task where a model is trained on a population of instances labeled with a discrete set of labels and the outcome is a set of predicted labels for a given collection of unobserved instances [55], (iv) clustering, an unsupervised task were given a similarity function, objects are grouped into clusters so that objects in the same cluster are more similar to each other than to objects in other clusters [105], (v) convolutional neural networks, a specialized type of neural networks that uses convolution in place of general matrix multiplication in at least one of its layers [37], (vi) decision trees, a family of classification and regression algorithms that learn hierarchical structures of simple decision rules from data and whose resulting models can be depicted as trees were nodes represent Artificial Intelligence Applied to Software Testing: A Tertiary Study 58:5 decision rules and leaf nodes are the outcomes [70], (vii) ensemble methods, algorithms leveraging a set of individually trained classifiers (such as, decision trees) whose predictions are combined to produce more accurate predictions than any of the single classifiers [73], (viii) probabilistic models, a family of classifiers that are able to predict, given an observation of an input, a probability distribution over a set of classes [40], (ix) recurrent neural networks, neural networks with recurrent connections, which can be used to map input sequences to output sequences [15], (x) reinforcement learning, is one of the fundamental machine learning paradigms, where algorithms address the "problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment" [46], (xi) regression, a set of mathematical methods that allow data scientists to predict a continuous outcome based on the value of one or more predictor variables [106], (xii) supervised learning, a machine learning paradigm for problems where the available data consists of labelled examples [87], (xiii) support vector machines, supervised learning algorithms where input features are non-linearly mapped to a very high-dimension feature space and a linear decision surface is constructed to generate classification and regression analysis models [24], and (xiv) unsupervised learning, one of the fundamental machine learning paradigms where algorithms try to learn patterns from unlabelled data [87].

Communication.
The AI domain referring to the abilities of identifying, processing, understanding, and generating information from written and spoken human communications.This domain is mainly covered by the natural language processing (NLP) [45,62].Key concepts related to our study and belonging to this domain are: (i) information extraction, the automatic extraction of structured information, such as entities, relationships and attributes describing entities, from unstructured sources [90], (ii) information retrieval deals with the problem of "finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)" [61], (iii) natural language generation refers to "the process of constructing natural language outputs from non-linguistic inputs" [80], (iv) natural language understanding refers to "computer understanding of human language, which includes spoken as well as typed communication" [103], (v) text mining is the semi-automated process of extracting knowledge from a large number of unstructured texts [29], and (vi) word embedding is "a word representation involving the mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension" [45].

Perception.
Refers to the ability of a system to become aware of the environment through the senses of vision and hearing.Although this is a broad domain of AI with many AI applications, the only key concept coming from this domain (particularly, from the sub-domain of computer vision) related to our study is image processing, which is the field dealing with the use of machines to process digital images through algorithms [78].

Integration and Interaction.
A transversal AI domain comprising, among others, the multiagent systems sub-domain.It can be described as the domain that addresses the combination of perception, reasoning, action, learning and interaction with the environment, as well as characteristics such as distribution, coordination, cooperation, autonomy, interaction and integration.Key concepts related to our study and belonging to this domain are: (i) intelligent agent, an entity equipped with sensors and actuators that exhibits some form of intelligence in its action and thought [87], (ii) q-learning, a reinforcement learning algorithm (see Section 2.1.3),which provides agents with the capability of learning to act optimally in Markovian domains by experiencing the consequences of actions, without requiring them to build maps of the domains [102], and (iii) swarm intelligence, which refers to the algorithms typically based on a population of simple agents interacting locally with one another and with their environment [42].
For completeness, we remark that the reference AI Watch taxonomy includes two-unrelated to this study-additional transversal domains, namely, Services and Ethics and Philosophy.Where, the first domain includes all forms of infrastructure, software and platform provided as services or applications, and the second is related to important issues regarding the impact of AI technologies in our society.

Software Testing
ST is defined by the 29119-1-2013 ISO/IEC/IEEE International Standard as a process made by a set of interrelated or interacting activities aimed at providing two types of confirmations: verification and validation [34].Verification is a confirmation that specified requirements have been fulfilled in a given software product (a.k.a., work item or test item), whereas validation demonstrates that the work item can be adopted by the users for their specific tasks.The main objective of ST is to assess the absence of faults, errors, or failures in the test items.Among the great number of taxonomies proposed in the literature for describing the different and heterogeneous aspects of the ST research area, in this work, we refer to the unified view proposed by Software Engineering Body of Knowledge (SWEBOK) [18].The SWEBOK is a guide to the broad scope of software engineering.Its core is a tested and proven knowledge base that has been developed and continues to be updated frequently, through practices that have been documented, reviewed, and discussed by the software engineering community.Even more precisely, in this article, we refer to dynamic testing that comprises the activities that are performed to assess whether a software product works as expected when it is executed [34].The ST taxonomy is shown in Figure 7, where, the white boxes represent ST domains and key concepts drawn from the SWEBOK [18], while the gray boxes are additional key concepts extracted from the analyzed secondary studies during the mapping process and missing in the SWEBOK.In the remainder of this section, we provide a short description for each domain and key concept of the taxonomy.Moreover, we indicated one or more references for the key concepts that are not described in the SWEBOK.

Test Target.
The ST domain that defines the possible objects of the testing.The target can vary from a single module to an integration of such modules (related by purpose, use, behavior, or structure) and an entire system.In this domain, we recognized three relevant fields: (i) Unit Testing, which verifies the correct behavior, in isolation, of software elements that are separately testable; (ii) Integration Testing, which is intended to verify the correct interactions among software components; and (iii) System Testing, which is concerned with checking the expected behavior of an entire system.

Testing
Objective.The ST domain defining the purpose of a testing process.Test cases can be designed to check that the functional specifications are correctly implemented.This objective is also defined in literature as conformance testing, correctness testing, functional testing, or feature testing [11].However, in Non-functional Testing, several other nonfunctional properties may be verified as well, including reliability, usability, safety, and security, among many others quality characteristics such as compatibility [30] and quality of service (QoS) [1].Other possible testing objectives are the following ones: (i) Acceptance Testing, which determines whether a system satisfies its acceptance criteria, usually by checking desired system behaviors against the customer's requirements; (ii) Regression Testing, which, according to [33], is "...selective retesting of a system or component to verify that modifications have not caused unintended effects and that the system or component still complies with its specified requirements.. "; (iii) Stress Testing, which exercises the software at the maximum design load with the goal of determining the behavioral limits, and to test defense mechanisms in critical systems; (iv) structural testing, whose target is to cover the 58:7 internal structure of the system source code or model [89]; and (v) GUI Testing, which focuses on detecting faults related to the Graphical User Interface (GUI) and its code [89].

Testing Technique.
The ST domain dealing with the detection of as many failures as possible.Testing techniques have the main goal of identifying inputs that produce representative program behaviors and assessing whether these behaviors are expected or not in comparison to specific oracles.[35]; (vii) Concolic Testing employs the symbolic execution of a program paired with its actual execution [11]; (viii) Metamorphic-based Testing, which uses metamorphic relationships for the test oracles definition [32]; (ix) Concurrency Testing, where tests are generated for verifying the behavior of concurrent systems [2]; and (x) Statistical Testing, where the test cases are generated starting from statistical models such as Markov Chains [12].

Testing Activity.
The ST domain that outlines the activities that can be performed by testers and testing teams into well-defined controlled processes.Such activities vary from test planning to test output evaluation, in such a way as to provide assurance that the test objectives are met in a cost-effective way.Well-known testing activities presented in the literature are the ones presented in the following of this section.(i) Test Case Generation whose goal is to generate executable test cases based on the level of testing to be performed and the particular testing techniques.(ii) Test Planning is a fundamental activity of the ST process, its includes the coordination of personnel, availability of test facilities and equipment, creation and maintenance of all test-related documentation, and planning for the execution of other testing activities.(iii) Test Logs Reporting is used to identify when a test was conducted, who performed the test, what software configuration was used, and other relevant information to identify or reproduce unexpected or incorrect test results.(iv) Defect Tracking is the activity where the defects can be tracked and analyzed to determine when they were introduced into the software, why they were created (for example, poorly defined requirements, incorrect variable declaration, memory leak, programming syntax error), and when they could have been first observed in the software.(v) Test Results Evaluation is performed to determine whether the testing has been successfully executed.In most cases, "successful" means that the software performed as expected and did not have any major unexpected outcomes.Not all unexpected outcomes are necessarily faults, but are sometimes determined to be simply noise.Before a fault can be removed, an analysis and debugging effort is needed to isolate, identify, and describe it.(vi) Test Execution represents both the execution of test cases and the recording of the results of those test runs.Execution of tests should embody a basic principle of scientific experimentation: everything done during testing should be performed and documented clearly enough that another person could replicate the results.(vii) Test Environment Development regards the implementation of the environment that is used for testing.It should be guaranteed that the environment is compatible with the other software engineering tools adopted during the testing process.It should facilitate the development and control of test cases, as well as the logging and recovery of expected results, scripts, and other testing materials.(viii) Test Oracle Definition is the activity performed either to generate automatically or to support the creation of test oracles [32].(ix) Test Case Design and Specification is executed to design or to specify the testing cases.This activity usually starts from the analysis of the requirements of the system under test [30,35].(x) Test Case Optimization/ Prioritization/Selection is performed for the optimized reduction or prioritization or selection of test cases to be executed [43].(xi) Test Data Definition, a.k.a.test data generation, is the activity where the data for test cases are produced [98].(xii) Test Repair is, in essence, a maintenance activity.Within the course of this activity, test scripts are adjusted to changed conditions.The need for it lies in the fact that test scripts are fragile and vulnerable to the changes introduced by developers in a newer version of the tested software [98].(xiii) Flaky Test Prediction is the activity where the tests expressing similar characteristics are identified and repaired.This activity significantly improves the overall stability and reliability of the tests [98].A flaky test is considered as such when it reports false positive or false negative test result, or when adjustment was made to the test scripts and/or to the code of the system under test.(xiv) Test Costs Estimation has the main goal to predict the testing costs, mainly the testing time [30].

Software Testing Fundamentals.
The ST domain that covers the Testing Related Terminology, such as basic definitions, basic terminology and key issues, and relationships between software testing and other software engineering activities.

TERTIARY SYSTEMATIC MAPPING PROTOCOL
In this section, we describe the research protocol adopted to conduct our tertiary systematic mapping study.The protocol was designed following the guidelines proposed by Petersen et al. [77] to fulfill the requirements of a structured process, whose execution details are provided in Section 4. Specifically, the protocol includes the following steps: (i) definition of goals and research questions, (ii) definition of the search string, (iii) selection of electronic databases, (iv) definition of inclusion and exclusion criteria, (v) definition of quality assessment criteria, and (vi) design of the data extraction form.We describe in detail each of these steps and their outcomes in the rest of this section.

Goal and Research Questions
The goal of our study is to understand how AI has been applied to support ST.To reach this goal, we defined nine research questions (RQs) grouped into two categories publication space and research space questions (as suggested by [77]).Publication space questions aim at characterizing the bibliographic information (i.e., venue, year of publication, authors' affiliation, etc.) of the identified sources (i.e., secondary studies).Research space questions aim at providing the answers needed to achieve the research goal.
Publication Space (PS) RQs.We defined the following five publication space research questions: PS-RQ1: How many secondary studies have been identified per publication year?PS-RQ2: Which types of secondary studies have been executed?PS-RQ3: What are the venues where the secondary studies have been published?PS-RQ4: What are the authors' affiliation countries of the selected secondary studies?PS-RQ5: What is the amount of primary studies analyzed by the selected secondary studies and how are they distributed over time?
Research Space (RS) RQs.We defined the following four research space research questions: RS-RQ1: What AI domains have been applied to support ST? RS-RQ2: What domains of ST have been supported by AI? RS-RQ3: Which ST domains have been supported by which AI domains, and how?RS-RQ4: What are the future research directions of AI in ST?

Search String Definition
To systematically define the search string to be used for finding secondary studies of interest, we adopted the PICOC (Population, Intervention, Comparison, Outcome, and Context) criteria as suggested in Petersen et al. [77].The main term of each of the PICOC view points are described in the following: • Population: We identified Software Testing as the main term of this view point, since it is the domain of interest of our study.• Intervention: We identified Artificial Intelligence as the main term of this view point, since our research questions are aimed at investigating how this science has been applied to the population.
• Comparison: This view point is not applicable in a systematic mapping study, since no effect of the intervention on the population can be expected.• Outcome: this view point is not applicable in a systematic mapping study, since no effect of the intervention on the population can be expected.• Context: We identified Secondary Study as main term of this view point, since it is the context where we expect to find sources.
To identify the keywords of the search string, we followed the general approach as suggested by Kitchenham and Charters [52].Hence, we performed a break down of our research questions (see Section 3.1) into individual facets (one for each PICOC view point).Then, we generated a list of synonyms, abbreviations, and alternative spellings.Additional terms were obtained by considering subject headings used in journals and scientific databases.The main terms and the synonyms we inferred for the PICOC view points are shown in Table 1.Finally, the search string was obtained by the conjunction (AND) of disjunction (OR) predicates, each built on the main term and the corresponding synonyms of a PICOC view point.Moreover, as suggested by the Kitchenham [52] and Petersen [77] guidelines, we checked our search string against four selected control papers (Garousi et al. [35], Trudova.et al. [98], Catal [22], and Durelli et al. [30]). 4The resulting search string is shown in Box 1.

Digital Libraries Selection
To retrieve candidate studies, we selected four of the most well-known digital libraries usually adopted for conducting literature review and mapping studies [4].The digital libraries adopted in this study are: ACM Digital Library,5 IEEE Xplore, 6 Web of Science, 7 and Scopus. 8We adapted the search string to accommodate to the syntax required by each digital library search engine, hence we built four queries that apply our search string to the title and abstract attributes.Additionally, for Scopus and Web of Science, we limited the results to the computer science and computer engineering categories.Since the ACM Digital Library and the IEEE Xplore gather publications within computer science and computer engineering no restrictions have been applied in the corresponding queries.

Inclusion and Exclusion Criteria Definition
To support the selection of retrieved secondary studies, we defined exclusion and inclusion criteria.When defining these criteria, we acknowledged our complementary skills in AI and ST.Therefore, as we will mention in Section 4, we defined these criteria with the outright intention that its application would be supported by classifiers with the required skills to properly apply the criteria in the context of the expertise of each of our fields.

Exclusion Criteria (EC).
We excluded a publication if at least one of the following 6 EC applies: (EC1) The study focuses on the testing of AI-based software systems rather than the application of AI to ST.
Artificial Intelligence Applied to Software Testing: A Tertiary Study 58:11  We remark that, to apply EC2, we took special attention to confirm that the source shares our focus on dynamic testing.In particular, we are looking to exclude studies (systematic mapping or reviews) that are not focused on the design and execution of test cases.

Inclusion Criteria (IC).
We included a publication i.i.f.all the following 4 IC apply: (IC1) The study is a secondary study.(IC2) The study addresses the topic of AI applied to ST. (IC3) The study is a peer-reviewed paper.(IC4) The study is written in English.

Quality Assessment Criteria Definition (QC)
To filter-out low quality publications, we scored each candidate paper according to a list of six quality assessment criteria inspired by Kitchenham et al. [53].We report in Table 2 such QCs along with the rationale we adopted to assign a score ∈ {0.0, 0.5, 1.0} to each paper.We evaluate the overall quality of a candidate by summing up the six QC scores and excluding those papers reaching an overall score lower than 3.0.

Data Extraction Form Design
To support the data extraction process, we designed the data extraction form reported in Table 3.This form was used to report the pieces of evidence-extracted from the selected papers-that will be analyzed to answer the RQs.The form includes a list of fields organized in two sections, one dedicated to the publication space RQs and the other dedicated to the research space RQs.For each field, we provide a name, a brief description of the data that the field is meant to collect, and the RQ for which the field is used for.

TERTIARY SYSTEMATIC MAPPING EXECUTION
In this section, we describe the execution of the tertiary systematic mapping study we conducted with the protocol that we introduced in Section 3. Specifically, in Section 4.1, we provide details about the selection process while, in Section 4.2, we provide details about the data extraction process.

Selection Process Execution
The process followed to select secondary studies is shown in Figure 1.The figure provides a representation of the executed steps and their outcomes.The selection process is based on the execution of the two stages described in the following.The full selection process was executed on June 2021 and repeated on May 2022 to ensure that we did not miss any recent secondary study on the investigated topic.

First Stage.
This stage was executed to select a preliminary set of secondary studies and relies on the sequential execution of the following four steps: (1) Secondary studies retrieval from the digital libraries: In this step, the queries (introduced in Section 3.3) were submitted to the four digital libraries reported in Section 3.3. 9As a result, 877 secondary studies were retrieved.(2) EC and IC application to title, abstract, and keywords: The 877 papers were divided into two groups.The title, abstract, and keywords of the secondary studies in each group were analyzed by two researchers-one AI expert and one ST specialist-to apply IC and EC presented in Section 3.4.At the end of this step, 806 studies were excluded, since both researchers reached the same consensus to remove them.The remaining 71 papers were included and passed to the next step.(3) EC and IC application to full text: The 71 papers were divided into two groups and the full text of each paper was read by two researchers-an AI expert and an ST specialist-to apply again the IC and EC.In the end, 29 studies were excluded, 32 were included, and 10 papers were labeled as a "doubt." All doubts came from studies for which no agreement was reached.(4) Dealing with secondary studies classified "doubt" : Two additional researchers-one AI expert and one ST specialist-were involved to reach an agreement on "doubt" papers.To this aim, all four researchers performed an additional discussion based on the papers' full read and analysis.At the end of the discussion, 4 studies were excluded whereas the remaining 6 were selected.As a consequence, the final set of selected papers included 38 studies.

Second
Stage.This stage refers to a snowballing process [104] that was conducted as a complementary search strategy to mitigate the threat of missing literature.As shown in Figure 1 the stage relies on the execution of six sequential steps, three of which (i.e., steps 2, 3, and 4) involve the same steps that we described in the first stage.
(1) Secondary studies retrieval by backward and forward snowballing: In this automatic step, 296 secondary studies were retrieved by applying backward and forward snowballing to the 63 papers selected in the First selection (title, abstract and keywords) step of the First stage.
Specifically, for the backward snowballing, we collected all studies cited by the 63 papers using their references.For the forward snowballing, we used Google Scholar10 instead of the four digital libraries already exploited in the first stage, as it allowed us to fully automate the retrieval of papers citing one or more of the 63 papers.(2) EC and IC application to title, abstract and keywords: As a result of this step, 277 secondary studies were excluded by applying IC and EC.We remark that, to apply the EC3, we needed, as input to this step, the 63 papers selected in the EC and IC application to title, abstract and keywords) step of the First stage of the process.As a result, 19 papers were included for full reading in the following selection steps.(3) EC and IC application to full text: In this step, among the 19 secondary studies, 11 were excluded, five were included, and the remaining three papers were labeled as "doubt" and needed an extra analysis.(4) Solving "doubt" secondary studies: Among the three papers labeled as "doubt, " two were excluded, and the other one was included.As a consequence, six secondary studies were obtained by snowballing.( 5) Merge: The papers selected in the two stages were merged to define a set of 44 candidate secondary studies.( 6) Quality Assessment: In this step, each paper was analyzed by one of the researchers who scored the source according to the six quality criteria described in Section 3.4.The studies with a quality score lower than 3 were excluded.Borderline papers, i.e., papers with a quality score between 2 and 4, were discussed by all the researchers.As a result of the quality assessment (see Table 4 ), 20 secondary studies were included into the final set of selected papers.

Data Extraction Execution
Since our RQs (see Section 3.1) cross two different research areas, we divided the authors into two groups, each containing an AI expert and an ST specialist.The two members of each group collaborated in the extraction of the pieces of evidence from each of the 20 selected studies using the extraction form shown in Table 3.Finally, to reach a large consensus, the two groups shared the extracted pieces of evidence and discussed the differences.

DATA ANALYSIS
In this section, we describe the results of the analysis performed on the extracted data (see Section 4.2) to answer our RQs (see Section 3.1).Specifically, in Section 5.1, we provide the answers to our PS-RQs, while in Section 5.2 to our RS-RQs.

Publication Space Research Questions-Results
In the following sections, we answer the five PS-RQs of this study.

PS-RQ1. How many secondary studies have been identified per publication year?
To reply to the first publication space question, we depicted in Figure 2      years (2017 to 2022), thus showing an increasing interest in the research community in conducting secondary studies about the application of AI in ST.

PS-RQ2. Which types of secondary studies have been executed?
The second PS-RQ is about the types of selected secondary studies.For each study, we report in Figure 2 the corresponding type, i.e., either Systematic Literature Review (SLR) in light green color or Systematic Literature Mapping (SLM) in blue color.We followed the guidelines defined by Kitchenham and Charters [52] to verify whether a secondary study was correctly classified as SLR or SLM by its authors, and changed the classification when needed.For instance, F14 and F15 were originally classified as SLRs by their authors.After a careful analysis, we opted to classify them as SLMs, since the authors: (i) did not perform a quality assessment of selected primary studies; (ii) summarized the selected works without executing a meta-analysis.From the data presented in Figure 2, we can observe that our selection includes 10 (50%) SLRs and 10 (50%) SLMs.

PS-RQ3
. What are the venues where the secondary studies have been published?The third PS-RQ aims to analyze the venues where the selected secondary studies have been published.Table 5 reports the type, name, and rank (Scimago Journal & Country Rank-SJR quartile 11 for journal papers and Computing Research and Education-CORE rank 12 for conference papers) of the venues of the 20 selected secondary studies.The table shows that 14 (70%) studies were published in journals and the remaining 6 studies (30%) were part of the proceedings of conferences, workshops, symposiums, or seminars.It is worth observing that 13 of the 14 journal papers have been published in top-ranked venues (according to the SJR quartile in which the venue is classified), with 6 of them published in the Information and Software Technology Journal. 13Thus, from our selection, we can derive that the topic of AI applied to ST is largely covered by top-ranked journals.

PS-RQ4. What are the authors' affiliation countries of the selected secondary studies?
The fourth PS-RQ is aimed at analyzing the countries of affiliation of the authors of the selected studies.Among the 20 selected studies, we found 68 different authors, with 5 authors (i.e., Érica Ferreira de Souza, Juliana Marino Balera, Lionel C. Briand, Nandamudi Lankalapalli Vijaykumar, and Juliana Marino Balera) involved in two studies each.We analyzed the countries of affiliations of these 68 authors, resulting in 16 unique affiliation countries.Since three authors reported a second affiliation country and five authors were involved in two different studies, we counted a total of 76 affiliations.Most of the selected studies have authors with affiliations from only one country, except studies F1 [11] (Brazil and UK), F2 [35] (Austria and Northern Ireland), and F3 [96] (Argentina and Uruguay), which included authors with affiliations from two different countries each.Figure 3 shows a world map of the authors' affiliation countries, with each color representing a different value for the number of affiliations.In particular, 27 (35.53%)affiliations are counted for Brazil, 10 (13.16%) for Malaysia, 6 (7.89%) for Sweden, 4 (5.26%) for Argentina, Canada, Norway, and Pakistan, each, 3 (3.95%)for the Czech Republic, India, and Iran each, 2 (2.63%) for Austria, United Kingdom, and Uruguay each, and 1 (1.32%) for Luxembourg and Turkey, each.From the extracted data, we can observe that most of the affiliations (33 over 76) are located in South America (Brazil, Argentina, and Uruguay).Interestingly, affiliation countries that typically dominate in computer science or computer engineering (e.g., USA and China) do not occur in our observations. 14

PS-RQ5
. What is the amount of primary studies analyzed by the selected secondary studies, and how are they distributed over time?The goal of this RQ is twofold: (i) to compute the number of primary studies that have been reviewed by the selected secondary studies and (ii) to understand how these primary studies are distributed over the publication years.Figure 4 shows, for each selected secondary study, the number of reviewed primary studies and how many of these studies are unique, i.e., works that have not been reviewed by any other secondary study.Looking at the figure, it shows that the 20 selected secondary studies analyzed a total of 807 primary studies, of which 710 (87.98%) were unique.Figure 5 shows the distribution of the unique primary studies per publication year.The figure shows that the reviewed 710 unique primary studies cover a period of 27 years, going from 1995 to 2021; 444 (62.5%) of these studies have been published in the past 10 years and 264 (37.18%) from 2015 to 2021.Primary studies that were "unique" in any of the older secondary studies could, in theory, have been found by the newer secondary studies.However, the publication date of a primary study would be just one of the factors that may lead it to be included in only one of the secondary studies.In addition, the search protocols for secondary studies also vary greatly in the choice of search strings and inclusion and exclusion criteria, leading to the great diversity of primary studies selected.The large amount of unique primary studies reviewed  by the selected secondary studies 15 and their distribution over time leads us to two interesting observations.First, we can state that our set of secondary studies is representative of the research conducted on the topic of AI applied to ST.Moreover, as will be confirmed by RS-RQ1 and RS-RQ2 (see Sections 5.2.1 and 5.2.2) the set of unique primary studies reviewed by these works is a significant sample of primary studies covering broad aspects of AI in ST.Second, we can infer that the topic of AI applied to ST is of interest to the research community and that the interest has grown over the past decade.Finally, it is worth observing that the research topic of AI applied to ST is not new for the research community, indeed the first 19 (2.67%) primary studies in this field date from the late 1990s.

Research Space Research Questions-Results
In the following sections, we answer the four RS-RQs of our study.

RS-RQ1. What AI domains have been applied to support ST?
To identify the AI domains from which solutions were applied to support ST, we analyzed the list of sentences about the applied AI domains that were extracted from our sources during the data extraction process with reference to the taxonomy introduced in Section 2.1.As a result of this analysis, in Figure 6, we report, for each AI domain concept, the list of secondary studies in which we found evidence of its application in ST.The most important findings we can derive from Figure 6 follow: (1) most of the secondary studies (11 of 20) investigated the application of AI solutions (i.e., algorithms, models, methods, techniques, etc.) belonging to the Planning and Scheduling / Searching / Optimization subdomains to ST.Specifically, most surveyed AI solutions belonging to this domain are: evolutionary algorithms, genetic algorithms, and metaheuristic optimisation; (2) 9 of 20 secondary studies focused on the application of Machine learning solutions.In particular, F12 [30] and F5 [98] covered almost all the concepts in this AI domain; (3) 6 of 20 secondary studies surveyed the support provided by  Knowledge representation/Automated reasoning/Common sense reasoning AI solutions.Specifically, most of these studies analyzed the use of ontologies and F4 [25] is the study that surveyed the use of most of the concepts in this AI domain; (4) few secondary studies surveyed works on the application of Natural language processing (5 of 20), Multi-agents systems (3 of 20), and Computer vision (1 of 20).It is worth noticing that, within the Natural language processing domain, the most surveyed applications are based on text mining, while only one study surveyed applications of word embeddings.Notably, only one of the secondary studies analyzed the use of image processing techniques belonging to Computer vision.
By analyzing the publication years of the selected studies (see Figure 2), we can observe that most of the works (6 of 9) surveying the application of machine learning to ST have been published very recently (2020 or later).This indicates a growing interest in this AI domain.Similarly, 4 of 5 studies investigating the use of Natural language processing have been published after 2020, highlighting the timeliness of this research field, with a special focus on text mining and word embedding.Fig. 6.The resulting excerpt taxonomy of AI supporting ST, built starting from the EU AI Watch report.Gray boxes represent key concepts not explicitly included in the AI Watch [88] report and added as a result of the data analysis process.Each concept is annotated with the labels of secondary studies in which it was surveyed.Original domain labels are reported in bold inside sub-domains boxes.
Secondary studies analyzing the use of Planning and Scheduling/Searching/Optimization cover a period spanning from 2009 to 2020, showing a consolidated and still of interest research topic.

RS-RQ2. What domains of ST have been supported by AI?
Similar to what we did to answer RS-RQ1, we analyzed the sentences collected from each secondary study during the data extraction process and annotated each study with the ST domain concepts involved in them, according to the taxonomy introduced in Section 2.2.The result of this analysis is shown by Figure 7, where, for each ST domain, we report the list of secondary studies in which we found pieces of evidence of the application of an AI solution to the specific ST domain.From Figure 7, we can observe the following: (1) Almost all selected secondary studies (19 of 20) have surveyed studies about the application of AI to the Testing activity ST domain.In particular, the most recurrent ST concepts of this domain are: Test Case Optimization/Prioritization/Selection (11 of 20), Test Data Definition (10 of 20), and Test case generation (8 of 20); (2) 12 secondary studies surveyed the use of AI in the Testing Technique domain.In this ST domain, Mutation Testing and Requirement-based Testing are the testing techniques for which more secondary studies found evidence of AI applications Fig. 7.The resulting excerpt AI supported ST taxonomy, built starting from the SWEBOK [18].Gray boxes represent key concepts not explicitly included in the SWEBOK and added as a result of the data analysis process.Each concept is annotated with the labels of secondary studies in which it was surveyed.
(6 and 4 studies, respectively); (3) 11 secondary studies showed evidences of AI applied in the Testing Objective ST domain.In particular, 5 studies showed the use of AI to support Functional Testing, 5 studies analyzed primary sources where AI was applied to Non-functional Testing, 5 studies showed evidence on the application of AI to GUI Testing, and 4 surveyed the use of AI for Regression Testing; (4) few secondary studies (3 of 20), reported evidence on the use of AI in the Test Target ST domain.Among these 3 secondary studies, 1 covered the use of AI in Unit Testing, 1 the application of AI in Integration Testing, and 2 the AI support in System Testing; (5) Software Testing Fundamentals ST domain has been covered by 3 secondary studies.All these works reported evidence on the use of AI to support the introduction and standardization of terms and definitions in the field of the Testing Related Terminology.Overall, the most important finding we can derive from Figure 7 is the evidence of an intense application of AI to: (1) the development of test cases, including their generation and the definition of test cases' input and expected output, i.e., test oracles.To aid the test oracle definition, AI has been applied to metamorphic-based testing and to GUI testing; (2) the management of the test cases, particularly, their prioritization and selection, which is confirmed by the use of AI for regression testing; (3) the generation of test cases from requirements using natural language processing and knowledge representation techniques; (4) the detection of equivalent mutants and the generation of new mutants in mutation testing techniques; and (5) the testing of both functional and nonfunctional requirements.

RS-RQ3. Which ST domains have been supported by which AI domains and how?
To answer RS-RQ3, we discuss the evidence collected from the selected secondary studies concerning what AI domains have been applied to support what ST domains.The bubble chart in Figure 8 reports the number of secondary studies that investigated the application of a given AI domain to a specific ST domain.From the chart, we can derive the following observations: (1) Testing Activity and Testing Objective are the only two ST domains for which we found evidence of the application of solutions from all the AI domains.Also, with the exception of Software Testing Fundamentals, the AI domains Planning, Communication, Learning, and Knowledge have been applied to all ST domains; (2) Knowledge is the only AI domain for which we found evidence of applications to Software Testing Fundamentals, moreover, it is the only AI domain involved in all the ST areas; (3) the majority of selected secondary studies (10 of 20) analyzed the application of AI techniques belonging to the Planning domain for supporting the Testing Activity, thus being the most surveyed interplay of AI and ST; (4) the second most surveyed interplay of AI and ST is Learning applied to Testing Activity.Moreover, we evidenced that machine learning is the only key concept belonging to the Learning AI domain that has been exploited in this ST domain; (5) very few secondary studies surveyed the application of the Integration & Interaction and the Perception AI domains to ST.More Rows and columns represent ST and AI concepts, respectively.Cells include the selected secondary studies from which we extracted the evidence of applications.
precisely, Multi-agent systems and Computer vision are the only AI key concepts belonging to these domains for which we had evidence of application in ST; (6) the Software Testing Fundamentals domain characterizes the Software Testing Fundamental terms and definitions.This justifies why Knowledge is the only one AI domain for which we found evidence of application to this ST domain.Additionally, to deepen our discussion, we analyzed the pieces of evidence extracted from the selected secondary studies, to identify more in detail which AI methodology has been applied to specific ST fields.The results of this analysis are reported in Tables 6(a) and 6(b).Each table Rows and columns represent ST and AI concepts, respectively.Cells include the selected secondary studies from which we extracted the evidence of applications.cell lists the secondary studies in which we found evidence of the application of a specific AI domain/subdomain (column) to support a specific ST domain/field (row).Looking at the mapping at a bird's-eye view, we can observe that: (1) evolutionary algorithms, genetic algorithms, and metaheuristic optimisation have been applied to almost all the ST domains and fields, and ( 2 Furthermore, to deepen the understanding of RS-RQ3, we drilled down into the cells with several sources assigned to them (i.e., 3 and 4).Such cells indicate that the associated AI concepts have been extensively applied to support the related ST fields.For such cells, in the following bullet list, we summarize (1) the commonalities and differences in the application of the AI techniques identified by the analyzed works, and (2) the difficulties and limitations of the application of the AI techniques to ST objective, and (3) practical insights.
(1) Evolutionary algorithms and genetic algorithms were found to be the most used AI techniques to support Mutation testing, with a prevalence of genetic algorithms w.r.t.evolutionary algorithms.As an example, F10 [94] reports that an "...evolutionary strategy with Gaussian Distribution to identify subdomains from which test cases can be selected with higher mutation score".While F8 [43] states: "Genetic Algorithms was also used to address the problem of equivalent mutants in mutation testing." These two search-based techniques have been applied to Mutation testing primarily for two purposes, either for mutant optimization or for Test case optimization.However, most of the proposed techniques are either presented in a general manner or are not sufficiently empirically evaluated and can not serve as a base for enabling practitioners to choose a specific technique for a given software.The major challenges include the effort and cost entailed in mutation testing and thus limiting its application to testing real-world programs.As stated by F1 [11] and F16 [57], very few works explored the application of hyper-heuristics to Mutation Testing, while this technique could bring the advantages of generating stronger mutants and reducing the number of mutants used.As highlighted by F10 [94], meta-heuristic search techniques, and genetic algorithms in particular, have been also effectively applied for the selection of mutant operators and the generation of mutants and generation of test data.From a more practical point of view Genetic Algorithm, Ant Colony, Bacteriological Algorithm, Hill Climbing, and Simulating Annealing have been extensively used in search-based mutation testing.(2) Genetic algorithm has also been widely used for Test Case Generation and Test Data Definition as it can be drawn from F9 [89] ("Researchers apply SBTs for automatic test case generation based on a test objective-adequacy criteria-that is formulated as a fitness function... ") and F10 [94] ("The data are considered a pattern to be executed in the design.In the crossover phase, the Genetic Algorithm selects sub-patterns with overlapped inputs to cross and generate new ones... ").According to F10 [94], experiments conducted in this field showed unsatisfactory results, with the most important challenge being the time necessary for obtaining a good solution, in terms of test cases and test data definition, when more than one solution must be found.However, preliminary results indicate that the use of meta-heuristic search techniques for reducing both the costs and efforts for test data generation in mutation testing is promising.In the Test Case Generation field, genetic algorithms and evolutionary algorithms have been widely applied for "global" search-based techniques (SBTs), i.e., the effective search for global optimal solutions to overcome the problem of getting stuck in local optima.Subsequently, "local" SBTs are used to efficiently obtain the optimal solution starting from global ones.In particular, hill climbing and simulating annealing are the most common examples of local SBTs (F9 [89]).From a more practical point of view, genetic algorithms seem to outperform random search in Test Case Generation for structural coverage (F11 [3]).(3) Metaheuristic optimisation has been extensively used for Test Case Optimization/ Prioritization/Selection, Functional Testing, Mutation Testing, Test Case Generation, and Test Data Definition.Examples of these applications are reported in F7 [1] ("Therefore, the initial results indicated SA as more effective than other approaches for finding smaller sized test suites"), F9 [89] ("...surveyed the past work and the current state-of-the-art of the applications of SBTs for structural testing, functional testing... "), F1 [11] ("...how SBST has been explored in the context of Mutation Testing, how objective functions are defined and the challenges and opportunities of research in the application of meta-heuristics as search techniques"), and F8 [43] ("Test case generation using mutation testing with Ant Colony Optimization, " "...a set of test cases are extracted automatically from the textual requirements").Although several secondary studies showed that metaheuristic-based techniques have been extensively used to provide solutions for automatizing testing tasks (such as test case selection and test order generation) and for implementing more cost-effective testing processes, some studies, in particular F5 [98] and F11 [3], also highlighted the need for additional empirical experimentation to demonstrate the applicability and the usefulness of metaheuristic in more realistic industrial scenarios.(4) Text mining is the most widely used NLP technique for Test Case Generation.As an example, F2 [35] and F5 [98], respectively, report: "...a set of test cases are extracted automatically from the textual requirements, " "...NLP techniques have been used to generate automated test cases from initial requirements documents...a tool named SpecNL generates a description of software test cases in natural language from test case specification.... " As pointed out by F2 [35] and F15 [2], the use of NLP-assisted software testing techniques and tools has been found highly beneficial for researchers and practitioners, as they reduce the cost of test-case generation and the amount of human resources devoted to test activities.However, for a wide industrial usage of NLP-based testing approaches, more work is required to increase their accuracy.Moreover, comparative studies should be performed to highlight strengths and weaknesses of NLP tools and algorithms.( 5) Ontologies have been mainly adopted to support the introduction and standardization of terminologies and definitions in ST.Several examples of such application are reported in F3 [96], F4 [25], and F14 [28], respectively: "...a software testing ontology is designed to represent the necessary software testing knowledge within the software testers' context... " ; "...the proposed ontology defines a shared vocabulary for testing domain which can be used in knowledge management systems to facilitate communication, integration, search, and representation of test knowledge... "; "...the authors presented an ontology, called Test Ontology Model (TOM), to model testing artifacts and relationships between them." As highlighted by F3 [96], the main benefit of having a suitable software testing ontology is to minimize the heterogeneity, ambiguity and incompleteness problems in terms, properties and relationships.Another potential value of using ontologies and, more in general, semantic web technologies in software testing highlighted by F4 [25] is that they can provide a more powerful mechanism for sharing test assets that are less application-dependent and hence more reusable.By analyzing the terminological coverage of the selected ontologies, in F4 [25] the authors observed that most ontologies cover terms related to dynamic and functional testing.Conversely, only a few ontologies consider terms related to static and non-functional testing.Similarly, the authors of F14 [28] highlighted that most ontologies have limited coverage and none of them is truly a reference ontology or is grounded in a foundational ontology.In conclusion, the software testing community should invest more efforts to get a unique and well-established reference software testing ontology.(6) Artificial neural networks have been used for several testing activities such as oracle definition, test-case generation, test-case refinement, and test-case evaluation.The following evidence of the applications of artificial neural networks for Test Oracle Definition are reported in F12 [30] and F18 [32], respectively: "...ML algorithms generated test verdict, metamorphic relation, and-most commonly-expected output oracles.Almost all studies 58:27 employ a supervised or semi-supervised approach, trained on labeled system executions or code metadata-including neural networks, support vector machines, adaptive boosting, and decision trees"; "...a trend we observed is that the oracle problem tends to be tackled by employing either ANN or decision tree-based approaches.... " Regarding the Test Oracle Definition activity, F2 [35] observed that test oracles obtained by using artificial neural networks are more efficient, effective, and reusable compared to those generated with existing traditional approaches.Additionally, F12 [30] identified the main advantages of using artificial neural networks and machine learning in their scalability and in the minimal need of human intervention.As for the main problem faced by researchers when trying to apply artificial neural networks and machine learning to solve software testing problems, both F12 [30] and F18 [32] identified the need for a substantial amount and high-quality training data, which is the key for machine learning algorithms to function as intended.( 7) Classification, clustering, and reinforcement learning AI methodologies have been widely adopted for Test Case Optimization/Prioritization/Selection, as highlighted by F19 [74]: "The main ML techniques used for Test case Selection and Prioritization are: supervised learning (ranking models), unsupervised learning (clustering), reinforcement learning, ...Supervised learning includes all ML techniques that rely on classification or ranking models .... " Similarly, F17 [50] states that: "...the publication trend of ML technique applied to Test Case Prioritization...shows that the classification technique category was the most popular followed by clustering then reinforcement learning come as the last preferred." F12 [30] also reports "...approaches that employ reinforcement learning to select and prioritize test cases according to their duration, previous execution and failure history." F17 [50] reported that classification is the most used ML technique as it benefits from the availability of historic data, which results in a high average percentage of faults detected and code coverage effectiveness.F17 [50] also highlighted that Reinforcement learning requires a more structured process and improvements before it is mature enough to be included in undergraduate taught programs.Interestingly, F19 [74] highlights that, although supervised learning, unsupervised learning, reinforcement learning, and natural learning processing are the four main ML techniques used for test case selection and prioritization, some combinations of them have also been reported in the literature.For example, NLP-based techniques, which are often used for feature preprocessing, were combined with supervised or unsupervised learning to achieve better performance for test case prioritization.F19 [74] highlighted that the lack of standard evaluation procedures and appropriate publicly available datasets resulting from the execution of real world case studies makes it very challenging to draw reliable conclusions concerning ML-based test case selection and prioritization performance.Thus, getting the research community to converge toward common evaluation procedures, metrics, and benchmarks is vital for building a strong body of knowledge we can rely on, without which advancing the state-of-the-art remains an elusive goal.
As a final consideration, we can highlight that the application of word embedding to Test Case Optimization/Prioritization/Selection has been observed only recently, in 2021 by F19 [74], which reports that "NLP-based techniques are used for processing textual data, such as topic modeling, Doc2Vec, and LSTM.NLP-based techniques can also be mixed with other ML or non-ML techniques." Moreover, word embedding and neural NLP models are becoming more and more pervasive in transdisciplinary studies and applications, and since foundation models 16 are receiving much attention 58:28 D. Amalfitano et al. from both academic and industrial researchers, we expect that in the near future NLP will be more extensively applied also to support ST.

RS-RQ4
. What are the future research directions of AI in ST? Table 7 summarizes the most recurrent future research directions in AI applied to ST emerging from the analysis of the selected secondary studies, and the list of studies mentioning them.The table was built by analyzing the sentences, extracted from each study, discussing future research directions and grouping sentences indicating similar research directions.Finally, for each group, we defined a category by means of a short summary of the research direction.The need for more rigorous experimental research is the most recurrent future research direction (8 of 20 studies).For instance, the authors in F12 [30] state that "most research efforts are not methodologically sound, and some issues remain unexplored." While in F11 [3], the authors report that empirical evidence is needed to assess how "AI-supported techniques (can outperform) current software testing techniques." Three studies identified the need to develop evidence with real systems, i.e., to fill the lack of studies investigating the application of AI to ST of larger and more complex software systems.As an example, the authors of F16 [57] observed that "the great majority of the conducted evaluations do not use real and large systems." Similarly, in F12 [30], the authors identified the lack of AI applications to "a wider range of software testing problems." We believe that this future research direction might mitigate the current challenges in the applicability and transferability of AI applications to ST in industrial settings.The authors of F13 [81] identify the need of introducing new data type representation for test data generation to apply genetic algorithms for automated definition of input/output test values.Another research direction emerging from the analysis is meant to apply ML to support automation.The authors from F12 [30] suggest more research be conducted to evaluate how machine learning approaches can be used to support ST automation by claiming: "We believe that the overarching motivation for research in this area should be automating most software-testing activities." Moreover, the authors of F14 [28] discuss the necessity to develop an ontology for ST as they concluded that "operational versions" of ST taxonomies must be "designed and implemented." Finally, 9 of 20 studies do not propose any future research direction.

FURTHER DISCUSSION
In this section, we first provide additional general considerations on the results of our study (Section 6.1).Then, we focus on Testing Activities whose automation has been supported by different AI techniques and, for each AI technique, synthesize the main purpose has been used for (Section 6.2).

Overall Considerations
Replicability of primary studies: As mentioned in Section 5.2.4,we found that 8 of 20 secondary studies have highlighted the need for rigorous empirical researches to evaluate the outcomes presented by the primary studies.Drawing from this need, we believe that future secondary studies 58:29 should devote more attention to this aspect by including specific research questions or quality assessment criteria aimed to evaluate the replicability of the surveyed studies.

Lack of benchmarks about the interplay between AI and ST:
We observed the lack of benchmarks that practitioners and researcherscan use to assess the outcomes of applying a specific AI technique to support ST.We feel that this could be an important line of research that can be underpinned by the mapping developed in this study.In particular, benchmarks could include datasets and case studies for which results are already known, and performance metrics the proposed AI-supported ST approaches could be compared against.We also feel that the availability of these benchmarks could facilitate future research advancements by providing a common set of outcomes to outline new research questions and performance metrics.
Use of the mapping from the point of view of ST engineers: ST engineers can use Tables 6(a) and 6(b) to find secondary studies about the AI methodologies that have been already applied to support specific ST domains and concepts.Each non-empty cell indicates that a specific AI concept has been already applied to support a given ST activity or field.For instance, let us suppose we have a practitioner interested in "Test Data Definition." The practitioner can look at Tables 6(a) and 6(b) and find out which AI methodologies have been leveraged to support this activity.Moreover, each of the secondary studies reported in non-empty cells supplies pointers to primary studies providing additional details on the specific application of AI in ST.In this specific example, the practitioner interested in the application of "genetic algorithms" can deepen this topic by retrieving the primary studies surveyed by the four secondary studies listed in the corresponding cell, i.e., F5, F8, F10, and F11.
Empty cells as food for thought for researchers: Researchers can use the mapping (Tables 6(a) and 6(b)) to identify new research opportunities by inspecting empty cells.An empty cell in these tables means that we did not find evidence of the application of a specific AI concept to a given ST one.Possible explanations for empty cells that should be properly taken into account by researchers are: Explanation 1: There are not enough primary studies on the application of the specific AI concept to a given ST field of interest.As a result, such application has not permeated through the secondary sources and into the resulting mapping of this tertiary study.Explanation 2: It represents a greenfield opportunity for research, which can be in the form of novel primary studies, or secondary studies that address the mapping associated to the specific empty cell; we note that, similar to this explanation for empty cells, an opportunity to conduct a secondary study is associated to cells of Table 7 including only one study published not recently.As an example, the only secondary study that surveyed the application of AI to support Non-functional Testing is F7, that has been published in 2009.As a result, an update of the F7 study could be of interest for the research community.Explanation 3: It is a false negative for our study.While we have taken great care with the analysis of our secondary sources, there is still the chance that we have missed a reported application.Explanation 4: It is not possible to apply the specific AI solution to the specific ST problem.
The cell might be empty, because the application of AI to software testing might not be feasible.Either temporarily due to limitations in computing power, or by construct, where the application of a specific mapping would not make sense.Researchers must be aware of this possibility when using the mapping as inspiration for research directions.
To exemplify how researchers can use empty cells, let us suppose we are interested to explore the "New Data type representation for test data generation" future research direction reported in Table 7.This future research direction is in relationship with the application of the knowledge representation reasoning AI concept to the Test Data Generation field; such application corresponds to an empty cell in Table 6 (a).At this point, we can use the rows and columns labels as relevant keywords to perform an initial search in Scopus.To follow this example, we executed this search string ''( TITLE-ABS-KEY ( knowledge AND representation AND reasoning ) AND TITLE-ABS-KEY ( test AND data AND generation ) )'' in Scopus and it returned 17 studies. 17We analyzed these papers and derived that just one of them could be potentially related to the empty cell or considered useful for the future research direction we are interested in.As a result, we can argue that this empty cell is consistent to support Explanation 1, Explanation 2, and Explanation 3, and clearly not supportive of Explanation 4. If several primary studies related to the empty cell would have been returned from Scopus, then only Explanation 2 and Explanation 3 would have been applied.

Use of standard or well-recognized terminologies and taxonomies:
We value the use of standard or well-recognized taxonomies (i.e., AI Watch [88] and SWEBOK [18]) as sources of a common language for our domain area.As such, they have been adopted to guide the analysis process.However, our analysis process shows how this outlook is not shared by the community.This puts a toll on the analysis process (in terms of construct validity threats, which we discuss in Section 7) to push the analysis forwards, as an agreement has to be reached upon the term used to describe a phenomenon.Needless to say, we do not view that standards or well-recognized taxonomies need to be static.Not only that these evolve, but novel research proposals might need novel terminology.Yet in general, we observed a lot of variations for concepts that are (or are supposed to be) well understood.We are far from the first to highlight this issue (for instance, see [36,84]), and in particular at the interplay of AI and software testing, Jöckel et al. [44] highlight how this issue becomes problematic for data analysis in our field and for extending and comparing research results.

AI Techniques Used to Support the Automation of Testing Activities
As it results from Table 6, several AI techniques have been applied to support ST.In this section, we focus on Testing Activities whose automation has been supported by different AI techniques and synthesize the main purpose for which each AI technique has been used for. 18I for Test Case Generation: Secondary studies shared similar conclusions about how AI techniques have been applied to support the test case generation activity.Search-based AI techniques have been used to generate optimal test suites according to a given adequacy criterion, such as code coverage or fault detection.NLP-based techniques have mainly been used to reduce the manual effort of extracting test cases from requirements, specifications and UML models [2,35,98].ML is considered an emerging AI topic for the automation of test case generation.To be applied, these techniques have to learn a behavioral model of the application under test.Such a model is usually built starting from a dataset of inputs and outputs, or on the fly during the exploratory testing of the application under test.The latter approach is mostly used in GUI-based testing, where the user interface is explored and tested at the same time by triggering user events [30].Ontologies have been used to build a vocabulary of terms that is specific for characterizing the application context of the software under test.The vocabulary can be used to build (1) abstract test cases, i.e., test cases that are not specific to a programming language or (2) platform specific executable test cases, i.e., test cases implemented in a specific programming language.Ontologies have also been used within NLP-assisted test case generation processes to impose restrictions on the context at hand and convert textual documents into an explicit system model for scenario-based test-case generation [25,35].
AI for Test Case Optimization/Prioritization/Selection: The analyzed studies also pointed out a joint observation that considers the ML-based techniques as the most exploited and promising ones to select or to prioritize the test cases from a test suite for reducing the testing resources such as the testing time and the use of expensive devices [12,30,50].Possible interesting applications of ML show that specific models can be trained, from a dataset of test suites, to select test cases that minimize the testing time or to predict defects in the system under test.The reduction of the testing time allows also the introduction of test case optimization processes based on ML in modern continuous integration development processes.ML leaning-based techniques may also be combined with NLP-based ones.The use of NLP is needed to process textual data for building the dataset used to train the models [74].Another common conclusion regards the use of ontologies.Semantic web-based techniques are the most used ontologies to define traceability links between test cases, test results and requirements.Such links are exploited to profile the test cases and to select or prioritize the ones that guarantee specific testing adequacy criteria, such as coverage of requirements or failure discovery [25,96].
AI for Test Data Definition: Most of the secondary studies reached a common consensus that considers the ant colony optimization techniques and genetic algorithms (GAs) as the most costeffective for the automatic test data definition in the context of mutation testing [3,43,81,89].GAs have been considered as the most effective solution for the automatic generating of test data for structural, functional, and mutation testing, and it has also been successfully exploited to generate data for testing floating point computations and expert systems.NLP and ML approaches have been mainly used to generate test data for GUI testing and, in particular, for mobile GUI testing.NLP have also been exploited to generate input values expressed in natural language [2,35], whereas ML techniques (such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Unsupervised and Reinforcement Learning) are used in automated exploratory testing to generate inputs (i.e., user events on the application GUI) allowing the exploration of application states that were not previously visited [32,98].

AI for Test Oracle Definition:
The studies concluded that ML has the potential to solve the "test oracle problem, " i.e., the challenge of automatically generating oracles.ML algorithms have been used to generate test verdicts, metamorphic relations, and most commonly expected output oracles [30,98].In particular, ML-based predictive models are trained to serve either as a stand-in for an existing test oracle (used to predict a test verdict) or as a way to learn a function that defines expected outputs or metamorphic relationships and that can be used to issue a verdict.Supervised and semi-supervised ML approaches seem to be the most promising; the associated ML models are trained on labeled system executions or on source code metadata.Of these approaches, many use some type of neural networks, such as Backpropagation NNs, Multilayer Perceptrons, RBF NNs, probabilistic NNs, and Deep NNs.Others apply support vector machines, decision trees, and adaptive boosting [32].The studies showed great promise, but significant open challenges.The performances of the trained ML models are influenced by the quantity, quality, and content of the available training data [32].Models should be retrained over time.The applied techniques may be insufficient for modeling complex functions with many possible outputs.Research is limited by the overuse of simplistic examples, lack of common benchmarks, and unavailability of code and data.A robust open benchmark should be created, and researchers should provide replication packages.Computer vision approaches are mainly used to support the oracle definitions in the context of GUI-based testing [98], where the verdicts need that specific regions or images of the graphical user interface are recognized to check their correctness, such as the color, the position on the screen, or the quality of the image.

THREATS TO VALIDITY
This section discusses the main possible threats to the validity of our tertiary study, classifying them according to Petersen et al. [77] and drawing suggestions from Zhou et al. [107].Thus, we classified the Threats to validity into (i) threats to Construct Validity, (ii) threats to internal validity, and (iii) threats to external validity.
Threats to Construct Validity.The use of different terminologies for AI and ST concepts in the selected secondary studies can lead to misclassification.As a strategy to mitigate this possible threat, we started from well-known taxonomies for both the AI [88] and ST [18] domains.In addition, the process of classifying the extracted data was performed iteratively and peer-reviewed by the authors.Furthermore, relevant concepts emerging from secondary studies were added to the adopted reference taxonomies, when missing.
Threats to Internal Validity.One of the major issues with systematic mappings is the risk of missing relevant studies.To mitigate this risk, we adopted a structured process to define and validate our search string, as suggested by Petersen et al. [77], and selected four major digital libraries to execute appropriate queries derived from it.In particular, our search string was designed to retrieve the largest number of published secondary studies by searching for the terms survey, mapping, review, secondary study, or literature analysis in the title or abstract of the papers.Furthermore, a snowball search process was performed to possibly find additional studies of interest.Another possible threat regards our decision to exclude gray literature papers, such as technical reports and graduate theses, that could lead to miss relevant secondary studies.However, since we reviewed secondary and not primary studies, the risk of excluding relevant but not peer-reviewed material is low.Biases or errors in the application of IC and EC as well as in the quality assessment of papers is another threat to the validity of our study.We mitigated this threat by having each selected paper examined by two groups of co-authors, including an AI expert and an ST specialist each, and having eventual disagreements resolved by face-to-face discussions between the members of the two groups.
Threats to External Validity.Publication bias is another common threat to the validity of secondary and tertiary studies [97].In particular, the result of our study might have been biased from inaccurate results reported in the selected secondary studies.A common reason for this is that primary studies with negative results are less probable to get accepted for publication and, as a consequence, to be taken into account by secondary studies, and therefore not permeating through to a tertiary study.Another external validity threat for our study relates to the risk of not extracting all the relevant information available in the selected studies or incorrect interpretation of the extracted data.Both these risks may have caused an inaccurate mapping of some analyzed studies.We tried to mitigate this threat by having an AI expert and an ST specialist involved in the data extraction and mapping of each study and resolving eventual disagreements in a face-to-face discussion.Our data extraction could have missed emerging trends provided by recently published primary studies that were not surveyed yet by any secondary studies.Also, since a tertiary study is based on data aggregated in secondary studies, it is possible that relevant information that was present in primary studies was omitted in the secondary studies and thus missed by our study.This threat is inherent to any tertiary study.

CONCLUSIONS
The goal of our tertiary study was to systematically understand how AI has been applied to support ST.As a result, we were able to uncover the interplay between the two domains and to reveal trends and possible future research directions.To achieve this goal, we defined nine RQs (five publication space RQs and four research space RQs) and conducted a systematic mapping study.We designed a strict research protocol and followed a systematic and peer-reviewed process to: (1) select our sources of information, (2) extract evidence from them, and ( 3) analyze the extracted data to answer our RQs.Starting from an initial set of 877 secondary studies retrieved from four major computer science digital libraries and an additional set of 296 studies retrieved by applying snowballing, the selection process led us to 20 relevant high-quality secondary studies.The analysis of the data extracted from the selected studies let us answer our RQs and derive the following main conclusions.
As for the publication space RQs: (1) the distribution of the selected secondary studies over the publication years (75% of them were published in the past six years), the large amount of unique primary studies they surveyed (710), and the distribution of these primary studies over time (the first dating 1995 and almost two-thirds of them appearing in the past ten years) show a growing interest from the research community in a well-consolidated research topic; (2) most of the selected studies were published in journal venues and a large part of them appeared in top-ranked journals, indicating the high importance of the topic; and (3) most of the authors' affiliations are located in South America (Brazil, Argentina, and Uruguay), while affiliation countries that typically dominate in computer science or computer engineering publications (e.g., USA and China) do not occur in our observations.
Regarding the research space RQs: (1) several AI domains have been applied to support ST with the Planning being the most popular one, and machine learning and natural language processing the most trendy; (2) several ST domains have been supported by AI.Almost all selected secondary studies surveyed the application of AI to the Testing Activity ST domain, and a majority of them surveyed the application of AI to the Testing Technique domain.Overall, it results that, in recent years, AI has been pervasively introduced in ST; (3) the majority of selected secondary studies investigated the application of Planning to support the Testing Activity, thus resulting the most surveyed pair of domains; (4) except for Software Testing Fundamentals, all ST domains have received support by more than one AI domain; in particular, Testing Activity and Testing Objective have seen applications from all AI domains.Similarly, by analyzing our mapping at a finer grain level, it results that most ST fields have received support from more than one AI concept, with some concepts having been applied only recently (e.g., word embedding); and (5) most frequent future research directions emerging from the selected secondary studies are: (i) the need for more rigorous research, (ii) the evaluation of the proposals in larger or real-world software systems, (iii) more research to evaluate how machine learning approaches can be applied to support software testing automation, and (iv) the need for the development of new types of representations to apply genetic algorithms for test data generation.
To the best of our knowledge, this research is the first tertiary study investigating how AI is used to support ST.As a result of this research, we obtained a fine-grained mapping that describes the current interplay between AI and ST.Researchers can leverage this mapping to identify opportunities for future research on new secondary studies to be conducted or new applications of AI to ST to be developed.Practitioners can also use the mapping to take an informed decision on which AI technology to possibly adopt in support of their testing processes.

(
EC2) The study focuses on the application of AI for either the prediction or the analysis, or the localization of: (i) errors; (ii) faults; (iii) bugs; or (iv) failures.(EC3) The study is a duplicate of another candidate paper.(EC4) The study does not provide substantially different contribution compared to another candidate work written by the same authors.(EC5) The study has another candidate paper, written by the same authors, which is an extended version of it.(EC6) The study is a tertiary systematic mapping.

Fig. 1 .
Fig. 1.Diagram of the secondary studies selection process execution.
the distribution of selected secondary studies per publication year.As shown in Figure2, 2009 is the first year for which we selected a secondary study.Most of the selected secondary studies (75%) were published in the past six Artificial Intelligence Applied to Software Testing: A Tertiary Study 58:15

5 AAFor
Algorithm in Regression Test Case Prioritization: A Review[76] of search-based software testing: a review[48] of virtual reality, artificial neural networks, and artificial intelligence in the automation of software tests: A review[91] critical review on automated test case generation for conducting combinatorial testing using particle swarm optimization[79] path coverage test data generation and optimization: A review [search-based approach for automated test data generation: A survey[60] techniques and mutation analysis in automatic test case generation: A survey approaches to test suite minimization for regression testing [8Evolutionary Testing: A Review of Evolutionary Approaches to the Generation of Test Data for Object-Oriented Software[71] systematic review of agent-based test case generation for regression testing [5each paper, we report the corresponding quality scores.ACM Computing Surveys, Vol.56, No. 3, Article 58.Publication date: October 2023.

Fig. 2 .
Fig. 2. Distribution of secondary studies per publication year and type.

Fig. 5 .
Fig. 5. Distribution of unique primary studies per publication year.

Fig. 8 .
Fig.8.AI and ST domain pairs covered by the selected secondary studies.For each pair of domains, we report the count of distinct secondary studies surveying the corresponding applications of AI to ST.

Table 2 .
Quality Assessment Criteria

Table 3 .
Data Extraction form Future Directions Space List of extracted sentences on future directions in AI applied to ST RS-RQ4In this table, we enumerate and describe the fields composing the data extraction forms for the PS and RS RQs.

Table 4 .
Quality Assessment Results

Table 5 .
Secondary Studies Per Venues' Types and Names

Table 6 .
(a) First Part of the Resulting Mapping

Table 6 .
(b) Second Part of the Resulting Mapping ) test case generation, test oracle definition, test case optimization/prioritization/selection, test data definition, Requirement-based Testing, and Mutation Testing are the ST fields that have seen support from most of the AI domains.

Table 7 .
Future Research Directions Indicated by the Selected Secondary Studies