Random Testing and Evolutionary Testing for Fuzzing GraphQL APIs

The Graph Query Language (GraphQL) is a powerful language for application programming interface (API) manipulation in web services. It has been recently introduced as an alternative solution for addressing the limitations of RESTful APIs. This article introduces an automated solution for GraphQL API testing. We present a full framework for automated API testing, from the schema extraction to test case generation. In addition, we consider two kinds of testing: white-box and black-box testing. The white-box testing is performed when the source code of the GraphQL API is available. Our approach is based on evolutionary search. Test cases are evolved to intelligently explore the solution space while maximizing code coverage and fault-finding criteria. The black-box testing does not require access to the source code of the GraphQL API. It is therefore of more general applicability, albeit it has worse performance. In this context, we use a random search to generate GraphQL data. The proposed framework is implemented and integrated into the open source EvoMaster tool. With enabled white-box heuristics (i.e., white-box mode), experiments on 7 open source GraphQL APIs and three search algorithms show statistically significant improvement of the evolutionary approach compared to the baseline random search. In addition, experiments on 31 online GraphQL APIs reveal the ability of the black-box mode to detect real faults.


INTRODUCTION
Web services are quite common in industry, especially in enterprise applications using microservice architectures [63].They are also becoming more common with the appearing of smart city technologies, where microservices are largely exploited in Industrial Internet of Things settings [36,39].The investigation of automating techniques for generating test cases for web service Random Testing and Evolutionary Testing for Fuzzing GraphQL APIs 14:3 The article is structured as follows.Section 2 provides background information on GraphQL APIs, EvoMaster, and the employed search algorithms.Section 3 discusses related work.Section 4 gives a detailed explanation of the main components of the proposed framework.The details of our empirical study are presented in Section 5, followed by a discussion of the main findings of our research work in Section 6. Section 7 discusses the possible threats to validity.Finally, Section 8 concludes the article.

BACKGROUND
This section provides important background information to better understand the rest of the article, particularly regarding GraphQL APIs (Section 2.1), the EvoMaster tool (Section 2.2), and the compared search algorithms RS (Section 2.3), WTS (Section 2.4), MOSA (Section 2.5), and MIO (Section 2.6).

GraphQL
GraphQL is a query language and server-side runtime for APIs [8].Given a set of data represented with a graph of connected nodes, GraphQL enables to query such graph, specifying for each node which fields and connections to retrieve (and so recursively on each retrieved connected node).Figure 1(a) shows a simplified/reduced example of a graph for a pet clinic, whereas Figure 1(b) shows a GraphQL query on it to retrieve the list of all pets that have been registered in the clinic with their owners and their total number of visits to the pet clinic.
A GraphQL web service would be typically listening on a TCP socket, expecting GraphQL queries as part of HTTP requests, on a given HTTP endpoint (typically /graphql).However, GraphQL is not tight to HTTP, and queries could be technically sent via other communication mechanisms.One advantage here is that a user can fetch all the data they need (and only that) in a single HTTP call.
A core concept in GraphQL is the schema, which is a collection of types and relationships between those types.It describes which kind of types and fields on those types a client can request.GraphQL is a strongly typed language and has its own language to write the schema.
Commonly used types definition in a GraphQL schema are as follows: (1) Object types: The most frequent elements of a GraphQL schema are object types.They indicate which kind of object (e.g., a node in the graph) you can fetch and what fields it has.
(2) Query type: A Query type is a special type in GraphQL that defines entry points (equivalent to remote procedure calls) for fetching data from the graph.It is the same as an object type, but its name is always Query.Each field of the Query type describes the name and return type of a different entry point.Note that a GraphQL server has to define a query type.
(3) Mutation type: A GraphQL operation can either be a read or a write operation.A Query type is used to read data, whereas the Mutation type is used to modify data.The Mutation type follows the same syntactical structure as queries; however, it defines entry points for writing operations.Note that a GraphQL server may or may not have a mutation type.(4) Scalar types: Scalar types are primitive types that resolve to concrete data.GraphQL has five default scalar types: Int, Float, String, Boolean, and ID.Note that GraphQL allows creating custom scalar types for more specific usage.(5) Input types: Input types are object types that allow passing complex objects as arguments to queries and mutations.An input type's definition is alike to that of an object type, but it starts with the keyword input instead of type.Note that input types can only have basic field types (input types or scalar types) and cannot have field arguments.(6) Enum types: An Enum type is a special scalar type with a restricted set of allowed values specified in the schema.(7) Interface types: An Interface type is an abstract type.An interface is composed of a set of fields held by multiple object types.When an object type implements an interface, it has to include all of that interface's fields.Thereby, interfaces enable returning any object type that implements that interface.Note that we can query an interface schema type for any fields defined in the interface itself, and we can also query it for fields that are not in the interface but in the object types implementing the interface.(8) Union types: Like interface types, the union type belongs to the GraphQL abstract types.It allows to define a schema type that belongs to multiple types.In its definition, a union type will determine which object types are included.In this case, the schema field can return any object type that is described by the union.Note that all the union's included types should be object types (e.g., not Input types).( 9) Non-nullable type: All types in GraphQL are nullable by default-that is, the server can return a null value for all the previous types.To override this default and specify that null is not a valid response, an exclamation mark (!) following the type is added indicating that this field is required.The Non-Null type can also be used in arguments.Note that in queries, a field is always optional-that is, one can skip a non-nullable field and the query would be still valid.However, if a field is required (declared as non-nullable), the server must never return the null value if the query fetches such field.With regard to input arguments, by default they are optional.However, if a type is declared as non-null, besides not taking the value null, it also does not accept omission (i.e., the input argument must be present).
The following is a fragment of a GraphQL schema extracted from one of the SUTs used in our empirical study in Section 5 (i.e., petclinic [10]).The schema describes the entry point pets that returns a list of all pets that have been registered in the pet clinic.It is defined as a non-nullable object array type named Pet.The object type Pet has as a field a non-nullable integer scalar type that represents the id of the pet.It also defines a non-nullable object type Owner that implements an interface named Person.The field VisitConnection is an object type specifying all the visits to the pet clinic of this pet.It has an non-nullable integer field totalCount that reports the total number of visits for this pet.As a response to a query, a GraphQL API will return a JSON object with two fields: data that contains the result of the query, and errors if there was any error with the query (and in this case, the data field would not be present).Notice that GraphQL makes no distinction between user errors (e.g., a wrongly formatted query or an input does not satisfy a business logic constraint) and server errors (e.g., internal crash in the business logic of the API, like a null-pointer exception).However, it might support it in the future. 2Furthermore, as GraphQL is independent from HTTP, a query with errors could still have an HTTP response 200 (i.e., OK), and this is a common behavior among GraphQL framework implementations.
Another limitation of GraphQL is that currently it has no standardized way to express constraints on the fields of the graph (e.g., an integer within a specific numeric range, or a string that should satisfy a given regular expression).Constraints could be added with "directives, " which are decorators used to extend the semantics of the schema.But those would be custom and unique for each different implemented API.

EvoMaster
EvoMaster [16,27] is an open source tool that aims at system test generation, currently targeting REST web services [19].EvoMaster is a search-based approach that has been integrated with a set of search algorithms (i.e., MIO [15], MOSA [64], WTS [47], and an RS algorithm) to evolve test cases.Internally, by default it uses the MIO [15] evolutionary algorithm enhanced with Adaptive Hypermutation [79], and can handle both black-box [20] and white-box testing [25].For white-box testing, it uses established heuristics like the branch distance [15,19,59], it employs testability transformations to smooth the search landscape [25], and can also analyze all interactions with 14:6 A. Belhadi et al.
SQL databases to improve the fitness function [24].For black-box testing, EvoMaster employs a random test generation.The tool produces random inputs but still is syntactically valid with respect to the OpenAPI/Swagger schema of the SUT.
Two different studies [57,80] compared EvoMaster with other fuzzers for RESTful APIs, showing that EvoMaster gives the best results.EvoMaster is open source, hosted on GitHub [5], with each release automatically published on Zenodo for long-term storage.The extension presented in this article is available to practitioners from EvoMaster version 1.5.0 [31].
EvoMaster currently targets REST APIs running on the JVM and NodeJS [84] (albeit for blackbox testing, it can be applied on any kind of REST web service), and it outputs test suite files in the JUnit and Jest format.Each generated test case is composed by one or more HTTP calls, and SQL data to initialize the database (if any).
When targeting RESTful APIs, EvoMaster analyzes their schema and creates a chromosome representation with a rich gene system to represent all possible needed types (from integers and strings to full JSON objects).The resulting phenotype will represent complete HTTP requests, where each gene would represent the different decisions that need to be made in these HTTP requests (e.g., query/path parameters in the URLs and body payloads in POST/PUT/PATCH requests).EvoMaster evolves test cases based on different metrics, like statement and branch coverage, as well as coverage of the HTTP status codes for each endpoint in the API.To detect potential faults, it considers the 500 HTTP status code (i.e., server error) and possible mismatches between the schema and the concrete responses [61].

Random Search
RS is a method to generate solution(s) of an optimization problem at random.It is easy to implement, as it does not need gradient to guide the search or any special heuristics.It simply keeps sampling new random solutions from the search space, checking if they give better fitness than the best solutions sampled so far.
EvoMaster implements RS to produce the tests in the context of test case generation-that is, randomly sample a sequence of test actions with test data generated randomly according to the schema of the SUT.Each time a newly sampled test covers a new target, it is stored in an archive.At the end of the search, RS generates the best set of test cases based on the defined fitness function.
In the context of REST API testing, RS is the default algorithm for the black-box mode of Evo-Master.Regarding the white-box mode of EvoMaster, the RS can be considered as a gray-box testing approach as it still employs white-box heuristics to produce the best test cases at the end of the search (i.e., tests that cover more code are saved in the archive).
Such an approach can be used as a baseline [28] to assess the effectiveness of white-box approaches [15,82,84]-for example, in terms of effectiveness of evolving tests with the white-box heuristics (e.g., branch distance).In the literature, RS is often used as baseline to evaluate novel techniques [13].A novel technique might have a non-trivial computational cost, which could make it slower and provide worse results than a naive RS (an example is unconstrained combinatorial testing [22]).

WTS
WTS was introduced by Fraser and Arcuri [47] for unit test generation of Java classes.The approach reformulates the test suite generation as a search problem, and employs a Genetic Algorithm (GA) to solve it, with the main focus originally on maximizing branch coverage in the context of white-box testing.
With the GA, an individual in a population is a test suite that is composed of a set of test cases.The search starts with a random population.At each iteration, a new generation (i.e., Z ) is evolved, with a subset (e.g., 10%) composed of the best individuals from the current population (known as elitism).Then, two individuals are selected based on the rank selection from the population as parents for the next evolution (i.e., parents P 1 and P 2 ).The two parents will be modified by crossover with a given probability and mutation to generate two offspring (i.e., offspring Q 1 and Q 2 ).The modification performed by crossover is at the test suite level, whereas the mutation can modify the test cases (like changing test data) or test suite (like removing or inserting new tests).The four individuals (i.e., P 1 , P 2 , Q 1 , and O 2 ) will be evaluated by the fitness function, which calculates the sum of all the heuristics/optimization targets by the test suite as a single objective [68].Based on the calculated fitness and constraints (e.g., maximum number of test cases), the best individuals from these four will be added into the new generation (i.e., Z ).Like in a GA, this process is repeated until the new population Z is full (i.e., it reaches its maximum size).Then, a new generation is created following the same process, and so on.The search will terminate once the specified coverage criterion is satisfied or the specified search budget is used up, then produces the best test suite evolved so far in the final generation.WTS is further extended by using an archive to store the best tests [69].More details on WTS can be found in other works [47,68,69].

MOSA
MOSA was introduced by Panichella et al. [64].Unlike the single objective handled by WTS, MOSA reformulates the test generation as many objectives.MOSA is also a white-box testing approachthat is, each coverage target (e.g., branch and line) is defined as an objective to optimize.An individual in MOSA is a test and not a test suite as in WTS.In MOSA, the population is composed of tests for an objective.Once the objective is achieved by a test, the test will be added into an archive that tracks the best tests during the search, and the objective will also not be involved in the fitness function in the following optimization steps.
MOSA is inspired by Non-Dominated Sorting Genetic Algorithm II (NSGA-II) [40], which is a widely applied algorithm in the scientific literature.To better tackle the test case generation problem and make the search focus on uncovered targets, MOSA defines a preference criterion and a new rank algorithm based on such a criterion (i.e., preference sorting).Similar to NSGA-II, MOSA starts with a random population.At each iteration, crossover and mutation are used to generate new tests, and selection is used for the next generation based on ranks.In MOSA, the main difference from NSGA-II is that its selection employs preference sorting, which calculates ranks with a further consideration of the preference criterion (e.g., test cases for uncovered targets will be assigned with the best rank to have them survive in the next generation with a higher chance).The search will terminate based on a specified search budget, and the best tests in the archive will be outputted at the end.

MIO
MIO is the default algorithm in EvoMaster, introduced by Arcuri [15].It is a genetic-based evolutionary algorithm inspired by the (1+1) evolutionary algorithm [44], designed specifically for handling system test case generation.Here, we briefly discuss how it works, but for the full details of MIO we refer to the work of Arcuri [15].MIO is a multi-population algorithm with one population for each testing target.Similarly to MOSA, MIO evolves individuals that are test cases and outputs test suites.MIO has been employed for testing REST APIs [19,84,86] and RPC-based APIs [82].
At the beginning of the search, one single population is randomly initialized, based on the chromosome templates constructed from the schema.For testing RESTful APIs, several kinds of testing targets are taken into account, such as statements coverage in the SUT, branch coverage, and returned HTTP status codes.Next, at each step, MIO either samples a new test at random or selects one existing test from one population that includes yet to be covered targets, and mutates such a test.Different strategies are used to select which population to sample from.Individuals are manipulated through only one operator, which is the mutation operator (i.e., no crossover).Two types of mutation are applied: either a structure mutation or an internal mutation.A test case is composed by one or more "actions" (i.e., an HTTP call in the case of testing of web services).In contrast to the internal mutation, which affects only the values of the genes in the actions, such as flipping the value of a Boolean gene from True to False, the structure mutation acts on the structure itself, such as adding and removing actions (e.g., HTTP calls).Each time a new test is sampled/mutated, its fitness is calculated.If it achieves any improvement on any target (regardless of the population it was sampled from), it will be saved in the corresponding populations (and the worst individuals in such populations are deleted).In this context, if a target is covered by a test, it is saved in an archive, the corresponding population is shrunk to one single individual, and it will never expand again nor be used for sampling.If new targets are reached (but not fully covered) during the evaluation of a test, a new population is created for each such newly discovered target.At the end of the search, MIO does output a test suite (i.e., a set of test cases) based on the best tests in the archive for each testing target.

RELATED WORK 3.1 Testing of GraphQL APIs
Automated testing of GraphQL APIs is a topic that has been practically neglected in the research literature [66].To the best of our knowledge, so far only three approaches have been investigated regarding the automated testing of GraphQL APIs [55,72,78] (besides the poster version of this extended work [35]).
Vargas et al. [72] proposed a technique called deviation testing.It consists of three steps.In the first step, an already existing test case is taken as input.This test constitutes a base to seed and compares the newly generated tests.The second step is the test case variation, where variations of the initial seeded test case are generated using deviation rules (where a deviation consists of a small modification).Four types of deviation rules are defined: (1) field deviation consists of adding and deleting the selection of fields in the original query, (2) not null deviation consists of replacing a declared non-null argument with null, (3) type deviation consists of changing an argument type by another type, and (4) empty fields deviation consists of deleting all fields and sub-fields of the original query.The third step is the test case execution, where the input test and its variations are executed.The last step consists of comparing the results between the input test and its variation (e.g., wrong inputs should lead to a response containing an error message).
Karlsson et al. [55] proposed a black-box property based testing method.The method consists of the following steps.First, all specifications of the types and their relations are extracted from the schema.Data is generated at random according to the schema, with customized "data generators" provided by the user.In addition, the authors suggest two strategies to use as automated oracles: the first one aims to check the returned HTTP status codes, and the second one verifies that the resulting data returned conforms to the given schema.
Zetterlund et al. [78] presented a technique to capture the HTTP calls done by users in production (e.g., when interacting with web frontend) and generate test cases for the GraphQL API in the backend.These generated tests can then be used for regression testing.
Our novel solution does not require any pre-existing test case (like in the work of Vargas et al. [72]), nor does it require the user to write customized input generators (like in the work of Karlsson et al. [55]), nor does it require users to interact with a frontend [78].Due to those limitations, no empirical comparisons with those techniques could be performed on our case study.Our framework benefits from the EvoMaster tool, which is able to do advance white-box testing 14:9 (e.g., using testability transformations [25] and SQL interaction analysis [24]), and also is able to perform black-box testing when source code of the SUT is not available.In this work, we provide a complete testing pipeline and an intelligent genetic-based exploration for the possible test case configurations.

Testing of REST APIs
In recent years, there has been an increasing interest in the research community about the automation of testing web services, where RESTful APIs are the currently the most common type [50].
For example, Godefroid et al. [49] introduced differential regression testing for avoiding breaking changes in REST APIs.They analyzed both types of regressions in APIs: regression in the contract rules between the client and the server, and regression in the server itself, the different changes in the server versions.The differential testing is performed to automatically identify abnormal behavior in both kinds of regressions.It consists of comparing different versions of the server and also the different versions of the contracts with the client.
Viglianisi et al. [74] proposed an approach for automatically generating black-box test cases for REST APIs.It takes as input the Swagger/OpenAPI specification of the API and consists of three modules.It first analyzes the schema and computes its corresponding operation dependency graph, which is a graph that represents the data dependencies.It then automatically generates test cases of the REST API by reading both the graph created in the previous step and the schema to test the nominal scenarios.It finally applies mutation operators to the nominal tests which violate data constraints for testing error scenarios.To decide whether a test is successful or not, two oracles are established based on the returned status codes and on the compliance with the schema.
Martin-Lopez et al. [62] presented a new formulation of the automated test API problem using the CSP (Constraint Satisfaction Problem).The IDL (Inter-parameter Dependency Language) is first introduced to formally describe the different relations among input parameters of the REST API.Then, the CSP is used to automatically analyze the IDL specification.Finally, a catalogue of analysis is constructed to extract helpful information such as checking whether an API call is valid or not So far, all approaches presented in the literature for testing RESTful APIs are black-box [50].Evo-Master (Section 2.2) is currently the only tool that can do both white-box and black-box testing.It uses evolutionary techniques for white-box testing and RS for black-box testing.Furthermore, recent comparisons of tools [58,80,81] show that EvoMaster gives the best results on the selected APIs used in those tool comparisons.
Note that empirical comparisons with existing fuzzers targeting RESTful APIs would not be particularly useful, as they would fail to generate any test case for GraphQL APIs (e.g., as the API schema definitions are different).Likewise, comparing with other testing tools targeting different domains such as mobile apps (e.g., Sapienz [60]) or data parsers (e.g., AFL [1]) would provide no useful information either.

GRAPHQL TEST GENERATION
This section presents our novel proposed framework for automated test case generation of GraphQL APIs, built on top of the EvoMaster tool.As sketched in Figure 2, the proposed framework targets both white-box and black-box testing.The white-box testing is performed when the information related to the schema is provided and has access to source code of the GraphQL API.In the case of only having the schema, black-box testing is used instead.
Algorithm 1 presents the pseudo-code of the GraphQL test generation algorithm.The input data is the entry point to the GraphQL API (e.g., a URL).In the case of white-box testing, we also provide the source code as input.The results will be a set of generated tests (e.g., in JUnit or Jest format).The process starts by extracting the schema from the graph data in line 3.In the case of having the source code of the GraphQL API, the white-box solution is performed.It first determines the corresponding genes, creates the chromosome template, and applies an evolutionary algorithm process to generate the tests (from line 4 to line 7).In the case of missing the source code of the GraphQL API, the black-box solution is established by generating random tests based on the schema, illustrated in line 9.The algorithm returns the set of generated tests as shown in line 11.In the following, both kinds of testing are explained in more detail.

White-Box Testing
When carrying out scientific research to apply evolutionary computation to solve a new engineering or scientific problem for the first time in the literature, many decisions need to be made and empirically evaluated.This includes defining the problem representation for the evolving individuals (Section 4.3), the search operators (Section 4.4), the fitness function (Section 4.5), and the final output format (Section 4.6).All these decisions based on scientific research can have major impact on the final results.Therefore, proper empirical evaluations with a large case study are needed.
Our process for automated search-based testing for GraphQL APIs starts by extracting the schema from the tested API (Section 4.2).The chromosome template is then constructed from the schema.Test cases are represented by a sequence of HTTP requests, instantiated from the chromosome template (Section 4.3).The test cases are evolved using a search algorithm, where each test case will contain genes representing how to build the GraphQL queries based on the 14:11 given schema.The evolutionary search is performed by applying mutation operators to evolve the test cases-for example, one to change the queries/mutations in each HTTP calls and the other to add/remove HTTP calls in a test case (Section 4.4).Test cases are rewarded based on their achieved code coverage and found faults (Section 4.5).From the final evolved solution, a self-contained test suite file (e.g., in JUnit format) is generated as output of the search (Section 4.6).
In the following, we describe the main components of the white-box testing in more detail.

API Schema
At a high level, a GraphQL API can be seen a server opening a TCP socket, processing HTTP requests, with body payloads written in a specific format (i.e., using the GraphQL language) for the different functionalities provided by the API.Sending random bytes on such TCP connection would unlikely lead to any meaningful message that would be immediately discarded by the API.Likewise, sending properly formatted GraphQL messages would result only in errors if those messages are not based on the actual entry points and expected input types of the API.
To send meaningful GraphQL messages that would execute the business logic of the API, such messages must be based on the schema of the API (recall Section 2.1).Each GraphQL API must have a schema definition, which can be retrieved online from the API itself (unless such option is disabled due to security reasons).
To fetch the whole schema from a GraphQL API, an introspective query is used.Given an entry point to the GraphQL API (e.g., typically a /graphql HTTP endpoint), GraphQL enables a standard way to fetch a schema description of the API itself.The schema specifies all the information about the available operation types, such as queries, mutations, and all available data types on each of them.As a result, the GraphQL schema is returned in JSON format.
Let us consider the following example of the introspective query in which we query one of the SUTs used in our empirical study in Section 5 (i.e., petclinic [10]) to obtain the resources that are available.In this introspective query, we query the field __schema, which provides all information about the schema of a GraphQL service.It is considered as a meta-field used by GraphQL for the introspection system.Such field is accessible from the root type of a query operation, and its type is defined next.By querying the fields queryType and mutationType, the GraphQL petclinic server will return all queries and mutations available from the schema.In this case, both query and mutation operations are available.
The field named types (of kind __Type) is at the core of the introspection system.It represents all types in the system: both named types (e.g., OBJECT kind) and type modifiers (e.g., NON_NULL kind).A reduced subset (due to its length) of the returned result of the introspective query for this example is shown next.
Like the software of the API itself, the GraphQL schemas can also have faults.This, for example, is a common issue among RESTful APIs [61].As the schema is the main source of information on how to prepare syntactically valid requests, issues in the schema can have negative impacts on the performance of the fuzzing sessions [80].
Most GraphQL frameworks (e.g., Apollo [3]) do validate the syntax of the API endpoints based on the defined schema-for example, to check the presence and format of each endpoint (i.e., query and mutation methods).In case of errors and mismatches, they would return a response with an error message at each incoming request or simply crash the server at start-up time.This kind of issue can be quickly identified and corrected.However, a schema could be underspecified.For example, the API could have implementations of endpoints that are not declared in the schema.But it would not be possible to call any of these endpoints, as the GraphQL frameworks running the API would not be aware of those endpoints.Therefore, even in these cases, such issues would be easily detected by users without the need of using any fuzzer.
This means that for GraphQL APIs, in contrast to RESTful APIs (where typically the framework servers do not validate the schemas), problems in the schemas do not seem to be as serious for testing purposes.However, more research will be needed to evaluate this potential issue in more detail.

Problem Representation
Once the schema of the tested API is fetched, this latter is then parsed in our EvoMaster extension and used to create a set of action templates, one for each query and mutation operation.Each action will contain information on the fields related to input arguments (if any is present) and return values.A chromosome template is defined for each action, which is composed of nonmutable information (e.g., the field's names) and a set of mutable genes.In this context, each gene characterizes either an argument or a return value in the GraphQL query/mutation.
For objects as return values, a query/mutation must specify which fields should be returned (at least one must be selected), and so on recursively if any of the selected fields are objects as well.
To represent the fact that a field is always optional for queries, a return gene is modeled by an object gene where all its fields are optional.However, we had to extend the mutation operator in EvoMaster with a post-processing phase, to guarantee that at least one field gene is selected during the search.In other words, if after a mutation of a gene, which represents a returned object value in the GraphQL query/mutation, all fields are deselected, then the post-processing will force the selection of one of them (and so on recursively if the selected field is an object itself).However, if a return value is a primitive type, then there is no need to create any gene for it, as there is no selection to make.Furthermore, similar to functions calls, fields in the returned value can have input arguments themselves.When a returned value for a parent field is executed, both input arguments and the returned value are recursively selected to generate a child field value until it produces a scalar value whether in input arguments or in returned values.To model those function calls, we introduced a new special type of gene called Tuple, discussed next.
To fully represent what is available from the GraphQL specification, the following kinds of gene types from EvoMaster have been reused and adapted: (1) String: This gene contains string variables that are defined by an array of characters.A minimum length of the string is zero, which represents the empty string.Each string gene cannot exceed a pre-defined maximum number of characters (e.g., 100).(2) Enum: This gene represents the enumeration type, where a set of possible values is defined, and only one value is activated at a given time.The elements in the set can be in different formats (e.g., enumerations of numbers or enumerations of strings).( 3) Float/Integer/Boolean: These are genes representing variables with simple data types.
Boolean genes represent variables with true or false values.Integer and float genes represent integer and real-value variables, respectively.( 4) Array: This gene represents a sequence of genes with the same type.This gene has variable length, where elements can be added and removed throughout the search.To mitigate creating too large test cases (e.g., with millions of genes), the size of an array gene should not exceed a given threshold.(5) Object: This gene defines an object with a specific set of internal fields.Differently from the array gene, where the elements should be with the same type, an object gene may contain elements with different types.To do so, this gene is represented by a map, where each key in the map is determined by the field name in each element in the object.( 6) Optional: This is a gene containing another gene, whose presence in the phenotype is controlled by a Boolean value.This is needed, for example, to represent nullable types in arguments and selection of fields in returned objects.(7) CycleObject: This special gene is used as a placeholder to avoid infinite cycles, when selecting object fields that are objects themselves, which could be references back to the starting queried object.Once a test case is sampled, its gene tree structure is scanned, and all CycleObject genes are forced to be excluded from the phenotype (e.g., if inside an Optional gene, that gets marked as non-selected, and the mutation is prevented to select it; if the CycleObject is the type for an array gene, such array gets a fixed size of 0, and the mutation operator is prevented from adding new elements in it).(8) LimitObject: GraphQL schemas are often very large and complex, and the levels of nesting fields can be potentially huge.We use this special gene as a placeholder when a customized depth limit is reached.The depth is the number of nesting levels of the object fields.(9) Tuple: This gene is needed, for example, when representing the inputs of function calls.It is composed of a list of elements of possible different types, where the last element can be treated specially.For example, this is the case of function calls when the return type is an object, on which we need to select what to retrieve (and these selected elements could be function calls as well, and therefore this is handled recursively).
To make this discussion more clear, let us consider a small, simplified portion of the schema of GitLab (one of the SUTs used in our empirical study in Section 5).To send the query to the server, the user must follow the preceding representation of the schema.For instance, to query the field fullName, the user might send the following query.This query is syntactically valid, conforming the schema represented previously.But it is not the only possible query conforming such schema.A user could rather send a permissionScope with value TRANSFER_PROJECTS, or simply such optional input could be avoided altogether.So, a genotype representation needs to be able to express all possible queries that are valid for the given schema.The tree in Figure 3 shows a genotype structure (seen as a tree of genes) for such schema, using the previously discussed gene system used in our framework.For example, the action representing the GraphQL query currentUser has the object gene UserCore, which contains an optional tuple field groups.When calling currentUser, one needs to specify which fields of the returned object UserCore to include in the response.In this particular example, there is only one field called groups.Considering that which field to return is optional, to represent this, each of these fields is inside an optional gene.If an optional gene is deactivated, none of its internal genes is used in the phenotype of the test case.The field groups is itself a remote function call.It is represented with a tuple gene, having an input argument permissionScope and return value GroupConnection.The argument is represented as an optional gene containing an enum gene (for the two possible values defined in the schema).The return value GroupConnection is represented with an object gene.For each field of an object, we need a gene to represent it.A field can be yet another object, or an array of them, like the case of nodes.So, this process is applied recursively.The non-method/non-object fields are represented with a Boolean gene (to check whether it will be part of the returned object or not), like the case of fullName.In this simplified example, the choices that need to be made are, for example, whether the optional genes should be active or not, the Boolean values of the Boolean genes, and values for the enum genes.The evolutionary process will make modifications to these values throughout the search.From this genotype, then the phenotype will represent syntactically valid queries.
For this subset of the schema, the search space of all possible queries is small.However, it would increase exponentially when dealing with more inputs, particularly for strings.
To fully support the whole specification of GraphQL, there are several special cases that need to be handled, like the use of interfaces.To deal with GraphQL interfaces, we use an optional object gene for each type that implements the interface, together with an extra optional object gene (labeled with BASE) to specify the interface fields themselves.Consider the following portion of the schema from digitransit (one of the SUTs used in our empirical study in Section 5).Here, the interface Node has two possible implementations: Trip and TicketType.A user can query different fields based on the concrete types of the returned objects.For example, assume querying the fields id, tripHeadsign, and price, as described next.The tree in Figure 4 shows the genotype for this user query containing interfaces, based on the gene system used in our framework.Here, whether to query the different concrete types of the interfaces, and their fields, is optional.Indeed, a genotype representation needs to be able to express all these possible kinds of valid phenotypes.
After defining the possible type of genes supported by the proposed framework, we consider the solution space, where each solution is a set of test cases.A test case is composed of one or more HTTP request.To represent an HTTP request, we typically need to deal with its components: HTTP verb, path and query parameters, body payloads (if any), and headers.
A GraphQL request can be sent via HTTP GET (used only for queries) or HTTP POST methods with a JSON body (used for queries and mutations).For simplicity, we only use the verb POST for 14:17  both queries and mutations.A GraphQL server uses a single URL endpoint (typically /graphql), where the HTTP requests with the GraphQL queries/mutations will be sent.In the context of test generation for a GraphQL API, the main decisions to make are on how to create JSON body payloads to send.The genotype will contain genes (from the set defined previously) to represent and evolve such JSON objects.
In Figure 5, there is an example of a test case generated automatically by EvoMaster for the petclinic API, outputted in JUnit format.It is composed of two HTTP POST requests.The first call with a body payload querying for the entry point specialities and the second requesting for the entry point owner.When a test case is generated and evaluated, we also provide assertions on the returned responses.
The test cases are generated in a random way, but they are still syntactically valid.For instance, if we consider the example illustrated in Figure 1(a), the test cases are generated by exploring the fields of the pets node.For instance, we consider the field "id" an integer represented in 32 bits.The possible test cases for the field "id" is 2 32 .We also explore different combinations of two or more fields in each node.For instance, considering the same example illustrated in Figure 1(a), the test cases might be generated from both fields "id" and "name" of the node pets.If we consider the length of the string is limited to 10, the possible tests cases for the field "name" is 2 160 (assuming each character is 2 bytes).Therefore, the number of possible test cases by only exploring the fields "id" and "name" is 2 32 × 2 160 , which results in an immense search space.Therefore, in our implementation, and to mitigate the combinatorial explosion, we use a threshold to limit the number of generated test cases that can be evaluated (i.e., we limit the number of test cases we sample during RS).

Search Operators
Once a chromosome representation is defined based on the GraphQL schema, test cases are evolved and evaluated in the same way as done for RESTful APIs in EvoMaster (recall Section 2.2), including testability transformations [25] and SQL database handling [24].Internally, the search algorithms in EvoMaster are implemented in a generic way, independently of the addressed problem (e.g., REST and GraphQL APIs), and it is only a matter of defining an appropriate phenotype mapping function (e.g., how to create a valid HTTP request for a GraphQL API based on the evolved chromosome genotype).
As stated previously, an evolving individual will be a set of actions (i.e., calls to query and mutation endpoints) on the tested API.Each action is represented with a gene tree template (e.g., recall examples in Figures 3 and 4), which needs to be instantiated (i.e., set the values of the genes).As part of the search, there are three main search operators: (1) random sampling, (2) mutation on the structure of the tests, and (3) mutation on the content of an action.Note that the term mutation in the context of GraphQL APIs (used to represent an endpoint in the API that can modify its state) has nothing to do with the term mutation used in the evolutionary computation literature (used to represent search operators that do small changes in the evolving individuals).
When sampling a new individual at random (e.g., needed for RS, as well as for evolutionary algorithms when they need to initialize their first population of individuals to evolve), first there is the need to choose how many actions K it contains.For example, it can be randomly chosen between 1 and N (e.g., where N = 10).Given A, the set of possible action templates (one for each query/mutation endpoint in the API), each of these K actions in the sampled test will be chosen randomly from A. Then, the content of each gene in such trees is set at random (considering their types and constraints).
Mutation operators are used to do small changes to an evolving individual.The structure of the test can be modified by removing an action from the current K (if K > 1) or by adding a new random action from A (if K < N ).This can be applied with a given probability P.
The content of an action a can be modified by selecting any from K, then selecting randomly any gene from its tree.Given G a , the set of genes in the selected action a, each gene could be mutated with probability 1/|G a |.The type of mutation depends on the type of the genes.For example, a numeric gene could have its phenotype value increased or decreased by a certain small delta.A Boolean gene could be flipped from true to false and vice versa.A string gene could have some of its chars modified randomly.And so on (full details can be found in the source code of EvoMaster [31]).
Consider the last example in Figure 5, where the mutation operators are applied as shown next.

Fitness Function
The fitness function plays a critical role in an evolutionary algorithm, as it specifies which individuals will survive and reproduce.The main goal of our testing is to find faults in the tested API.A fault cannot manifest if the code in which it lies is not executed.Therefore, an indirect approach to try to find more faults is to maximize the code coverage achieved by the generated tests.However, generating high code coverage tests is a complex task, as the execution flow in the API might depend on complex constraints (e.g., complex predicates in if statements), which could be satisfied only with very specific inputs.
There is a large body of research literature on the topic of maximizing code coverage for software testing.In the case of search-based software testing [13], there are common techniques like branch distance [59] and testability transformations [51].For the work in this article, we do not define any new white-box heuristics.We rather rely on the state-of-the-art white-box heuristics for system test generation provided by EvoMaster.This includes advanced testability transformations [26], as well as different types of branch distance heuristics.All evolutionary algorithms compared in this work use the same fitness function.
Besides testing targets based on the source code (e.g., line, statements, and branches), there are other metrics of interest for practitioners.For example, for Web APIs using HTTP, covering different returned HTTP status codes can provide a better coverage of the API.For example, you can make a correct query and receive a 200 status code in an HTTP response.For the same endpoint, you could send an invalid input (e.g., a number outside a specified range), which could lead the server to return a 400 status code (user error), although this depends on the server implementation (some GraphQL servers return 200 even in the case of errors).A request with no authentication information could return a 401.A request with authentication but no authorization (i.e., no right permissions) could return a 403.An input that leads to a crash (e.g., an exception thrown in the business logic of the API) could result in a 500 status code.And so on.
For each GraphQL endpoint, we create a different testing target for each returned HTTP status code.This enables EvoMaster to do not discard newly generated tests that cover endpoints returning different status codes (and thus showing different behaviors of the API).
When evaluating the fitness of an evolved test, besides considering testing targets related to code coverage and HTTP status coverage (for each different query/mutation operation), we also create new testing targets based on the returned responses.As discussed in Section 2.1, each response could contain either a data field or an errors field.For each query and mutation in the GraphQL schema, we consider two additional testing targets for those two possible outcomes.Note that a trivial way to get a response with errors is to send a syntactically invalid query.As such evolved test cases would be of little use, we explicitly avoid generating such test cases (unless there are faults in EvoMaster).
As automated oracles to detect faults once a test is executed, we consider two properties: returned HTTP status code 500 and responses with errors fields.The former is a common oracle used in fuzzing HTTP-based APIs (e.g., [58,61]).However, it is important to keep in mind that not all 500 responses are necessarily related to software faults.For example, an API could return a 500 when unable to communicate with its database because it is down-not reachable for some technical reasons.As errors fields might be due to user errors besides server errors, the users would still need to check those generated tests to see if actual faults are detected.
As a given query/mutation might fail for different reasons, we keep track of the last executed line in the business logic of the SUT.We further create a separated testing target for each combination of errored query/mutation and last executed line.Having explicit testing targets for those cases enables the search algorithms to keep those test cases, albeit the fitness function would have (currently) no gradient to lead to generate such test cases in the first place.

Output Test Suite
Depending on how long the search is left running, millions of test cases could be evolved and evaluated.Huge test suites would be of little use for practitioners, as they are not manageable.For this reason, at the end of the search, only a minimized test suite is given as output to the user.Each single test case contributes to the overall fitness of the output test suite (e.g., code and HTTP status coverage, detected faults).In particular, in EvoMaster, for each search algorithm we employ an archive, which is updated each time a new test case covers a new testing targets.The output test suites is based on what is stored in such archive during the search.
As the final output, we generate executable test cases in common formats such as JUnit (for Java/Kotlin) and Jest (for JavaScript), as shown in Figure 5.Each test case will have assertions on the obtained responses-that is, they capture the current behavior of the API.If, for example, a JSON object is returned, assertions will be recursively created for each of its fields.These assertions do not directly detect faults, as without a formal specification we cannot know what are the expected outputs for the used inputs.However, these assertions can be used for regression testing (i.e., to check if the behavior of the API changes with new updates).Furthermore, as the output test suites are minimized, these assertions could be useful as well to point out possible faults if the users manually review the generated tests.

Black-Box Testing
We use black-box testing when we do not have any knowledge about the source code of the GraphQL API or it is not available for instrumentation (e.g., to calculate the search-based heuristics like the branch distance).It is not straightforward to get a high coverage value for such tests [20], as little information from the SUT can be exploited.However, in some cases (e.g., when testing remote services), a black-box approach might be the only option available for automated testing.
In addition, as no code analysis is performed, black-box testing can be applied regardless of the programming language the API is written in, such as Python and Ruby.However, currently, for white-box testing with EvoMaster we are limited to languages running on the JVM (e.g., Java and Kotlin) and NodeJS (e.g., JavaScript and TypeScript).
We use the RS algorithm developed in EvoMaster.The main idea behind RS is performing a randomized process in generating the test cases, where no code-based fitness function is employed.The reason for not using search-based heuristics is due to the lack of the source code of the GraphQL APIs.
From a practical standpoint, our black-box testing is the same as RS but without code-based heuristics.Like for white-box testing, we start by fetching the schema (Section 4.2) and create a problem representation (Section 4.3) from which new test cases are randomly sampled (Section 4.4).No evolutionary mutation operator is applied here.We use the same fitness function to reward found faults (Section 4.5) but without any code metrics.At the end of the search, the final test suite is minimized, to contain only the test cases that contribute to the fitness (Section 4.6).In other words, for each query/mutation, we retain test cases that lead to different HTTP status codes, and at least one with a correct data response and at least one with an errors response.
Both black-box and white-box testing share the same goal of detecting faults in the tested APIs.They use the same automated oracles to detect faults.Both testing approaches are important, as they have their own strengths and weaknesses.For example, black-box testing is easier to use (e.g., it requires no setup to specify how to start the application with automated instrumentation), and it is of wider applicability (e.g., it is not restricted to any specific programming language).However, white-box testing can achieve better results (i.e., code coverage and fault finding), as it can exploit information about the source code of the API.Furthermore, its generated tests can be used for regression testing (as the generated tests can start, stop, and reset the API).
In this article, we provide and empirically evaluate both approaches, as both of them are useful for practitioners in the industry.Considering that, to the best of our knowledge, this is the first work in the literature addressing this problem, more can be done in future research.For example, our black-box approach is very basic, simply an RS on syntactically valid queries based on the schema.

Tool Support
All the novel techniques presented in this work have been implemented as part of our existing tool EvoMaster.EvoMaster is open source on GitHub, with each new release automatically uploaded to Zenodo for long-term storage (e.g., [31]).
When a practitioner uses EvoMaster, they need to specify with command-line options whether they are testing a REST or GraphQL API.For example, black-box testing of an online API such as GitLab can be done on the command line as shown next.
Here, one needs to specify that we are fuzzing a GraphQL API (using --problemType) and not, for example, a RESTful one, where the API is located (--bbTargetUrl), the type of testing (--blackBox), the format of the output tests (--outputFormat), for how long to run the fuzzing session (--maxTime), and a rate-limiter (--ratePerMinute) to do not overload the tested API of requests (needed when testing APIs on the Internet to avoid denial of service).For doing white-box testing, some manual effort is needed, as there is the need to implement a driver class to specify how to start and stop the API.
Extending an existing fuzzer for a new problem domain not only requires scientific research but also significant engineering effort.What is presented in this article took 2 years of work.Considering the complexity of EvoMaster (which is currently more than 200,000 LOCs, not including tests), providing precise code metrics is not viable.Although modules specific for GraphQL can be identified (e.g., org.evomaster.core.problem.graphqlwith more than 4,000 lines of Kotlin code), changes were needed throughout the whole code base of EvoMaster to be able to support GraphQL.For example, the gene system of the evolutionary engine of EvoMaster needed to be extended with new genes like TupleGene.We can estimate around 10,000 to 15,000 LOCs needed to support GraphQL API testing.
To reduce the risk of publishing wrong results based of faulty software, this work has been carefully tested.For example, in unit tests (e.g., GraphQLUtilsTest), we parse 75 GraphQL schemas (having more than 860,000 lines), to make sure that our schema analysis algorithms do not crash and give the correct results (at least for those 75 schemas).Furthermore, EvoMaster has a sophisticated system of end-to-end tests [30].We create several artificial APIs, run EvoMaster on them, compile the generated tests, run them, and verify properties on those tests.This is all done automatically from JUnit tests (including the compilation and dynamic loading and execution of the new generated tests on the fly), and run in a Continuous Integration system (i.e., GitHub Actions) at each new Git commit (more details can be found in the work of Arcuri et al. [30]).Due to all these end-to-end tests, the current EvoMaster build takes more than 2 hours.For GraphQL, we currently have end-to-end tests for 39 artificial APIs (in the module spring-graphql), covering different aspects of GraphQL, for a total of more than 6,000 LOCs.

EMPIRICAL STUDY 5.1 Experimental Setup
In this section, we carry out several experiments to validate the applicability of the proposed framework for GraphQL test generation.This can be achieved by answering the three following research questions: RQ1: For white-box testing of GraphQL APIs, how effective are evolutionary algorithms at maximizing code coverage and fault detection compared to RS? RQ2: How does black-box testing fare on existing APIs on the Internet?RQ3: What kinds of faults are found by our novel technique?5.1.1White-Box.GitHub [7], arguably the main repository for open source projects, was used to find SUTs for experimentation.JVM and NodeJS projects were scanned and filtered while excluding trivial projects.For example, we excluded APIs with less than 500 LOCs and student projects.For this study, seven GraphQL web services were selected, which we could compile and run with no problems: • The Spring petclinic [10] API (4,567 LOCs) is an animal clinic where a pet owner can register his pet for an examination.The examination is carried out by a veterinarian who has one or more specialist areas.• patio-api [9] (12,552 LOCs) is a web application that attempts to estimate the happiness of a given team periodically by asking for a level of happiness.• graphql-ncs (548 LOCs) and graphql-scs (577 LOCs) are based on artificial RESTfulAPIs from an existing benchmark [6].For this study, we adapted these two APIs into GraphQL APIs.graphql-ncs and graphql-scs are based on a code that was designed for studying unit testing approaches on solving numerical [21] and string [14] problems.

14:23
• react-finland [11] (16,206 LOCs) is an API for a week-long developer conference focused on React.js and related technologies.
• timbuctoo [12] (85,365 LOCs) is an API that allows scientists to decide how data from different databases is shared.• e-commerce [4] (1,791 LOCs) is an e-commerce API built on Phoenix and Elixir that can be utilized to create interactive e-commerce web applications.
To the best of our knowledge, there is no other existing white-box fuzzer that can be used to test GraphQL APIs.Therefore, in this article, we cannot compare with any existing technique, as none is available.White-box fuzzing GraphQL APIs is a novel contribution of this work.Still, it is important to verify whether a novel sophisticated technique is really warranted, and no simpler technique would be already as effective [13].When nothing else is available, a common baseline in software testing research is random testing [28], in which an application is tested with random inputs.Still, sending random bytes on the TCP connection the SUT is listening on would be of little to no value, as the chances of generating a valid GraphQL query (or even simply a valid HTTP request) would be virtually non-existent.Therefore, for doing random testing, we still sample and send syntactically valid GraphQL queries based the schema of the SUT.
Once a software engineering problem is modeled as an optimization/search problem (e.g., by specifying the problem representation and the fitness function), different search algorithms can be applied and evaluated.But no search algorithm is best on all problems [76].To improve performance on a specific problem, research is needed to customize the algorithms to exploit as much domain knowledge of this problem as possible.In the specific case of white-box test suite generation, the most used algorithms are WTS (Section 2.4), MOSA (Section 2.5), and MIO (Section 2.6).MOSA [64] replaced WTS [47] as the default search algorithm in EvoSuite [46] (which is the most famous search-based tool for unit test generation), based on large empirical studies comparing many different search algorithms [37,65].However, for the system test generation of RESTful APIs, MIO [18] provided the best results in search algorithm comparisons [18].As the testing of GraphQL APIs with search algorithms is a novel contribution of this work that has not been done before in the research literature, in this work we apply and compare the three most common search algorithms for test suite generation (i.e., WTS, MOSA, and MIO).
For the experiments, we set 1 hour as the search budget for our white-box testing approach.To take into account the randomness of the algorithms, each experiment was repeated 30 times [23].In total, these experiments took 7 × 4 × 30 = 840 hours (i.e., 35 days).Experiments were run in parallel (15 at a time) on the same hardware: an HP Z6 G4 Workstation with an Intel Xeon Gold 6240R, 24 cores, CPU @2.40GHz 2.39-GHz processor, 192 G of RAM, and 64-bit Windows 10.
To evaluate and compare the effectiveness of the employed algorithms, we selected covered testing targets (#Targets), line coverage (%Lines), and the number of detected faults (#Errors) as metrics for comparisons.The testing target (#Targets) is the default coverage criterion in Evo-Master.It comprises and aggregates different metrics, such as code coverage (including branch coverage), HTTP status code coverage, and fault findings.The line coverage (%Lines) is collected as part of our code instrumentation.Furthermore, we also reported (#Errors) by identifying potential faults-that is, 500 HTTP status codes and responses with errors entries (recall Section 4.5).

Black-Box.
To evaluate the black-box testing, 31 online APIs with different domain applications and different numbers of endpoints were selected from apis.guru [2], a curated public listing of available web services on Internet.These APIs are written with different programming languages, such as JavaScript and Python.Some APIs provide their implementation (e.g., open source), whereas others do not (e.g., commercial services).When an API required authentication, we created an account on these APIs and added the right authentication information to the HTTP We ran our extension of EvoMaster on all those APIs with black-box mode.The stopping criterion was set to 1,000 HTTP calls per run.Each experiment was run only three times, since sending thousands and thousands of HTTP calls to live services could be interpreted as a denial-ofservice attack.For the same reason, we put a rate limiter of at most 10 HTTP requests per minute (i.e., EvoMaster would make HTTP calls only every 6 seconds).In total, these experiments took 3 × 31 × (1,000/10) = 9,300 minutes (i.e., 6.4 days).They were run in parallel in the same way as for the white-box experiments.
As we do not have any control on these remote APIs, repeating the experiments more times would not add much more information, as such runs would not be fully independent.

Results for RQ1.
To compare MIO, MOSA, WTS, and RS, Table 2 reports their average #Targets, %Lines, and #Errors for each employed search algorithm on each of the case studies.Overall, on average, MIO provides the best results.However, there are two out of seven APIs in which it provides worse results (i.e., ecommerce-server and petclinic).A detailed analysis of MIO is provided with pairwise comparisons using Mann-Whitney-Wilcoxon U-tests (p-value) and Vargha-Delaney effect sizes ( Â12 ), when compared with RS (Table 3), WTS (Table 4), and MOSA (Table 5).
Note that there is no data collected for WTS on react-finland.On this kind of API, WTS crashed due to being out of memory in every single experiment.
When looking at the achieved target coverage, there is a clear improvement of white-box evolutionary search (e.g., MIO) compared to RS (see Table 3).On five out of seven APIs, the effect size is maximum, (i.e., Â12 = 1).This means that in each of the 30 runs of MIO, the results were better than the best run of RS.
When looking at only line coverage, we can see that MIO enables covering 90.3% of lines in graphql-ncs.This is a large improvement compared to the 59.5% of RS (i.e., +30.8%).On average, among the SUTs, the improvement is +7% (i.e., from 37.1% to 44.1%).When looking at the absolute values for line coverage, we need to point out an issue with collecting coverage for NodeJS applications (i.e., ecommerce-server and react-finland), as coverage computation is not considering what was achieved during boot time of the API.When looking at line coverage along, there are statistically worse results for petclinic and e-commerce.However, the differences are minimal (i.e., at most -0.4%).For instance, petclinic is a simple API used for demonstration, where large parts of its code is not executed (e.g., it has three different implementations of its data layer, where only one can be active at a time, and this has to be specified in a configuration file when the API is started).On simple problems, RS can be already very good, whereas evolutionary search can have some small side effects (which would likely disappear when using a longer search budget).This is particularly the case if the fitness function does not provide a gradient to the search and the algorithm gets stuck on local optima.The better the fitness function is (e.g., more advance white-box heuristics), the better the results can be achieved with evolutionary search.
In Table 2, we also report the number of potential faults identified by the different search algorithms.The table shows clearly better results for MIO compared to RS, with an average high effect size of Â12 = 0.67 (see Table 3).On timbuctoo, MIO achieves the most, finding 89.7 errors on average compared to 49.5 for RS.This result is achieved thanks to the evolutionary operators adopted in the proposed framework to handle GraphQL APIs.
Considering a 1-hour search budget, an automatically achieved coverage of 39.8% for a complex API like patio-api could be considered a good, practical result, although more still need to be done (e.g., better search heuristics).For example, there was practically no difference in code coverage between random and evolutionary search on timbuctoo.An in-depth analysis of this API would be needed to point out which branches were not covered, potentially pin-pointing which kinds of constraints were not be able to be solved with current search-based heuristics.This would be needed to design new heuristics to solve those kinds of constraints [80].
At any rate, in these experiments, not only are many potential faults found, but the generated tests can also be used for regression testing (i.e., they can be added to the test suites of the SUTs and run as part of continuous integration to check if any change is breaking any current functionality).
The choice of using 1 hour as the stopping criterion is technically arbitrary.It could had been more or less.Such choice was based on what practitioners could use in practice [83].Nevertheless, the chosen search budget can impact the conclusions taken from the comparisons of search algorithms.For this reason, in Figure 6, we report plot lines for demonstrating the performance of the compared techniques for the number of covered targets throughout the search, collected at each 5% interval (i.e., at each 3-minute interval).
According to the reported results, MIO outperforms RS for all cases except for petclinic and ecommerce-server, where we observed better results for RS.On the other APIs, the improvements 14:27 of MIO are visible throughout the entire search.In a few cases, already with small budgets (e.g., 3 minutes), MIO gets better results than all the other algorithms running for 1 hour.
In these experiments, MOSA shows some interesting behavior.For example, it has a "slow start" on timbuctoo and graphql-scs, but then with the passing of time it gets better results than RS and WTS.Depending on the chosen search budget, different conclusions could be drawn from these empirical comparisons.

RQ1: In terms of covered targets, MIO demonstrates consistent and significant improvements (+7%
line coverage and +10.2 more faults found on average) compared with random testing.In these experiments, MIO provides better results than other evolutionary algorithms such as WTS and MOSA.This shows the effectiveness of MIO adapted for GraphQL testing for maximizing code coverage and fault detection.1.The results include the number of endpoints (#Endpoints) representing the number of queries and mutations present in the schema, the percentage of endpoints with generated tests without errors (%NoErrors), and the percentage of endpoints with generated tests with errors (%WithErrors).Tests with errors and others without errors could be generated for the same endpoint.However, the following formula would be satisfied: and In other words, for each endpoint, we want to see if we can generate at least one test case in which the endpoint returns data correctly (i.e., data field), and at least one test case in which there are issues (i.e., errors field).The former type of test could be useful for regression testing, and to manually check if the API behaves correctly and gives the expected outputs.The latter type of test could potentially detect faults and thus would be a first step for debugging the API.As an example, if an API has 10 endpoints (i.e., #Endpoints = 10), then we want to see how many tests with data (0 ≤ #NoErrors ≤ 10) and with errors (0 ≤ #W ithErrors ≤ 10) responses are generated.Generating a test with correct data responses is not necessarily trivial, as the inputs might have constraints that are not specified in the schema (e.g., a string input must match a specific regular expression).Likewise, if the API has no faults, or the testing tool is not able to create inputs that find any existing fault, or the API has no input validation, no test with a response containing errors would be generated.Ideally, a fuzzer should aim at being able to generate valid inputs for all endpoints (i.e., %NoErrors = 100%), whereas whether it is possible to find any fault strongly depends on whether there is any fault in the tested API.
From Table 6, we remark that all endpoints were reached, and responses with either data or errors fields are effectively derived.From the table, we can see that we can generate tests which lead to responses with errors fields for many endpoints (77.8%).However, there are also many queries/mutations for which we could not get back any valid data (i.e., responses with data field and no errors were less than 50%).This is likely due to input constraints which are unlikely to be satisfied with random data.Without code analysis (or constraints expressed directly on schema), likely there is not much a black-box tool can do here (besides having the user provide some sets of valid inputs to fuzz).
Similarly to the fuzzing of RESTful APIs, black-box testing can find faults in GraphQL APIs by just sending random (but syntactically valid) inputs, as often APIs are not particularly robust when dealing with such kinds of random data [61].However, without being able to analyze the source code, it can be hard to bypass their first layer of input validation and generate successful API requests [20].

RQ2:
The black-box testing implemented in our novel approach enables the automated test generation that can detect on average up to 641 endpoints with errors out of 825 endpoints (i.e., 77.8%).

Results for RQ3.
As discussed in Section 2.1, currently GraphQL makes no distinction between user and server errors.So, without an in-depth manual analysis of the generated tests, it is hard to tell which responses with errors messages are due to actual software faults and not a simple misuses of the API.Furthermore, without knowing the full details of the expected business logic of the specific API under analysis, it might be hard for researchers (who are not the developers of the API) to determine if a returned error is indeed due to a software fault.This problem is further exacerbated for the external APIs used for black-box testing experiments, where the source code is not available and cannot be used to validate if an error is indeed likely due to a fault.Still, when evaluating a novel fuzzing technique like we do in this article, it is important to check if it can find any actual faults.For this reason, we did a manual analysis of hundreds of generated tests from our experiments.Here, we discuss some of the most interesting cases.
Let us start from the following generated test case for petclinic.
A similar case can be seen in the following test generated for react-finland.

14:31
However, there are a few cases of more serious faults, like when the returned responses are not matching the constraints of the GraphQL schema of the API.For example, consider the following case of a HTTP call in a generated test for Bahnql.
1 given () .accept (" application / json ") 2 . contentType (" application / json ") 3 . body (" { " + 4 " \" query \": \"{ parkingSpace ( id : 842) { name , label , responsibility , spaceType , location { latitude },url , operator , distance , facilityType , openingHoursEn , isSpecialProductDb , isOutOfService , occupancy { validData , timestamp , timeSegment }, clearanceHeight , outOfService , isMonthSeason , tariffDiscount , tariffPaymentCustomerCards , tariffFreeParkingTimeEn , tariffPaymentOptionsEn , slogan } } \" " + 5 " } ") 6 . post ( baseUrlOfSut ) 7 . then () 8 . statusCode (200) 9 . assertThat () 10 . .body (" ' errors '[0].' path '" , hasItems (" parkingSpace " , " location " , " latitude ")) 18 .body (" ' data '.' parkingSpace '" , nullValue () ); Here a 200 status code is returned, which would imply a success from the point of view of HTTP.However, the error message is "Cannot return null for non-nullable field Location.latitude." This looks like a case of an internal server error, where a test case is asking for a non-nullable field named latitude, but the server tried to return a null value.All types in GraphQL are nullable by default, and the null value is a valid response.However, when looking into its schema definition, the field latitude is defined as a non-null scalar.This is a clear example showing an actual fault in the SUT, where the API tries to return a response that violates the schema: Here a 200 status code is returned, which would imply a success from the point of view of HTTP.However, in the body of the response, the following error message appears: "Can't find property named "cursor" on mapped class Information->information in this Query." It states that there is no field named cursor belonging to the root query information.We have extracted and analyzed the whole schema of the Catalysis-hub API by sending an introspective query to its endpoint.The schema reveals that the field information is of type InformationCountableConnection.The type In-formationCountableConnection has the field named edges (that we have asked for) of type Informa-tionCountableEdge. The latter, as shown in the following, has two fields, namely node and cursor, showing a clear fault in the SUT.The internal implementation of the server does not respect the defined GraphQL schema.Although schema violations might not be always easy to identify, there are other cases in which faults are very clear.For instance, consider the following test generated for the Buildkite API.
Another clear example of a major problem can be seen in the following test generated for the Catalysis-hub API.The test requests the resource called hasPreviousPage, but the actual call (done with the library RestAssured) throws an exception.
First, this could be technically a security issue if the API was in production and not just run locally for testing.Full stack-trace details are useful for debugging, but they expose internal details of the API that could be exploited by external attackers.Second, the exception seems to happen in an SQL SELECT query, which is malformed.The point here is that the id in the SQL database is of type numeric, and so a value like "Z" is invalid.However, the API does not check for such integer constraints as a first layer of input validation when a GraphQL query is executed, and rather fails afterward.This can be a serious problem if there are modifications to the internal state of the API before the thrown exception, as the API might be left in a inconsistent state.
RQ3: Different kinds of faults were automatically detected with our novel techniques, including wrong handling of requests for missing data, and generated responses that do not match the API schemas. 14:36 A. Belhadi et al.

DISCUSSION AND FUTURE DIRECTIONS
This section discusses the main findings of the article, followed by possible future work.The main findings of using EvoMaster for automated GraphQL APIs testing can be summarized as follows: (1) The first finding of this study consists on the difficulty of automatically identifying test cases of GraphQL APIs compared to the RESTful APIs.Indeed, the graph representation of the actions is more complex than the traditional representation of the RESTful APIs.This representation is rich and might be used in different domain applications; however, this needs careful care of the automated test generation process.Furthermore, whereas a RESTful API can have clear relations between resources based on hierarchical URIs, and that information can be successfully exploited by test generation tools [85], this does not seem to be the case for GraphQL APIs (e.g., no easy heuristics to determine which resources on the graph each mutation operation might manipulate).
(2) The second finding of this study is that the EvoMaster tool proved its applicability in handling other kinds of web service APIs, represented by GraphQL APIs.EvoMaster was implemented and architectured from the start to be able to be extended and adapted to other system test generation domains besides REST APIs [16].The results obtained in this work show that EvoMaster is a generic enough framework for evolutionary-based system test generation, at least for applications where the entry point is a TCP connection.Being released as open source [27], EvoMaster can be further extended and used in other domains as well.(3) To obtain better code coverage, white-box heuristics based on search-based techniques can help significantly.However, existing APIs can have many faults that can be easily detected by simply sending random inputs.This makes even simple approaches like black-box testing potentially useful for practitioners.
To improve the effectiveness of the automated GraphQL API testing, several directions may be investigated in the future: (1) Test oracle problem: Given a test case, whether the result of its execution is correct or not can be determined with an automated oracle [34].Without an automated oracle, the developer has to determine manually whether the observed test results are as expected or not.But having to manually check hundreds/thousands of generated test cases might not viable.
As discussed in the article, query responses with errors fields might not be representing actual faults in the SUT but rather just the user sending wrong data.To mitigate the test oracle problem, an intelligent automated strategy is needed to differentiate between the actual faults from the user errors for a given GraphQL API response.One approach is to use machine learning, particularly supervised classification, to automatically label whether a response with an errors field should be treated as a potential fault that the developer should investigate.When test suites are generated at the end of the search, the test cases could be ordered based on their probability of representing actual faults.(2) Evolutionary computation: Evolutionary computation is an intelligent mechanism of exploring large and big solution spaces, inspired by the evolutionary process from nature.In this research work, we only used the evolutionary algorithms MIO, MOSA, and WTS, but others might be more fitting for the case of GraphQL API testing.To further improve the code coverage of the automated testing in this domain, further investigation should be carried out in this area.For example, other techniques such as Particle Swarm Optimization [54] and Ant Colony Optimization [77] could be considered.Combining other testing techniques (e.g., Symbolic Execution [33]) with evolutionary computation can also be considered a good direction to further address this problem [48]. 14:37 (3) Knowledge discovery: Data mining and knowledge discovery is the process of extracting hidden patterns from a large data collection.Decomposition is a widely used technique in solving complex problems [42,43].The aim is to create highly correlated clusters, where each cluster contains similar data.In our context, the idea is to apply the decomposition method to the GraphQL schema to derive sub-graphs of schema.Each sub-graph might contain highly connected actions.Good decomposition methods allow to find independent sub-graphs as much as possible, to enable the same test case generation while dealing with the sub-graphs as when dealing with the entire GraphQL schema.(4) Industrial settings: Further investigations with more case studies will be essential to generalize the effectiveness of our novel technique.Of particular importance will be to apply our technique in industrial settings, to see and evaluate how engineers would use tools like EvoMaster in practice on their APIs.

THREATS TO VALIDITY
Threats to internal validity come from the fact that our experiments are derived from a software tool.Errors in such a tool could negatively affect the validity of our empirical results.Although our EvoMaster extension was carefully tested, we cannot provide any guarantee of not having software faults.However, as it is open source, anyone can review its source code.Another potential issue is that the implemented solution in this research work is based on random algorithms.This happens in particular for population initialization of the evolutionary algorithm, where different test cases may be generated.To deal with this issue, each experiment for white-box testing was repeated 30 times [23], with different random seeds, and the appropriate statistical tests were used to analyze the results.All the APIs used for the white-box experiments are collected in a GitHub repository called EMB [6], which is stored on Zenodo as well [32].Furthermore, all of our scripts used to carry out our experiments are stored as part of the repository of EvoMaster.This is done to enable third parties to replicate and validate our experiments.However, experiments for blackbox testing cannot be reliably replicated, as they rely on live services on which we do not have any control (e.g., they can be modified at any time by their owners).
Threats to external validity are due to the fact that only 7 GraphQL APIs for white-box testing and 31 GraphQL APIs for black-box testing were used in our empirical analysis.The generalization of such results to other APIs might not be possible at this stage.More APIs should be investigated in the future.However, as this is the first work on white-box testing of GraphQL APIs, already achieving good coverage and finding real faults on a complex GraphQL API provide a promising first step.

CONCLUSION
This article introduced a new approach for automated testing for GraphQL APIs.It is a full complete solution, starting from the schema extraction and ending by automatically generating test cases outputted in JUnit and Jest format.Two testing modes are implemented and evaluated: whitebox and black box testing.
To intelligently explore the test case space, evolutionary computation techniques are used in the white-box testing.Two mutation operators (internal and structure mutation) are defined, where the goal is to maximize code coverage and fault finding.In addition, random testing is used for the black-box mode.
To validate the applicability of the proposed framework, it is integrated into the EvoMaster open source tool.Our empirical analysis was carried out on 7 GraphQL APIs for white-box testing, empirically comparing three different evolutionary algorithms, and 31 GraphQL APIs for blackbox testing.The results show the clear improvement of using evolutionary computation compared

Fig. 1 .
Fig. 1.Example of GraphQL data graph and a query on it.

Fig. 3 .
Fig. 3.A gene representation of a user query.

Fig. 4 .
Fig. 4. A gene representation of a user query using GraphQL interfaces.

Table 1 .
GraphQL APIs Used for Black-Box Experiments headers of EvoMaster (e.g., the authentication header can be set with the command-line argument --header).For our experiments, we considered all the GraphQL APIs listed on apis.guru, but we excluded APIs that are no longer available (but still listed on apis.guru) or that required payment to create an account.Table1gives a short description of the 31 APIs used in our experiments.

Table 2 .
Results for 1-Hour Budget for White-Box TestingBest results for each metric on each API are highlighted in bold.

Table 3 .
Detailed Comparisons between MIO and RS Statistically significant effect-sizes (at α ≤ 0.05 level) are marked in bold.

Table 4 .
Detailed Comparisons between MIO and WTS Statistically significant effect sizes (at α ≤ 0.05 level) are marked in bold.

Table 5 .
Detailed Comparisons between MIO and MOSA

Table 6 .
Results for Black-Box Testing Results for RQ2.Table6presents the results of the black-box testing on the 31 APIs described in Table