Leveraging Large Language Models to Improve REST API Testing

The widespread adoption of REST APIs, coupled with their growing complexity and size, has led to the need for automated REST API testing tools. Current tools focus on the structured data in REST API specifications but often neglect valuable insights available in unstructured natural-language descriptions in the specifications, which leads to suboptimal test coverage. Recently, to address this gap, researchers have developed techniques that extract rules from these human-readable descriptions and query knowledge bases to derive meaningful input values. However, these techniques are limited in the types of rules they can extract and prone to produce inaccurate results. This paper presents RESTGPT, an innovative approach that leverages the power and intrinsic context-awareness of Large Language Models (LLMs) to improve REST API testing. RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification. It then augments the original specification with these rules and values. Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation. Given these promising results, we outline future research directions for advancing REST API testing through LLMs.


INTRODUCTION
In today's digital era, web applications and cloud-based systems have become ubiquitous, making REpresentational State Transfer (REST) Application Programming Interfaces (APIs) pivotal elements in software development [29].REST APIs enable disparate systems to communicate and exchange data seamlessly, facilitating the integration of a wide range of services and functionalities [11].As their intricacy and prevalence grow, effective testing of REST APIs has emerged as a significant challenge [12,19,40].
Automated REST API testing tools (e.g., [3, 5, 6, 9, 14-16, 18, 20, 22, 39]) primarily derive test cases from API specifications [2, 23,25,34].Their struggle to achieve high code coverage [19] often stems from difficulties in comprehending the semantics and constraints present in parameter names and descriptions [1,17,19].To address these issues, assistant tools have been developed.These tools leverage Natural Language Processing (NLP) to extract constraints from parameter descriptions [17] and query parameter names against databases [1], such as DBPedia [7].However, attaining high accuracy remains a significant challenge for these tools.Moreover, they are limited in the types and complexity of rules they can extract.
This paper introduces RESTGPT, a new approach that harnesses Large Language Models (LLMs) to enhance REST API specifications by identifying constraints and generating relevant parameter values.Given an OpenAPI Specification [25], RESTGPT augments it by deriving constraints and example values.Existing approaches such as NLP2REST [17] require a validation process to improve precision, which involves not just the extraction of constraints but also executing requests against the APIs to dynamically check these constraints.Such a process demands significant engineering effort and a deployed service instance, making it cumbersome and time-consuming.In contrast, RESTGPT achieves higher accuracy without requiring expensive validation.Furthermore, unlike ARTE [1], RESTGPT excels in understanding the context of a parameter name based on an analysis of the parameter description, thus generating more contextually relevant values.
Our preliminary results demonstrate the significant advantage of our approach over existing tools.Compared to NLP2REST without the validation module, our method improves precision from 50% to 97%.Even when compared to NLP2REST equipped with its validation module, our approach still increases precision from 79% to 97%.Additionally, RESTGPT successfully generates both syntactically and semantically valid inputs for 73% of the parameters over the analyzed services and their operations, a considerable improvement over ARTE, which could generate valid inputs for 17% of the parameters only.Given these encouraging results, we outline a number of research directions for leveraging LLMs in other ways for further enhancing REST API testing.

BACKGROUND AND MOTIVATING EXAMPLE 2.1 REST APIs and OpenAPI Specification
REST APIs are interfaces built on the principles of Representational State Transfer (REST), a design paradigm for networked applications [11].Designed for the web, REST APIs facilitate data exchange between clients and servers through predefined endpoints primarily using the HTTP protocol [30,35]  include headers and a payload, while the corresponding response typically contains headers, content, and an HTTP status code indicating the outcome.OpenAPI Specification (OAS) [25] is arguably the industry standard for defining RESTful API interfaces.It offers the advantage of machine-readability, supporting automation processes, while also presenting information in a clear, human-readable format.Key features of OAS include the definition of endpoints, the associated HTTP methods, expected input parameters, and potential responses.As an example, Figure 1 shows a portion of the FDIC Bank Data's API specification.This part of the specification illustrates how one might query information about institutions.It also details an expected response, such as the 200 status code, which indicates a successfully processed scenario.

REST API Testing and Assistant Tools
Automated REST API testing tools [5, 6, 9, 14-16, 18, 20, 22, 39] derive test cases from widely-accepted specifications, primarily OpenAPI [25].However, these tools often struggle to achieve comprehensive coverage [19].A significant reason for this is their inability to interpret human-readable parts of the specification [17,19].For parameters such as filters and sort_order shown in Figure 1, testing tools tend to generate random string values, which are often not valid inputs for such parameters.
In response to these challenges, assistant tools have been introduced to enhance the capabilities of these testing tools.For instance, ARTE [1] taps into DBPedia [7] to generate relevant parameter example values.Similarly, NLP2REST applies natural language processing to extract example values and constraints from descriptive text portions of the specifications [17].

Large Language Model
Large Language Models (LLMs) [13,24,36] represent a transformative leap in the domains of natural language processing (NLP) and Machine Learning.Characterized by their massive size, often containing billions of parameters, these models are trained on vast text corpora to generate, understand, and manipulate human-like text [28].The architecture behind LLMs are primarily transformerbased designs [37].Notable models based on this architecture include GPT (Generative Pre-trained Transformer) [27], designed mainly for text generation, and BERT (Bidirectional Encoder Representations from Transformers) [10], which excels in understanding context.These models capture intricate linguistic nuances and semantic contexts, making them adept at a wide range of tasks from text generation to answering questions.

Motivating Example
The OpenAPI specification for the Federal Deposit Insurance Corporation (FDIC) Bank Data's API, shown in Figure 1, serves to offer insights into banking data.Using this example, we highlight the challenges in parameter value generation faced by current REST API testing assistant tools and illustrate how RESTGPT addresses these challenges.
(1) Parameter filters: Although the description provides guidance on how the parameter should be used, ARTE's dependency on DBPedia results in no relevant value generation for filters.NLP2REST, with its keyword-driven extraction, identifies examples from the description, notably aided by the term "example".Consequently, patterns such as STNAME: "West Virginia" and STNAME: ("West Virginia", "Delaware") are accurately captured.(2) Parameter sort_order: Here, both tools exhibit limitations.ARTE, while querying DBPedia, fetches unrelated values such as "List of colonial heads of Portuguese Timor", highlighting its contextual inadequacy.In the absence of identifiable keywords, NLP2REST fails to identify "ASC" or "DESC" as potential values.
In contrast to these tools, RESTGPT is much more effective: with a deeper semantic understanding, RESTGPT accurately discerned that the filters parameter was contextualized around state names tied to bank records, and generated test values such as STNAME: "California" and multi-state filters such as STNAME: ("California", "New York").Also, it successfully identifies the values "ASC" or "DESC" from the description of the sort_order parameter.This example illustrates RESTGPT's superior contextual understanding, which enable it to outperform the constrained or context-blind methodologies of existing tools.

OUR APPROACH 3.1 Overview
Figure 2 illustrates the RESTGPT workflow, which starts by parsing the input OpenAPI specification.During this phase, both machinereadable and human-readable sections of each parameter are identified.The human-readable sections provide insight into four constraint types: operational constraints, parameter constraints, parameter type and format, and parameter examples [17].
The Rule Generator, using a set of crafted prompts, extracts these four rules.We selected GPT-3.5 Turbo as the LLM for this work, given its accuracy and efficiency, as highlighted in a recent report by OpenAI [24].The inclusion of few-shot learning further refines the model's output.By providing the LLM with concise, contextually-rich instructions and examples, the few-shot prompts ensure the generated outputs are both relevant and precise [8,21].Finally, RESTGPT combines the generated rules with the original specification to produce an enhanced specification.

Rule Generator
To best instruct the model on rule interpretation and output formatting, our prompts are designed around four core components: guidelines, cases, grammar highlights, and output configurations.Guidelines 1. Identify the parameter using its name and description.2. Extract logical constraints from the parameter description, adhering strictly to the provided format.3. Interpret the description in the least constraining way.
The provided guidelines serve as the foundational instructions for the model, framing its perspective and clarifying its primary objectives.Using the guidelines as a basis, RESTGPT can then proceed with more specific prompting.

Cases
Case 1: If the description is non-definitive about parameter requirements: Output "None".... Case 10: For complex relationships between parameters: Combine rules from the grammar.
The implementation of cases in model prompting plays a pivotal role in directing the model's behaviour, ensuring that it adheres to precise criteria as depicted in the example.Drawing inspiration from Chain-of-Thought prompting [38], we decompose rule extraction into specific, manageable pieces to mitigate ambiguity and, consequently, improve the model's processing abilities.
Additionally, the Rule Generator also oversees the value-generation process, which is executed during the extraction of parameter example rules.Our artifact [31,32] provides details of all the prompts and their corresponding results.

Specification Enhancement
The primary objective of RESTGPT is to improve the effectiveness of REST API testing tools.We accomplish this by producing enhanced OpenAPI specifications, augmented with rules derived from the human-readable natural-language descriptions in conjunction with the machine-readable OpenAPI keywords [33].
As illustrated in Figure 2, the Specification Parsing stage extracts the machine-readable and human-readable components from the API specification.After rules from the natural language inputs have been identified by the Rule Generator, the Specification Building phase begins.During this phase, the outputs from the model are processed and combined with the machine-readable components, ensuring that there is no conflict between restrictions.For example, the resulting specification must have the style attribute only if the data type is array or object.The final result is an enriched API specification that contains constraints, examples, and rules extracted from the human-readable descriptions.

Evaluation Methodology
We collected nine RESTful services from the NLP2REST study.The motivation behind this selection is the availability of a ground truth of extracted rules in the NLP2REST work [17].Having this data, we could easily compare our work with NLP2REST.
To establish a comprehensive benchmark, we incorporated a comparison with ARTE as well.Our approach was guided by the ARTE paper, from which we extracted the necessary metrics for comparison.Adhering to ARTE's categorization of input values as Syntactically Valid and Semantically Valid [1], two of the authors meticulously verified the input values generated by RESTGPT and ARTE.Notably, we emulated ARTE's approach in scenarios where more than ten values were generated by randomly selecting ten from the pool for analysis.and the F 1 score across a majority of the REST services.NLP2REST, while effective, hinges on a validation process that involves evaluating server responses to filter out unsuccessful rules.This methodology demands engineering effort, and its efficacy is constrained by the validator's performance.

Results and Discussion
In contrast, RESTGPT eliminates the need for such validation entirely with its high precision.Impressively, RESTGPT's precision of 97% surpasses even the precision of NLP2REST post-validation, which stands at 79%.This emphasizes that RESTGPT is able to deliver superior results without a validation stage.This result shows an LLM's superior ability in nuanced rule detection, unlike conventional NLP techniques that rely heavily on specific keywords.
Furthermore, Table 2 presents data on accuracy of ARTE and RESTGPT.The data paint a clear picture: RESTGPT consistently achieves higher accuracy than ARTE across all services.This can be attributed to the context-awareness capabilities of LLMs, as discussed in Section 2. For example, in language-tool service, we found that, for the language parameter, ARTE generates values such as "Arabic", "Chinese", "English", and "Spanish".However, RESTGPT understands the context of the language parameter, and generates language code such as "en-US" and "de-DE".

FUTURE PLANS
Given our encouraging results on LLM-based rule extraction, we next outline several research directions that we plan to pursue in leveraging LLMs for improving REST API testing more broadly.
Model Improvement.There are two ways in which we plan to create improved models for supporting REST API testing.First, we will perform task-specific fine-tuning of LLMs using data from APIsguru [4] and RapidAPI [26], which contain thousands of real-world API specifications.We will fine-tune RESTGPT with these datasets, which should enhance the model's capability to comprehend diverse API contexts and nuances.We believe that this dataset-driven refinement will help RESTGPT understand a broader spectrum of specifications and generate even more precise testing suggestions.Second, we will focus on creating lightweight models for supporting REST API testing, such that the models do not require expensive computational resources and can be deployed on commodity CPUs.To this end, we will explore approaches for trimming the model, focusing on retaining the essential neurons and layers crucial for our task.
Improving fault detection.RESTGPT is currently restricted to detecting faults that manifest as 500 server response codes.By leveraging LLMs, we intend to expand the types of bugs that can be detected, such as bugs related to CRUD semantic errors or discrepancies in producer-consumer relationships.By enhancing RESTGPT's fault-finding ability in this way, we aim to make automated REST API testing more effective and useful in practice.
LLM-based Testing Approach.We aim to develop a REST API testing tool that leverages server messages.Although server messages often contain valuable information, current testing tools fail to leverage this information [17].For instance, if a server hint suggests crafting a specific valid request, RESTGPT, with its semantic understanding, could autonomously generate relevant tests.This would not only enhance the testing process but also ensure that potential loopholes that the server messages may indicate would not be overlooked.

Figure 1 :
Figure 1: A part of FDIC Bank Data's OpenAPI specification.

Figure 2 :
Figure 2: Overview of our approach.
. The Grammar Highlights emphasize key operators and vocabulary that the model should recognize and employ during rule extraction.By providing the model with a fundamental contextspecific language, RESTGPT identifies rules within text.Output Configurations Example Parameter Constraint: min [minimum], max [maximum], default [default] Example Parameter Format: type [type], items [item type], format [format], collectionFormat [collectionFormat]

Table 1
presents a comparison of the rule-extraction capabilities of NLP2REST and RESTGPT.RESTGPT excels in precision, recall,

Table 2 :
Accuracy of ARTE and RESTGPT.