Automated Framework to Extract Software Requirements from Source Code

Software maintenance and innovation are constant challenges across industries, especially as programming languages evolve with technology. Similarly, poor lexicon quality degrades program comprehension, increasing the effort required by developers to improve existing software products. To address these challenges, we propose a novel automated framework that extracts software requirements directly from source code using a baseline AI language model applied to a Java code base. Leveraging natural language processing techniques, the framework validates programs and generates easily readable requirements by analyzing file contents. The framework enhances agility and flexibility by providing comprehensive documentation for existing software systems. It caters to both experienced and less-experienced developers, offering an intuitive graphical user interface and enabling efficient identification and resolution of errors. The resulting output facilitates natural interaction through language processing. By automating the extraction process, the framework allows developers to better understand software systems, make informed decisions, and adapt to evolving needs.


INTRODUCTION
With the rapidly evolving environment of information-based systems, there is a pressing demand for the adoption of innovative tools and methodologies [7,23].It is essential for industries to ensure that their tools and methodologies are agile and flexible enough to keep up with the rapid advancements in software development and technologies.Extracting functional requirements from source code is a crucial task in software extension and modernization [19].It involves understanding the intended functionality of software by analyzing the developed code [16].This process is vital in ensuring that software is built, or updated, in accordance with its original specifications and requirements [22].
The purpose of this research is to explore and develop a framework that can automate the extraction of requirements from source code resulting in a clear roadmap to facilitate the development of new modern software products.This framework can be used by developers to ascertain whether a new product satisfies the same requirements as previous established legacy code.Additionally, it would reduce time and effort spent manually exploring existing software without sacrificing the necessary knowledge of software implementation.This benefit would improve the efficiency of planning and validation in the development life cycle [2,18].Overall, automated requirement extraction can result in a faster, more exact, and more efficient software development process.
Developers often need to spend a significant amount of time and resources reverse engineering these types of code or implementing extensions.Further, there may be limited documentation or knowledge transfer from the original developers, making the process even more challenging.Using our proposed framework, developers including students could reduce the time and cost spent on interpreting source code for maintenance, enhancement, and modernizing the software products.
The remainder of the paper is organized as follows: Section 2 provides a background for the work while Section 3 details the related work.Sections 4 and 5 outline the proposed methodology and describe our analysis of the results.Lastly, Section 6 presents our conclusion from this research.

BACKGROUND 2.1 Extracting functional requirements rules
Automated extraction of requirements or business rules from software refers to the process of automatically identifying and documenting the functional and non-functional requirements and business rules to which the software adheres [25].This process is usually performed by specialized software tools or algorithms that analyze the software code or its associated documentation to identify and extract the relevant information.One common technique used in automated requirement extraction is Natural Language Processing (NLP) [4,27,28], which involves analyzing the text in the software documentation, such as use cases, functional specifications, and user manuals, to identify key phrases and keywords that indicate requirements and business rules.NLP methods may be applied to the extracted code to determine the most essential components and relationships.An NLP system can establish a high-level abstraction of the software's operation and help identify any possible bugs or areas for development by summarizing the actual functionality of the code under test, which can then be compared to the intentions outlined in the documentation.This data may then be utilized to improve the functional requirements of software and guarantee that it satisfies the needs and expectations of the end user.

Natural Language Processing (NLP)
NLP methods have typically been used to evaluate texts and extract structured data from unstructured data [1,10,12].NLP algorithms attempt to extract a more comprehensive and meaningful representation from free text by determining the real value of each phrase in a given context.As a result, this improves the quality of the generated output.NLP algorithms employ linguistic ideas, elements of speech, grammatical structure, ambiguity, and anaphora.This domain includes several knowledge representations such as a lexicon of words and their meanings, grammatical attributes, and a set of grammar rules, as well as other resources such as an ontology of entities and activities or a thesaurus of synonyms or abbreviations.
NLP has become increasingly important in the age of big data, where large volumes of text data are generated every day.By automating the analysis and interpretation of natural language, NLP can help organizations extract valuable insights from this data and make more informed decisions.With a substantial amount of raw data, NLP can infer information that is not explicitly derived.AI language Models such as ChatGPT-3Turbo [9] and A121 Jurassic-2 [15] can be utilized to analyze particular programming languages.With this knowledge, we can take syntax queries and interpret the semantics inherent in the architecture to derive functional requirements that are readable to non-developers or software developers alike.

RELATED WORK
Software that is outdated and lacking thorough documentation poses a continual challenge for the industry.Here, we present work in the area of extraction of requirements from source code.Various researchers [13,14] have investigated tools for the extraction of functional requirements from legacy code.They have proposed methods to measure the accuracy of derived requirements directly from the project documentation (SRS) following 3 different categories of business rules namely, structural, behavior, and constraint.Case studies [11,26] were performed on projects containing simple lines of code using the technique of forward program slicing derived from this category to infer the function and weighted value of the sliced line of code.This technique however transforms the codebase to extract business rules from expressions and variables to derive context from the code and infer the purpose of the program.
Antoniol et al. [5] described a methodology to trace objectoriented code to functional requirements.This method posits the extraction of requirements to design specifications as a bridge between system architecture and functional requirements.They provide a framework to trace classes to functional requirements and explore reverse engineering to replicate the the intention of previous developers and scope of the product.
Mirakhorli et al. [17] described a framework called ReqGen to automatically generate natural language requirements specifications based on given keywords.The approach includes selecting keywords from the domain ontology, injecting them into a pre-trained language model, and designing a requirements-syntax-constrained decoding.Experiments on two public data sets show that ReqGen outperforms six other natural language generation approaches with respect to keywords' inclusion, BLEU, ROUGE, and syntax compliance.
Putrycz et al. [21] developed a novel systematic approach to reverse engineer poorly commented legacy code to extract business rules (functional requirements).The target system architecture observed is in COBOL with the intention of building a new system with extracted business rules in mind.A tool named HotRod developed by Netron extracts business rules from COBOL, and paired with Evolveware's S2T tool, automates the re-documentation of legacy software.The techniques used to design S2T are similar to those used in our proposed framework.However, each extraction phase carried out by S2T requires human verification to ensure precision.The business rules are translated from source code and processed into two main phases.These rules are defined as conditional expressions which are parsed into an Abstract Syntax Tree (AST) to interpret the hierarchical order and functionality of the legacy source code to manually analyze the pre/post conditions of the extracted business rules.This allows the development and interpretation of the functional requirements.
Allamanis et al. [3] proposed a framework using a Convolutional Attention Network (CAN) for the extreme summarization of source code.They argue that traditional text summarization techniques are not well-suited for source code due to its unique properties, such as the lack of natural language structure, high-level abstraction, and domain-specific vocabulary.The proposed CAN approach overcomes these challenges by leveraging the structural information of the source code and using a convolutional neural network (CNN) to extract relevant semantic information of the system.The model then uses an attention mechanism to highlight the most important parts of the source code and generates a summary.
AveriSource [6], Updraft [24], EvolveWare [8] and Cognizant, offer application modernization services with a focus on legacy software and migration to cloud-based development stacks.They automate solutions for extracting requirements from legacy code to improve the efficiency and effectiveness of applications.Cognizant's AppLens platform incorporates an ML algorithm with predefined business rules to automate the modernization process and minimize maintenance and expenses.AveriSource's iSAT suite includes an extraction module that focuses on extracting business rules from legacy software and provides a tabular presentation of rules, routine flow, and identifiers.EvolveWare's Intellysis platform recently added Agile Business Rule Extraction to automate business rules' identification, consolidation, cataloging, and annotation from legacy code.Their solution aims to improve rule extraction while avoiding code freeze during project re-engineering and provides documentation for reference.
However, the above discussed approaches, platforms, and features vary, providing businesses with options to choose from based on their specific needs and requirements.It is important to note that these companies offer proprietary systems -their tools are not open source and may come with restrictions.In contrast, we propose an open-source tool with the goal of free distribution for programmers and/or students pursuing system verification.This could make software development and legacy code modernization more accessible to a larger community of developers, particularly those who may not have advanced programming skills still require system verification capabilities.

METHODOLOGY
The ever-evolving technological setting requires constant maintenance and updating of systems to stay competitive, which is usually done by inexperienced programmers including students.In the software space, students need tools for their studies and to ensure that their programs meet the expected requirements.At the same time, companies may prefer using legacy code rather than investing countless man-hours and other resources to update a software system to meet new requirements.However, the lack of open-source tools to facilitate this need presents a challenge.

Proposed Approach
To address the aforementioned challenges, this work proposes and develops a novel framework which includes a streamlined user interface based on the methodology captured in Figure 1.This framework utilizes Natural Language Processing (NLP), similar to Chat-GPT, to validate programs and generate a set of easily readable requirements.

Figure 1: Automated Framework Methodology for Extracting Software Requirements
We present an implementation of our proposed framework in the form of a desktop application that works by parsing and extracting software requirements from source code files.It takes a file as input and iterates through its contents to identify and extract the necessary requirements.To accomplish this task, we employed the use of an auto-regressive language model called Jurassic-2 [15], specifically designed to extract requirements from Java-based source code.

Technical Implementation
Our desktop application is driven by the Java Swing interface, making it applicable to any operating system that supports Java-based GUIs.The system can accept Java source code as text or file input.It then analyzes the source code and produces requirements based on its latent functionality.
The ingested data is processed in parallel by one of AI21's Jurassic-2 Instruct models and a subset of the JavaParser package created by Nembhard [20] to produce a summary of the purpose of the code, a representation of the relative Abstract Syntax Tree (AST) corresponding to the class and method hierarchy, and a set of requirements statements satisfied by the input.
The interface and local processing is done by the Java parser which utilizes Swing, a popular Java library that extends the basic GUI components offered by the in-built AWT module, allowing greater customization and flexibility.Behind the user interface is our processing code, which accepts a directory or file path as input to discover Java source code.All files with the ".java" extension are collected utilizing an instance of the FilePathResolver class within the JavaParser package.For each file, the code is parsed as an individual Compilation Unit and then structurally analyzed to extract methods, parameters, and comments.These features are stored in a custom data structure and written to files to be later displayed to the user.The AST and commented AST files represent the hierarchy of each class and its methods, with any present comments aligned to the corresponding feature in the commented file.The statements file contains requirements statements for each class generated using our custom data structure that maintains the structural relationship between class, method, and parameter.These files together represent the output produced locally after submitting input.
Alongside the internal processing, there is an API request made to AI21's state-of-the-art Jurassic-2 Large Language Models (LLMs).These auto-regressive LLMs are purposed for general use, as a custom model for this project's scope has not been trained yet.Despite this, the Instruct models have proven to be effective in performing our needs for this stage of the investigation.The input source code is read and inserted as part of a prompt to interpret its purpose and then returned to the user interface for display.Figure 2 depicts the entire process flow of the framework.

Jurassic-2 Configurations
In the use of Large Language Models such as GPT-3, or, in our implementation, Jurassic-2, prompts must be delivered in a structure that describes some expected features of the response.These properties include attributes such as the number of responses generated, minimum and maximum tokens, prompt temperature, topP, and more.Temperature is a significant value because it indicates For the prompts submitted to Jurassic-2's completions API, we limited number of responses per call to 1 in the interest of reducing necessary post-processing before presenting the output.We chose a maxToken value of 175 to prevent excessively verbose or repetitive responses.Additionally, the temperature of our prompts was reduced to 0. We chose to utilize the more deterministic responses generated from low-temperature prompts to produce concise and relevant purpose statements of the input code.This configuration is applied to all API calls sent by the current implementation of the software with the possibility of user customizability to come in future work.

RESULTS AND DISCUSSION
In this section, we discuss the results obtained from using our proposed framework on a Java project.Defining an objective metric to describe the accuracy of results is nontrivial, and perhaps ineffective at this stage of research, because the language model implemented in the current program is primarily built with conversational applications in the training data.Due to the extensive and specialized training conducted by AI21 on this model [15] and their interest in providing accurate and correct responses to prompts, we have been able to produce responses that demonstrate the potential for highly relevant requirement generation with a supervised learning model trained on correct examples.This work has proven incredibly beneficial, as AI language models such as ChatGPT-3Turbo [9] and A121 Jurassic-2 [15] can also understand particular programming languages.
The Requirements Extractor as shown in Figure 4 has been tested so far on approximately 20+ Java files and most frequently on a project for a translator between two other code languages.Figure 3 presents an example Java project file hierarchy that has been tested with our proposed framework.We have successfully generated a descriptive code hierarchy ( => ℎ =>  ) and template-based requirement statements relating the methods and parameters to the parent class for all tested files.One difficulty in producing responses from the LLM is the prompt length, as a token limit is imposed by the API, which rejects code of extreme length (> 32000 characters, approx.).As such, few files were unable to be represented by a summary from the Jurassic-2 model; however, among those that could be summarized, all responses correctly described features of the provided Java code, and 77% captured interesting features relevant to the intended purpose of the inputs.

CONCLUSION
In conclusion, from the work conducted in this study, we have developed a proof-of-concept software that accepts text or file input as Java code, then performs concurrent API communication and local processing to provide the user with information about the structure, functional purpose, and an estimation of the initial requirements that have been satisfied by the code.Automated extraction of requirements from source code offers significant benefits to the software development process.It reduces time and labor requirements compared to manual extraction.Automation eliminates human error and allows developers to allocate more time to other critical tasks like design and testing.Manual extraction is subjective and prone to mistakes, while automated extraction follows consistent rules and procedures, ensuring objective analysis.This leads to improved understanding of software functionality and minimizes errors or misunderstandings during development.Automated extraction streamlines the development process, enhancing efficiency and productivity.For future work it is important to note the potential for bad actors to try gaining access to proprietary software with the proposed framework, and to design for mitigating this risk.Further, we will conduct user studies to ascertain the usefulness of the requirements extracted by our proposed framework.

Figure 2 :
Figure 2: Process Flow of the Automated Framework for Extracting Software Requirements

Figure 3 :
Figure 3: Example Project File Structure

Figure 4 :
Figure 4: Requirements Extractor Graphical User Interface