Towards Trustworthy AI Software Development Assistance

It is expected that in the near future, AI software development assistants will play an important role in the software industry. However, current software development assistants tend to be unreliable, often producing incorrect, unsafe, or low-quality code. We seek to resolve these issues by introducing a holistic architecture for constructing, training, and using trustworthy AI software development assistants. In the center of the architecture, there is a foundational LLM trained on datasets representative of real-world coding scenarios and complex software architectures, and fine-tuned on code quality criteria beyond correctness. The LLM will make use of graph-based code representations for advanced semantic comprehension. We envision a knowledge graph integrated into the system to provide up-to-date background knowledge and to enable the assistant to provide appropriate explanations. Finally, a modular framework for constrained decoding will ensure that certain guarantees (e.g., for correctness and security) hold for the generated code.


INTRODUCTION
Software development (SD) is a complex, expensive process [3].In times of ever-increasing digitalization of our society and scarcity of IT talent, it makes sense to (partially) automate this process -the popularity that AI-assisted code generation has gained since the appearance of large language models (LLMs) indicates this need.At the same time, given that software dominates our digitized society, guaranteeing the quality of the generated code is paramount.Moreover, for the acceptance of AI software development assistants, it is essential that they cover a broad range of software engineering tasks and faithfully mimic existing well-understood and proven SD methods and practices.
Although existing LLMs, when pre-trained or fine-tuned on code 1 , have shown promise in handling coding-related queries and synthesizing coherent-looking code, they exhibit concerning issues.For instance, AIs like ChatGPT produce erroneous code suggestions: On Stack Overflow, the large number of ChatGPT-generated inaccurate answers led to a ban on its use.The moderators stated that "the posting of answers created by ChatGPT [. . .] is substantially harmful to the site and to users who are asking questions and looking for correct answers" 2 .A later study made similar observations [13].
Moreover, a study on GitHub Copilot revealed that 40% of the generated code contained critical security vulnerabilities [27].Further emphasizing the dangers of such issues in the domain of software engineering is an investigation that discovered that "participants who had access to the AI assistant were more likely to introduce security vulnerabilities [. . .], yet also more likely to rate their insecure answers as secure" [28].These studies highlight quality and security problems with current industrial AI software development assistants.
To remedy these problems, we seek to establish the foundations of a next-generation AI software development assistant system 3 .Given the software engineer's natural language queries, the assistant should suggest high-quality code, i.e., correct but also secure, readable, and otherwise compliant with best practices.So far, code model research has focused almost exclusively on correctness (e.g., [5,8,16,19,20,34]).
Beyond that, the assistant should be able to serve the role of a virtual pair programmer, explaining its suggestions, helping with debugging, etc., similar to how a human would.The positive effects of pair programming practices on code correctness and quality [25,41,42], as well as on collaborative learning [24], are well known.Despite this, pair programming has not been widely adopted because working closely with a teammate is challenging 4 .A virtual pair programmer would not cause such problems.Moreover, we envision the system we build as an AI counterpart to the traditional DevOps pipeline.Therefore, our assistant should be able to support the software engineer in all phases, from software design to deployment.
The listed objectives are highly ambitious and require addressing various challenges.This paper is intended to outline a way forward for us and other researchers trying to develop better AI software development assistants.To this end, we start by identifying five key challenges: (1) lack of representative datasets, (2) difficulties capturing code structure and semantics, (3) low code quality, (4) insufficient explanations, and (5) no guarantees on the results.We then propose five solutions to address these challenges: (1) curated real-world datasets, (2) graph representations of code, (3) fine-tuning through feedback from code analyses (4) enriched code knowledge graphs, and (5) constrained decoding.Our considerations lead to a holistic multi-component architecture for AI software development assistant systems built around a central code model.It is illustrated in Figure 1.

APPROACH
The first step towards improving current AI software development assistants is to understand their limitations.Here, we elaborate on the five challenges we found to be most relevant.For each of them, we propose a possible solution together with an evaluation plan.All the described ideas contribute to the overall goal of this vision: Creating a trustworthy AI SD assistant that a software engineer can use in real-world software projects without having to worry about the correctness, quality, safety, and security of the generated code.

Representative Datasets
Challenge.Popular code datasets5 such as CodeSearchNet6 [12] and Py150K7 consists of individual code snippets with single functions.Such datasets do not represent real-world software, typically exhibiting complex software patterns, interdependent class hierarchies, project dependencies involving multiple interconnected files, and further complexities.A study by Hellendoorn et al. [11] revealed that the absence of these features in coding benchmarks contributed to many shortcomings of code completion systems.We, therefore, consider the lack of high-quality datasets that accurately represent real-world coding patterns and complex software architectures a key challenge for training code models.However, real-world code suitable for creating such datasets is scarce.Moreover, annotating and curating code datasets requires significant code quality and dataset management expertise.
Envisioned solution.To meet this challenge, we plan to compile and curate a comprehensive dataset of code that accurately reflects common real-world coding patterns and software structures.For this, we will build upon established state-of-the-art datasets.To make these datasets representative, we will devise heuristics rooted in general coding patterns observed in real-world coding practices.Hellendoorn et al. introduced initial heuristics for the C# programming language to distinguish real-world completions from synthetic ones, such as length, type, and origin of (syntactic) tokens.We will expand on these heuristics in a generalized manner for other popular languages, including Java, Python, and C.
We will annotate our dataset with qualitative metrics to ensure we train with high-quality code.We will rely on various techniques to accurately label the dataset, including static code analysis tools, Stack Overflow discussions, and GitHub issue trackers.While mining GitHub repositories, strategies must be adopted to circumvent known pitfalls [14].When extracting labels from Stack Overflow, we will address challenges posed by incomplete and unstructured code suggestions in the discussions by exploring automated program repair [7], including pattern-based [15] and neural [9] approaches.Evaluation.To evaluate the impact of the representative dataset, we will train code models like CodeBert [5], AlphaCode [20], and TravTrans [16] on vanilla versions of CodeSearchNet.Subsequently, we will curate a variant of CodeSearchNet that represents realworld coding patterns and retrain the models on it.We then compare the differently trained model variants with each other.Standard metrics such as exact match score, statement-level accuracy, or BLEU [26] are of limited value here, because programming tasks can be solved in multiple ways and different looking programs can be semantically equivalent [31].Therefore, we will evaluate the model performance using CodeBLEU [31], which also considers abstract syntax trees and data flow.Additionally, we will explore the use of techniques like semantic parsing and program analysis to estimate the semantic similarity of code segments.

Capturing Code Structure and Semantics
Challenge.In Section 1, we gave some examples of issues with state-of-the-art AI SD assistants, particularly their shortcomings in generating correct and secure code.Note that these are errors at the semantic level.How and to what extent code models can capture and respect program semantics remains an open research question 8 .One conceivable factor is that current code models treat code as formatted text and discard crucial semantic information such as control or data flow [4], implicit usage constraints of third-party libraries, and domain-specific knowledge.It remains unclear what representations of code and SD knowledge should be used by these models.
Envisioned solution.To quantify the problems with current textual code representations, we are developing an analysis technique to assess how much of a program's semantics transformer models learn to capture.We do this by calculating their semantic precision and recall, i.e., how well the models' attention maps match the codes' control and data flow graphs.We will also consider the graph edit distance [6]  code models may benefit from native support of graph-based representations.To this end, we will explore different approaches and investigate their impact on the quality of suggested code: Firstly, we will work on flattening graph representations for code and using them as training data for sequence-based transformers.Positional encodings specialized for graph structures [34] may be instrumental in teaching the model how to interpret its input sequences correctly.Secondly, approaches related to graph attention networks [38] and attention masking [22] will be worth exploring, as they enable constraints on graph relations to be injected directly into the transformer's attention heads.Thirdly, we will explore the potential of graph neural networks [45] and graph transformers [43] to operate directly on graph-based code representations.We will investigate the trade-offs between all three approaches.Potentially, this effort will lead to a foundational code model with increased semantic awareness.
Evaluation.To evaluate the semantic awareness of our graphbased code models, we will use the novel analysis technique described above.Additionally, to evaluate their robustness, we randomly select code artifacts and make semantic-preserving changes to them -renaming variables, translating for-statements into whilestatements, etc.In a robust semantic-aware model, such semanticspreserving changes should typically not be reflected in large-scale changes to the attention maps.We will also compare the performance of the semantic-aware foundational code models we build with standard models like CodeBert, TravTrans, and GraphCode-Bert [8] on downstream tasks using the metrics mentioned in Subsection 2.1.

Code Quality
Challenge.Human software engineers receive feedback on their code as they write it through automated tools like compiler checks and linters, or from colleagues during pair programming and code reviews.Such feedback loops can have a very positive impact on code quality.For example, studies on pair programming report fewer bugs, better readability, higher test coverage and passing rates, and other benefits [25,41,42].Code models, on the other hand, lack similar mechanisms.They are typically optimized for accuracy, not for qualitative criteria.We believe that a reliable AI SD assistant should provide code that not only accurately reflects existing code patterns, but is also of high quality according to wellestablished quality criteria and best practices.
Envisioned solution.Inspired by the feedback loop described above, we propose an approach based on reinforcement learning (RL) [35] to fine-tune code models for multiple code quality criteria.The approach shares ideas with actor-critic RL [17,36], where, in our case, the code model takes the actor's role, and one or more code analysis tools serve as critics.Each critic evaluates the code generated by the actor and returns a token-wise reward value.The rewards are then aggregated (and traded off against each other) using utility functions known from multi-objective RL [10].Finally, the policy gradient is used to update the actor model, e.g., through proximal policy optimization (PPO) [33].
While ideas with actor-critic RL have recently been explored for program correctness [19], our setting has a number of special characteristics that are not systematically explored: Since there is no universal metric for code quality, we resort to multiple critics, each specializing in one or more qualitative aspects.Possible critics for general best practices are linters and similar static analysis tools.
To optimize security aspects, specialized security checkers like CogniCrypt [18] or CryptoGuard [30] can be used.One could also consider more subjective criteria such as readability metrics [32].In our setting, critics must provide individual rewards for each token.This enables reward shaping and, thus, more efficient learning, as it conveys to the actor which actions (i.e., generated tokens) were particularly good or bad.Despite this, our setting has the benefit that we can generate a complete program before applying the critics to it.This allows us to use a wide variety of critics, even if they cannot handle partial (incomplete) programs.Lastly, we must point out that multi-objective RL with many (i.e., four or more) objectives is considered a hard problem [10].In our case, however, the different code quality objectives can be assumed to be mostly non-conflicting.Therefore, there is no need to find all Pareto-optimal policies -one is sufficient -which allows us to use much more efficient algorithms.
Evaluation.The effects of our code quality fine-tuning approach can be measured by a direct comparison of variants of a code model before and after fine-tuning.The same tools used as critics can be used to evaluate the code generated by the two variants.In addition, the code should be audited and rated by humans experts.For most code quality criteria, any coding benchmark is suitable, while specialized benchmarks like LLMSecEval [37] can be used to evaluate safety and security criteria.

Explainability
Challenge.Current code models are not designed to base their answers on concrete background knowledge and use that knowledge to account for their answers to the user.While instruction finetuned (code) models [44] often add explanations to their answers, these answers are the result of purely statistical computations and may be hallucinated.For instance, a recent study by Kabir et al. [13] found that ChatGPT answered Stack Overflow questions incorrectly half of the time.In our view, the issue of how to enable LLMs to provide correct and appropriate explanations for code remains unresolved 9 .
Envisioned solution.For this challenge, we plan to investigate how to integrate code knowledge graphs (CKGs) into our AI SD assistant.This way, we want to give the assistant access to a rich, easily updatable source of background knowledge that it can ground its generated code and explanations in.A potential subject for our investigation is IBM's GraphGen4Code 10 [1].This CKG is a set of triplets describing relationships (edges) between entities (nodes).For a given piece of code, the corresponding CKG contains nodes for each statement, with edges signifying control and data flow.Additionally, GraphGen4Code has established initial semantic links between code fragments and corresponding forum discussions on Stack Overflow.However, the expressiveness of these name-based links is rather superficial.
To improve the expressiveness of the relationships between code and knowledge sources, we propose to enrich GraphGen4Code with information about the task to which a particular code description or forum post relates.This information can be obtained using intent classification.Moreover, while Stack Overflow is a valuable source of discussions about and fixes to programming issues, the GitHub issue tracker can be another equally important source of such information.We thus propose to add links between code segments and GitHub issues to the CKG, along with additional information, such as whether a pull request successfully fixed the corresponding issue.
How to integrate the CKG into the assistant remains to be determined.One option is to augment the code model to query the CKG during the generation of its answer -several papers have already demonstrated how to teach LLMs to use external tools or APIs [23].Alternatively, in a post-processing step, the CKG could be searched for code segments similar to the generated one.Then, the model could do a second pass over the generated code, using the retrieved information to "decorate" the generated code with explanations.In order to find relevant code samples in the CKG, a separate retriever model could be used [21].
Evaluation.To evaluate the extended CKG's ability to be integrated into the assistant and provide accurate and appropriate explanations to the software engineer, we will utilize evaluation protocols similar to those used to evaluate the original GraphGen4-Code.We will involve human annotators and design an agreement metric to rank the quality of the assistant's suggestions and explanations.We will also evaluate the computing cost of enhancing large CKG's with complex semantic links and new sources of information such as GitHub issues.

Controlled Code Generation
Challenge.All measures presented so far serve to enhance the capabilities of the AI SD assistant.However, due to the statistical nature of LLMs, even the best model may occasionally produce bad (e.g., incorrect, insecure, or low-quality) results.This lack of guarantees is a major hindrance to the trustworthiness of AI SD assistants.
Envisioned solution.To provide the desired guarantees, we propose to equip the assistant with constrained (a.k.a.guided) decoding (e.g., [29]) -a recently popular extension of the decoding algorithms of causal LLMs that enforces compliance with certain rules for the generated text.At each generation step, all tokens that would lead to a violation of the rules are prevented from being generated by setting their probability to zero before applying the regular decoding strategy.The underlying rules can, e.g., be defined through regular expressions or grammars and the implementations can be very efficient [40].This approach has already been applied to coding tasks, where it could, among other things, ensure syntactic correctness [29] and prevent hallucinations [2].Models equipped with constrained decoding often outperform larger models [2,29,40].
Since constrained decoding rules can be implemented and applied independently of each other, we envision a modular constrained decoding framework on top of the central code model.The modularity allows the user to select a ruleset appropriate for the current programming language and domain.There is a wide variety of rules that could be implemented.Constrained decoding is, however, no replacement for a capable code model for three reasons: rules must be checkable given only a program prefix; for efficiency reasons, they should be incremental, i.e., not require reanalysis of the program at each generation step; and, depending on the complexity of the rules, the constraints may be unsound and/or incomplete.Still, the lower bound on code quality, and thus the trustworthiness of the assistant, that constrained decoding provides is very valuable.
Evaluation.Measuring how effective constrained decoding is in preventing undesired outputs is easy: First, we perform unconstrained decoding on a series of prompts.Then we iterate over the output sequences using the constrained decoding framework and count in how many cases at least one of the rules was violated (and the constrained decoding would have intervened).This could also be insightful as it reveals typical errors made by unconstrained models.On the other hand, rules can be overly strict and guide the decoding into "dead ends" where the model can only terminate the generation or produce degenerate (i.e., valid but useless) results.How common this is compared to unconstrained decoding can be determined via standard benchmarks.The outcome of the described evaluations will of course highly depend on the used foundation model and ruleset.

FUTURE PLANS
The vision presented in this paper encompasses multiple ideas, all working towards creating a trustworthy AI software development assistant.Despite this, each idea is independent enough to form its own line of research.Therefore, they are worked on in parallel by different subgroups of researchers and lead to several full papers.At the current state, all components are conceptualized but still need to be implemented and evaluated as described above.Combining all outcomes to build the described programming assistant is a highly ambitious goal and will take several years to realize.

CONCLUSION
In this paper, we shared our understanding of the shortcomings of current code models and presented our vision of how they can be overcome.We considered aspects such as representative datasets, semantic code representations, code quality, explainability, and guarantees.

Figure 1 :
Figure 1: High-level architecture of our envisioned AI software development assistant.It consists of five main components: (1) A curated training dataset representing real-world coding patterns and software architectures.(2) A foundational code model that uses graph representations for better understanding of program semantics.(3) An RL-based feedback mechanism to fine-tune the model for improved code quality.(4) A semantically enriched code knowledge graph to help the assistant explain its code.(5) A modular constrained decoding framework on top of the model that prevents the generation of undesired code.