Interleaving Static Analysis and LLM Prompting

This paper presents a new approach for using Large Language Models (LLMs) to improve static program analysis. Specifically, during program analysis, we interleave calls to the static analyzer and queries to the LLM: the prompt used to query the LLM is constructed using intermediate results from the static analysis, and the result from the LLM query is used for subsequent analysis of the program. We apply this novel approach to the problem of error-specification inference of functions in systems code written in C; i.e., inferring the set of values returned by each function upon error, which can aid in program understanding as well as in finding error-handling bugs. We evaluate our approach on real-world C programs, such as MbedTLS and zlib, by incorporating LLMs into EESI, a state-of-the-art static analysis for error-specification inference. Compared to EESI, our approach achieves higher recall across all benchmarks (from average of 52.55% to 77.83%) and higher F1-score (from average of 0.612 to 0.804) while maintaining precision (from average of 86.67% to 85.12%).


Introduction
This paper presents a new approach for using Large Language Models (LLMs) to improve static program analysis.LLMs [18,25] have been shown to demonstrate impressive reasoning abilities in natural and programming languages tasks via fewshot [3] and chain-of-thought [28] prompting.The approach presented in this paper utilizes this reasoning ability of LLMs when the static analysis is unable to make progress; the results of the query to the LLM are used for subsequent analysis.Furthermore, the query (or prompt) to the LLM incorporates the current results of the static analysis, which enables it to provide more accurate results.In this way, our approach interleaves calls to the static analyzer and the LLM, with each utilizing the results of the other.
We apply this novel approach to the problem of errorspeci cation inference of functions in systems code written in C, i.e., inferring the set of values returned by each function upon error (Section 2).The C language does not have builtin exception or error handling; thus, a common idiomatic practice for error-handling is to check the return value of a function on error, i.e., the return code idiom.These return values indicate the functions' error speci cations, which can aid in program understanding as well as in nding errorhandling bugs.EESI [6] has shown higher e ectiveness and performance at inferring error speci cations compared to prior approaches [1,7,14].Our approach interleaves calls to the EESI static analyzer and the LLM (Figure 1).
We evaluated our approach on six real-world C programs, such as MbedTLS and zlib (Section 5).Our approach improves recall and F1-score over EESI from 52.55% to 77.83% and 0.612 to 0.804, respectively, while maintaining a high precision of 85.12% compared to 86.67% in EESI.Our evaluation demonstrates that by interleaving static analysis and LLM prompting, we can signi cantly improve upon the error speci cation inference capabilities of just a static analyzer.
The contributions of this paper are as follows: • We propose a technique for interleaving a static analysis with LLM prompting.• We designed a tool for error speci cation inference of C programs using our approach of combining EESI static analyzer and LLM prompts.• We evaluate our approach on 6 real-world C programs comparing it with prior state-of-the-art EESI.We provide an ablation study on the individual components of our approach.

Background
Error Speci cation Inference.The C language does not feature programming constructs for exception handling.Instead, developers often use the return code idiom to indicate error.An error speci cation refers to the set of values returned by a function upon error.Because it is not possible to enforce compile-time rules regarding error code propagation and checking, the return code idiom often leads to bugs, e.g., developers may miss or incorrectly check the error return values of functions.
A few approaches have presented techniques for inferring error speci cations [1, 6-8, 14, 30].In this paper, we consider a state-of-the-art static program analysis using abstract interpretation for error speci cation inference named EESI [6].As input, EESI takes in multiple forms of optional user-supplied initial domain knowledge: (1) initial error specications, (2) error codes, (3) success codes, and (4) error-only functions (only called along error paths).With this initial domain knowledge, EESI uses static analysis to infer new error speci cations.
While EESI has demonstrated success in error speci cation inference, it has two inherent limitations that a ect its recall and precision: (1) incomplete program facts, and (2) thirdparty functions.As EESI is a static program analyzer using abstract interpretation to infer program semantics related to idiomatic practices, it provides approximations that may be insu cient in learning enough program facts for error speci cation inference.One important source of incomplete knowledge is third-party functions.Third-party functions are called within a program, but de ned elsewhere.Because the analyzer does not have access to the source code, it cannot reason about their error speci cations.
Large Language Models (LLMs).LLMs are language models trained on large amounts of data for tasks such as text generation and language understanding.These models have been developed for both natural language [25] and programming languages [23], while some models are trained for both [18,22,26].One of the key components of LLMs are the prompts, i.e., the input to the LLM.There has been considerable research done in recent years related to the generation of prompts that improve the performance of LLMs in various tasks [2,22,27,28].These approaches include concepts such as chain-of-thought [28], where LLMs are given question and answer as examples with the associated chain-of-though  Using EESI and the LLM to infer error speci cations reasoning, and self-consistency [27] prompting, where LLMs are prompted with the same question multiple times, using the most consistent answer given.

Overview Example
This section illustrates how our approach of interleaving calls to the EESI static analyzer and the LLMs to infer error specications.Consider the function x509_get_attr_type_value in MbedTLS.EESI alone is unable to infer its error specication; EESI infers ⊥ as the function error-speci cation as shown in Figure 2.
The LLM alone is also unable to infer its error speci cation.We can construct a prompt to the LLM that includes the general description of the error speci cation inference problem (Common Context in Figure 2) as well as the source code of the function (Question in Figure 2).However, querying the LLM with just this information is not su cient to give us the correct error speci cation.In particular, the LLM infers that the error condition for mbedtls_asn1_get_tag is ≠ 0 from the conditional check.Even when the value of the error code MBEDTLS_ERR_X509_INV_NAME is included in the Common Context, the incorrect assumption about the called function leads the LLM to incorrectly infer that the return value on the error-path is the negative error code added with any non-zero value; that is, the LLM infers that the error value could be anything, and the error speci cation is ⊤, instead of < 0.  However, if we also include intermediate results from the EESI static analyzer in the LLM prompt, then the LLM is able to return the fact that x509_get_attr_type_value returns a value < 0 on error.In particular, the LLM prompt includes the error speci cation of the function mbedtls_asn1_tag that is called from x509_get_attr_type_value (Function Context in Figure 2); this error speci cation is inferred by the EESI static analyzer.
This example illustrates how our approach provides bene ts over purely static analysis or LLM approaches by interleaving calls to the static analyzer and the LLM: the LLM is used only when the static analyzer is unable to make progress, and the LLM prompt includes intermediate information gleaned by the static analyzer.Furthermore, the output of the LLM is fed back into the EESI static analyzer.For example, the LLM's speci cation for x509_get_attr_type_ value would allow EESI to subsequently nd the error speci cation (< 0) for mbedtls_x509_get_name from analyzing its implementation: i f ( ( r e t = x 5 0 9 _ g e t _ a t t r _ t y p e _ v a l u e ( . . . ) ) ! = 0 ) r e t u r n ( r e t ) The speci cs about the LLM prompt construction; viz, Common Context, Function Context, and Question, are deferred to Section 4.1.Figure 3 illustrates another scenario illustrating the benets of incorporating calls to an LLM in the static analyzer.The function otrng_global_state_instance_tags_read_ from is a third-party function called in Pidgin OTRv4.Because the source code of this function is not available, EESI is unable to infer its error speci cation, and consequently, it might not be able to infer the speci cations of functions that call it.However, constructing an LLM prompt that includes information from the user-provided domain knowledge, the LLM is able to correctly infer the error speci cation for otrng_global_state_instance_tags_read_from.

Approach
We illustrate our approach for interleaving static analysis and LLMs in Figure 1.The input is the program source code and optional domain knowledge, and the output are the function error speci cations inferred by the analysis.

Building Prompts
When interacting with the LLM, we construct a prompt that consists of the Common Context, Function Context, and Question, as mentioned in Section 3.
Common Context.The prompt Common Context used for error speci cation inference consists of a problem description and an explanation of the abstract domain used by the EESI static analyzer.We provide the explanation of the abstract domain, because we want the LLM to output its learned error speci cations using this domain.Relating to the program under analysis, the Common Context also contains any error codes, success codes, and error-only functions from the domain knowledge input.We include additional observed idiomatic practices related to the return code idiom: We also provide multiple, basic chain-of-thought examples that consist of a function de nition and its associated error speci cation, with a chain-of-thought explanation.We do so to demonstrate the task of error speci cation inference and so that the LLM generates parse-able output.We do this, in addition to providing the explanation of the abstract domain, in order to limit the LLM from generating output that is unexpected.However, if the LLM output does not follow the expected format, then the related error speci cation will consist of the ⊥ element, i.e., unknown.For example, the expected output for malloc would be malloc: 0. Function Context.The Function Context of the prompt relates to any relevant function error speci cations for the function that is being queried by the LLM.The Function Context that is generated depends on the selected LLM query function that will be explained further when introducing our algorithm, Algorithm 1.In all cases, these error speci cations are provided as few-shot examples to the LLM, with the aim to generate parse-able output, as well as providing demonstrative examples to the LLM.These error speci cations provide additional context that can assist the LLM when it comes to understanding returned error values.This is especially true when there are functions that exist in the same library as demonstrated with Figure 2.
Question.The Question in all constructed prompts asks for the LLM to return any error speci cation that it is con dent in using the abstract domain used by EESI.

Error Speci cation Inference
For the task of error speci cation inference, we present Algorithm 1 to demonstrate how the static analyzer and LLM are used.Our algorithm takes in the domain knowledge as a map of program facts and the set of functions from the source code .The algorithm returns the updated facts after performing analysis.
The analysis begins by iterating over the functions ∈ bottom-up in the Call Graph ( ) as demonstrated on Line 2. This ensures that that called functions are inferred before their caller, because called functions provide additional context for error speci cation inference.Note, for brevity, we do not include in the algorithm that we perform a xpoint on the Strongly Connected Components (SCC) in , as recursion may exist in the call chains.The algorithm algorithm will attempt to infer an error speci cation in one of three cases: (1) queryLLMThirdParty (Line 4), ( 2) runAnalysis (Line 6), or (3) queryLLMAnalysis (Line 8).

Third-Party Function Error Speci cations.
For each function, we rst check if it is a third-party function (Line 3), and if it is, we perform queryLLMThirdParty as demonstrated on Line 4. Because the source code de nition is not available for third-party functions, we cannot statically analyze it.As Function Context for the prompt, we provide the entire set of error speci cations that are in on Line 22.The Question in this case just simply lists the name of the functionof-interest (Line 21).The LLM is then queried, where the output is then parsed (Line 24) and if any error speci cation is learned, the program facts are updated (Line 10).

Error Speci cation Analysis.
If the function is not third-party, then the EESI static analyzer will perform its own analysis.EESI will determine if the error speci cation of the function is infallible (∅), unknown (⊥), or any other value (e.g., < 0) from runAnalysis on Line 6.If this result is ⊥ (Line 7), then we query the LLM once for the function under analysis with queryLLMAnalysis on Line 8.
Unlike the Function Context provided in queryLLMThird-Party, we only provide the known error speci cations of called functions contained in the function de nition (Line 15).We demonstrate an example of this in Figure 2, where error speci cation mbedtls_asn1_get_tag is learned from EESI  ←getErrorSpeci cations( ) ← parseOutput(queryLLM( )) 25: return spec 26: EndFunction and provided as Function Context to the LLM, correctly inferring x509_get_attr_type_value.
The constructed Question as part of the prompt consists of the source code of the function being analyzed (Line 14).
The resulting output from the LLM is then parsed (Line 17) and any newly learned error speci cation is updated in the program facts on Line 10.

4.2.3
Validating the LLM Response.We re-query the LLM for every generated prompt to limit the side e ects of hallucination.Hallucination refers to when LLMs make up information to satisfy a prompt, even if the provided chainof-thought reasoning is contradictory.We speci cally ask the LLM to ensure that the error speci cations provided match the given chain-of-thought description from itself.Additionally, we also limit some of the imprecision by identifying two inconsistencies with formal reasoning.First, we do not infer error speci cations if the resulting error value from the LLM includes a known success value.Second, we do not infer an error speci cation if the LLM states that the error specication is an improper superset of the return range of the function.As both of these program semantics are obtained via an approximation during the analysis of EESI, we cannot guarantee that these inconsistencies are removed entirely, but we can utilize these rules to limit low-hanging fruit.

Experimental Evaluation
For our experimental evaluation, we perform an ablation study.We propose three research questions with one baseline to target components of our approach: RQ0 How well does the static analysis of EESI perform?This is our baseline.RQ1 What is the impact using the LLM to infer third-party error speci cations, i.e., queryLLMThirdParty?RQ2 What is the impact of using the direct LLM analysis, i.e., queryLLMAnalysis?RQ3 What is the impact of interleaving EESI and the LLM?Our code and data are publicly available at h ps://github.com/ucd-plse/eesi-llm.

Experimental Setup
Benchmarks.We consider a data set of six benchmark programs that represent a variety of error-handling patterns and system types, as listed in Table 1.Domain Knowledge.For all approaches, we supply the same initial domain knowledge as input.Initial error specications are identi ed via one of two strategies.The rst is that we select applicable error speci cations from a list of common and well-known standard library functions.The second is that we manually inspect a small subset of functions based on the program's call graph, supplying functions that appear lower in the call graph as initial domain knowledge.Success and error codes are mined automatically through pattern matching header les for patterns such as ERR, err, and SUCCESS.Error-only functions (only called on error paths) are selected via manual inspection.The manual e ort involved in nding the above domain knowledge for all benchmarks took a total of one hour.Evaluation metrics and ground truth.We measure precision, recall, and F1 (F1-score) -where we only consider a true positive (TP) to be a learned error speci cation that matches the ground truth exactly; for example, ≤ 0 and < 0 are not equivalent and would be considered a false positive (FP).If the analysis determines an error speci cation is unknown ⊥, then that is considered a false negative (FN).
As every function-under-analysis will have an error speci cation, even infallible ∅ functions, we do not have true negatives (TN).For all metrics, we calculate based on a manually inspected ground-truth G as depicted in Table 2.For smaller benchmarks, we inspected all functions, but for larger benchmarks we randomly sampled a subset.We did so, as manual inspection over all functions is not feasible due to time constraints, as some functions may consist of hundreds or thousands of lines.Note, numbers represented in Table 2 do not count initial error speci cations from the domain knowledge.
Precision, recall, and F1 are de ned as: EESI is implemented using the LLVM infrastructure [15] to analyze bitcode and our LLM error speci cation inference uses GPT-4 [18] as the LLM.Our experiments were run on a 2.10 GHz Xeon Silver 4216 CPU with 384 GB of RAM.

RQ0: How well does EESI perform in error speci cation inference?
For this task, we simply supply the initial domain knowledge and source code to the static analyzer of EESI and receive its inferred error speci cations.The number of inferred error speci cations are represented in Table 3. From these, we can see that the most common error speci cation inferred across all benchmarks is < 0. Many standard library functions indicate that they return a negative error code on failure, which has been adopted by many other software programs.However, this cannot be assumed for all functions, as indicated with benchmarks such as Apache HTTPD, which EESI achieves a precision ranging from 64.60% to 97.14% as seen in Table 3, averaging at 86.67% per benchmark.However, the recall varies even more depending on the benchmark, ranging from 10.33% to 83.33%, averaging 52.55%.The benchmark with the lowest recall, Pidgin OTRv4 (10.33%) is also notably the benchmark with the highest percentage of third-party functions at 72.2% as listed in Table 2.

RQ1: What is the impact of queryLLMThirdParty?
We measure the impact of queryLLMThirdParty by running it in the rst step of our interleaved error speci cation inference.We then run the static analysis of EESI through runAnalysis, however, we do not call queryLLMAnalysis when EESI infers ⊥.
As we can see demonstrated in Figure 4, we notice an average recall of 62.20% (Figure 4c) and average increase of 29.17% (Figure 4a) for inferred error speci cations over EESI.Our precision remained similar to EESI (Figure 4b).We notice the largest impact for the benchmark Netdata, which increased the most by 70.50%.This benchmark was impacted signi cantly, as it refers to many well-known libraries such as pthread.We do not see as much of an increase in the Pidgin OTRv4, as many of the third-party libraries are for niche purposes, e.g., the GTK library.However, this is not the case for all library functions; for example, the error speci cation inference demonstrated in Figure 3 occurs through queryLLMThirdParty.

RQ2: What is the impact of using queryLLMAnalysis?
To isolate the contributions of queryLLMAnalysis, we skip queryLLMThirdParty in the work ow.Instead, we proceed to running the static analysis of EESI, followed by querying the LLM if the result is ⊥.
The results depicted in Figure 4 demonstrate an average increase of 59.88% (Figure 4a) across all benchmarks, with an average recall of 70.26% (Figure 4c).Our benchmark that saw the largest percentage increase was Apache HTTPD at 183.33%, which contains the second highest percentage of third-party functions (Table 2).In Figure 2  function bodies, even while the static analysis of EESI is insu cient.
RQ3: What is the impact of interleaving EESI and LLM?For our combined approach, we utilize the entire work ow, calling both queryLLMThirdParty and queryLLMAnalysis.We see in Table 4, that our combination of prompting strategies is extremely bene cial in applications such as Pidgin OTRv4, Netdata, and Apache HTTPD.Signi cantly improving the recall and F1 over EESI in Table 3.In fact, we see an increase over the average F1 of EESI by +0.192 (Figure 4d).We also see the precision Δ on newly learned error speci cations that were not inferred strictly via static analysis.With Netdata, we saw 144 new < 0 error speci cations inferred, with our overall precision going up for the benchmark.We make note that even when we do lose some precision, as seen with Apache HTTPD, we have an increase of 188.07%and still signi cantly improve our F1-score to 0.752.In Figure 4, our combination of prompting strategies to the LLM only improved upon the total number of inferred error speci cations (Figure 4a), obtaining the highest recall (Figure 4c), and F1 (Figure 4d), while maintaining a similar precision (Figure 4b) to the analysis of EESI.We specifically highlight the advantages that each component has demonstrated, where queryLLMThirdParty demonstrated great success in assisting analyze benchmarks with a signi cant majority of third-party functions such as Pidgin OTRv4; where queryLLMAnalysis has demonstrated great success in analyzing function bodies directly, inferring error speci cations in scenarios such as their called context.

Related Work
Error Speci cation Inference.Acharya and Xie [1] introduce techniques for mining error speci cations for APIs using static traces.APEx [14] uses path-sensitive symbolic execution to nd error-paths on the assumption that error paths are shorter than normal paths.Several other works [10,20,21,24] nd function error speci cations via fault injection.MLPEx [30] is a machine-learning based approach that uses path-features to learn whether or not a program path is an error path.EESI [6] is a static analysis of C programs for error speci cation inference that allows the use of domain knowledge to bootstrap the analysis.Our task improves EESI by interleaving it with LLM prompting.
Program Analysis and LLMs.Ahmed and Devanbu [2] demonstrate that when a LLM is provided semantic information produced by static analysis, then tasks such as code summarization can be signi cantly improved.Li et al. [16] demonstrate that by carefully crafting questions using functionlevel behavior and summaries, LLMs can assist in removing false positives from a bug nding tool.Li et al. [17] also introduce a technique for combining static analysis using symbolic execution with LLMs to nd Use Before Initialization (UBI) bugs, demonstrating that the LLM can be used to extract some program semantics and lter out false positives caused by the imprecision of the static analysis.Wen et al. [29] also demonstrate success in removing false positive warnings by using customized questions with domain knowledge from the Juliet [12] benchmark.LLMs have also been recently used to generate program invariants [19], including generating loop invariants [13] and subsequently ranking them using zero-shot prompting [4].In contrast to all of the above, our work interleaves facts provided by both a static analysis and a LLM to improve the precision of an existing static analysis for error speci cation inference.
Program Analysis and Machine Learning.Seldon [5] is a tool using semi-supervised learning through building and solving a constraint system from information ow constraints for taint speci cation inference.InspectJS [9] is an approach for taint speci cation inference that uses manual modeling from CodeQL [11], inferred speci cations using an adaptation of Seldon, a ranking strategy using embeddings, and manual user feedback.As discussed previously in relation to error speci cation inference, MLPEx [30] uses machine learning for error speci cation inference.While these approaches combined machine learning and traditional program analysis techniques to improve analysis results, our technique di ers in that we are using LLMs and that both the static analysis and LLM-based inference results are interleaved throughout the entire analysis.

Conclusion
We have presented an approach for interleaving static program analysis and LLMs for the task of error speci cation inference.We have demonstrated that by providing program facts from the analysis of EESI to the LLM that it can infer error speci cations correctly and in-turn can assist EESI to further learn new error speci cations.We show this in our evaluation (Section 5), where our average recall grows from 52.55% to 77.83% and our F1-score improves from 0.612 to 0.804.Our evaluation also demonstrates a similar precision to the original static analysis, where the average only decreases from 86.67% to 85.12%.

Figure 1 .
Figure 1.Our approach infers error speci cations by interleaving calls to the EESI static analyzer and the LLM.

Figure 2 .
Figure 2. Using EESI and the LLM to infer error speci cations

Figure 3 .
Figure 3. Using the LLM to infer error spe cation of a thirdparty function

1 .
Error speci cation values must be a subset of the returned values of a function.2. Unknown error speci cations are ⊥.3. Success values are not part of the error speci cation.4. The NULL return value is equal to 0. 5. Error codes from standard library functions are positive integers.6. Macros may check return values and return if failing.

Algorithm 1 :
InferErrorSpeci cation( , )INPUT: Map of program facts , Set of functions .OUTPUT: Updated with new error speci cations.

Figure 4 .
Figure 4. Average increase, precision, recall, and F1-score for approaches.The minimum and maximum benchmark results are represented as error bars for their respective metric.

Table 1 .
Selected benchmarks with their LOC and selected domain knowledge -initial error speci cations, error-only (EO) functions, error codes, and success codes.

Table 2 .
Total number of functions, functions in G, and third-party functions in G.

Table 3 .
Speci cation counts, precision, recall, and F1-score for EESI , we can see that the direct LLM analysis allows the LLM to reason about

Table 4 .
Speci cation counts, precision, recall, and F1-score for our framework interleaving static analysis and LLMs.