The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

The application of Large Language Models (LLMs) in software engineering, particularly in static analysis tasks, represents a paradigm shift in the field. In this paper, we investigate the role that current LLMs can play in improving callgraph analysis and type inference for Python programs. Using the PyCG, HeaderGen, and TypeEvalPy micro-benchmarks, we evaluate 26 LLMs, including OpenAI's GPT series and open-source models such as LLaMA. Our study reveals that LLMs show promising results in type inference, demonstrating higher accuracy than traditional methods, yet they exhibit limitations in callgraph analysis. This contrast emphasizes the need for specialized fine-tuning of LLMs to better suit specific static analysis tasks. Our findings provide a foundation for further research towards integrating LLMs for static analysis tasks.


INTRODUCTION
In the dynamic field of Software Engineering (SE), the incorporation of advanced computational models, especially Large Language Models (LLMs), marks a significant shift in the software development processes [7,8,21,22].Static analysis (SA), an integral component of SE, involves examining source code without executing it, to identify potential errors, code quality issues, and security vulnerabilities.The emergence of LLMs, such as BERT [6], T5 [15], and GPT [14], has transformed several diverse SE tasks, including the SA tasks [21].Recent works have shown how different SA tasks can benefit from LLMs, such as false-positives pruning [9], improved program behavior summarization [10], type annotation [17], and general enhancements in precision and scalability of SA tasks [10], both fundamental issues of SA.
This study here situates itself at the intersection of SA and LLMs, specifically focusing on the effectiveness of LLMs in SA within SE.It aims to evaluate the accuracy of LLMs in performing specific SA tasks: callgraph analysis and type inference, specifically in Python programs.Callgraph analysis helps in understanding the relationships and interactions between different components of a program, while type inference aids in identifying potential type errors and improving code reliability.To assess the performance of LLMs in these areas, we use the P CG [16] and H G [20] microbenchmarks for callgraph analysis, and T E P [19] for type inference.
The use of micro-benchmarks in evaluating the performance of LLMs in our study is grounded in several key considerations.Firstly, micro-benchmarks are designed to target specific aspects of the features under test and various characteristics of the programming language involved.This helps highlight the models' strengths and weaknesses, allowing for a more nuanced understanding of their capabilities in SA tasks.Additionally, their development involves rigorous manual inspection and adherence to scientific methods, ensuring reliability and accuracy in evaluation.Conversely, obtaining large-scale, real-world data that can serve as ground truth is often a challenging endeavor.Where such data is available, it is susceptible to human errors, which can skew the results.
By testing a range of 26 different LLMs, our study provides a comprehensive analysis of their capabilities in the context of SA.Furthermore, the evaluation enables one to make direct comparisons with the existing capabilities of traditional approaches in SA.The insights from this study are intended to offer a preliminary understanding of the role LLMs can play in SA, and potentially guide future research and practical applications in the AI4SE and SE4AI fields.
The structure of the paper is as follows: Section 2 provides a motivating example to introduce the concepts of SA.In Section 3 we discuss the related work.The research questions are outlined in Section 4, while Section 5 describes our methodology.Results are presented in Section 6 and subsequently discussed in Section 7. Section 8 addresses the threats to validity.Finally, the paper is concluded by outlining future research directions in section 9.
Availability.T E P is published on GitHub as open-source software: https://github.com/secure-software-engineering/TypeEvalPy

BACKGROUND
In the following code, the create_str function returns a string, the variable func_ref is assigned with function references at lines 4 and 8 and x is assigned the value result + 1 at lines 6 and 10.
1 d e f c r e a t e _ s t r ( x ) : 2 r e t u r n x .u ppe r ( ) 3 4 f u n c _ r e f = c r e a t e _ s t r 5 r e s u l t = f u n c _ r e f ( " H e l l o !" ) 6 x = r e s u l t + 1 # Type mismatch !7 8 f u n c _ r e f = l e n 9 r e s u l t = f u n c _ r e f ( " H e l l o !" ) 10 x = r e s u l t + 1 # Works Type Inference.A static analyzer with type inference capabilities can resolve that the variable result at line 5 is a string, while the variable result at line 9 is an integer.Using this, the static analyzer can raise a type error at line 6 even before executing it.
Callgraph.The complete callgraph for the snippet is as follows: main → create_str() → upper() main → len() A flow-sensitive analysis can further resolve exactly where these calls are made.For instance, it can resolve that at line 5 the variable func_ref points to the function create_str while at line 9 func_ref points to the function len.

RELATED WORK
Ma et al. [11] and Sun et al. [18] explore the capabilities of LLMs when performing different program analysis tasks such as controlflow graph construction, callgraph analysis, and code summarization.They conclude that while LLMs can comprehend basic code syntax, they are somewhat limited in performing more sophisticated analyses, such as pointer analysis and code behavior summarization.In contrast, LLift, an LLM-based approach, showed successful results for different programming analysis tasks, including program behavior summarization [10] and how LLMs can be successfully integrated into an SA pipeline.Researchers conjecture that the reasons behind the difference in the results were benchmark selection, prompt designs, and model versions.Li et al. [9] present a solution to prune SA false positives by asking carefully constructed questions about function-level behaviors or function summaries.Seidel et al. [17] propose CodeTIDAL5, a Transformerbased model trained to predict type annotations in TypeScript.In this study, we explore how different LLMs perform on callgraph analysis and type inference for Python programs.

RESEARCH QUESTIONS
We focus on the following research questions to evaluate the effectiveness of LLMs using micro-benchmarks in static analysis tasks: RQ1: What is the accuracy of LLMs in performing callgraph analysis against micro-benchmarks?RQ2: What is the accuracy of LLMs in performing type inference against micro-benchmarks?

METHODOLOGY
We next describe the experimental setup, the model selection criteria, prompt design, and metrics used to investigate these RQs.Micro-benchmarks.To answer RQ1, we choose two benchmarks designed to evaluate callgraph analysis performance, P CG [16] and H G [20].P CG is the first callgraph construction algorithm that uses a context-insensitive and flow-insensitive SA as its backend.P CG includes a micro-benchmark containing 112 unique python programs targeting various Python features organized into 16 categories.H G is a tool that uses SA to enhance comprehension in computational notebooks.H G improves P CG's static analyzer with flow-sensitivity and type inference.H G includes a micro-benchmark with 121 code snippets with flowsensitive call sites as ground truth.Note that for this study we have extended P CG's micro-benchmark with additional snippets from the H G micro-benchmark.To answer RQ2, we choose the micro-benchmark from T E P [19], a general framework for evaluating type inference tools in Python.T E P contains a micro-benchmark with 154 code snippets and 845 type annotations as ground truth.
Model Selection.In this study, we evaluate several state-of-theart LLMs.First, we include two closed-source LLMs, GPT 3.5 Turbo and GPT 4 from OpenAI as it is the leading general-purpose LLM.Furthermore, we include ten popular open-source models based on the download count on the Huggingface [1] platform.This includes llama2, mistral, dolphin-mistral, codellama, codebooga, tinyllama, vicuna, wizardcoder, and orca.We include several variations of these models such as the number of parameters (7b, 13b, etc.,).Overall, we evaluate 24 open-source models and two closed-source models, totaling 26 LLMs.
Furthermore, we create a fine-tuned version of GPT-3.5 Turbo, refined with a training dataset.The dataset created for fine-tuning GPT-3.5 Turbo comprises 15 program categories.It serves as the representative collection of the P CG, H G , and T E P micro-benchmarks, emphasizing key Python features such as functions, classes, decorators, and exceptions.This approach seeks to enhance the model's adaptability, equipping it to effectively handle a diverse range of challenges.
Prompt Design.To optimize prompt design, we adopted an iterative and experimental approach [5].Initial efforts focused on enhancing the prompt by incorporating detailed task descriptions and specifying the expected response format.Notably, we used a one-shot prompting technique, embedding an example question and answer within the prompt.Despite these refinements, we encountered challenges with the LLM's ability to produce structured outputs.Our experiments revealed that even with explicit instructions to generate outputs in JSON format, models struggled to deliver results that could be reliably parsed.To address this, we explored a question-answer based method, querying the model and then translating its natural-language responses back into a structured JSON format.This offers a more flexible solution to the challenges of generating structured data outputs.
Evaluation Metrics.To assess both flow-insensitive callgraph construction and flow-sensitive call-site extraction, in this study, we measured completeness, soundness, and exact matches.Completeness is the absence of false positives in the callgraph, ensuring that no call edges were included if they did not exist.Soundness, conversely, focuses on the inclusion of every call edge, thereby avoiding any false negatives.Exact matches is measured as the number of function calls that exactly match the ground truth.This evaluation approach mirrors the methodologies used in previous studies, specifically in P CG [16] and H G [20].Furthermore, aligning with the literature [4,12,13,19], for type-inference evaluation we use exact matches as the metric.Additionally, the total runtime of these tools for analyzing the respective microbenchmark is also included by computing the mean over three runs.
Implementation Details.In the implementation of our experiments with LLMs, we employed Ollama [3], an open-source platform that simplifies running LLMs by providing an efficient HTTP server for lifecycle management.This served as our backend infrastructure.In addition, to create a pipeline for efficient prompting and response handling, we used LangChain [2], a framework designed for building applications that interact with LLMs.Additionally, to implement the type-inference experiments, we extended the TypeEvalPy framework [19], due to its flexibility in adding support for new tools.

RESULTS
We next present the findings of our study, addressing the research questions and highlighting key results.

RQ1: Accuracy of Callgraph Analysis
Table 1 presents the outcomes of our experiments using LLMs on the flow-insensitive callgraph analysis evaluation micro-benchmark of P CG, and the flow-sensitive callgraph analysis evaluation microbenchmark of H G .Flow-insensitive Callgraph analysis.The static analysis algorithm P CG demonstrated superior performance over LLMs in terms of completeness, soundness, exact matches, and processing time.Specifically, in a set of 121 test cases in the benchmark, P CG achieved 93.3% completeness and 86.7% soundness, significantly outperforming the closest LLM, ft:gpt-3.5-turbo,which only achieved 57.8% completeness and 61.9% soundness.Furthermore, P CG obtained 250 exact matches (out of 284), which is 43 more exact matches than ft:gpt-3.5-turbo.This performance difference is further emphasized in running times, where P CG processed flow-insensitive callgraphs 190 times faster than ft:gpt-3.5-turbo.Among the LLMs, the best-performing one without fine-tuning is gpt-4; however, the fine-tuned gpt-3.5-turbomodel surpasses the vanilla gpt-4, indicating the potential benefits of fine-tuning LLMs for specific applications.Yet, other open-source models lagged significantly in performance.Notably, due to their failure to produce structured outputs in line with our prompts, some LLMs like codellama:34b-instruct, vicuna:13b, llama2:70b, and llama2:7b experienced lengthy running times.Despite clear instructions regarding the output format and the instruction to avoid explanatory content, they sometimes continued to generate irrelevant content and consequently reached the timeout.
Flow-sensitive Callgraph analysis.Here, H G demonstrated superior performance over LLMs across all evaluated metrics.In particular, H G achieved 91.7% completeness and 93.3% soundness, which is more than double the performance of its closest LLM competitor, ft:gpt-3.5-turbo,which managed only 38.8% completeness and 39.6% soundness.In terms of exact matches, H G identified 327 out of 355 call sites, surpassing the bestperforming LLM by 178 matches.Moreover, H G 's runtime is 15 times shorter than the fastest LLM in analyzing the entire benchmark.Note that LLMs fared considerably poorer in flowsensitive analysis compared to flow-insensitive analysis, likely due to the increased complexity and the requirement for precise flowsensitive pointer information, which may pose challenges to LLMs.And this although in the prompt we did provide specific instructions to ensure the LLMs' awareness of the flow-sensitive aspects.

RQ2: Accuracy of Type Inference
Table 2 shows the performance of LLMs, HeaderGen, and HiTyper considering the exact-match performance.In general, LLMs significantly here outperform the current state-of-the-art approaches for type inference, namely, HeaderGen and HiTyper models.Specifically, OpenAI's GPT-4 is the best-performing model, correctly inferring 775 of 845 type annotations in the micro-benchmark.This is expected, as GPT-4 is one of the most powerful LLMs in the wild, though it can be slow and expensive to run.It is also interesting to see that the fine-tuned version of GPT 3.5 Turbo is the second best-performing model with 730 correctly inferred type annotations and an inference speed 4 times faster than that of GPT 4. Considering open-source LLMs, with 699 correctly inferred annotations CodeLlama (13B-instruct) has comparable performance to GPT-4 and the fine-tuned GPT 3.5.LLMs specialized in coderelated tasks like CodeLLaMA outperform general-purpose LLMs such as vanilla LLaMA.Another observation is that TinyLlama, a 1.1B parameter model, performs poorly: it only infers 26 annotations correctly.It seems that models smaller than seven billion parameters, like TinyLlama, are insufficiently capable of the type inference task.

DISCUSSION
Similar to findings in previous work [11,18], we observe that the construction of callgraphs does not yet significantly benefit from the use of LLMs.In comparison to LLMs, for this task traditional SA methods remain more efficient.However, fine-tuning GPT models showed promising improvements in callgraph analysis results, paving the way for future research in this direction.
In the type-inference tasks, LLMs such as gpt-4 and gpt-3.5, have demonstrated promising results, as evidenced in our study involving the T E P framework.Nonetheless, in extensive Python projects using LLMs for type inference can be resource-intensive.Moreover, employing OpenAI's services incurs monetary costs and lacks privacy for proprietary projects.Open-source LLMs like CodeL-LaMA avoid these problems as they are free and also offer the advantage of local deployment.
The LLMs tested in this study are predominantly large, having over seven billion parameters.This renders them unsuitable for deployment on standard machines equipped with a single GPU.In contrast, P CG and H G , both traditional SA methods, are capable of operating well within such hardware constraints.Consequently, for SA tasks, traditional SA methods still yield the best trade-off between accuracy and speed.Nonetheless, as indicated by our findings related to type inference, where accuracy is paramount, LLMs can be effectively used, especially with fine-tuning.

THREATS TO VALIDITY
We list limitations and threats to the validity of our study as follows: (1) We only analyzed the source code of the main program, excluding the code of the imported modules in the prompt.This decision was due to the complexities of constructing a prompt that accounts for the diverse import statement variations.This particularly affects programs in the "imports" category of the T E P , H G , and P CG benchmarks.Despite this, the affected portion is relatively small (5.6% of the total facts), so the overall results are only insignificantly altered.For a more comprehensive analysis, future work should include imported files.(2) We used the same prompt for all models, which may not have extracted the best possible performance from each.(3) Open-source models often deviate from the required output formats.We addressed this by manually identifying response patterns and adding a preprocessing step for format standardization.However, this does not cover all possibilities.This issue further highlights the LLMs' inability to produce structured data consistently.

CONCLUSION
In this paper, we used micro-benchmarks to evaluate the application of LLMs in static analysis tasks on Python programs.Our findings reveal that LLMs, including OpenAI's GPT-3.5 Turbo, GPT-4, and open-source models like LLaMA and CodeLLaMA, demonstrate promising capabilities in type inference, often surpassing traditional static analyses.GPT-4 stood out as the most effective model without fine-tuning, while fine-tuning GPT-3.5 Turbo yielded significant improvements.However, in the area of callgraph analysis, traditional methods still outperform LLMs, indicating a need for more focused fine-tuning and task-specific model adaptation.
Notably, these advancements come with substantial computational and monetary requirements.To reduce LLM size and enhance inference speeds, future research should explore model compression techniques, such as quantization [23].Further avenues of research include applying explainability methods to understand the challenges faced by LLMs in static analysis, expanding the scope to cover various static analysis tasks and programming languages, and evaluating the performance of fine-tuned open-source models.These efforts aim to optimize LLMs for broader utility and efficiency in various static analysis tasks.

Table 2 :
Exact match comparison of LLMs in type inference FRT: Function return type, FPT: Function parameter type, LVT: Local variable type

Table 1 :
Comparative analysis across LLMs for callgraph analysis on P CG and H