Vulnerability Root Cause Function Locating For Java Vulnerabilities

Software Composition Analysis has emerged as an essential solution for mitigating vulnerabilities within the dependencies of software projects. Reachability analysis has been increasingly leveraged to streamline vulnerability remediation procedures by prioritizing reachable vulnerabilities, which require the code-level root cause of vulnerabilities to perform reachability analysis. Notwithstanding, pinpointing the root cause leading to exploitation is laborious and resource-intensive, given the requisite manual oversight from specialists. To this end, we introduce root cause function Finder (RCFer), a solution capable of autonomously identifying root cause function utilizing semantic analysis of enriched vulnerability descriptions and source code. The top-10 outcomes successfully pinpoint root cause functions for 73.81% of assessed vulnerabilities.


INTRODUCTION
Software Composition Analysis (SCA) has become indispensable in safeguarding the security of dependencies within contemporary software projects.Reporting vulnerabilities based on dependency lists could induce numerous false alerts of disclosed vulnerabilities [14,19,20,25,28,38].The vulnerabilities sometimes could not be triggered from the users' code because the functionalities affected may not be used, especially for transitive dependencies [26,39,43].Hence, there is a growing trend towards prioritizing reachable vulnerabilities [44,45] based on code analysis for effective remediation, as opposed to simply relying on the Bill-of-Material [3,46].
Conducting reachability analysis for vulnerabilities necessitates pinpointing root cause function (RCF) that could lead to exploitation.Unfortunately, popular databases such as NVD [7] and Google OSV [9] do not offer this vital information.Although RCF could be deduced from patches, patches are only available for around 66% of vulnerabilities [35].Worse, even if the patch is present, its location is not necessarily a RCF location according to a study [21].Hence, it is a non-trivial task to locate RCF for reachability analysis without the aid of patches.Locating RCF in source code diverges significantly from bug or fault localization, primarily because it relies not on structurally formatted bug reports and stack traces but on free-text descriptions.This distinction is crucial, particularly when considering many CVE descriptions lack critical aspects, including root cause, attack vectors, and impact, as highlighted in a study [17].This deficiency in detailed information inherently complicates the process of locating RCF from CVE descriptions.
Due to free-text description with possibly missing aspects, this task usually involves a manual examination of descriptions and patches if available to interpret the location of the root cause as RCF [10].Given the labor-intensive nature of this manual procedure, we developed RCFer, an automated solution that enriches the CVE descriptions and matches the semantics derived from both the source code and the vulnerability description for Java.

BACKGROUND AND RELATED WORK
Given that Java is an object-oriented language, RCF refers to method signatures including classes and packages.There is numerous research work aiming to identify vulnerability-related information, such as affected libraries [13,18], Common Weakness and Exposures (CWE) [15,42,42], Common Platform Enumeration [33,36], aspects [17,34] and patches [35,40].Similar as bug localization, many research work [22-24, 27, 29, 31, 37, 41, 47] has been proposed.However, most of them rely on stack trace and bug reports simultaneously.Only Blizzard [29] could work solely on bug reports.Thus, we included Blizzard for the comparison.To the best of our knowledge, there is no existing work focusing on automatically locating the RCF for CVEs based on descriptions without patches.

APPROACH
In Figure 1, RCFer processes both the enriched vulnerability description and its associated source code repository to locate the method signatures of RCF within the repository.In the context of Maven ecosystem [6] for Java language, a source code repository could have multiple artifacts.Each artifact, identified by group and artifact may have plenty of class files that could contain RCF methods.Thus, RCF method is identified by a coordinate (repository, group:artifact, package, class, signature).Specifically, RCFer extracts and summarizes the semantics of classes and methods from the source code and matches them with the vulnerability descriptions to deduce which methods the descriptions are referencing as below: • Aspect Breakdown: According to Guo et al. [17]   lemmatization [8], camel splitter, wordNet tokenizer [11] to preprocess the words.Then, sentence-transformer [30] is used to vectorize tokens for cosine similarity between code summaries of class-method pairs and enriched descriptions.RCFer returns the top-10 methods with corresponding coordinates as the output.

EVALUATION AND RESULTS
We collected 1100 Java CVEs from NVD and manually labeled RCF.The count of CVEs with successfully identified vulnerable methods is listed in Table 1.Note that the third to fifth columns denote the remaining correct CVEs after each step.For RCFer, after Pre-filter, 178 were falsely excluded due to the lack of Root Cause and Attack Vector.For example, CVE-2023-35839 [4] only mentions Solon allows Deserialization of Untrusted Data.The enriched description was still ambiguous.After Artifact Locating, only 27 CVEs were falsely excluded due to insufficient descriptions from the Maven repository.After Semantics Matching, 113 vulnerabilities were ranked low regarding semantics similarity because many non-vulnerable classes with similar descriptions were present as noises.It is seen that RCF for the rest of 782 vulnerabilities could be successfully located in the top-10 results.Furthermore, after sorting the Java files by RCFer, the files with the real RCF are in the top 0.6% on average.
For Baseline, it skips the Description Enrichment step and keep the rest same as RCFer to verify the step's effectiveness.It is observed that the result dropped around 10% for Top-10, which emphasized the significance of the step.
For Blizzard, It could only found 12.18% RC methods.As it heavily relies on Lucene [12] to perform the file searching, no semantics have been extracted from the source code files, but keywords match.

CONCLUSION AND CONTRIBUTIONS
We proposed the first solution, RCFer, to locate the RCF for Java vulnerabilities without patches for SCA reachability analysis.In evaluation, RCFer could locate RCF for 73.81% vulnerabilities.
The design of RCFer could be extended to other languages based on adapted summaries of functions in other languages.In the future, we plan to optimize the Description Enrichment and Semantic Matching with advanced natural language processing models to refine the semantics for better results.

Figure 1 :
Figure 1: Overview of RCFer and 85% of CVEs had no root cause.To verify the description's completeness, RCFer first parses CVE descriptions to derive Partof-Speech tags and excludes those evidently wrong aspects.Then, based on the sentence patterns, such as [Vulnerability Type] in [Component] in [Affected Product] allows [Attacker Type] to [Impact] via [Attack Vector], RCFer extracts the aspects from descriptions.RCFer then enriches the descriptions with additional information with the next step.• Description Enrichment: It is necessary to enrich the descriptions with more supplementary information when the root cause and attack vector are absent.RCFer uses CWE descriptions directly and descriptions of similar CVEs with Pseudo Relevance Feedback to enrich the incomplete target description.Firstly, RCFer concatenates CWE descriptions with the original description.Secondly, based on the existing aspects, RCFer cross-references the missing aspects with other CVEs retrieved from initial retrieval.If there exist other CVEs with rich aspects, RCFer concatenates the additional aspects to the description.Finally, the affected product is extracted to narrow down the range of source code repositories.• Artifact Locating: As CVEs may not be mapped to source code repositories, the affected product has been used to locate the potential repositories by calculating the Jaccard Similarity [1] between affected product and the repository name.To identify the artifact, RCFer conducts the Semantics Matching between the Maven artifact description and the enriched description of the CVE (elaborated later).At last, the coordinate of a repository and an artifact with the highest average of two similarity scores is selected to pinpoint the RCF in the next step.• Pre-filter: Considering substantial computational resources for semantic analysis, RCFer excludes unrelated Java files by employing TF-IDF [2] similarity matching.RCFer first tokenizes each Java file and the associated enriched descriptions, splitting camel case words into separate tokens.Following this, RCFer derives the upper segment of Java files based on the descendingly sorted cosine similarity split by elbow point using the L-Method [32].• Code Summarization: Given the filtered files, RCFer extracts the classes and methods with a parser [5] and filters out the interface and abstract classes as they have no implementation and cannot be exploited.RCFer employs codeTrans model [16] to summarize the functionalities for classes and each method.• Semantics Matching: This step calculates the semantic similarity between two pieces of text.For the robust alignment of code summaries and descriptions in natural language processing, it is critical that the extraction and comparison of semantics remain unaffected by factors such as word order, inflections, verb tenses, camel casing, and synonymic variations.RCFer employs NLTK

Table 1 :
Count of CVEs with Correctly Located RCF