Understanding and Finding Java Decompiler Bugs

Java decompilers are programs that perform the reverse process of Java compilers, i.e., they translate Java bytecode to Java source code. They are essential for reverse engineering purposes and have become more sophisticated and reliable over the years. However, it remains challenging for modern Java decompilers to reliably perform correct decompilation on real-world programs. To shed light on the key challenges of Java decompilation, this paper provides the first systematic study on the characteristics and causes of bugs in mature, widely-used Java decompilers. We conduct the study by investigating 333 unique bugs from three popular Java decompilers. Our key findings and observations include: (1) Although most of the reported bugs were found when decompiling large, real-world code, 40.2% of them have small test cases for bug reproduction; (2) Over 80% of the bugs manifest as exceptions, syntactic errors, or semantic errors, and bugs with source code artifacts are very likely semantic errors; (3) 57.7%, 39.0%, and 41.1% of the bugs respectively are attributed to three stages of decompilers—loading structure entities from bytecode, optimizing these entities, and generating source code from these entities; (4) Bugs in decompilers’ type inference are the most complex to fix; and (5) Region restoration for structures like loop, sugaring for special structures like switch, and type inference of variables of generic types or indistinguishable types are the three most significant challenges in Java decompilation, which to some extent explains our findings in (3) and (4). Based on these findings, we present JD-Tester, a differential testing framework for Java decompilers, and our experience of using it in testing the three popular Java decompilers. JD-Testerutilizes different Java program generators to construct executable Java tests and finds exceptions, syntactic, and semantic inconsistencies (i.e. bugs) between a generated test and its compiled-decompiled version (through compilation and execution). In total, we have found 62 bugs in the three decompilers, demonstrating both the effectiveness of JD-Tester, and the importance of testing and validating Java decompilers.


INTRODUCTION
Java decompilers convert Java bytecode into high-level Java source code, and are important in reverse engineering; they are typically used to help developers comprehend and reuse third-party code like Java/Android libraries.Java decompilers have become more reliable over the years, and are increasingly used in program analysis, such as detecting malware [Alzahrani et al. 2019;Chen et al. 2019aChen et al. , 2018;;Enck et al. 2011;Li et al. 2017;Lu et al. 2019;Martín et al. 2017;Mathis et al. 2017;Moiz and Alal 2020] and constructing ground-truth [Agrawal and Trivedi 2020;Luo et al. 2022].
Undoubtedly, no matter whether in improving software comprehension, facilitating code reuse or supporting program analysis, it is critical to ensure the correctness of Java decompilers, i.e., the decompiled code should be syntactically correct and semantically equivalent to the source code.Several e orts have analyzed the decompilation quality of current Java decompilers by applying Java decompilers to code samples from previous work [Naeem et al. 2007], o -the-shelf Java projects [Harrand et al. 2020], or Android apps [Mauthe et al. 2021] with di erent metrics for measuring decompilation results.It is unsurprising all these studies show that Java decompilers are still far from generating syntactically and semantically correct code.However, they all only focus on manifestations of decompilation bugs and no extant research has examined the underlying causes of decompilation bugs and how they contribute to decompilation failures.
To bridge this research gap, we have undertaken the rst in-depth empirical study to understand the characteristics and causes of real-world decompilation bugs, from the perspectives of both decompiler development and testing.Our study centers on a sample of 333 unique bugs, selected from three popular Java decompilers.First, we investigate three characteristics of the bugs from the perspective of testing: • Provided artifacts: the artifacts provided in the bug-reporting issues to help developers reproduce the decompilation bugs.In analyzing the provided artifacts, we nd that, besides decompiled code, source code and executable Java programs are the two most frequently provided artifacts.Although most of the bugs were found when decompiling real-world programs, 40.2% of them contain small test examples to help reproduce the bugs.
• Error symptoms: indicators in the decompilation results for failures.In this part, we nd that crashes with exceptions, incorrect syntax, and inconsistent semantics are the three most common types of error symptoms-over 80% of the bugs manifest as one of these three error symptoms.We also nd that, for a bug in Java decompilers, there is a strong correlation between having source code in the bug-reporting issue and manifesting as a semantic error.We hypothesize that those who nd more semantic are likely to perform more detailed, accurate semantic comparisons between the source code and the decompiled code.
• Fixing duration: the time developers spent to address a bug.In this part, we nd that developers' responses to bugs di er across the studied decompilers.For CFR and Jadx, developers usually act on reported bugs promptly; over 80% were con rmed and xed within 15 days.In contrast, it took developers of FernFlower 6-8x longer time to x the reported bugs.
Next, to understand the causes of decompilation bugs, we investigate two features of the bugs in terms of their commits: • Buggy les: les modi ed in a commit.In this part, we nd that bugs in Java decompilation mainly concentrate on three decompilation stages: (1) entity&instantiation (i.e., loading structure entities from bytecode), (2) optimization (i.e., refactoring entities for more readable code generation), and (3) generation (i.e., generating source code from entities).57.7%, 39.0%, and 41.1% of the bugs have buggy les that are related to these three stages, respectively.
• Modi ed LOCs: lines of modi ed code (LOC) in a commit.One interesting observation is that bugs whose x commits have the most modi ed LOCs are related to type inference, rather than the three aforementioned stages.The reason is that many type inference bugs are due to incomplete type inference support, thus commits for these bugs usually contain signi cantly more LOCs to improve the type inference systems.Moreover, we provide several representative cases to illustrate the main challenges in three detailed stages of Java decompilation, namely sugaring (a sub-stage of optimization), region restoration (a sub-stage of entity&instantiation), and type inference.These cases provide another aspect of how decompilation bugs are introduced due to these challenges.For instance, bugs in region restoration play a key role in entity&instantiation and contribute 42.6% of the entity&instantiation bugs.Similarly, sugaring bugs account for 40.0% of the optimization bugs.Both kinds of bugs are also generally more complex to x than bugs in their sibling stages.
These results and ndings guide our design and realization of JD-Tester, a di erential testing framework for Java decompilers.Indeed, JD-Tester utilizes Java test generators that produce executable Java tests to focus on exceptions, syntactic, and semantic errors from decompilers.By revealing these three kinds of errors, JD-Tester helps Java decompilers mitigate exceptions, enhance code clarity, and mitigate code with ambiguity during decompilation.As a consequence, JD-Tester contributes to the facilitation of human comprehension and promotes the reusability of the decompiled code.We evaluate JD-Tester on the three studied Java decompilers with 580 Java tests generated by two di erent generators, respectively.JD-Tester found decompilation failures on 52.7%, 34.3%, and 65.8% of these tests respectively for the three decompilers.We analyze these failures and identify 62 unique bugs, 15 of which manifest as semantic errors.
The rest of the paper is organized as follows.Section 2 introduces the methodology of our study.Section 3 presents the general statistics of the 333 unique bugs.Next, Section 4 and Section 6 detail the characteristics and causes of the 333 bugs in terms of their bug-reporting issues and bug-xing commits, respectively.Section 5 describes and illustrates three types of decompilation bugs in Java decompilers with cases.Section 7 presents the JD-Tester framework and our experience using it to test the three popular Java decompilers.Finally, we discuss related work (Section 8) and conclude (Section 9).

METHODOLOGY
This section introduces Java decompilers used in our study and our steps to collect bugs.Possible threats to validity are also discussed.

Source of Bugs
In collecting representative bugs from Java/Android decompiler projects, we follow the practice of a previous empirical study [Zhang et al. 2018].Initially, we conducted a search on GitHub using the keyword "Android/Java decompilers" and identi ed ten decompiler projects with over 500 stars per project indicating their popularity.To ensure an adequate number of bug samples for analysis, we only considered projects with more than 200 commits.We further restricted the selection to projects primarily coded in Java, mitigating the impact of language discrepancies when measuring bug complexity via modi ed LOCs.Eventually, we chose three projects, namely Jadx, FernFlower, and CFR, whose details are listed in Table 1.Table 1 includes the selected versions (Version) of the three decompiles, their stars (#Star), their size in lines of code (LOC), counts of their commits (#Commit) and issues (#Issue) selected as study subjects.Additionally, we provide the time duration (Duration) during which we selected these commits and issues for analysis.
Next, we crawled issues of the three projects from their inception to Mar 2022.For CFR and Jadx, we obtained 248 and 1,111 issues from their issue trackers on GitHub.For FernFlower, which has its own issue tracker, we searched for issues with the keywords "Java Decompiler" and "FernFlower" using the embedded search engine, and obtained a total of 352 issues.Among these issues, 306 were associated with "Java Decompiler, " 143 with "FernFlower, " and 97 issues contained both keywords.Afterward, we selected only bug-reporting issues that had been xed and collected closed issues with tags such as "bug", "crash", or "exception".This resulted in 111, 171, and 223 bug-reporting issues respectively in three projects.Finally, we checked the associated bug-xing commits of these bug-reporting issues in two di erent ways.Developers of CFR and Jadx tend to attach the bug-xing commits to the bug-reporting issues as long as they x the bugs; so for the two projects, we manually checked the attached commits in the issue discussions.Thus, in CFR and Jadx, we harvested 92 and 217 bug-xing commits associated with the 111 and 223 bug-reporting issues.In contrast, FernFlower's developers tend to attach the IDs of associated bug-reporting issues in the messages of bug-xing commits; so for FernFlower, we checked their issue messages and harvested 90 bug-xing commits associated with 171 bug-reporting issues.In the end, our dataset contains 505 (111+171+223) bug-reporting issues and 399 (92+90+217) corresponding bug-xing commits.This is a sizable, substantial dataset as performing deep, manual inspection into these bug-reporting issues and bug-xing commits is di cult and time-consuming, and requires solid knowledge of Java decompilers and how each part of these decompilers works.In the subsequent discussion, we use "issues" to denote bug-xing issues and use "commits" to denote bug-xing commits in brief.

Threats to Validity
Potential threats to the internal validity mainly lie in our manual inspection of the bugs, such as the classi cations of the bugs in terms of their error symptoms (Section 4.2) and decompilation stages (Section 6.1).To mitigate these threats, we provide explicit de nitions and detailed steps of these classi cations in this paper, enabling others to reproduce our manual inspections.Moreover, the results of our manual inspection of these bugs were also carefully double-checked by both the paper's rst and second authors.They are respectively PH.D and master students majoring in software engineering, both having taken the lesson of principles of compilation.Consensus between the two students was imperative for the manual inspection, and in instances of any disparities, resolution was facilitated by the involvement of a third PH.D student.Potential threats to the external validity may lie in the representativeness of our datasets, including the chosen decompiler projects and bugs used to conduct our study.For the decompiler projects, we selected CFR, FernFlower, and Jadx out of the 10 most popular Java decompilers from GitHub.All three projects have over 1,000 stars, showing their maturity and popularity.Moreover, FernFlower is the embedded Java decompiler of the well-known IntelliJ IDEA IDE, further underscoring its relevance.Our study involves a total of 505 real-world bug-reporting issues and 399 bug-xing commits in these projects.These few hundred issues and commits, although not great in number, are di cult to signi cantly expand because our study concerns the detailed analysis of the causes of their bugs, thus necessitating substantial manual labor.We emphasize that it took both the rst and the second authors of this paper approximately 200 hours to analyze the 505 issues and 399 commits and an additional 150 hours to investigate the source code of the three projects.

STATISTICAL ANALYSES
In this section, we give the statistics of the 333 unique bugs from both aspects of issues and commits.).Among the 53 invalid issues, 19 issues are invalid due to insu cient information provided by the issuers to locate the bugs (NoInfo), 16 issues are invalid because they were identi ed as false positives by developers (FP), and the other 18 issues are invalid because developers did not intend to x these issues despite con rming the error symptoms (Won'tF).An intriguing nding from the analysis of these Won'tF issues reveals that 7 out of the 18 closely resembled requests for features aimed at enhancing comprehension.For instance, in an issue of FernFlower1 , the issuer advocated for developers to restore constants into de ned constant variables instead of literal in the decompiled code.However, the challenge arises from the typical loss of information associated with these constant variables during compilation, rendering such restoration almost impossible without heuristics.Consequently, developers chose to overlook this issue.These concerns highlight the desire among users of Java decompilers for decompiled code to be presented in a more comprehensible manner, in addition to being semantically equivalent and syntactically correct.While these issues fall outside the primary focus of our work, they could indicate potential areas for future improvement to our work.Among the 452 valid issues, 114 are tagged as duplicates of 36 unique bugs (Duplicate, X→Y means X issues report Y unique bugs), 10 issues report multiple bugs (MBugs) and 328 issues that are one-to-one mapped to 328 unique bugs (1-1).Notably, a higher number of duplicate issues were observed in FernFlower (occupying 78 out of 114 duplicate issues) compared to CFR and Jadx.It is reasonable since FernFlower is now embedded in IntelliJ IDEA [JetBrains 2023], one of the most popular IDEs for Java, and is also widely used along with IntelliJ IDEA.Subsequent to the identi cation of the 328 1-1 unique issues and corresponding unique bugs, the consolidation of 114 duplicate issues into 36 unique bugs, and the separation of 10 issues reporting multiple bugs into 21 unique bugs, a cumulative total of 385 (328+36+21) unique bugs were taken for further examination in terms of their commits.

Commits of Unique Bugs
Moving on now to consider the commits of the bugs.it is noteworthy that out of the total 385 bugs, 39 (comprising 5, 24, and 10 bugs from the three projects) are unsuitable for analysis.This determination stemmed from our inability to locate the corresponding commits for these bugs, though developers claimed that these bugs had already been xed.For the remaining 346 (385-39) issues, they have 399 corresponding commits (#Commit T ) as shown in Table 2.Among the We excluded test cases in collecting buggy les and modi ed LOCs in our statistics because the modi cations on the test cases are not related to the functionalities of Java decompilers.This decision excluded 4 issues whose commits only modi ed test case les (Test).Furthermore, a bug associated with an "outlier" commit (Outlier) was excluded from the analysis.The reason for exclusion was that this commit refactored the type inference system by modifying 55 les and 3185 LOCs, making it an outlier in terms of the impact it could have on the statistical analysis.We categorize the 394 valid commits and corresponding 341 unique bugs (385-39-5) into three distinct groups: (1) 1-1, denoting that 303 commits correspond individually to 303 unique bugs; (2) MFixes, where 88 multiple commits are submitted to x 30 unique bugs (X→Y means X commits are used to x Y unique bugs); (3) MBugs, where 3 single commits are used to x 8 multiple bugs.In addition, among the 30 bugs xed using multiple commits, the quantity of commits allocated to address each unique bug spans from 2 to 4. Only two exceptions were observed, which involved a bug in Jadx xed with 5 commits and another bug in FernFlower xed with 13 commits.It is noteworthy that both bugs are related to type inference, which is a complex and scattered functionality in the decompilation process, particularly in FernFlower.This complexity necessitates modifying multiple les across di erent modules to x the type inference bug.In further analyses of buggy les and modi ed LOCs, we used the sum of all its commits for each bug xed with multiple commits, i.e., MFixes.However, bugs xed in the same x commits, i.e., 8 bugs of MBugs, were excluded due to the di culty of distinguishing the modi cations for each issue in these commits.A total of 3 bugs in CFR, 2 in FernFlower, and 3 in Jadx were a ected by this decision.Eventually, a total of 333 (341-8) unique bugs were selected as the main subjects in our study.

CHARACTERISTICS OF UNIQUE BUGS
This section presents our analyses of the 333 bugs from the aspect of their reporting issues.

Provided Artifacts
Provided Artifacts represent the artifacts issue reporters provide in the issues for reproducing or locating the bugs.Usually, provided artifacts can also be used to reproduce the bugs.Besides the essential Reports demonstrating the error symptoms which can be found almost in every issue, artifacts can be: Executable les (Executable) of format jar/class/apk/dex, etc., program snippets in the form of Byte code (Code b ), Source code (Code s ) and Decompiled code (Code d ).Some professional reporters can even provide information about the Positions and possible Fixes of the bugs.Table 3 lists the top 5 most frequently used artifacts (Single Artifacts) provided in three projects.Since it is often the case that reporters provide multiple artifacts in their issues, Table 3 also lists the top 3 most frequently used artifact combinations (Artifact Combinations).
Table 3 indicates that the three most frequently provided artifacts across all three projects are decompiled code, source code, and executable les, with a usage frequency ten times higher than the other three artifacts.Furthermore, more than 80% of the issues o er at least one of these three types of artifacts.However, there are signi cant di erences in the usage frequency of the top three artifacts among the three projects, notably between Jadx and the other two projects.Speci cally, issuers reporting issues in Jadx prefer decompiled code and executable les, while reporters of FernFlower and CFR prefer decompiled code and source code.This trend is also re ected in the statistics on artifact combinations, with the combination of executable les and decompiled code being the most frequently used in Jadx, while the combination of source code and decompiled code is the most commonly used in CFR and FernFlower.
To better understand this phenomenon, we conduct further analyses of the sources of these provided artifacts, i.e., from which kinds of programs the code snippets come.From the 276 issues providing executable les, decompiled code, or source code in three projects (77 from CFR, 49 from FernFlower, 150 from Jadx), we observe two primary sources: real-world programs and small test examples.We can distinguish code snippets from real-world programs with special meanings in the names of variables, classes, packages, etc.In contrast, for code snippets from small test examples, keywords like "test", "xyz", "abc", numeric IDs, and other meaningless words describing the decompilation errors in the names of variables, classes, packages, etc., are strong indicators.
In Jadx, among the 150 issues providing executable les, decompiled code, or source code, 129 (86.0%) have artifacts from real-world programs and only 13 (8.7%)have artifacts from small test examples.However, this is clearly di erent in CFR and FernFlower: among the 77 and 49 respective issues in CFR and FernFlower, 5 (6.5%) and 7 (14.3%)have artifacts from real-world programs, while 66 (85.7%) and 35 (71.4%) have artifacts from small test examples.In summary, among the 276 issues, 140 (50.9%)have artifacts from real-world programs, and it is highly likely that these issues were found when decompiling real-world programs with the three decompilers.The artifacts of another 111 (40.2%) issues are from test examples.However, these test examples were likely created in many possible means.For instance, they may be directly created by reporters from tricky code patterns, or simpli ed from code snippets of real-world programs, etc.The only evidence we nd is from Marcono, the leading issue reporter in CFR, who has contributed to almost one-third of the issues with artifacts from small test examples.Marcono claims that he nds most of the issues when decompiling the popular minecraft project [Marcono1234 2019].This claim, along with the fact that over half of the issues use artifacts from real-world programs, indicates that most bugs are still found from decompiling real-world Java programs.Another nding worth noting is that for the 111 issues with artifacts from test examples, their code size ranges from 3 to 30 LOCs with an average of 10 LOCs, indicating that, though small in size, test examples can also be used to reveal interesting, relevant bugs in Java decompilers.

Error Symptoms
Error Symptoms represent the outward manifestations of these decompilation bugs, which indicate to users that there are failures in the decompilation process.We collect this information from issue reporters' detailed descriptions of these bugs.Table 4 shows six kinds of error symptoms and the distribution of the 333 bugs across the six kinds of error symptoms.We summarize three  et al. 2007] based on software metrics [Halstead 1977].These metrics are: (1) Exceptions, i.e., the decompiler fails with exceptions upon execution, (2) Syntactic errors (Syntactic), i.e., the decompiler produces a syntactically incorrect program, and (3) Semantic errors (Semantic), i.e., the decompiler produces a syntactically correct while semantically incorrect program.We add three additional di erent kinds of error symptoms: (4) Con guration errors (Con guration), i.e., the con gurations or options set to the decompiler do not function as intended, (5) GUI errors (GUI), i.e., the decompiler does not show the produced program correctly in its GUI, and (6) Performance errors (Performance), the decompiler fails to produced program within expected time/memory.Syntactic, semantic errors, and exceptions are the top 3 most common error symptoms, which account for about 80% of all the 333 bugs.Furthermore, in Table 4 we can observe a gap between syntactic and semantic errors in Jadx and the other two projects, especially CFR: there are many more issues that manifest as semantic errors in CFR while many more issues manifest as syntactic errors and exceptions in Jadx.Associated with the di erence in the provided artifacts of the three projects, we perform further analyses of the correlation between di erent kinds of error symptoms and provided artifacts.Particularly, we nd that providing source code as artifacts and manifesting as semantic errors are strongly associated.Table 5 shows the distributions of bugs that provide source code (Source)/no source code (¬Source) as their artifacts and manifest as semantic errors(Semantics) or not (¬Sementics) in three projects, respectively (CFR, FernFlower, Jadx) and totally (Sum).We use a simple matching coe cient (SMC) to evaluate the association between source code and manifesting as semantic errors.The SMC correlations between source code and semantic errors are 0.60, 0.66, and 0.83 in three projects, with a general correlation of 0.74.This nding indicates a strong connection between providing source code as artifacts and manifesting as semantic errors in the issues.In particular, providing source code enables issue reporters to compare the decompilation results with the original source code.This nding is in line with the cognitive process, which suggests that without source code for comparison, it is challenging for reporters to detect semantic errors in the decompiled code.Consequently, the study conjectures that reporters of Jadx are more likely to overlook bugs that manifest as semantic errors, as they tend to decompile real-world programs where the source code is usually not available for comparison.This conjecture may account for the fewer instances of semantic errors identi ed among the 185 bugs in Jadx.

Fixing Duration
Fixing Duration represents the time interval between the creation of the issues and the moment developers attach all the commits.Figure 1 shows the distribution of the xing duration of the 333 unique bugs in three projects.The data indicates that developers of CFR and Jadx give quick responses to these issues, with approximately 80% of the bugs being resolved within 15 days.The median xing duration for these two projects is three and four days, respectively.In contrast, developers of FernFlower take signi cantly longer to x the bugs, with a median xing duration of approximately 28 days.This duration is eight and six times longer than that of CFR and Jadx, respectively.For the statistical analysis, medians, not means, are preferred, as several bugs were deferred for extended periods before being resolved.For example, in FernFlower, 21 bugs, i.e., approximately one-third of all bugs, had a xing duration exceeding three months.Among 16 of these bugs FernFlower's developers gave a quick response within a week, though they did not give any explanations for the delayed xes.There are also 17 bugs with xing duration over three months in Jadx.Among the 17 bugs, Jadx's developers mentioned that three bugs were too di cult to x.To our surprise, Jadx's developers did not x six bugs, i.e., around one-third of the 17 bugs, for several months until reminded by issue reporters.However, developers of Jadx were able to address these six bugs within a few days after being reminded.Notably, among the three projects, only FernFlower has implemented a dedicated code-review policy aligned with other IntelliJ IDEA projects.In detail, each issue in FernFlower necessitates a comprehensive work ow encompassing con rmation, xing, and veri cation processes, typically assigned to speci c developers.Therefore, a potential delay may arise whenever any designated developers are unavailable.This policy may contribute to a number of long-time-to-x issues in FernFlower.

CASE STUDIES
Before analyzing the causes of the 333 bugs, we give some interesting, representative examples from three decompilation stages to give a better understanding of Java decompilation bugs.

Sugaring
Many speci c structures in Java programs, like Enum, switch, String concatenation, anonymous class, lambda expressions, etc., usually are expanded with multiple additions when compiled into bytecode.These additions refer to the inclusion of additional bytecode instructions for implementing speci c structures.Decompilers face the challenge of restoring the additions and transforming the expanded bytecode into clean and concise source code, a process known as "sugaring".A typical example is the lambda expression, which we name as Case S in gure 2a.In the source code, each lambda expression can be exibly formatted as param→body, which is compiled into an additional synthetic method implementing the body and an INVOKEDYNAMIC instruction creating the handler of this method in bytecode.It is highly recommended for Java decompilers to restore these lambda-related instructions into their original formats.However, this task is notably challenging due to quite a number of sub-stages to be noticed in the restoration stage, including mapping the INVOKEDYNAMIC instructions and the corresponding synthetic methods, passing the parameters used between, recovering their types, etc. Decompiler developers often nd it di cult to comprehensively address all these stages, and therefore bugs can be easily introduced inside 2 .Moreover, it becomes even worse when encountered with corner cases in the utilization of lambda expressions, such as out-of-package invocation for lambda expressions 3 , nested lambda expressions 4 , etc.

Region Restoration
Another di culty in decompiling Java programs lies in region restoration.For conditional (if and switch) and loop structures (for-loop, while-loop, and do-while-loop), try-catch, throw, break, continue, synchronized, and even composite expressions (we denote all the above structures as "GOTO structures"), they are compiled into bytecode containing one or multiple jumping instructions, in which the GOTO instruction is a typical one.Region restoration is to properly restore these GOTO instructions (and also other instructions) to the above structures.It is a heuristic and experience-based task that requires a comprehensive understanding of how Java compilers work.The restoration process can become more challenging for combinations of these GOTO structures.Here we take a labelled break inside a switch as an example, which we name Case R in Figure 2b.It was an assumption made by CFR's developers that the sole GOTO in case blocks was restricted to jumping only to the block immediately after the default block.This assumption, however, does not hold in Case R. In Case R, this break is decompiled into a GOTO instruction that jumps to two blocks after the default block, the sugaring assumption being broken and an exception arising.This example highlights the potential for error in region restoration, which is typically based on developers' experience.Various bugs in region restoration for the combinations of switches and other structures have been reported, like switch&throw5 , switch&assert6 , switch&loop7 , switch with multi-exits8 , not only in CFR but also in Jadx.These bugs lead to exceptions, syntactic, and semantic errors.Particularly, region restoration bugs manifesting as semantic errors can be extremely dangerous as these bugs may go unnoticed and developers may easily misunderstand the code.
Another typical case where developers can inadvertently introduce bugs in region restoration is multi-entry loops.Multi-entry loops represent the situation where a loop, usually nested, has multiple connected cycles, where decompilers may fail to identify the correct main entry of such a loop.Although academic communities have proposed many proven solutions to this problem, developers are usually not willing to apply these solutions.A real-world example may be found in Jadx9 with a decompilation failure due to a multi-entry loop.The issue reporter provided a detailed explanation of the failure and introduced an e ective academic solution [Yakdan et al. 2015].Despite this, the developers of Jadx rejected this solution and opted to retain a pattern-matching engine, citing the rarity of multi-entry loops and the di culty of implementing such solutions.

Type Inference
There are considerable issues reporting bugs in type inference of the three projects, which can be generally classi ed into two kinds (denoted as Case T-p1 and Case T-p2).First, a prominent reason for the prevalence of type inference bugs, surprisingly, can be attributed to the incomplete development of the type inference system, particularly with respect to generic types.Unlike other languages such as C++, in Java bytecode, the type parameters of generic types are typically replaced with their bounds or Object.Consider Figure 3a as an illustrative instance.Within the source code, the type parameter T serves to represent the generic type in class Box.However, during compilation, this type parameter T undergoes transformation into Object, across all its usage in the eld, method, and class declaration.Restoring this generic type in declarations is usually straightforward, as the type information in the declarations' signatures directly indicates that Objects in the declarations should be restored to T. The main challenge, however, arises from inferring the concrete type parameter utilized in the substitution of T during the instantiation of Box (L1) in the method instantiate().Explicit generic type information is unavailable for these instances, necessitating decompilers to infer the concrete type parameters through "guesswork" from diverse sources, such as method invocations (as seen in Figure 3a, the return type Integer of the method invocation Integer.valueOf()),cast expressions, eld access, etc.These sources are language-speci c and often exhibit signi cant variation.The comprehensive capture of type information embedded in these sources is not consistently achieved by the decompilers-bugs may arise whenever certain kinds of type information are overlooked.For instance, A bug-reporting  issue documented an oversight by the developers of Jadx10 , wherein the type information associated with outer generic class declarations was neglected during the inference of the concrete type for its inner generic classes, such as "Map<K,V>" out of "Entry<K,V>".This oversight subsequently resulted in a compilation error.This bug was xed just three years ago or in the 8 th year since the initial release of Jadx.Fixing these bugs can be a complex undertaking, requiring developers to advance and augment their type inference systems.
An additional source of type inference bugs can be attributed to indistinguishable variables in bytecode, which we name as Case T-p2.Let's take Figure 3b as an example.In the bytecode of Case T-p2, all the values are stored in a single local variable, i 1 , and all operations are performed on i 1 .It is typical for decompilers to map i 1 to the variable b in the source code and inline all the operations on b, as the source code does.However, this may result in a type inference con ict: the instruction I2B indicates b is a byte, while a later method invocation with b as its input indicates it is an integer.To address this, developers can take at least three strategies for Case T-p2: (1) generating b as a byte or (2) integer, or even (3) splitting the lifespan of i 1 and two di erent variables b 1 of type integer and b 2 of type byte should be generated instead.The main challenge is that indistinguishable variables can have con icting type information, and developers may easily adopt incorrect strategies for inferring their types.For instance, FernFlower's developers used to take the rst strategy and variable b of type byte led to the invocation of the second function output, which was semantically nonequivalent to the source code.Most of the time, bugs in inferring types of indistinguishable variables only cause "cast missing" errors, i.e., syntactic errors that manifest as missing type-casting operations in the source code.These bugs can also manifest as semantic errors as Case T-p2, which are silent while dangerous.Therefore, developers of Java decompilers should carefully design the type inference strategies for indistinguishable variables with con icting type information.

CAUSES OF UNIQUE BUGS
In this section, we demonstrate the causes of the 333 unique bugs from the aspect of commits.6 depicts our statistics of buggy les of the 333 bugs in three projects, including the number of unique bugs (#Bug U ), all les (#File T ), all buggy les modi ed in bug xing (#File B ) and their mean (Mean), median (Median), standard deviation (SD), minimum (Min), and maximum (Max).In all three projects, the majority of bugs are found in one single buggy le, comprising 44.1%, 58.1%, and 25.9% of all bugs.Meanwhile, around 70% of bugs are xed within ve buggy les in CFR and Jadx, and the ratio even reaches 90% in FernFlower.Table 6 also reveals that fewer than half of the les have ever been modi ed in xing the 333 bugs.
Table 7 lists the top 10 buggy les in the 333 bugs in three projects, including their les (File), the number of bugs in these les (#Bug U ), the LOC of these les (LOC), and brief descriptions of these les (Description).We also include the bug density, i.e., number of bugs per thousand lines of code (#Bug U /KLOC) as an additional metric.A higher value in this context signi es a greater incidence of bugs, providing readers a better comprehension of the relative bug density across these les.To understand the descriptions, we rst brie y introduce the main decompilation stages of modern Java decompilers.All three projects design the decompilation process with the composite pattern, in which components are code structures from classes at the highest level to statements or even expressions at the lowest level.For each component, decompilers load the high-level representations of the target structures from bytecode and generate source code from the representations.Here we use Entity (Ent for short) to represent the high-level representation of these structures, Instantiation (Ins for short, and the aforementioned region restoration is a sub-stage of instantiation) to represent the stage of loading entities from bytecode, Generation (Gen for short) to represent the stage of generating source code from entities.In our statistical analysis, we amalgamated Ent and Ins into a uni ed category denoted as Ent&Ins, considering that these two stages are typically implemented within the same les, as observed in Java decompilers.There are also other stages like Type Inference (Type for short) and Optimization (Opt for short).Type represents the type inference stage while Opt represents the stage of refactoring entities to improve the readability of generated source code, which is a common practice in today's decompilers [Brumley et al. 2013;Yakdan et al. 2015].Besides Sugaring (Sug for short), Opt also contains other sub-stages, like constant&variable inlining, expression simpli cation, etc.
In all three projects, the top 10 buggy les are observed to include those related to Ent, Ins, and Gen of classes.This nding is justi able considering that classes are typically regarded as the fundamental components of decompilation, containing a substantial amount of information and being susceptible to errors.The large size of these les may contribute to their high incidence of bugs.For instance, the les ClassFile and ClassWriter in CFR and FernFlower, identi ed as the most buggy bugs of the two projects, exhibit averages of 7.6 and 6.9 bugs per KLOC, respectively.These two gures are notably lower than the mean #/KLOC of the top 10 buggy les within the respective projects.However, this trend does not hold universally as indicated by our metric #/KLOC.Such as VariableFactory in CFR, VarProcess in FernFlower, and AFlag in Jadx among the top 10 buggy les, despite their having comparable or even lower average LOCs than other les, they achieve the highest #/KLOC of 53.2, 25.8, and even 180.1, in the three decompilers respectively.In particular, AFlag is extensively used for instruction tagging to enhance region restoration, generation, and type inference, while both VariableFactory and VarProcess are utilized in variable instantiation.
Table 8 shows the distribution and density of the 333 unique bugs in the seven decompilation stages in three projects, both individually and collectively.The columns #Bug U and #LOC represent the quantities of unique bugs and the code base (in LOC) of the three projects.The remaining columns, such as Ent&Ins and Opt, represent the distribution of bugs across the seven stages, expressed as a percentage of bugs associated with each stage.The numerical values within parentheses signify the bug density, i.e., the count of bugs per KLOC relevant to each stage.To note, Other is for bugs related to none of the left four stages, and Reg and Sug are for bugs related to region restoration and sugaring.We obtained the relationship between stages and bugs with the following steps-we reused the classi cation in Table 7, assigned the most appropriate stage(s) to each le in three projects and collected all the stages of buggy les in all commits of each bug.Eventually, we identi ed these stages as related ones to each bug.
Within the three projects under study, there exist 50, 37, and 115 bugs, respectively, containing modi cations to multiple decompilation stages.This signi es that 60.6% ((50+37+115)/333) of the total bugs are related to multiple decompilation stages.In terms of the bug distribution, it is clear in Table 8 that Ent&Ins, Opt, and Gen are the top 3 most relevant stages to the bugs, and over 80% bugs in all three projects are related to at least one of the three stages.Most bugs are related to Ent&Ins.A possible explanation is that there are many bugs related to Regin total, at least one-fth of bugs are related to Reg, which occupy 42.6% (24.6%/57.7%) of bugs In addition, around 40.0% (15.6%/39.0%) of the Opt bugs are also related to Sug.Regarding the bug density, while the majority of bugs are related to Ent&Ins, the number of bugs per KLOC related to this stage is notably lower compared to other stages across the three projects, both individually and collectively.It is reasonable, since Ent&Ins occupy around the largest fraction, i.e., around 58.4% of the whole code base of the three projects.Conversely, Opt (including Sug), Gen, and Type exhibit comparatively higher bug densities, indicating a relatively elevated prevalence of bugs in these stages.In addition, although Reg is not as buggy as Opt and Type, we can easily observe that Reg is much buggier than its parent stage, Ent&Ins.modi ed in the bugs and their mean (Mean), median (Median), standard deviation (SD), minimum (Min) and maximum (Max).In Figure 4b, all the medians are smaller than 50 and are much smaller than the means.Figure 4a also shows the fraction of the 333 bugs, where the x-axis is the modi ed LOCs, and the y-axis is the fraction of the bugs with that number of modi ed LOCs.These bugs in Figure 4a are systematically arranged based on their modi ed LOCs, progressing from low to high values.Particularly, we highlight the bugs with LOCs closest to 100 in three projects.It is clear that 79.0%, 76.7%, and 69.7% of the bugs in three projects are xed within 100 LOCs.What is striking in the bugs with over 100 LOCs is that the bugs with the most LOCs in Jadx&FernFlower and the bug with the second most LOCs in CFR are all related to type inference.Based on the information gleaned from the commit x messages, it can be postulated with a high con dence that the three bugs under investigation are attributed to the incomplete type inference systems.Therefore, it is recommended that developers invest more e ort in their type inference systems.Table 9 lists the average modi ed LOCs of bugs related to the seven decompilation stages.The average modi ed LOCs of all the bugs (#LOC A ) are also provided.Table 9 shows bugs related to Type have the most modi ed LOCs in all three projects.As we have mentioned before, it is because a considerable number of type inference bugs require a general improvement to the type inference systems and consume many more LOCs (see Section 5).Another nding in Table 9 is that bugs related to Reg and Sug consume more LOCs than sibling stages, i.e., Ent&Ins and Opt.This nding holds for all three projects, which tells that bugs related to Reg and Sug are generally more complex compared with other bugs related to Ent&Ins and Opt.

JD-TESTER&EXPERIMENT
This section proposes our JD-Tester, the testing framework for Java decompilers.We evaluate JD-Tester on the three Java decompilers and give the evaluation results to show JD-Tester' e ectiveness in revealing decompilation bugs.

Methodology of JD-Tester
The present study begins with the manifestation of bugs in the Java decompilation process and gradually delves into their root causes.Our ndings, which were revealed through our study, highlight the crucial requirements for uncovering bugs in Java decompilation, as well as the critical challenges faced during the decompilation process.Our results indicate that these ndings can be utilized in an e ective and e cient manner to test Java decompilers.
• Finding#1: although most of the bugs were found when decompiling large, real-world code, a considerable number of issues provide small test cases for bug reproduction.This suggests that the buggy code snippets may not be scattered throughout the entire program, but rather concentrated in speci c lines of code.As a result, small test cases that exhibit di erent code structures can be used to detect bugs, eliminating the need for blindly collecting real-world programs.
• Finding#2: over one-fth of bugs manifest as semantic errors, which only come after exceptions and syntactic errors.This highlights the need for Java decompiler testers to consider semantic errors, in addition to exceptions and syntactic errors, in the testing process.Thus, our JD-Tester should provide a thorough check of programs' semantics before/after decompilation.
• Finding#3: region restoration, sugaring, and type inference are the three most signi cant challenges in Java decompilation.This highlights the need for developers to be especially careful when working on these stages, as they are most likely to introduce bugs in the decompilation process.Additionally, the more complex the code structures that go through these stages, the more likely bugs will be found in Java decompilers.
To take advantage of the three ndings, We provide three solutions implemented in JD-Tester: • Solution#1: according to Finding#1, JD-Tester distinguishes itself by utilizing the capabilities of advanced fuzzing techniques.Speci cally, JD-Tester utilizes tests generated by these generators as input, making it a novel approach in comparison to previous research.Most of the previous works on Java decompilation testing have employed either real-world programs [Harrand et al. 2019[Harrand et al. , 2020;;Mauthe et al. 2021] or sample programs [Kostelanskỳ and Dedera 2017].However, none has demonstrated the e ectiveness of these generators in testing Java decompilers, despite the fact that the generated test cases can be highly complex and nearly inexhaustible in number.
• Solution#2: according to Finding#2, JD-Tester engages in di erential testing for Java decompilers.Unlike prior studies which only reveal bugs in decompilers through the process of decompiling executable artifacts and compiling the decompiled code [Harrand et al. 2019[Harrand et al. , 2020;;Mauthe et al. 2021], JD-Tester also veri es the semantic equivalence of the source code and the decompiled code by comparing their execution results.While Harrand et. al. [Harrand et al. 2020] make use of existing test cases in real-world programs, not every code fragment has its own test, whereas tests generated by generators are all executable.
• Solution#3: according to Finding#3, we conducted an investigation into both previous studies [Bonnaventure et al. 2021;Chaliasos et al. 2022;Chen et al. 2019bChen et al. , 2016;;Yoshikawa et al. 2003;Zang et al. 2023] and open-source projects [Mohammad R. Haghighat 2016, 2018, 2023] towards Java fuzzing testing.After careful evaluation, we select JavaFuzzer [Mohammad R. Haghighat 2018] and Hephaestus [Chaliasos et al. 2022] to generate tests in JD-Tester framework.JavaFuzzer stands out as a highly e cient Java test generator operating at the source-code level, notable for its self-su ciency, requiring no additional input.JavaFuzzer incorporates a diverse set of code patterns that comprise combinations of loop, switch, and other GOTO structures, as well as patterns involving explicit and implicit type conversion, which aligns with Case R and Case T-p2 in Section 5, respectively.Conversely, Hephaestus adopts a distinctive approach by applying targeted transformations to an input program, aiming to expose compiler bugs associated with type inference.This is particularly relevant to generic types and lambda expressions, the latter being a signi cant and frequently-used feature introduced in Java 8. Notably, lambda expressions, while enhancing expressiveness, may introduce bugs during the sugaring process.In another word, Hephaestus aligns with Case S and Case T-p1.It is evident that JavaFuzzer and Hephaestus exhibit mutual complementarity, and these two test generators are adept at precisely addressing the critical concerns arising from our observations within Case S, Case R, and Case T. Furthermore, tests generated by JavaFuzzer and Hephaestus can exhibit a remarkable breadth, exploring more complex combinations of these error-prone structures, including rare ones that may not have been encountered in real-world scenarios.Hence, we believe JavaFuzzer and Hephaestus can help JD-Tester reveal more bugs due to sugaring, region restoration and type inference.Certainly, additional kinds of generators can be integrated into our JD-Tester to focus on bugs related to generic types and characteristics of new Java versions, which we leave as future work.
It is important to clarify that our methodology does not aim for the precise generation of the original source code from bytecode.Rather, our purpose is more achievable, focusing on producing code that is both syntactically correct and semantically equivalent to the original source code.Given that the primary purpose of Java decompilation is to facilitate code comprehension and reuse, generating syntactically correct and semantically equivalent decompiled code serves this purpose well.Hence, the design philosophy of our JD-Tester aligns with this purpose, with a concentrated focus on the identi cation of decompilation exceptions, syntactic errors, and semantic errors.Speci cally, by revealing these three kinds of errors, JD-Tester helps Java decompilers reduce exceptions generated during decompilation.These exceptions typically lead to empty implementation of speci c classes or methods, which are completely incomprehensible and nonreusable.JD-Tester also helps reduce semantically inconsistent code, a hard-to-detect blind spot without the support of source code, as detailed in Section 4.2.This reduction not only minimizes potential user misunderstandings but also mitigates the risk of harmful reuse of decompiled code.In addition, JD-Tester aims to solidify Java decompilers to improve their ability in producing syntactically correct decompiled code.This enhancement mitigates potential ambiguities in users' interpretation of the code semantics, meanwhile, facilitating the direct reuse of the decompiled code.
The work ow of JD-Tester is presented in Figure 5. JD-Tester starts by generating tests with JavaFuzzer and Hephaestus.For each test , JD-Tester compiles into a Java bytecode le and then records the execution result by executing .Subsequently, JD-Tester decompiles with the decompilers under test and obtains the decompiled code ′ .′ is recompiled and reexecuted, whose result ′ is compared with the original result .Throughout the aforementioned procedure, JD-Tester records three di erent kinds of failures, they are: Decompilation Failure 1 , where exceptions instead of the decompiled code ′ are outputted by decompilers; Recompilation Failure indicated, the three kinds of failures can be roughly mapped to exceptions, syntactic errors, and semantic errors explained in Section 4.2.
To note, JD-Tester also takes the advantage of Perses [Sun et al. 2018], one of the state-of-the-art program reducers to minimize the test cases for further manual analysis.For the source code , Perses tries to minimize this code to a reduced form denoted as ′′ .The minimized code snippet, ′′ , should be able to accurately reproduce the decompilation/recompilation/comparison failures encountered in testing the original code .The role played by Perses is noteworthy, as it signi cantly contributes to our ability to recognize critical code structures associated with these failures.Based on these failure-related code structures, we therefore identi ed unique bugs from failures with distinct code structures and eventually reported these bugs to developers.

Experimental Setup
This section illustrates the detailed setup of our experiment, especially the con gurations of used compilers, Java versions, and the test generation tools JavaFuzzer and Hephaestus.Compil-ers&Version: JD-Tester supports both jar les and dex les, i.e., the bytecode les of Java programs and Android apps, respectively.It is because quite a few Java compilers, like Jadx, mainly focus on Android apps.For compilation from the tests to jar les, we used Oracle javac compiler version 11.0.3 in our JD-Tester, which is the most popular Java compiler and JDK version at present [Mohammad R. Haghighat 2022].For compilation from jar les to dex les, we used d8 compiler version 33.0.2, which is provided by the o cial Android SDK building tool.
Con gurations of JD-Tester: JD-Tester took the three Java decompilers of the latest versions up to our empirical study as test subjects in the experiment.JD-Tester took a total of 1160 Java test cases as input, among which half were generated by JavaFuzzer after a 5-hour execution and the other half were generated by Hephaestus.In detail, the 580 cases generated by JavaFuzzer each contained one class with 89-414 LOCs (223 on average) and 5-7 methods (5.1 on average), whereas the 580 cases generated by Hapheastus each contained 5-46 classes (15 on average) with 31-706 LOCs (191 on average) and 5-87 methods (24 on average).In addition, given that the tests generated by JavaFuzzer do not contain any entry points for execution, we have made modi cations to JavaFuzzer.Speci cally, we have enhanced its functionality to generate an additional Main class to serve as the entry point.Within this Main class, all other concrete classes are instantiated in its main method, and which exceptions are thrown during the instantiation are evaluated for comparison.

Experimental Results
This section gives a detailed analysis on the failures and bugs revealed by JD-Tester.
7.3.1 Decompilation, Recompilation, and Comparison Failures.Figure 6 gives a pro le of the decompilation (Dec), recompilation (Rec), and comparison (Com) failures (#Fail) revealed by JD-Tester with JavaFuzzer and Hephaestus, respectively.To note, Figure 6 also provides the number of unique bugs (#Bug).These unique bugs were manually identi ed from failures exhibiting distinct code structures and error symptoms, as we have mentioned in Section 7. No matter whether tested with tests generated by JavaFuzzer or Hephaestus, the failure rates of CFR and Jadx are extremely higher than those tested with real-world programs-in Mauthe et.al. 's experiment using real-world Android apps and malwares, Jadx achieved a very low failure rate of only 0.02% failed methods on average [Mauthe et al. 2021]; in Harrand et.al. 's experiment using real-world Java programs, CFR and Jadx achieved a recompilation failure rate of only 16.6% and 31.3%, and a comparison failure rate of 0.9% and 3.2% [Harrand et al. 2019[Harrand et al. , 2020]].It is plausible that the numerous decompilation and recompilation failures observed in CFR and Jadx can be attributed to the decompilers' inability to decompile corner code patterns that are recurrent in the 580 tests.For instance, in Jadx, a "cast missing" error (Bug #J-10 in Table 2) occurred repeatedly on the same code pattern, which is shared by 159 tests generated by JavaFuzzer.To illustrate this pattern, consider an assignment statement x+=y, where x is a long variable and y is a oat constant such as 0.275f.In this case, Jadx inlined the constant y and simpli ed the assignment to x+=0.275f.This simpli cation resulted in a syntactic error since a oat constant cannot be directly added to a long variable, and the recompilation outputted an error log of "incompatible types: possible lossy conversion from oat to long".JD-Tester encountered this failure in 159 similar assignments across 159 tests.Similar cases can also be observed in the 580 tests generated by Hephaestus.For instance, given a class declaration of class T1<V,U,F>, CFR failed to infer the types used in the instantiation of these class types such as new T1<Byte,Byte,Short>(b1), instead, CFR gave the decompilation result of new T1<V,U,F>((Boolean)bl) (Bug #C-12).Such class declaration occurred in 275 test cases generated by Hephaestus, and therefore, resulted in 275 failures.
The only exception, however, lies in the test process of FernFlower with test cases generated by JavaFuzzer, where FernFlower encountered the fewest failures with the fewest LOCs.We conjecture a likely explanation for this superior performance of FernFlower lies in the utilization of established and e ective algorithms from academic research, especially in challenging stages such as region restoration.Speci cally, during the process of region restoration, FernFlower capitalizes on Tarjan's algorithm [Tarjan 1972] to identify strongly connected components in the control-ow graph.
On the other hand, the developers of CFR and Jadx rely more on their personal insights while determining restoration patterns, as delineated in Section 5. 7.3.2Revealed Bugs.Our JD-Tester is the rst di erential testing technique targeting Java decompilers.Our JD-Tester demonstrates considerable promise and e cacy, having successfully identi ed 62 unique decompilation failures with unique error symptoms observed from the decompilers'/compilers' error logs and the related code structures.We identi ed each kind of these failures as a unique bug and reported it to the developers.In other words, we have led a total of 62 issues for 62 unique bugs thus far.
Table 10 lists all the 62 bugs we revealed, including their failure types (Fail, where Dec stands for decompilation failures, Rec stands for recompilation failures, and Com stands for comparison failures), the bugs revealed in what kinds of generators' test cases (Sou, where J stands for Java-Fuzzer and H stands for Hephaestus), whether the bugs were revealed by our JD-Tester for the rst time or not (New), the number of tests presenting the same bugs (#), brief descriptions or the error logs of the bugs (Description(Error Logs)), the id of the issues we reported for these bugs (Issue No.) and the states of these issues (State).In addition, the rst letter of ID indicates which project the bug belongs to: "C" for CFR, "F" for FernFlower and "J" for Jadx.
Among the 62 bugs, 19 are in CFR, 10 in FernFlower and 33 in Jadx.Our analysis of the failures in test cases generated by JavaFuzzer exposed 11, 6, and 20 bugs in CFR, FernFlower, and Jadx respectively.In a parallel investigation with tests generated by Hephaestus, we revealed 8, 4, and 13 bugs in CFR, FernFlower, and Jadx respectively.It is noteworthy that 60 out of the 62 bugs were revealed by JD-Tester for the rst time.This fact underscores the imperfections in Java decompilers and highlights the e cacy of our JD-Tester in uncovering bugs deeply hidden in Java decompilers.Meanwhile, each of these bugs were manually summarized from decompilation failures associated with distinct code structures and error symptoms.In another word, meticulous e orts were made to prevent the redundant identi cation of unique bugs from similar decompilation failures, thereby contributing to our high con dence in their uniqueness.Furthermore, among all 13 con rmed or xed bugs, no instances of duplication were reported by developers, further bolstering our con dence.The detection of these new and unique bugs leading to decompilation failures also underscores the dual bene ts of our JD-Tester in enhancing both human comprehension and the reuse of Java decompiled code.Speci cally, our JD-Tester aids Java decompilers in minimizing exceptions thrown during the decompilation process.These exceptions often result in empty implementations of speci c classes or methods, rendering them incomprehensible and non-reusable.Additionally, our JD-Tester helps reduce the semantically inconsistent code, a hard-to-detect blind spot without source code support as discussed in Section 4.2.This reduction not only minimizes potential user misunderstandings but also mitigates the risk of harmful reuse of decompiled code.
Upon thorough examination of these identi ed bugs, we have con rmed that there exists no overlap between the bugs revealed in tests generated by JavaFuzzer and Hephaestus.It is reasonable, since these two tools have their own priorities in test generation.This discernment is further underscored by an examination of the bugs revealed by the respective test cases of the two tools.Speci cally, of the 21 bugs revealed through tests generated by JavaFuzzer, a notable correlation is observed with bugs related to type inference for primitive types (Case T-p2) or the restoration of regions associated with GOTO structures (Case R).In contract, 17 bugs revealed with Hephaestus were highly related to generic types (Case T-p1) or lambda expressions (part of Case S).
Particularly, in the 37 bugs revealed with JavaFuzzer, we found three bugs (#J-1, #J-2 and #C-1) related to multi-entry loops in Jadx and CFR.However, these failures did not occur in FernFlower, demonstrating that quite a number of multi-entry loop bugs found by JD-Tester should be feasible to x.We also found 2 bugs (#C-8 and #F-4) in expression simpli cation, a sub-stage of optimization.Consider a composite expression s+(103.596F+ iMeth()) in the source code.FernFlower tries to simplify it into s + 103.596F + (float)iMeth() by removing the brackets out of (103.596F+ iMeth()).Although it seems harmless, this simpli cation changes this expression's execution order and leads to a oating point error.
It is also interesting that four bugs (#C-18, #C-19, #F-9, #J-25) were related to the inference of primitive types, out of which three (#C-18, #C-19, and #F-9) were bound to ternary expressions, i.e., (condition)?expressionif-true :expression if-false .Through an examination of these bugs, we revealed that these cases were quite similar to our Case R: in bytecode, ternary expressions were usually compiled into IFEQ instructions, which, however, could also be used to compile if statements.Therefore, the heuristic decompilation patterns devised for beautifying speci c if statements may inadvertently give rise to bugs when applied to these ternary expressions.
To note, there are a total of 28 type-inference-related bugs (10+13+5) revealed from both the test from JavaFuzzer and Hephaestus.Moreover, type inference bugs may arise from both primitive and generic types, and these issues are distributed across all three decompilers.This observation aligns with our ndings in our empirical study that type inference should be the most complex stage in decompilation, and quite a proportion of studied bugs are related to this stage.Collectively, these ndings underscore the non-trivial nature of the type inference task in decompilation.Despite having garnered substantial attention, it is evident that the type inference process in real-world Java decompilers remains a work in progress.
Eventually, developers of FernFlower have con rmed eight of the ten issues, ve out of which manifest as semantic errors.It is worth noting that among these ve issues, four of them were reported for the rst time.Developers of Jadx have con rmed ve issues and four of them have been xed.The lack of response to the remaining issues may be attributed to several factors.Firstly, one reason may be inferred from their comment on one of the xed issue (#J-1), where the developers posited that this bug was attributable to unusual code structures (such as an empty loop in #J-1) that might rarely be encountered in real-world scenarios.However, we believe these pieces of unusual code are still worth noticing because it is a well-known obfuscation technique to insert such unusual code into the normal code to confuse Java or Android decompilers [Balachandran et al. 2016;Dong et al. 2018].Secondly, when xing the lambda-expression-related bug #J-24, Jadx's developers mentioned that this bug was partially xed as the inlining of multi-instruction was still in progress.This comment indicated that supporting multi-instruction inlining is a complex and long-term task, and the developers were aware that this inlining stage might contain quite a number of bugs.In a parallel context, the TBD tag (to be addressed in the next milestone) given to the generic-type-related bug #J-21, coupled with the absence of responses from developers regarding other generic-type-related bugs thereafter, could arise from similar considerations.This assumption is reasonable since the inference of generic types is also a di cult and long-term task like the inlining of lambda expressions.We conjecture these two reasons may explain why Jadx's developers have not replied to the remaining issues.For CFR, their developers have exhibited limited activities in bug xing.This observation may account for the lack of response.

RELATED WORK
The correctness of decompilation has been much studied.In this section, we discuss representative related e orts with respect to our study.
Studies on the correctness of Java decompilation can be traced back to as early as 2009.Hamilton and Danicic [Hamilton and Danicic 2009] provided the rst evaluation of the decompilation results of 13 Java decompilers with respect to a 9-level correctness metric, ranging from exceptions to generating semantically equivalent with perfect code layouts.However, only 9 test samples from previous research were tested in the evaluation.This evaluation was redone in 2017 [Kostelanskỳ and Dedera 2017] on the latest versions of these Java decompilers.More recently, Mauthe et al. [Mauthe et al. 2021] provided a large-scale empirical study of the decompilation success rate of four popular decompilers on ten-thousand real-world Android apps and malware samples.Mauthe et al. used a coarse-grained metric to categorize decompilation results into only three classi cations: success, failure, and timeout.Harrand et al. [Harrand et al. 2019, 2020] assessed the decompilation results of eight Java decompilers on 14 well-known Java projects with respect to three quality indicators: syntactic correctness, syntactic distortion, and semantic equivalence.All these e orts only focus on the decompilation results of Java decompilers.Our work is the rst comprehensive study on both the characteristics and causes of Java decompiler bugs, wherein several representative bugs are given to illustrate the root causes of Java decompiler bugs and the main challenges inside.
Based on the study results, we proposed JD-Tester, a di erential testing framework for Java decompilers.We thus also discuss related work on testing decompilation/disassembly. Paleari et al. [Paleari et al. 2010] presented N-version disassembly, a method for checking the correctness of disassemblers via di erential testing.Several e orts checked the semantics equivalence of IRs generated by binary lifters with formal veri cation techniques [Dasgupta et al. 2020;Kim et al. 2017].Liu et al. [Liu and Wang 2020] proposed a testing framework that is the most related to JD-Tester,

Fig. 2 .
Fig. 2. The source code (le ) and the bytecode (right) of Case S and Case R.

Fig. 3 .
Fig. 3.The source code (le ) and the bytecode (right) of Case T-p1 and Case T-p2.

Table 1 .
Statistic of the three projects.

Table 2 .
Issues and commits of the 333 unique bugs in three projects.

Table 3 .
Artifacts provided in the 333 bugs.

Table 4 .
Error symptoms of the 333 bugs.

Table 5 .
Distributions of bugs providing source code and manifesting as semantic errors or not.

Table 6 .
Buggy files of the 333 bugs.

Table 7 .
The top 10 buggy files of the 333 bugs.

Table 8 .
The distribution and the density of 333 bugs in the seven decompilation stages.

Table 9 .
The average modified LOCs of 333 bugs related to the seven decompilation stages.Figure 4b depicts our statistics of the modi ed LOCs of the 333 bugs, including all LOCs (#LOC T ) CFR and Jadx.In other words, CFR and Jadx failed on 36.0%and 84.0% of the 580 tests.However, as the evaluations progress with the test cases generated by Hephaestus, CFR, FernFlower and Jadx demonstrate comparable failure rates reaching 69.3% 62.9%, and 43.8%, respectively.To note, almost all these failures are recompilation failures, with the exception of 22 decompilation failures (3.8%=22/580) encountered by Jadx.

Table 10 .
62 bugs revealed when testing the three decompilers with JD-Tester.