Symbol-Specific Sparsification of Interprocedural Distributive Environment Problems

Previous work has shown that one can often greatly speed up static analysis by computing data flows not for every edge in the program's control-flow graph but instead only along definition-use chains. This yields a so-called sparse static analysis. Recent work on SparseDroid has shown that specifically taint analysis can be"sparsified"with extraordinary effectiveness because the taint state of one variable does not depend on those of others. This allows one to soundly omit more flow-function computations than in the general case. In this work, we now assess whether this result carries over to the more generic setting of so-called Interprocedural Distributive Environment (IDE) problems. Opposed to taint analysis, IDE comprises distributive problems with large or even infinitely broad domains, such as typestate analysis or linear constant propagation. Specifically, this paper presents Sparse IDE, a framework that realizes sparsification for any static analysis that fits the IDE framework. We implement Sparse IDE in SparseHeros, as an extension to the popular Heros IDE solver, and evaluate its performance on real-world Java libraries by comparing it to the baseline IDE algorithm. To this end, we design, implement and evaluate a linear constant propagation analysis client on top of SparseHeros. Our experiments show that, although IDE analyses can only be sparsified with respect to symbols and not (numeric) values, Sparse IDE can nonetheless yield significantly lower runtimes and often also memory consumptions compared to the original IDE.


INTRODUCTION
Static program analysis has proven useful for diverse purposes including compiler optimization [18], program comprehension [9] and developer assistance [38].It is now an essential part of software engineering for assuring bug-free [4], secure [22] and quality software [12].The key strength of static program analysis is to account for all possible executions of a target program.But this imposes two often competing challenges: precision and scalability.Static analyses yield more precise results by tracking statement ordering and by distinguishing different calling contexts.IDE (Interprocedural Distributive Environment) [30], with its extensions [2,24,33], is a state-of-the-art precise interprocedural static analysis framework.It covers a wide class of data-flow problems ranging from variations of classical taint analysis [16] to typestate [11,20] and constant propagation [25] analyses.IDE represents data-flow analysis problems on an exploded supergraph and models data-flow facts as environments.Environments are mappings from symbols (often program variables) to domain values.The exploded supergraph is a data-flow graph induced by the inter-procedural control-flow graph (ICFG) for the whole program.Its nodes are pairs (, ) of program statements and data-flow facts.A data-flow fact  holds at a statement  if in the exploded supergraph the corresponding node (, ) is reachable from the start node.The edges of the exploded supergraph represent the effects of program statements on a data-flow fact.IDE computes over the exploded supergraph by tracking all data-flow facts densely across all program points.As previous work [1,15,19,41] has shown, this approach does not scale well for large-scale real-world programs.A key observation is, however, that in practice many program statements do not affect the analysis result.Such statements thus can be safely ignored, e.g. by sparsifying the exploded supergraph.
Sparsification is a well-known technique for scaling data-flow analyses [13,14,26,31,35,36] while still maintaining their precision.Sparsification approaches create sparse versions of the original CFGs of a target program by removing statements that are irrelevant to the analysis and then computing over the sparse CFGs.Recent on-demand approaches take sparsification further by utilizing the information available during the analysis.SparseBoomerang [17] accelerates demand-driven pointer analysis by computing over sparse CFGs specialized to the alias queries.SparseDroid [15] accelerates taint analysis by computing over sparse CFGs specialized to individual data-flow facts.Both approaches demonstrate sparsification on IFDS-based problems, that focus on mere symbol reachability, without considering value computation.
The IFDS (Interprocedural Finite Distributive Subset) [29] framework is the "small brother" of IDE.It reduces the data-flow analysis problems to a pure graph reachability problem.Yet, IFDS is limited to data-flow problems with finite domains: all IFDS problems can be encoded as IDE problems, but only a subset of IDE problems can be encoded as IFDS problems [30].As an example, consider the statement a = a + 1.Here, using IFDS one can encode a simple taint analysis inferring that a is tainted/reachable after the statement if and only if it was previously tainted/reachable.Efficient computation of a's numeric value, however, requires one to compute values within the infinitely broad domain of integers, going beyond pure reachability.As we show, this has implications for sparsification: while the statement a = a + 1 can be safely considered irrelevant w.r.t.a's reachability, and will be disregarded in sparsification approaches for IFDS [15,17], it is a relevant statement when constant propagation is considered: it changes a's value.This observation is not limited to constant propagation analysis, it applies to other data-flow analysis problems that require value mappings.For instance, a sparse typestate analysis must retain statements that alter a symbol's associated state value.Based on this observation, we generalize the recent work on SparseDroid, i.e., on sparse IFDS [15]: we propose Sparse IDE, a symbol-specific sparsification of the IDE framework, that enables efficient sparsification, even in the presence of arbitrarily large value domains.In addition, we also show the limits of sparsification in IDE: while one can effectively sparsify with respect to symbols, such sparsification cannot be performed with respect to values.
We formalize Sparse IDE, and show how this formalization covers also IFDS data-flow analysis problems as a special case.We implement Sparse IDE in a tool SparseHeros, extending the popular Heros IDE solver [5].We compare both implementations in terms of performance, and show that sparsification maintains correctness.To this end, we implement a linear constant propagation analysis client that uses both implementations.To validate Sparse-Heros's correctness, we run both on ConstantBench, a novel microbenchmark suite for integer linear constant propagation analysis.To evaluate its performance impact, we run the analysis client on real-world Java libraries using both Heros and SparseHeros.The analysis client produces the same results in both cases while terminating significantly faster when using SparseHeros.
To summarize, this paper presents the following original contributions, whose implementations are open-sourced 1 : • A formalization of Sparse IDE and its implementation in SparseHeros on top of Heros and Soot [37], • its correctness evaluation on the ConstantBench microbenchmark suite for linear constant propagation analysis, and • its performance evaluation on real-world Java libraries.
The remainder of the paper is organized as follows.In Section 2, we present the background.In Section 3, we introduce Sparse IDE and in Section 4, we instantiate it on linear constant propagation analysis.In Section 5, we present the evaluation results.In Section 6, we discuss the limitations of our approach and threats to its validity.In Section 7, we discuss the related work and we conclude with Section 8.

BACKGROUND
This section briefly introduces the background that our work builds on.We begin with the IFDS and IDE frameworks.Then we introduce sparse data-flow analysis and discuss why it is an effective alternative.Finally, we explain how the recent approaches sparsify further by utilizing the information available during the analysis runtime.

IFDS and IDE
IFDS [29] and IDE [30] are two frameworks for interprocedural flowand context-sensitive data-flow analysis.IFDS represents data-flow analysis problems as graph reachability on an exploded supergraph, whose nodes are pairs of program statements and data-flow facts.The individual edges in the exploded supergraph constitute flow functions; they show each statement's effect on each data-flow fact's reachability.A flow function determines whether a data-flow fact is being generated, propagates to the next statement, spawns another fact, or gets killed.
Figure 1 shows how the flow functions are represented as edges in the exploded supergraph.The data-flow fact above the edge means that it holds before applying the function; the fact below means that it holds after.A special fact, Λ holds always.Facts connected to it are newly generated.The identity function,   , leaves data-flow facts unchanged.The function   shows the case where data-flow fact a is being generated.The function   shows how the existing fact, a creates another fact, b, e.g. at an assignment, b = a.
IDE generalizes the IFDS framework by computing domain values that symbols map to.It does so in two phases: first it determines whether symbols are reachable, just like IFDS, and then computes their values.IDE achieves this by annotating the individual exploded supergraph edges with so-called edge functions, which constitute environment transformers.IFDS and IDE apply to a wide class of data-flow analysis problems.IFDS requires data-flow problems to be defined with flow functions that are distributive over the merge operator.Many reachability problems such as taint, reaching definitions, or live variables analysis fall into this category.IDE, on the other hand, also requires data-flow problems to be expressed with distributive environment transformers.IFDS suits better the problems with a binary value domain, e.g.taint analysis where the domain simply consists of two values, tainted or not tainted [3].It has been applied to more complex domains, e.g. for typestate analysis where the domain contains arbitrary object states [23].The drawback of IFDS is that it represents data-flow facts as symbol-value pairs, which blows up the data-flow fact space with increasing size of the domain.Because of this representation, IFDS's runtime performance depends on the value domain's size.Further, it may not terminate when the value domain is infinitely broad, e.g., in constant propagation analysis, where the domain contains all integers.IDE, on the other hand, restricts data-flow facts to static symbols and computes their (approximated) runtime values using the edge functions along the path where the symbols are reachable in the exploded supergraph.Therefore, IDE can terminate efficiently even with infinitely broad value domains-only the set of symbols must be finite.

Sparse Data-flow Analysis
Data-flow analysis techniques aim to produce precise results while remaining scalable within a reasonable time budget.Techniques that prioritize scalability often resort to sacrificing precision aspects: flow-insensitive analyses ignore control-flow ordering [40], fieldinsensitive analyses approximate field accesses [8], and contextinsensitive analyses do not distinguish different calling contexts [21].Sparse data-flow analyses, on the other hand, often improve a dense data-flow analysis' scalability while maintaining its precision.They sparsify a target program's control-flow graph by removing program statements that provably do not affect the analysis result.Sparsification often uses a cheaper pre-analysis stage to aid a more expensive analysis [14,31,35].Recent on-demand sparse data-flow analyses sparsify further by exploiting the information that is only available during analysis runtime [15,17].

Fact-Specific On-Demand Sparsification
When IFDS and IDE compute a data-flow fact's reachability, starting from the statement that generates the data-flow fact, they propagate it along all statements as long as it is not killed.At each statement, they check whether the statement is relevant for all the data-flow facts that have reached it.Figure 3 shows how the reachability is computed for an example constant-propagation analysis setting.The fact-specific id edges and non-id edges show the edges which IFDS and IDE create when propagating data-flow facts.The dataflow facts actually only need to be propagated to the required nodes.For instance, data-flow fact a only needs to propagate to the statement b = a; all other statements are redundant for a.Similarly, b only needs to propagate to the statement, c = b + 1.Based on this observation, He et al. [15] introduced the sparse IFDS algorithm in their implementation SparseDroid.Instead of propagating all the data-flow facts to the next statement, it propagates them simply to the next statement that uses the facts.Sparse IFDS keeps all non-id edges and replaces the fact-specific id edges with sparse id edges, effectively keeping all required nodes and skipping over all redundant nodes.
Fact-specific on-demand sparsification allows effective propagation of the data-flow facts along the sparse CFGs specific to them, which is not limited to data-flow analysis.Recent work [17] has applied it to pointer analysis, where the variable in alias queries is treated as the initial data-flow fact and propagated along its queryspecific sparse CFGs.So far, however, fact-specific on-demand sparsification has only been applied to the analysis problems that deal with fact reachability.In this work, we expand the scope of factspecific on-demand sparsification to include the data-flow analyses that compute over an additional value domain, specifically IDE.

SYMBOL-SPECIFIC ON-DEMAND SPARSIFICIATION WITH SPARSE IDE
In this section, we first explain the original IDE algorithm [30] in detail.We then introduce the Sparse IDE algorithm by highlighting the modifications to the original IDE algorithm.

The Original IDE Algorithm
Sagiv et al. [30] define an IDE problem instance formally as  = ( * , , , ), where •  * is the program supergraph (ICFG), which consists of control flow graphs (CFG),   of individual procedures, •  is a finite set of program symbols, •  is a finite-height lattice (which can be infinitely broad), and ) is an assignment of distributive environment transformers to the edges of  * .
The original IDE algorithm [30] solves such an IDE problem, , in two phases.In Phase I, it creates the jump functions that show the reachability of each  ∈ , by assuming that their initial mappings to  are always  .⊤.In Phase II, it computes each 's actual value mapping to  by evaluating the edge functions defined in .
According to Sagiv et al. [30], the total cost of the IDE algorithm is bounded by  (||| | 3 ), which is the cost of Phase I. Since  is the set of symbols, it should not change if correctness is preserved.We, therefore, apply our sparsification approach in Phase I, where the jump functions are created by reducing , the set of edges.Phase II is oblivious to how the jump functions are created-it automatically benefits from the sparsification of Phase I.
Figure 4 shows the algorithm for Phase I.Each procedure 's CFG,   consists of a start node   , an exit node   , and normal (non-call) nodes  or .Procedure calls are represented with two nodes: the call-site node  denotes the point right before the procedure call, and the return-site node  denotes the point right after.Program symbols, e.g.variables, access paths, etc., are denoted with  ′ ,  ∈  ∪ {Λ} 1 Function ForwardComputeJumpFunctionsSLRPs(): Select and remove an item ⟨ , 36 Function Propagate(e, f ): including the special symbol Λ. Λ is required for generating new symbols at arbitrary program points.
Initialization.In lines 2-5, jump and summary functions are initialized.Jump functions, denoted by , correspond to the same-level realizable paths (SLRPs) from the start node   of a procedure  to a node  in .Summary functions, denoted by , summarize the effect of a procedure call through samelevel realizable paths from the call-site  to return-site  .In line 3, (⟨  ,  ′ ⟩ → ⟨, ⟩) =  .⊤states that the jump function from the node ⟨  ,  ′ ⟩ to each ⟨, ⟩ is initialized to  .⊤.In line 5, (⟨,  ′ ⟩ → ⟨, ⟩) =  .⊤states that the summary function from each call-site node ⟨,  ′ ⟩ to its corresponding returnsite ⟨, ⟩ is initialized to  .⊤. Line 6 initializes the PathWorkList to {⟨  , Λ⟩ → ⟨  , Λ⟩} representing a self-loop edge on the start node of the main procedure whose jump function is the identity function, id.The jump function from the start node   until the current statement  is denoted with  .Call nodes.Lines 12-19 handle the case where  is a call-site node in , calling a procedure .In line 14, the self-loop edge on the start node of the callee procedure  is initialized with id.In line 17, the edge from   the corresponding return-site  is computed by composing the  , the jump function until  and the edge function from  to  .In line 19, the edge from   the corresponding return-site  is computed by composing  and  3 , the corresponding summary function when it is not mapping to ⊤.
Exit nodes.Lines 20-30 handle the case where  is the exit node of .Edges from each call-site node  to the start node   (shown with  4 ) and from the exit node,   to each caller's return-site  (shown with  5 ) must be computed.In line 25, a new summary function  ′ is computed by composing  5 ,  , and  4 and merging the existing summary function for the same  and  .When it is a new summary, a new jump function is computed from the caller procedure's start node   to the node return-site node  by composing the  ′ with the existing jump function  3 from   to call-site node .
Normal nodes.Lines 31-33 handle the case where  is a non-call or intraprocedural node.Edges from the start node   to each node , which is the statement that appears directly after  in procedure , are computed by composing the edges from   to  (shown with  ) and the edges from  to .

The Sparse IDE Algorithm
In the original IDE algorithm, each symbol  ∈  ∪ {Λ} at a statement  is propagated to its direct successor statement .As also pointed out in previous work [15], this behavior is desired when  is a call and exit node.For these nodes, the reachability of each  in different contexts is left to the data-flow function definition.call-flow functions propagate each  into the context of the callee procedure.return-flow functions propagate each  back to the context of the caller procedure.call-to-return-flow functions propagate each  from before a procedure is called to after the procedure is called.However, when  is a non-call node, each  can safely be propagated to 's next use statement.
Figure 5 shows the modifications for the Sparse IDE algorithm for Phase I. We replace line 17 from the original IDE algorithm with lines 17-19 in the Sparse IDE algorithm.Instead of propagating  3 to the direct return site node  , we obtain  ′ which is the next use statement of  3 in its symbol-specific sparse control flow graph.Similarly, we replace line 33 with lines 33-35, to propagate  3 to its next use statement  ′ its sparse control flow graph.Our sparsification approach mirrors that of sparse IFDS algorithm [15], however, since we generalize it to IDE, we also account for edge function composition.

Sparse IFDS Revisited
As shown in Figure 3, a statement can behave as identity function, meaning it does not affect any data-flow fact,  ∈ .However, as shown by He et al. [15], many statements only affect a few dataflow facts, often even just a single fact.Their flow functions can be considered fact-specific identity functions for the facts that they do not affect.Sparse IFDS defines fact-specific identity functions as follows [15]: Given a symbol,  ∈  and a flow function,  ∈ 2  → 2  ,  is a d-specific identity function if the following conditions hold: Condition 1.1 states that  is not affected by other facts when applying  , and 1.2 states that  does not affect the other facts when applying  .However, these conditions only apply to symbols from  and ignore mappings from  to the value domain , and, if applied to IDE problems, one would wrongly treat such flow functions that are annotated with non-identity edge functions as -specific identity functions as well.
Figure 6 shows two important cases where sparse IFDS would sparsify incorrectly.First, reassignments: a = 3 reassigns a, but sparse IFDS recognizes that  already exists (is "tainted"), and therefore it treats this statement as -specific identity.Second, value updates: a = a + 1 updates 's value, but sparse IFDS has no notion of values, therefore, from its perspective, this statement is "identity" as well.Sparse IDE, on the other hand, is aware of the effects on the value domain and retains both statements.

Fact-Specific Identity Transformers
To generalize fact-specific sparsification to the IDE framework, we define symbol-specific identity transformers that take into account the environments that map the symbols from domain  to the values from domain .Given a symbol  ∈  and a value  ∈ ,  = [ ↦ → ] is an environment  mapping from  to , i.e.,  () = .Then  is an element of the set of environments  (, ).An environment transformer,  ∈  (, ) →  (, ) given  ∈  : ∀ ∈  (, ) : Second, for all other mappings,  produces identical results no matter whether or not -specific mappings are present: given  ∈  : ∀ ∈  (, ).∀ ′ ∈  \ { }. ∀ ∈  : We test the edge functions from Figure 2 on these conditions.  is an -specific identity transformer (  ≡    ), because applying .does not change a's previous mapping.  is not an -specific identity transformer (     ), because applying .[ ↦ → 3] changes a's previous mapping.  is also not an -specific identity transformer (     ) because applying .[ ↦ → 2 *  () + 1] changes another value's mapping (for ) depending on what  maps to, and because it changes b's value   is not a -specific identity transformer either (     ).Note that, importantly, a transformer can only be considered a -identity transformer if the above restrictions hold irrespective of any concrete  ∈  that might be associated with : (2.2) quantifies over all  ∈ .This is necessary because IDE produces procedure summaries that must be sound with respect to all , and thus their creation must not be made dependent on .In other words, IDE can support symbol-specific but not value-specific sparsification!

Determining symbol-specific identity
When propagating fact , we consider only those statements as irrelevant statements for  that fulfil conditions (2.1) and (2.2).But since these conditions are value-agnostic-they quantify over all  ∈ , this allows one to determine ahead of time the statements whose environment transformers adhere to both conditions, structurally.First, by Condition  Naturally, sparsification effectiveness is closely tied to the analysisspecific environment-transformer definitions.The environment transformer for the statement a = a + 1 is  ≡   for taint analysis, where  = ..For constant propagation analysis, however,    , where  = .[ () + 1].
Sparse IDE strictly generalizes Sparse IFDS as presented in Sparse-Droid.One can easily define sparse IFDS as an instantiation of sparse IDE by restricting the value domain  to {⊥, ⊤}, where symbols that map to ⊥ are considered reachable.In this setting, our definitions (2.1) and (2.2) become equivalent to (1.1) and (1.2).

APPLICATION TO LINEAR CONSTANT PROPAGATION
As Sagiv, Reps and Horwitz explain in their seminal work [30], constant propagation analysis is the perfect problem setting where IDE outperforms IFDS [29].This is not only because the problem's lattice is larger than the binary domain, but also it is infinitely broad where IFDS cannot terminate.We are, therefore, motivated to apply the Sparse IDE framework to linear constant propagation analysis.Heros, and thus SparseHeros, are generic tools and they are independent of the target language and their intermediate representations (IRs).In this work, we use Soot [37] static program analysis framework for Java and its intermediate representation Jimple.Therefore, in the following, we explain our implementation based on the Jimple IR.

Analysis Definition
Linear constant propagation analysis handles the linear expressions that generate a new data-flow fact by using just a single other fact, e.g. a = b or a = 2*b + 1. Full constant propagation analysis involves statements such as a = b + c.Such a flow function is not distributive; it cannot be precisely computed within the IDE framework.Our linear constant propagation analysis implementation handles the assignment statements shown in Table 1.
IR.The IR always ensures binary operation (binop) representation by reducing more complex operations to binary operations.For instance, a = 2*b + 1 would be reduced to s1 = 2 * b and a = s1 + 1.The IR also reduces longer access paths to multiple assignments with a single access path (n=1).For instance, a statement such as a = b.f1.f2 would be reduced to s1 = b.f1,s2 = s1.f2,and a = s2.The same reduction applies to procedure invocations as well.
Flow functions.We generate a symbol when it is assigned with a constant.As discussed, we handle the binary operations in the linear form.We distinguish between the assignments that require alias handling and the ones that do not.The assignments such as local, field load, static field load, and array load, overwrite the local variable, , on their left-hand side and therefore do not need to know 's aliases.The assignments such as field store, static field store, and array store, on the other hand, require handling the aliases of the base variables or the array references.To handle aliasing we use the Boomerang [34] demand-driven pointer analysis framework.When necessary, we query the aliases of the base variables and add them to the set of propagated symbols.Note that in Table 1, the alias sets contain the query variable as well.The IDE framework requires three types of flow functions to model the effects of invoke statements.The call flow function propagates the symbol for the actual parameter to the context of the callee procedure, by mapping it to the procedure's corresponding formal parameter.The return flow function propagates the symbol for the returned variable to the context of the caller procedure, by mapping it to the symbol on the left-hand side of the invoke expression.The call-to-return flow function propagates the symbols that are not passed to the context of the callee procedure, to the next statement after the invoke statement.
Edge functions.For most statements, the edge functions map the target symbol to the value of the source symbol, acting as identity transformers.The constant and binop statements are the only exceptions.The constant statement maps the target symbol,  to the given constant value, .The binop statement maps the target symbol,  to a new value.The value is computed by simulating the operation ⊙ using the source symbol's value,  () and the constant operand, .Edge functions must be composed and reduced to a simple value mapping when computing the actual values.Given  1 ,  2 ∈  (, ) and  1 appears before  2 as an edge in the exploded supergraph, we compose the edge functions as follows: If an edge function is the identity transformer, we always apply the other function by the first two conditions.We always apply the subsequent edge function if it is a constant assignment, by the third condition.If the subsequent edge is a binop, we compute its value immediately in place by applying the preceding edge first, as suggested in previous work [5].
Lattice.We perform the linear constant propagation on integers.Therefore the lattice is Z ⊤ ⊥ .Given  1 ,  2 ∈ Z ⊤ ⊥ , we define the meet operator as follows: If a value is ⊤, the meet operator yields the other value by the first two conditions.If either of the values is ⊥, the meet yields ⊥, and if both of the values are ⊤ it yields ⊤ by the third and fourth conditions respectively.

Sparsification for Constant Propagation
Our sparsification approach has much in common with the one proposed by He et al. [15], though modifications were necessary.We build the sparse control flow graphs (CFGs) by ignoring symbolspecific identity functions.Given a procedure, ,   is its original dense CFG.We build sparse CFGs specific to each symbol,  in , denoted as  , and propagate  across its own sparse CFG.As shown with the IR in Table 1,  can be a local, an instance field or static field, or an array access. , is constructed by determining whether each statement's corresponding flow function in   is a -specific identity function.
As a major modification, and most importantly, we account for a statement's effect on the value domain.In addition to determining whether each statement's corresponding flow function is a d-specific identity function, we determine whether its edge function is a d-specific identity transformer with the assumptions explained in Section 3.3.Further, we propagate the tautological fact, Λ, (sparsely) to the statements that can generate new data-flow facts, e.g. ← .Otherwise, it is impossible to generate new facts at arbitrary program points.Finally, we soundly retain all branching statements to keep the original CFGs' control flow as it is.

EVALUATION
We next explain the research questions that guide our evaluation and its experimental setup, and then we discuss the evaluation results.Sparse data-flow analyses promise extensive performance improvements, while still maintaining the precision of their nonsparse counterparts.Therefore, first, we compare the sparse analysis results against the non-sparse analysis results.Second, we measure whether the sparse analysis produces the promised performance benefits.Third, we investigate the factors contributing to the performance impact.Therefore, we focus on the following research questions: • RQ1: Does Sparse IDE produce the same results as the original IDE? • RQ2: How does the sparsification impact the performance in terms of runtime and memory?• RQ3: To what extent does the number of propagations correlate with the performance impact?

Experimental Setup
We implement the proposed approach in SparseHeros, by extending the open source Heros IDE solver's latest version, at the time of writing (e7e4a85) [32].Using SparseHeros and the Soot static analysis framework [37], we implement a linear constant propagation analysis.To handle aliasing, we integrate our client analysis with the Boomerang [34] demand-driven pointer analysis, using its latest version (1179227) [7].Heros, and thus SparseHeros, support multi-threading, yet, because Boomerang is single-threaded, our client analysis uses a single-thread.Therefore, our evaluation results present single-thread performance.
As benchmark subjects we use: • ConstantBench: A benchmark suite for constant propagation analysis targeting Java, did not previously exist.We, therefore, created ConstantBench as a micro-benchmark suite for integer linear constant propagation analysis.We run both Heros and SparseHeros on this benchmark suite and compare the analysis results that they produce.• Real-world Libraries: We include real-world Java libraries to investigate the performance of our approach under the

IDE Sparse IDE
Figure Memory consumption of Sparse IDE compared to the baseline original IDE in %, annotated with exact memory consumptions in GB, using the same sorting as Figure 7 workload of large-scale and complex programs.As opposed to applications, libraries do not have a specific entry method.We follow the closed package assumption [27] for analyzing library code, and treat public methods of the libraries as entry methods.We consider a method as an entry method if it adheres to the following entry method selection criteria: -c1: The method is a public instance method that is not abstract, native or a constructor, -c2: The method contains an integer assignment statement.We selected the most downloaded (>5000) Java libraries from the maven repository [28].We discarded the libraries that do not contain any entry methods according to the selection criteria, and the ones that caused an error in the underlying static analysis tool, Soot [37].In the end, we retained 30 libraries.
• Replication Package: We set up a replication package, available at https://zenodo.org/records/10461449 We have performed the evaluations on an Intel i7 Quad-Core at 2,3 GHz with 32GB memory.We configured the JVM with 25GB maximum heap size (-Xmx25g) and 1GB stack size (-Xss1g).5.3 RQ2: How does the sparsification impact the performance in terms of runtime and memory?
Figure 7 shows the relative analysis runtime spent by Sparse IDE in comparison to the runtime of the baseline original IDE algorithm.We sorted the results for each library by the time spent by the original IDE algorithm.Note that we keep the same sorting for the rest of the paper.This sorting highlights the fact that our Sparse IDE approach pays off better for the cases where the original IDE's runtime is relatively larger.Sparse IDE, compared to the original IDE algorithm, performs up to 30.7x faster.We measure

NestedLoops Binop
HashCode the mean speedup as 7.9x, and the median speedup as 6.7x.The concrete measurements are presented in Table 3. Results show that, in terms of runtime, Sparse IDE outperforms the original IDE in each run, except for the libraries #1-#3 (jcl-over-slf4j, slf4j-api, lombok), which have the shortest analysis time.In each run, Sparse CFG construction overhead is lower than 1% of the Sparse IDE total analysis runtime, which is substantially smaller than the achieved speedups.
Figure 8 shows the relative memory consumption of Sparse IDE in comparison to the memory consumption of the original IDE algorithm.We have measured up to 94% reduction in memory consumption in the best case, and up to a 19% increase in the worst.The Sparse IDE algorithm, compared to the original IDE, associates data-flow facts with fewer statements, therefore, we anticipated memory improvements.On the other hand, because we cache sparse CFGs ( , ) per each symbol and procedure pair (, ), for some input programs memory consumption increases.However, as shown in Figure 8, these cases are limited to a few outliers.Moreover, the mean and median impacts on memory consumption are 51% and 63% reduction, respectively.
We statistically assess the significance of the Sparse IDE algorithm's impact on runtime and memory improvements.According to Wilcoxon signed-rank test [39] at 0.05 significance level, Sparse IDE significantly improves both the runtime ( = 6.1e−08) and memory consumption ( = 5.7e−07) of the original IDE algorithm.The essence of the Sparse IDE approach is that, compared to the original IDE algorithm, it propagates data-flow facts to fewer statements.We investigate to what extent this contributes to improving the scalability of the original IDE algorithm.Figure 9, shows how the ratio of data-flow fact propagations in IDE and Sparse IDE correlate with the ratio of runtime speedups.We observe that reducing the number of propagations is an effective approach to improving IDE's scalability in terms of runtime.Similarly, Figure 10 correlates the same with the ratio of memory consumptions in IDE and Sparse IDE.We observe a comparable trend but not to the same degree.Given these findings, in the future, one could investigate the potential synergies between our approach and recent approaches that improve the scalability, in particular, in terms of memory [1,19].

LIMITATIONS AND THREATS TO VALIDITY
By definition, Sparse IDE can solve the same data-flow problems as the original IDE framework [30].It requires data-flow analysis problems to be expressible as distributive environment problems.Many popular static analyses, such as taint analysis for vulnerability detection [3] or typestate analysis for API misuse detection [10], are expressible as distributive environment problems.Just like other fact-specific sparsification approaches [15,17], Sparse IDE also exploits analysis domain knowledge.Domain-specific analysis Sparse IDE should theoretically lead to a similar performance impact on other data-flow analysis problems where IDE is applicable.For instance, when performing a typestate analysis, Sparse IDE would safely omit the statements that have no impact on the tracked state.However, due to space constraints, we were not able to empirically show whether our evaluation results carry over to other analysis problems.
The reported evaluation results might depend on the selected set of Java libraries, and entry-method selection criteria.Nevertheless, for real-world library selection, we followed the systematic procedure described in Section 5.1.
To account for variations in runtime and memory measurements, we conducted three runs and presented the average across these runs.
A direct comparison to SparseDroid [15] was not possible for many reasons.It extends an existing taint analysis client (FlowDroid [3]) that has a basic integrated alias analysis, whereas our analysis client utilizes a sophisticated external demand-driven pointer analysis [34].Moreover, SparseDroid's implementation is not publicly available, and most importantly, IFDS may not terminate when the value domain is infinitely broad.

RELATED WORK
The IFDS [29] and IDE [30] frameworks enabled precise interprocedural data-flow analyses that are flow-and context-sensitive.Previous works have extended these frameworks with diverse goals.Naeem et al. [24] proposed four extensions to the IFDS framework, to improve its scalability and precision under certain practical analysis conditions.Heros [5] introduced a Java-based generic IFDS and IDE solver.Reviser [2] proposed an algorithm to adapt IFDS and IDE to incremental program updates.CleanDroid [1] introduced a technique for reducing the memory footprint of IFDS-based data-flow analyses.DiskDroid [19] applied a disk-assisted computing approach for improving the scalability of IFDS-based taint analysis.
Sparsification has been applied to improve the scalability of static analyses.Choi et al. [6] introduced sparse data-flow evaluation graphs based on SSA (static-single assignment).Oh et al. [26] presented an abstract interpretation-based framework for designing generic sparse analyses, which guarantees to preserve the precision of the non-sparse analysis through data dependencies.Pinpoint [31], SVF [35] and SFS [14] utilize cheaper pre-analyses to sparsify pointer analyses.Recent on-demand sparsification approaches exploit the data-flow facts that become available during the analysis runtime for further sparsification.SparseBoomerang [17] exploits the variables in alias queries during demand-driven pointer analysis, to create query-specific sparse CFGs.The sparse IFDS algorithm [15] exploits data-flow facts to create fact-specific sparse CFGs and propagate each fact on its own sparse CFG.In this work, we present the more generic Sparse IDE algorithm that efficiently solves not just IFDS-based reachability problems, but also IDE problems that require value computation.

CONCLUSION AND FUTURE WORK
In this work, we presented the Sparse IDE framework as a scalable alternative to the original IDE framework.Sparse IDE is the first fact-specific sparsification approach that allows for computations on infinitely broad domains.The essence of Sparse IDE is creating symbol-specific sparse control flow graphs on-demand, and propagating data-flow facts sparsely through these graphs.Sparse IDE produces equally precise results as the original IDE, while significantly improving its scalability.We also explicitly discuss the limits of sparsification for IDE: while symbol-specific sparsification is possible and useful, one cannot sparsify with respect to the (typically numeric and infinite) value domain.
In the future, we plan to apply the Sparse IDE framework to other data-flow analysis problems and investigate problem-specific requirements for building sparse CFGs.We also plan to combine Sparse IDE with other scalability-improving techniques that are orthogonal to our sparsification approach.

Figure 2
Figure2shows how the edge functions are represented.The environment transformer   keeps the values as they are.  shows the case where data-flow fact, a is mapped to a domain value, e.g. through a constant assignment, a = 3.   shows how the value of b is calculated depending on the value of a, e.g. through a linear arithmetic operation, b = 2*a + 1. IDE can only compute linear equations precisely.IFDS and IDE apply to a wide class of data-flow analysis problems.IFDS requires data-flow problems to be defined with flow functions that are distributive over the merge operator.Many reachability problems such as taint, reaching definitions, or live variables analysis fall into this category.IDE, on the other hand, also requires data-flow problems to be expressed with distributive environment transformers.IFDS suits better the problems with a binary value domain, e.g.taint analysis where the domain simply consists of two values, tainted or not tainted[3].It has been applied to more complex domains, e.g. for typestate analysis where the domain contains arbitrary object states[23].The drawback of IFDS is that it represents data-flow facts as symbol-value pairs, which blows up the data-flow fact space with increasing size of the domain.Because of this representation, IFDS's runtime performance depends on the value domain's size.Further, it may not terminate when

Figure 3 :
Figure 3: Original and sparse propagations after applying fact-specific on-demand sparsification.

Figure 5 :
Figure 5: Modifications for Sparse IDE algorithm for Phase I (mirrors the design from [15]).

Figure 6 :
Figure 6: Comparison of the Sparsification Approaches of Sparse IFDS and Sparse IDE is a -specific identity transformer, denoted by  ≡   , if the following holds:First, the transformer  keeps all -specific mappings intact:

Figure 9 :
Figure 9: Ratio of data-flow fact propagations and corresponding speedup ratios, in log scale

Table 1 :
Statements for Linear Constant Propagation Analysis with Corresponding IRs and Flow/Edge Functions.
Relative runtime of Sparse IDE compared to the baseline original IDE in %, annotated with exact runtimes in seconds, sorted by original IDE's runtime

Table 2 .
Assignment cases test possible flow and edge functions, as well as flow sensitivity.Branching and Loops cases test the meet operation.Field sensitivity cases test field sensitivity and aliasing scenarios.Context sensitivity cases test various calling contexts.Array cases test array handling and NonLinear cases test analysis' behavior under unanticipated non-linear operations.The results validate the correctness of Sparse IDE by showing that SparseHeros produces the same outputs as the non-sparse Heros.

Table 3 :
Performance of Sparse IDE compared to the baseline original IDE algorithm semantics must be correctly encoded with flow and edge function definitions within the IDE framework.