Compositional Taint Analysis for Enforcing Security Policies at Scale

Automated static dataflow analysis is an effective technique for detecting security critical issues like sensitive data leak, and vulnerability to injection attacks. Ensuring high precision and recall requires an analysis that is context, field and object sensitive. However, it is challenging to attain high precision and recall and scale to large industrial code bases. Compositional style analyses in which individual software components are analyzed separately, independent from their usage contexts, compute reusable summaries of components. This is an essential feature when deploying such analyses in CI/CD at code-review time or when scanning deployed container images. In both these settings the majority of software components stay the same between subsequent scans. However, it is not obvious how to extend such analyses to check the kind of contextual taint specifications that arise in practice, while maintaining compositionality. In this work we present contextual dataflow modeling, an extension to the compositional analysis to check complex taint specifications and significantly increasing recall and precision. Furthermore, we show how such high-fidelity analysis can scale in production using three key optimizations: (i) discarding intermediate results for previously-analyzed components, an optimization exploiting the compositional nature of our analysis; (ii) a scope-reduction analysis to reduce the scope of the taint analysis w.r.t. the taint specifications being checked, and (iii) caching of analysis models. We show a 9.85% reduction in false positive rate on a comprehensive test suite comprising the OWASP open-source benchmarks as well as internal real-world code samples. We measure the performance and scalability impact of each individual optimization using open source JVM packages from the Maven central repository and internal AWS service codebases. This combination of high precision, recall, performance, and scalability has allowed us to enforce security policies at scale both internally within Amazon as well as for external customers.


INTRODUCTION
Enterprises enforce a wide range of security policies on software applications to detect potential vulnerabilities [8, 20,22], data leaks [13,21] , information ow policy breaches [10], etc.A common route to enforcing these policies is statically analyzing code and con gurations and issuing warnings to the user either early on in software lifecycle, e.g., during code reviews [2, 6, 7], or analyzing deployed artifacts like containerized applications [3,9] and issue high-severity warnings [18,19].Irrespective of the stage at which static analysis tools are deployed, it is essential that these tools have a low false positive rate to minimize the e ort and time required to investigate these warnings, and a low false negative rate in order to ensure high coverage w.r.t. the properties being checked.Typically, achieving these goals is at odds with scaling to analyzing millions of lines of code in industry-scale applications [31].
Tracking data ow from sources to sinks can detect a large class of security vulnerabilities i.e., it can report data ow from APIs where user-controlled or "tainted" data enters the application to where the data reaches security-sensitive endpoints.A vast amount of research exists in scaling static taint analysis such as demand-driven approaches that follow a "start anywhere" in code [47,49], modular bottom-up analysis [35], and bi-abduction based analysis [30].In this paper, we describe CompTaint, a compositional taint analysis for Java code that is deployed internally within AWS and externally as part of two cloud-based services-Amazon CodeGuru Reviewer [2] and Amazon Inspector [18].
Design Choice CompTaint implements a eld, object and context sensitive compositional heap analysis following the approach in [35,41].We extend this heap analysis to a compositional taint analysis using the approach in [36].We made the design decision to focus on a compositional analysis because it unblocks optimizations that are key to our two main use cases: code-review integration on internal code bases and in CodeGuru Reviewer [2], and container scanning as part of Amazon Inspector [18].A compositional analysis typically involves computing a generalizable summary of a program component that can be applied to multiple contexts, e.g., computing analysis summary of a method that can be applied to di erent calling contexts to obtain the context-sensitive analysis states at call sites.This is important when deploying an analysis that posts recommendations at code review time as most of the code stays the same between commits.Reusing the analysis results from a previous scan for unchanged components ensures a fast turnaround.Furthermore, code artifacts deployed in containers often consist of many open source libraries that do not change between deployments.Precomputing analysis results for such libraries greatly reduces analysis time.
The bene ts of such a modular analysis (avoiding repeated reanalysis of components, e.g., per usage context, and analyzing independent components in parallel) are well-established and have been discussed by previous work [40,41].In this paper, we present three orthogonal performance optimizations that were key to deploying this analysis in production including discarding the intermediate state, an optimization intrinsic to the compositional nature of the analysis.Furthermore, we present an extension to the taint analysis in order to verify complex taint speci cations that arise in practice, while maintaining the compositional nature of the analysis resulting in a signi cant improvement in precision.

Soundness and Precision
A lot of work in the literature explores the impact of memory abstractions [37] and design choices around context, ow, and eld sensitivities [46,48,49] on the precision of static analyses.While precisely tracking data ow from sources to sinks is indeed important to maintaining a low false positive rate, an equally important aspect that has received signi cantly less attention is the problem of precisely identifying sources and sinks in source code.With the notable exception of CodeQL [6], many taint speci cations in the literature simply list a set of APIs of interest [17,43] that are marked as sinks or sources.However, in practice the speci c behavior of these APIs that determines whether they are sinks, sources or sanitizers depends on the context in which they are called.For example, the Java Cipher class will either perform encryption (behave as sanitizer) or decryption (behave as source) depending on how it was initialized.
We found that in addition to precisely tracking data ow, the precision of the analysis signi cantly depends on the accuracy of identifying program locations matching such contextual taint speci cations.The context could be constant values passed to certain APIs as in the Cipher example, sequences of API calls to de ne sources or sinks, and such.To address this concern while maintaining compositionality, we developed a novel speculative context resolution technique integrated into the compositional taint analysis.This technique resulted in a reduction in the false positive rate of CompTaint by 9% on average on a large corpus of real-world examples as motivated in § 2.1.
Steps before Production To ascertain that CompTaint is productionready, we evaluated it on the OWASP + benchmark [14] with groundtruth, conducted shadow reviews on datasets without ground-truth and iterated on adding features in the analysis to address recall and precision.Once the analysis achieved best-in-class OWASP score among competing tools and a stipulated high acceptance rate in its internal deployment (< 20% false positives for all its information ow rules), we focused on scaling the analysis to larger analysis targets followed by large number of analysis targets.A target is any analyzable artifact.For example, a target could be a JAR le from build artifacts of a code repository or even a collection of JAR les including the runtime dependency closure of a set of code repositories.Before deploying in production, we evaluated the analysis on datasets representative of two deployment scenarios: (a) a small number of code artifacts as target.This represents the deployment in CI/CD on code reviews alongside other cloud-based SAST tools that runs in AWS [1, 34,42].(b) large dependency closures of a code artifact, typically containing hundreds of code artifacts, representing analysis of containerized applications running in the cloud [18,19].Out of the box, the analysis did not scale to the single-target deployment scenario above.Contributions In this paper, we rst describe CompTaint's compositional taint analysis emphasizing key features that allowed it to meet the recall and precision bar inside AWS.Speci cally: • We developed an encoding on top of our abstraction of the heap to perform a compositional taint analysis, including a novel speculative context resolution technique to identify contexts around sources, sinks, and sanitizers, which signi cantly increased the precision of our analysis while retaining its compositional formulation.Next, we describe a set of optimizations that made it possible to scale CompTaint to large industry-scale applications in its production deployment: • Discarding intermediate analysis state: we leverage compositionality of our analysis to discard a large fraction of the abstract state for analysis components that are already summarized.We measure its e ect and show how it favorably impacts CompTaint.• Analysis scope reduction: we design a light-weight scopereduction analysis that prunes entry-points into the program that if analyzed could not produce a security vulnerability given a set of input taint speci cations.This optimization elides the analysis of a sizeable amount of code signi cantly reducing analysis complexity without compromising soundness.• Caching invocation models: we implement caching of applicable taint speci cations matching invocation sites of taint relevant APIs.We show that this substantially reduce the time for the scope-reduction analysis and the taint analysis.Evaluation In this paper, we evaluate CompTaint on 20 artifacts from Maven Central [12] and code artifacts from 4 external AWS services.In order to present results comparable to CompTaint's deployment in Amazon Inspector where it scans large containerized applications, we create analysis targets by generating code artifacts of the dependency closures of 500 Maven Central repositories and use a sampling methodology to select the closures.Likewise, to measure CompTaint's performance on scans of industry-scale cloud services, we analyze the dependency closures of AWS services starting from a few known root repositories.We measure the e ect of each optimization on the above datasets and describe how these techniques underpinning our analysis turned out to be critical in production.In order to evaluate the e cacy of speculative context resolution we evaluate CompTaint on a dataset of injection vulnerabilities.Further, to establish that the baseline analysis with context resolution, before the performance optimizations, has state-of-theart recall and precision, we evaluate CompTaint on the dataset that includes OWASP [14], an industry standard for benchmarking security properties, among other real-world code examples.
Deployment CompTaint is deployed internally at Amazon integrated with the code-review system.CompTaint runs an ensemble of checks, and automatically posts recommendations on code reviews based on its ndings.Developers have the option of marking recommendations as useful or not useful.Based on this developer feedback, CompTaint has an average acceptance rate of > 80%1 .CompTaint is also deployed externally as part of an AWS service called Amazon Inspector [4,18] and Amazon CodeGuru Reviewer [2].CompTaint powers Amazon Inspector to execute high-delity scans of containerized AWS Lambda [5,18] functions.

MOTIVATION
In this section, we present several motivating examples to illustrate the kind of complex contextual taint speci cations that arise in practice.Additionally, we motivate the need for additional performance optimizations by showing empirical results on running the baseline analysis on benchmarks from Maven Central [12] and code artifacts from four external AWS services using the methodology described in § 6.
Traditionally, taint tracking tools [17,43] specify sources, sinks and sanitizers at the API level by matching against a given method signature.However, in practice whether a given API acts as a source, sink or sanitizer often depends on the context.Consider Code 1: whether the Cipher.doFinalmethod performs encryption and acts as a sanitizer for sensitive data, depends on whether the Cipher class was initialized with the Cipher.ENCRYPT_MODE option.It is not safe to assume that any call of Cipher.doFinalperforms encryption.As another example, consider checking whether Code 2 is vulnerable to cross-site scripting: we want to ensure that attacker controlled data does not reach the HttpServletResponse.getWriter().writemethod.Note that the HttpServletResponse.getWriter() method returns a PrintWriter.Simply matching on the PrintWriter.writemethod signature results in many spurious ndings.
Finally, Code 3 shows a more complex example of object deserialization using an XStream [23] instance typically used to serialize and deserialize objects in XML and JSON formats.The simplicity of usage of XStream comes with the cost of exploitability.It has been exploited by researchers and adversaries to in ict remote command execution and denial-of-service attacks [24].In Code 3, a new URL is created from untrusted external input on line 6 and InputStream created from the URL is later deserialized using XStream on line 7.The XStream library now provides methods to allow list trusted types Code 3: XML external entity sink.using allowTypes.Line 14 shows this potential mitigation to the vulnerability by calling safeConfigure before readUrl on line 3. Observe that tainted data-variable str-still ows into the sink on line 11.The context that the analysis must capture is associated with variable xs and not the tainted variable str, and the sink fromXML() is neutralized due to safe state of the xs object.This example also illustrates that precisely tracking the context requires an inter-procedural analysis.

Contextual Taint Speci cations
Not taking the context into account, and matching the taint specication at the API level results in a precision loss of 9.85% on average and as much as 50% on certain vulnerability categories ( § 6.1: Table 1 shows detailed results).Compositionality in Contextual Data-ow In order to accurately identify the context around matching taint speci cations on program values that are not tainted but relevant to context, the analysis could use other sub-analyses to identify the sources, sinks, and sanitizers precisely, either apriori or synchronously with the main analysis.These sub-analyses could use light-weight, local analyses to identify these contexts imprecisely, use demand-driven heavy-weight interprocedural precise analyses such as constant propagation, type-state, and complex-value ow analysis.We overview the compositional taint analysis that powers CompTaint in § 3. Our design resolves these contexts in the same pass integrated with the compositional taint analysis.First, in the use cases we encountered the context spanned across large depths of inter-procedural data-ow obviating local lightweight analysis.Second on-demand analysis does not scale beyond a bounded depth of inter-procedural ow in practice due to an exponential number of recursive queries [34], and it's challenging to reuse partial analysis results due to new contexts that renders the partial summaries invalid [28].Third, we wanted to keep the compositional formulation such that the analysis remains compatible with our goals of leveraging reusable summaries of already analyzed program components.Fourth, as § 3 will detail, the computation of the abstraction of the heap computed for every strongly connected component (SCC) in the program is expensive and is prohibitive to be repeated for ow of di erent types through the heap such as constants, API calls for context, as well as other metadata for the taint analysis ruling out the possibility of using multiple sequential analyses.

Practical Challenges: Speed and Scale
Before we describe the techniques in § 3 underpinning the high precision in contextual data-ows (results in § 6.1), we note that this appealing result initially came with practical setbacks.We performed o ine experimentation on initial versions of our evaluation datasets (see § 6.2.1) before deploying CompTaint in production.We ran the baseline analysis without the performance optimizations in § 4 on the dependency closures of repositories from Maven Central [12] and dependency closures of target repositories of 4 AWS services.We observed that 19.5% out of 82 Maven dependency closures timed out with a time limit of 1 hr; the logs showed on average 57.5% of the whole program was not analyzed on the Maven closures computed based on number of SCCs processed.This raises the following question: Is there a path to e ciency without sacri cing accuracy?To answer the question we started investigating on three main areas: (a) Given a set of taint speci cations , does the analysis need to analyze all components in the global callgraph built from all the target code artifacts to retain soundness and precision?(b) Are there performance bottlenecks in the analysis?(c) Are there any symptoms of memory bottlenecks, and if yes, can we address the issues leveraging the compositional analysis design?

COMPOSITIONAL ANALYSIS
In this section we give a high level overview of CompTaint 's compositional analysis algorithm.To handle heap aliasing compositionally, we use the approach described in [41] to compute contextindependent summaries that are agnostic to the input heap.To achieve compositional taint tracking, we extend the compositional heap summaries of [41], to taint summaries by taking the approach presented in [36]: heap e ects in summaries are extended with taint e ects ( § 3.2).Taint e ects capture how the tainted speci cation applies to code e.g.whether a heap location contains tainted data coming from a source, or whether it ows into a sink.
At a high level, CompTaint considers each method in the program as a component, i.e., the unit of composition.For each method, Comp-Taint computes its e ects using the e ects computed for the methods it calls.§ 3.1 describes the de nition of a component in presence of recursion.E ects, which we describe in § 3.2, capture data ow relevant behavior, including heap accesses, and taint sources and sinks, among other analysis state.CompTaint computes methods' e ects in dependency order, i.e., callees before callers.The dependency order is determined from the call graph, which we describe in § 3.1.CompTaint computes the e ects of each method by iterating over the e ects of its statements.Since the call graph may be cyclic, and individual methods can contain loops, CompTaint computes the limits of these iteration sequences to ensure methods' e ects capture all possible behaviors.We guarantee termination by ensuring these limits have xed points by applying abstractions to approximate e ects [32].

Component Dependency Order
To determine the dependency order between program components, CompTaint rst computes a whole-program call graph.Technically, the call graph provides a mapping from program statements that might invoke some method, i.e., call sites, to methods that are potentially invoked, i.e., call targets.We obtain a dependency graph among methods by identifying call sites with their enclosing methods.However, this dependency graph may be cyclic, due to either recursion or call-graph imprecision.To obtain the desired dependency order among components, we compute the strongly connected components (SCCs) of the method dependency graph.CompTaint then considers each SCC as one single component, and computes components' e ects in SCC dependency order.
To achieve an adequate balance of precision and scalability, Comp-Taint computes call graphs via the variable type analysis (VTA) algorithm [50] implemented by SPARK [39].This algorithm utilizes an inexpensive yet whole-program context-, ow-and object-insensitive "pointer analysis" using a data structure called the type-propagation graph or pointer assignment graph (PAG).Graph nodes represent program variables, and edges represent assignments.Program types are seeded to their corresponding graph nodes at allocation sites, and propagated across graph edges to the nodes corresponding to call-site receivers.We obtain the resulting call graph by collecting methods' implementations for the types propagated to each call site as potential targets.We achieve this in linear time by computing and propagating types over the strongly connected components of the PAG.

Compositional E ects
CompTaint computes e ects capturing the behaviors relevant to data ow analysis.These e ects include whether a given program value originated from a data ow source, reached a data ow sink, or was processed by a data ow sanitizer.Since these e ects are semantic properties relative to the policy being enforced, their speci cations are provided as input rather than hard-coded into the analysis.CompTaint consumes such speci cations as models that apply to program statements.For example, models can specify that sink e ects are applied to the input arguments of SLF4J logging API calls, or that a sanitizer e ect is applied to the return value of an application-speci c sanitizer method.
Because CompTaint analyzes each component in isolation, we must capture these e ects, e.g., of a sink, without knowing whether the given value originated from a source.CompTaint represents such compositional e ects symbolically with respect to method parameters.Simple e ects like source, sink, and sanitizer amount to unary predicates on symbolic parameters, as well as local and global variables.Input models induce such e ects, for example in Code 3 at line 6 a source model for URL.openStream() would apply a source e ect on its return value.Flow-through e ects capture binary data ow relations among symbolic parameters, e.g., ow from a method parameter to its return value.For example in Code 3 at line 10, a ow model for InputStream.readAllBytes()would apply a ow-through e ect from its receiver object to its return value.When methods' e ects are composed together, i.e., at call sites, the resolution of symbolic e ects can trigger combinational logic.For example, when a sink e ect of a method parameter is resolved to a call-site argument with a source e ect, a source-to-sink ow can be detected; if the rst parameter had a sanitize e ect instead, the source e ect could be removed.
To achieve an adequate level of precision, e ects are context-, ow-, eld-and object-sensitive.The aforementioned symbolic representation provides context sensitivity, since symbolic values are resolved according to call-site context.We achieve ow sensitivity by computing e ects sequentially over program statements and composing e ects at call sites in call graph dependency order.To achieve eld and object sensitivity, CompTaint follows the modular heap analysis framework of Madhavan et al. [41] and Feng et al. [35], representing e ects over object graphs: nodes correspond to objects reachable from parameters, local, and global variables, and edges capture eld accesses among objects.In this way, aliasing among accesses is captured by multiple incoming edges to a given node.This representation provides eld sensitivity, since distinct elds of any given object may be incident on distinct nodes in the graph, and object sensitivity, since distinct objects in the graph may share the same type.

Speculative Context Resolution
Next, we discuss the challenge with contextual taint speci cations.The fundamental problem stems from two distinct ows, one for the taint, and another for data-ow determining context around the taint.Referring back to the code example in Code 3, the XStream.fromXML()method applies a sink e ect on str only in program contexts where XStream.allowTypes() has not been called earlier on its receiver xs.Note that when this context dependency is actually resolved, for example at line 14 inside safeConfigure method, the tainted value str is not available and thus we cannot simply apply a sanitize effect on it.Instead, the validity of the sink e ect on str at line 11 depends on the state of xs.If safeConfigure were to be called just before line 11, then this context could be immediately resolved for any context where deserialize is called and we could elide the sink e ect on str.In general however, this context may be resolved interprocedurally, for example by calling safeConfigure before the call to readUrl at line 3 when neither the source e ect at line 6 nor the sink e ect at line 11 have yet manifested.As such, when analyzing deserialize method in isolation, the validity of the sink e ect at line 11 cannot be resolved since it may indeed be called in a context where safeConfigure was never called.
In order to capture such inter-procedural contextual data-ows in our compositional analysis design, we introduce speculative e ectsan e ect that is only valid when additional context predicates are − t a r g e t : c l a s s : XStream method: a l l o w T y p e s s o u r c e : !t h i s # taints the receiver object k i n d : SAFE_CONFIG # to capture context − t a r g e t : c l a s s : XStream method: fromXML s i n k : ! a l l A r g s # sinks all arguments k i n d : XML_READ c o n t e x t : { on: !t h i s , i f : { h a s :NONE , k i n d :SAFE_CONFIG } } Code 4: CompTaint speci cations for handling contextual data-ows in Code 3. also satis ed.A context predicate evaluates a logical combination of primitive predicates on a symbolic method parameter.CompTaint supports two types of predicates that check set membership of the kind(s) of taint or the values of program constants among a speci ed set of values.
The general support for contextual data-ows in CompTaint necessitated careful handling of speculative e ects to handle multiple context predicates, their partial resolution in method summaries, and their interactions with regular or speculative sanitize e ects.We elide these details here, but such intricate handling was needed to precisely resolve contextual data-ows in observed real-world code patterns.

OPTIMIZATIONS
This section describes the three optimizations that had a signi cant impact in CompTaint's deployment.

Discarding Intermediate E ects
Recall that CompTaint implements a compositional analysis that computes individual method summaries, and analyzes SCC in the method dependency graph to a x point.This means that we can reduce the peak memory usage by discarding intermediate perstatement e ects for previously-analyzed components, loading program components dynamically as they are analyzed, and unloading previously-analyzed program components.CompTaint currently exploits the former, but not the latter two opportunities.Note that within a component, CompTaint must keep the e ects for each program statement in order to compute the xed points of e ect iteration sequence limits.Once the xed points have been computed for a given component, only the method-level e ects need be retained, i.e., to apply to call sites; per-statement e ects are deallocated.Note that a traditional whole-program analysis would need to keep the state at all program locations in order to reach a xed point, so this optimization leverages the compositional nature of the analysis.

Analysis Scope Reduction
Given a set of input speci cations and a call-graph built globally over all the targets for an instance of the analysis, the goal is to determine parts of the program on which the heavyweight heap-e ect analysis can be elided without loss in soundness or precision.The analysis that determines what can be elided must be lightweight.Soundness Versus Cost At a very high-level, one might start from an insight as follows: a subgraph ′ of the whole-program call graph, , is relevant for the analysis if data-ow from a source of tainted data to a sink occurs in ′ .A simple over-approximation of this idea is that if a source and sink are not reachable in a subgraph ′ rooted at vertex over outgoing edge , then ′ could be elided from analysis assuming ′ is reachable from roots of only via .However, it is straightforward to come up with a counterexample to the above argument.public void entry () { valB = foo ( valA ) // aliases valA and valB bar ( valA , valB ) // taints valA and sinks valB } In the example above, we see an invocation to foo is followed by invocation to bar in method entry.While ′ , the program reachable from foo does not taint or sink the data owing into foo, it creates an aliasing relationship between valA and valB.The subgraph rooted at bar then taints valA and sinks valB creating an insecure data ow.
Clearly, eliding ′ 's analysis will be unsound, however, precisely checking for aliasing will require an analysis as expensive as the full-blown taint analysis.

Eliding Safe Call Graph Roots
To avoid analyzing a subgraph it is not su cient to conclude that the subgraph is devoid of program locations with matching sources or sinks but we need to ascertain that the subgraph does not induce aliasing relations that are then used in the same subgraph or another subgraph in the call graph.Fundamentally, to elide analysis of subgraph ′ rooted at , the analysis needs to consider sources and sinks reachable from and aliasing created in ′ .As a sound over-approximation, we can elide roots of the call graph from analysis-inferred as safe roots-if no matching sources and sinks are reachable ∀ ∈ , i.e. even if reachable subgraphs from create aliasing.For example, in the example above, if no source or sink were reachable from the subgraph rooted at the call to bar, the root entry is safe, hence the entire program reachable from entry can be elided from taint analysis.
Adding Precision to Root Elision Given a set of taint speci cations , we derive a set of taint rules .A rule, in is given by { | ∈ ( , )}.The single element e ects described in § 3 such as source and sink belong to a hierarchy of types called kinds.A rule speci es the types of sources and sinks that constitute a vulnerability.For example, a rule to detect XXE vulnerability [16] in Code 3 is speci ed by source type UNTRUSTED_DATA_NETWORK and sink type XML_READ.A root of the call graph is only relevant for analysis, if it has reachable source and sink types associated by a rule .The scope-reduction analysis computes all source types and sink types reachable from the roots of the call graph-the entrypoints-and discards the entrypoints which lack any reachable ( , ) that corresponds to any rule ∈ .The scope-reduction analysis is lightweight and discards entrypoints that are guaranteed to be safe.The call graph is reused across the scope-reduction analysis and taint analysis.The scope-reduction only requires matching taint speci cations-sources and sinks, and propagates matching ( and ) up to the entrypoints bottom-up in the SCC graph.Note that its analysis domain does not need any notion of access paths or variables.CompTaint uses the results of scope-reduction analysis to recompute a SCC graph using only potentially unsafe entrypoints that are relevant for the analysis; the heavyweight taint analysis that follows uses the reduced SCC graph.In § 6, we discuss the impact of this optimization.

Caching Invocation Models
CompTaint provides a library of source, sink, and sanitizer specications that are applied to the program under analysis.Additionally, to model the ow of tainted data in libraries, CompTaint supports ow models that apply ow-through e ects.These models are applied at invocation sites to di erent methods, i.e.API calls in the program, and referred to as invocation models2 .Any instantiation of CompTaint must match every model in , the set of invocation models in the speci cation library, to every call site.CompTaint as described in § 3 executes a xed-point iteration on every SCC.In the entire program if CompTaint executes iterations, and models are matched at every invocation site, say sites, the time complexity of model matching is ( × × ).
In order to avoid repeating a linear scan of all the models every time a call site is analyzed in an iteration, CompTaint creates an index of the target and the models that match the target method(s) at a call site.Once cached, the cost of model matching at a call site is roughly a constant time lookup on the cached models that apply only for the targets at the call site.Asymptotically the time complexity is dominated by and , i.e. ( × ).This is signi cant since a tool like CompTaint usually has a perpetually growing list of models owing to its vast number of customers and common libraries and SDKs used by its di erent customers.
The caching described here is an over-approximation and ignores the context resolution described in § 3.3.Further, the scope-reduction analysis and taint analysis equally bene t from caching invocation models.Recall that the scope-reduction analysis only needs caching of source and sink models, unlike taint analysis, which caches sanitizer and ow models in addition.§ 6 discusses the impact of this optimization on CompTaint's performance.

IMPLEMENTATION
CompTaint is implemented as a modular static data-ow analysis framework for Java.At the heart of this framework lies an abstract reachability algorithms module that traverses over abstract program statements and control-ow edges to compute xed point.This module can plugin the underlying program representation, and currently we support the Soot Jimple representation [51] for Java bytecode analysis, and the MU Graph representation [25] for Java and Python source code analysis.Before the analysis, we compute the entrypoints for the analysis.Entry-points can be annotated explicitly.In addition, we generate a synthetic entry-point for a subject that captures invocations to all public methods in a non-deterministic order.We then build the whole-program call-graph using variabletype analysis (VTA) [50] implementation from Soot Pointer Analysis Research Kit [39] to determine the component dependency order as described in § 3.1.Client analyses extend the reachability analysis by providing implementations for their analysis e ects, states and state transformers.CompTaint implements an alias analysis by modeling heap locations as nodes in a graph and program statements with alias e ects for assignments, reads and writes inducing edges among them.CompTaint then extends this with the introduction of taint attributes for heap locations and e ects of source, sink, sanitize and ow of taint attributes.The aliasing and taint e ects are computed and summarized simultaneously for each program component.Throughout the analysis, various relations from program locations to e ects on attributes of heap locations are asynchronously written to a tracing database on disk.When CompTaint detects a nding, it uses the tracing database to reconstructs a trace on-demand.In addition to the optimizations discussed in § 4, CompTaint provides a number of options and analysis abstract state size limits for con guring the scope of analysis e.g.state size limiting for SCC components, and making it tractable within various SLAs of its deployment use cases.To ensure we can handle very large inputs where it may not be feasible to terminate, CompTaint has the ability to report partial ndings.A trace reconstruction thread runs in parallel and queries the tracing database to report detailed traces as ndings are discovered.Note that this works even when we reach the analysis state size budget on an SCC component: due to the compositional nature of the analysis we can just compute an empty summary for the o ending component and continue the rest of the analysis.
For security policy enforcement, CompTaint provides an extensible YAML based language to specify rules and models.Rules map interactions of taint and sink kinds to known vulnerabilities.And models specify which API methods induce taint e ects of said kinds.CompTaint checks 17 information-ow policy rules to prevent data leaks and top OWASP injection vulnerabilities [15].It has an extensive library of models for the JDK, Javax, Apache Commons, Guava and popular Java libraries for logging, authentication, serialization and DOM parsing, database connectivity and web-app frameworks.Additionally, it uses models for the AWS SDK and service APIs for scanning Amazon internal codebases.
Limitations: CompTaint's current implementation is robust for Java bytecode analysis, thoroughly tested for versions 8 and 11 of the JDK.It does not analyze native code and code that uses re ection.It does not currently support runtime dependency injection frameworks.When analyzing concurrent programs, it considers their single-threaded execution, so it does not guarantee detection of data-ows via shared-memory interference, and inter-process communication.

EVALUATION
In this section we present experimental results showing the precision and scalability of CompTaint.For evaluating precision we use a labeled dataset consisting of the OWASP Benchmark [14] as well as an internal testsuite.For performance evaluation, we use a set of open-source Maven Java projects, as well as a set of internal Java Amazon code bases.In both cases, we analyze not only the application packages, but also include the packages in the runtime dependency closure.Note, we do not evaluate precision and recall on this larger dataset because we do not have ground truth labels for this dataset.

Precision Impact of Contextual Data ow
To evaluate the precision impact of contextual data ow models, we use a comprehensive labeled dataset of injection vulnerabilities from the OWASP benchmark [14], an industry standard for evaluating the accuracy and coverage of automated software vulnerability detection tools.Due to the synthetic nature of these benchmarks, we further  1).Table 1 summarizes the false positive rate (FPR) when running CompTaint on both datasets.On the OWASP benchmarks [14] alone, CompTaint achieves a 100% recall and 13.23% false positive rate on the six applicable categories.The table reports FPR with Baseline and with CompTaint's contextual data-ow modeling-i.e.modeling validity of sources, sinks, and sanitizers based on inter-procedural context similar to Code 3. On all injection attack categories, an absence of contextual modeling, causes a precision loss of 9.85% on average, most notably a loss of 50% on code-injection vulnerabilities.
To evaluate precision on real world code, we use our internal deployment of CompTaint at code-review time.CompTaint posts ndings as comments on code reviews, and Amazon developers can mark recommendations as useful, or not useful.Contextual dataow modeling lowers the false positive rate, computed based on this developer feedback, to less than 20% on internal code 1 .
Notably, this signi cant improvement in precision is achieved with modest e ort in writing and maintaining taint speci cations.CompTaint uses a library of 1534 taint speci cations and only 42 (2.7%) of these require additional contextual modeling.
6.2 Performance Impact of Optimizations 6.2.1.Experimental Setup.We provide the methodology for building dependency closures from Maven and internal service repositories.
Maven Analysis Targets We used the libraries.ioDB [11], which has precomputed dependencies between libraries, and retained Java projects from Maven with Apache, MIT, or BSD-like licenses.This yielded 44,757 projects, not counting versions of the same project.We built the dependency graph modulo versions conservatively counting every version of only the runtime dependencies.We started from the roots, 10,555 projects, and computed the transitive closure of dependencies of each.Since evaluating dependency closures with large overlap is redundant, we reduced overlap as follows.We computed the Jaccard distance between each root and all other roots, based on sets of dependencies, and sorted the candidates by mean Jaccard distance to all others.We selected top 500 projects with latest versions in libraries.io.We binned these subject closures on number of .jars,e.g. 1 jar, 2 jars, 3 jars and etc.We limit subject sizes at 20 jars since subjects with 20+ jars timeout on most con gurations, making it infeasible to empirically demonstrate the e ect of optimizations.We used strati ed sampling on this distribution to get 20 closures uniformly distributed across buckets.

Internal Code Analysis Targets
We also evaluated our analysis on four large internal applications.We selected code repositories with application code and discarded third-party code, e.g.open-source libraries.For each application, we build the closures from these jars that includes all their bytecode.We include method signatures and type hierarchies for the rest of the classpath.Table 2 shows statistics about the size of these subjects.For the purposes of this evaluation we will use "subject" and "closure" interchangeably, the latter referring to the dependency closure of the former, the root repositories.To evaluate the impact of the optimizations, we run CompTaint on Amazon EC2 m5.12xlarge hosts using di erent con gurations as shown below, each with 64 GB Java heap limit and 1 hour time limit.
• ScopeReduction: Only scope reduction enabled over baseline.
• ScopeReduction + Caching: This enables caching invocation models and scope reduction over baseline, i.e. adds caching to ScopeReduction con guration.• Discarding: This enables only discarding of analysis state for already summarized components over baseline.• CompTaint: This enables all analysis optimizations.It is worth noting that these con gurations are analysis semantics preserving and have no e ect on the number of detected ndings.We con rmed that the number of traces generated from each of the con gurations is identical for all subjects.

Impact of Scope Reduction on Analyzed Code
In order to answer EQ1: How much e ect does scope-reduction analysis have in soundly pruning the size of the analysis problem?, we compare Baseline with no performance optimizations and ScopeReduction.Our experiments show that scope-reduction analysis reduces the number of relevant entry points in every subject.The average reduction is 87%, with 89% on Maven and 83% on service code.Figure 1 and Figure 2 show the reduction in the number of entry points, while Figure 3 and Figure 4 show the reduction in the number of methods analyzed.Note that adjudging entry points as safe or irrelevant may not lead to proportionally lower methods analyzed.For example, a large fraction of code may be reachable from a small fraction of relevant entry points.However, in practice, we see substantial reduction in methods analyzed, for the reduced set of entry points above, on average 70%, 72% on Maven and 59% on service closures.What is the e ect of scope-reduction analysis in reducing analysis time?, we analyze the di erence between analysis time with and without scope-reduction analysis, ScopeReduction and Baseline respectively, shown in Figure 5 and Figure 6.Without caching, there is an average 47% reduction for Maven subjects.For service code, there is reduction in analysis time on the 2 subjects, and in fact for the remaining 2, scope-reduction analysis adds overhead to the baseline.Next, in EQ3, we discuss how model caching turns this around, and reverts its performance to be a lightweight analysis as hypothesized.E ect of Invocation Model Caching To understand the e ect of model caching, we use the con guration called ScopeReduction + Caching. Figure 5 and Figure 6 show the analysis time for ScopeReduction + Caching, used to answer EQ3: How does model caching improve the time taken by taint analysis and scope-reduction analysis?.The average time reduction versus baseline rises to 60% for Maven subjects and 19% for service subjects.Hence, combining both optimizations produces worthwhile savings overall.Figure 7 shows the amount of time spent in scope-reduction analysis with and without caching.The average reduction is 89.2%.Note that in several Maven closures we found the analysis time was almost reduced by close to 100 percent, since all entry points were deemed irrelevant by scopereduction analysis.We do not present these subjects in the gures since they are less interesting, but in practice such scope reduction has proven to be useful in production to reduce time and cost.
On average, the time spent in scope-reduction analysis without caching represents 13.6% of total baseline time.With caching, these percentages drop down to just 1.4% of the total baseline and 7.5% of the total CompTaint time.Overall, 7.5% of analysis time is spent in reducing 70% of code analyzed on average, and signi cant reduction in overall analysis time as discussed above.We conclude that the cost of the scope-reduction analysis does pay o when combined with model caching.Baseline times out).We observe an average reduction of 94%.We also measured the peak heap memory usage to estimate the e ect of this optimization.Although we see reduction in peak heap usage on service code (not shown), peak heap usage depends on the heap budget and frequency and number of garbage collections, and does not always correlate growth in memory usage to increase in analysis problem size.This dramatic reduction in abstract state size translates to lowering analysis time on some services, e.g.CompTaint versus ScopeReduction + Caching in Figure 6.On Maven, we observe that discarding abstract state sometimes come at a small cost in time due to more garbage collections.Nevertheless, holding only necessary state in memory lowers chances of out of memory errors on pathological subjects with complex components that are memory intensive.Overall, CompTaint reduces analysis time over baseline by 69.1% on Maven and by 16.3% on service closures.

RELATED WORK
We discuss relevant related work that are geared towards scaling static taint analysis.RAPID [34] internally uses a IFDS [44] based type-state analysis and boomerang based taint analysis [47,49].RAPID combines type-state checking and taint analysis to check similar properties as CompTaint.RAPID scales on large subjects only with bounded call-stack depths and cannot reuse analysis results of analyzed components due to context-dependent summarization [28].RAPID required partitioning [31,34] in order to scale to subjects of sizes we evaluate at the cost of soundness.
ANTaint [52] is an approach deployed at Alibaba for data leaks detection and data consistency checks.It uses the FlowDroid [27] taint analysis with several changes that improve the precision, recall, and scalability on service-oriented applications (SOAs), such as Spring applications.Another approach tailored to SOAs is Jac-kEE [26], a Doop-based [29] data-ow analysis that demonstrates improvements in precision and scalability.JackEE achieves this via two techniques, a generalized modeling of framework runtime behavior and sound-modulo-analysis model of selected Java data structures.While JackEE shows speed up of 4X compared to other analyses on selected applications, the improvements are tailored to speci c frameworks and a subset of standard Java data structures.Comp-Taint introduces more general optimizations.P/Taint [36] is another approach based on the Doop framework.In conventional taint analysis approaches, the data-ow analysis is a client of the points-to analysis (e.g., Beacon [38], FlowDroid [27]).The uni cation of both analyses into a single analysis is the key feature of this approach.P/Taint mainly focuses on improving precision and recall.CompTaint is an industry-scale analysis and emphasizes on maintaining compositionality but like P/Taint uni es taint propagation and heap analysis.Tricoder [45] employs a collection of intraprocedural analyses and uses a microservices architecture for scalability.CompTaint is speci cally built for scaling inter-procedurally.Infer and Zoncolan [33] are inter-procedural bi-abduction [30] based analyses that operate at scale in Facebook.A qualitative comparison of the approaches, such as a comparison with CompTaint's compositional contextual modeling, requires further analysis details that are not published to the best of our knowledge.There is a rich body of work on CFL-reachability based static analysis.Graspan [53] models reachability as transitive closure problem on graphs and uses large-scale graph processing for scalability.Grapple extends it to checking nite state properties [54].CompTaint, combines taint tracking with contextual data-ow modeling, a nite-state property, into a single compositional analysis.

CONCLUSION
In this paper we presented an industry-scale compositional static analysis that's deployed internally in Amazon and externally as part of AWS cloud services.We overview the compositional algorithm we implemented and detail our contribution to model contextual data-ow over the heap analysis.We describe the setbacks we experienced before deploying CompTaint in production and how a set of sound optimizations allowed us to productionize the tool.We measure the precision bene t of contextual data-ow modeling.We systematically built benchmarks to demonstrate challenges in real deployment scenarios that require analyzing large artifacts, and present the e ect of the optimizations on the subjects.

Figure 1 :Figure 2 :
Figure 1: Number of relevant versus total public entry points for Maven closures.

Figure 3 :
Figure 3: Number of methods analyzed, with and without scope-reduction analysis, for Maven closures.TO stands for timeouts.

Figure 4 :
Figure 4: Number of methods analyzed, with and without scope-reduction analysis, for service closures.Impact of Scope Reduction on Analysis Time To address EQ2:What is the e ect of scope-reduction analysis in reducing analysis time?, we analyze the di erence between analysis time with and without scope-reduction analysis, ScopeReduction and Baseline respectively, shown in Figure5and Figure6.Without caching, there is an average 47% reduction for Maven subjects.For service code, there is reduction in analysis time on the 2 subjects, and in fact for the remaining 2, scope-reduction analysis adds overhead to the baseline.Next, in EQ3, we discuss how model caching turns this around, and reverts its performance to be a lightweight analysis as hypothesized.E ect of Invocation Model CachingTo understand the e ect of model caching, we use the con guration called ScopeReduction + Caching.Figure5and Figure6show the analysis time for ScopeReduction + Caching, used to answer EQ3: How does model caching improve the time taken by taint analysis and scope-reduction analysis?.The average time reduction versus baseline rises to 60% for Maven subjects and 19% for service subjects.Hence, combining both optimizations produces worthwhile savings overall.Figure7shows the amount of time spent in scope-reduction analysis with and without caching.The average reduction is 89.2%.Note that in several Maven

Figure 5 :Figure 6 :
Figure 5: Total analysis time for Maven closures for all the congurations.The label TO adjacent to bars stands for timeouts.

Figure 7 :Figure 8 :
Figure 7: Time spent performing the scope-reduction part of the analysis, for Maven closures.

Table 1 :
False positive rate (FPR) of CompTaint on a labeled dataset of injection vulnerabilities compiled from 1572 OWASP tests, and 120 real-world code examples from the wild with false positives reported by AWS developers on recommendations reported by di erent SAST tools on code reviews.complement them by adding 120 real-world code examples based on false positives reported by Amazon developers on recommendations reported by di erent SAST tools on Amazon's internal code reviews.This further includes ve additional injection categories not covered by the OWASP tests (shaded bottom ve rows in Table

Table 2 :
Experimental subject closures.Maven subjects include their full transitive closure of runtime dependencies.Service subjects include their closure of internal-code excluding third-party dependencies.
Discarding Abstract State To answer EQ4: To what extent does discarding intermediate abstract state impact the total amount of abstract state needed to complete the analysis?we measure the size of the abstract state for Maven closures-nodes and edges in graph modeling the heap, with and without discarding intermediate state.Figure 8 shows the size of the abstract state for Maven closures, with and without discarding intermediate state (minus a few cases where