Program Analysis for Adaptive Data Analysis

Data analyses are usually designed to identify some property of the population from which the data are drawn, generalizing beyond the specific data sample. For this reason, data analyses are often designed in a way that guarantees that they produce a low generalization error. That is, they are designed so that the result of a data analysis run on a sample data does not differ too much from the result one would achieve by running the analysis over the entire population. An adaptive data analysis can be seen as a process composed by multiple queries interrogating some data, where the choice of which query to run next may rely on the results of previous queries. The generalization error of each individual query/analysis can be controlled by using an array of well-established statistical techniques. However, when queries are arbitrarily composed, the different errors can propagate through the chain of different queries and bring to a high generalization error. To address this issue, data analysts are designing several techniques that not only guarantee bounds on the generalization errors of single queries, but that also guarantee bounds on the generalization error of the composed analyses. The choice of which of these techniques to use, often depends on the chain of queries that an adaptive data analysis can generate. In this work, we consider adaptive data analyses implemented as while-like programs and we design a program analysis which can help with identifying which technique to use to control their generalization errors. More specifically, we formalize the intuitive notion of adaptivity as a quantitative property of programs. We do this because the adaptivity level of a data analysis is a key measure to choose the right technique. Based on this definition, we design a program analysis for soundly approximating this quantity. The program analysis generates a representation of the data analysis as a weighted dependency graph, where the weight is an upper bound on the number of times each variable can be reached, and uses a path search strategy to guarantee an upper bound on the adaptivity. We implement our program analysis and show that it can help to analyze the adaptivity of several concrete data analyses with different adaptivity structures.


INTRODUCTION
Consider a dataset  consisting of  independent samples from some unknown population .How can we ensure that the conclusions drawn from  generalize to the population ?Despite decades of research in statistics and machine learning on methods for ensuring generalization, there is an increased recognition that many scientific findings generalize poorly (e.g.[19,26] ).While there are many reasons a conclusion might fail to generalize, one that is receiving increasing attention is adaptivity, which occurs when the choice of method for analyzing the dataset depends on previous interactions with the same dataset [19].Adaptivity can arise from many common practices, such as exploratory data analysis, using the same data set for feature selection and regression, and the re-use of datasets across research projects.Unfortunately, adaptivity invalidates traditional methods for ensuring generalization and statistical validity, which assume that the method is selected independently of the data.The misinterpretation of adaptively selected results has even been blamed for a "statistical crisis" in empirical science [19].
A line of work initiated by Dwork et al. [15], Hardt and Ullman [25] posed the question: Can we design general-purpose methods that ensure generalization in the presence of adaptivity, together with guarantees on their accuracy?The idea that has emerged in these works is to use randomization to help ensure generalization.Specifically, these works have proposed to mediate the access of an adaptive data analysis to the data by means of queries from some pre-determined family (we will consider here a specific family of queries often called "statistical" or "linear" queries) that are sent to a mechanism which uses some randomized process to guarantee that the result of the query does not depend too much on the specific sampled dataset.This guarantees that the result of the queries generalizes well.This approach is described in Fig. 1.This line of work has identified many new Fig. 1.Overview of our Adaptive Data Analysis model.We have a population that we are interested in studying, and a dataset containing individual samples from this population.The adaptive data analysis we are interested in running has access to the dataset through queries of some pre-determined family (e.g., statistical or linear queries) mediated by a mechanism.This mechanism uses randomization to reduce the generalization error of the queries issued to the data.
algorithmic techniques for ensuring generalization in adaptive data analysis, leading to algorithms with greater statistical power than all previous approaches.Common methods proposed by these works include, the addition of noise to the result of a query, data splitting, etc.Moreover, these works have also identified problematic strategies for adaptive analysis, showing limitations on the statistical power one can hope to achieve.Subsequent works have then further extended the methods and techniques in this approach and further extended the theoretical underpinning of this approach, e.g.[7,13,14,16,27,31,35,36].
A key development in this line of work is that the best method for ensuring generalization in an adaptive data analysis depends to a large extent on the number of rounds of adaptivity, the depth of the chain of queries.As an informal example, the program  ←  1 ();  ←  2 (, );  ←  3 (, ) has three rounds of adaptivity, since  2 depends on  not only directly because it is one of its input but also via the result of  1 , which is also run on , and similarly,  3 depends on  directly but also via the result of  2 , which in turn depends on the result of  1 .The works we discussed above showed that, not only does the analysis of the generalization error depend on the number of rounds, but knowing the number of rounds actually allows one to choose methods that lead to the smallest possible generalization error -we will discuss this further in Section 2.
For example, these works showed that when an adaptive data analysis uses a large number of rounds of adaptivity then a low generalization error can be achieved by a mechanism adding to the result of each query Gaussian noise scaled to the number of rounds.When instead an adaptive data analysis uses a small number of rounds of adaptivity then a low generalization error can be achieved by using more specialized methods, such as data splitting mechanism or the reusable holdout technique from Dwork et al. [15].To better understand this idea, we show in Fig. 2 three experiments showcasing these situations.More precisely, in Fig. 2(a) we show the results of a specific analysis 1 with two rounds of adaptivity.This analysis can be seen as a classifier which first runs 400 non-adaptive queries on the first 400 attributes of the data, looking for correlations between the attributes and a label, and then runs one last query which depends on all these correlations.Without any mechanism the generalization error of the last query is pretty large, and the lower generalization error is achieved when the data-splitting method is used.Fig. 2(c) shows how this situation also change with the number of queries.Specifically, it shows the root mean square error of the last adaptive query when the numbers queries varies.This also highlight the fact that different mechanisms, for the same analysis, produce results with very different generalization error.In Fig. 2(b), we show the results of a specific analysis2 with four hundreds rounds of adaptivity.At each step, this analysis runs an adaptive query based on the results of the previous ones.Without any mechanism, the generalization error of most of the queries is pretty large, and this error can be lowered by using Gaussian noise.
This scenario motivates us to explore the design of program analysis techniques that can be used to estimate the number of rounds of adaptivity that a program implementing a data analysis can perform.These techniques could be used to help a data analyst in the choice of the mechanism to use, and they could ultimately be integrated into a tool for adaptive data analysis such as the Guess and Check framework by Rogers et al. [31].
The first problem we face is how to formally define a model for adaptive data analysis which is general enough to support the methods we discussed above and which would permit to formulate the notion of adaptivity these methods use.We take the approach of designing a programming framework for submitting queries to some mechanism giving access to the data mediated by one of the techniques we mentioned before, e.g., adding Gaussian noise, randomly selecting a subset of the data, using the reusable holdout technique, etc.In this approach, a program models an analyst asking a sequence of queries to the mechanism.The mechanism runs the queries on the data applying one of the methods above and returns the result to the program.The program can then use this result to decide which query to run next.Overall, we are interested in controlling the generalization of the query results returned by the mechanism, by means of the adaptivity.
The second problem we face is how to define the adaptivity of a given program.Intuitively, a query  may depend on another query , if there are two values that  can return which affect in different ways the execution of .For example, as shown in [14], and as we did in our example in Fig. 2(a), one can design a machine learning algorithm for constructing a classifier which first computes each feature's correlation with the label via a sequence of queries, and then constructs the classifier based on the correlation values.If one feature's correlation changes, the classifier depending on features is also affected.This notion of dependency builds on the execution trace as a causal history.
In particular, we are interested in the history or provenance of a query up until this is executed, we are not then concerned about how the result is used -except for tracking whether the result of the query may further cause some other query.This is because we focus on the generalization error of queries and not their post-processing.To formalize this intuition as a quantitative program property, we use a trace semantics recording the execution history of programs on some given input -and we create a dependency graph, where the dependency between different variables (queries are also assigned to variables) is explicit and track which variable is associated with a query request.We then enrich this graph with weights describing the number of times each variable is evaluated in a program evaluation starting with an initial state.The adaptivity is then defined as the length of the walk visiting most query-related variables on this graph 3 .In other words, we define adaptivity as a quantitative form of program dependency.
The third problem we face is how to estimate the adaptivity of a given program.The adaptive data analysis model we consider and our definition of adaptivity suggest that for this task we can use a program analysis that is based on some form of dependency analysis.This analysis needs to take into consideration: 1) the fact that, in general, a query  is not a monolithic block but rather it may depend, through the use of variables and values, on other parts of the program.Hence, it needs to consider some form of data flow analysis.2) the fact that, in general, the decision on whether to run a query or not may depend on some other value.Hence, it needs to consider some form of control flow analysis.3) the fact that, in general, we are not only interested in whether there is a dependency or not, but in the length of the chain of dependencies.Hence, it needs to consider some quantitative information about the program dependencies.
To address these considerations and be able to estimate a sound upper bound on the adaptivity of a program, we develop a static program analysis algorithm, named AdaptFun, which combines data flow and control flow analysis with reachability bound analysis [22].This combination gives tighter bounds on the adaptivity of a program than the ones one would achieve by directly using the data and control flow analyses or the ones that one would achieve by directly using reachability bound analysis techniques alone.We evaluate AdaptFun on a number of examples showing that it is able to efficiently estimate precise upper bounds on the adaptivity of different programs.All the proofs and extended definitions can be found in the supplementary material.
To summarize, our work aims at the design of a static analysis for programs implementing adaptive analysis that can estimate their rounds of adaptivity.Specifically, our contributions are: (1) A programming framework for adaptive data analyses where programs represent analysts that can query generalization-preserving mechanisms mediating the access to some data.(2) A formal definition of the notion of adaptivity under the analyst-mechanism model.This definition is built on a variable-based dependency graph that is constructed using sets of program execution traces.(3) A static program analysis algorithm AdaptFun combining data flow, control flow and reachability bound analysis in order to provide tight bounds on the adaptivity of a program.(4) A soundness proof of the program analysis showing that the adaptivity estimated by AdaptFun bounds the true adaptivity of the program.(5) An implementation of AdaptFun and an experimental evaluation of the bounds this implementation provides on several examples.

Some results in Adaptive Data Analysis
In Adaptive Data Analysis an analyst is interested in studying some distribution  over some domain X.Following previous works [7,15,25], we focus on the setting where the analyst is interested in answers to statistical queries (also known as linear queries) over the distribution.A statistical query is usually defined by some function query : X → [−

Short Title
In this work we consider analysts that ask a sequence of  queries query 1 , . . ., query  .If the queries are all chosen in advance, independently of the answers of each other, then we say they are non-adaptive.If the choice of each query query  depend on the prefix query 1 ,  1 , . . ., query  −1 ,   −1 then they are fully adaptive.An important intermediate notion is  -round adaptive, where the sequence can be partitioned into  batches of non-adaptive queries.Note that non-adaptive queries are 1-round and fully adaptive queries are -round adaptive.
We now review what is known about the problem of answering  -round adaptive queries.
( In fact, these bounds are tight (up to constant factors) which means that even allowing one extra round of adaptivity leads to an exponential increase in the generalization error, from log  to .
Dwork et al. [15] and Bassily et al. [7] showed that by using carefully calibrated Gaussian noise in order to limit the dependency of a single query on the specific data instance, one can actually achieve much stronger generalization error as a function of the number of queries, specifically.Theorem 2.2 ( [7,15]).For any distribution , any , any  ≥ 2 and any  -round adaptive statistical queries, if we answer queries with carefully calibrated Gaussian noise we have: More interestingly, Dwork et al. [15] also gave a refined bounds that can be achieved with different mechanisms depending on the number of rounds of adaptivity.Theorem 2.3 ([15]).For any  and , there exists a mechanism such that for any distribution , and any  ≥ 2 any  -round adaptive statistical queries, it satisfies Notice that Theorem 2.3 has different quantification in that the optimal choice of mechanism depends on the number of queries and number of rounds of adaptivity.This suggests that if one knows a good a priori upper bound on the number of rounds of adaptivity, one can choose the appropriate mechanism and get a much better guarantee in terms of generalization error.As an example, as we can see in Fig. 2, if we know that an algorithm is two rounds adaptive, we can choose data splitting as the mechanism, while if we know that an algorithm has many rounds of adaptivity we can choose Gaussian noise.It is worth to stress that by knowing the number of rounds of adaptivity one can also compute a concrete upper bound on the generalization error of a data analysis.This information allows one to have a quantitative, a priori, estimation of the effectiveness of a data analysis.This motivates us to design a static program analysis aimed at giving good a priori upper bounds on the number of rounds of adaptivity of a program.

AdaptFun formally through an example.
We illustrate the key technical components of our framework through a simple adaptive data analysis with two rounds of adaptivity.In this analysis, an analyst asks  + 1 queries to a mechanism Anon. in two phases.In the first phase, the analyst asks  queries and stores the answers that are provided by the mechanism.In the second phase, the analyst constructs a new query based on the results of the previous  queries and sends this query to the mechanism.The mechanism is abstract here and our goal is to use static analysis to provide an upper bound on adaptivity to help choose the mechanism.This data analysis assumes that the data domain X contains at least  numeric attributes (every query in the first phase focuses on one), which we index just by natural numbers.The implementation of this data analysis in the language of AdaptFun is presented in Fig. 3(a).

twoRounds(k) ≜
The AdaptFun language extends a standard while language 5 with a query request constructor denoted query.Queries have the form query( ), where  is a special expression (see syntax in Section 3) representing a function : X →  on rows.We use  to denote the codomain of queries and it could be for some  we consider.This function characterizes the linear query we are interested in running.Indeed, as we discussed in the previous section, linear queries compute the empirical mean of a function on rows -we use  to abstract a possible row in the database.As an example,  ← query( [ ] •  []) computes an approximation, according to the used mechanism, of the empirical mean of the product of the  ℎ attribute and  ℎ attribute, identified by  [ ] •  [𝑘].Notice that we don't materialize the mechanism but we assume that it is implicitly run when we execute the query.In Fig. 3(a), the queries inside the while loop correspond to the first phase of the data analysis and compute an approximation of the product of the empirical mean of the first  attributes.The query outside the loop corresponds to the second phase and computes an approximation of the empirical mean where each record is weighted by the sum of the empirical mean of the first  attributes.
This example is intuitively 2-rounds adaptive since we have two clearly distinguished phases, and the queries that we ask in the first phase do not depend on each other (the query  [ ] •  [] at line 3 only relies on the counter  and input ), while the last query (at line 6) depends on the results of all the previous queries.However, capturing this concept formally is surprisingly difficult.The difficulty comes from the fact that a query can depend on the result of another query in multiple ways, by means of data dependency or control flow dependency.

Adaptivity definition.
The central property we are after in this work is the adaptivity of a program.We define formally this notion in three steps, which we will describe in details in Section 4. First, we define a notion of dependency, or better may-dependency, between variables.To do this we take inspiration from previous works on dependency analysis and information flow control and we say that a variable may depend on another one if changing the execution of the latter can affect the execution of the former.We can see in Fig. 3(a) that the value of the variable , which corresponds Short Title to the result of the execution of the query in the second phase (in the command with label 6), is affected by the value of the variable , which corresponds to the result of the execution of the query at line 3 in the first phase, via the variable .To formally define this notion of dependency, as in information flow control, we use the execution history of programs recorded by a trace semantics (see Definition 3).
Second, we build an annotated weighted directed graph representing the possible dependencies between labeled variables.We call this graph semantics-based dependency graph to stress that this graph summarize the dependencies we could see if we knew the overall behavior of the program.The vertices of the graph are the assigned program variables with the label of their assignments, edges are pairs of labeled variables which satisfy the dependency relations, weights are functions associated with vertexes and describing the number of times the assignment corresponding to the vertex is executed when the program is run in a given starting state 6 , and the annotations, which we call query annotations, are bits associated with vertexes and describing if the corresponding assignment comes from a query (1) or not (0).The semantics-based dependency graph of the twoRounds(k) program we gave in Fig. 3(a) is described in Fig. 3(b) (we use dashed arrows for two edges that will be highlighted in the next step, for the moment these can be considered similar to the other edges-i.e.solid arrows).We have all the variables that are assigned in the program with their labels, and edges representing dependency relations between them.For example, we have two edges ( 6 ,  5 ) and ( 5 ,  3 ) describing the dependency between the variables assigned by queries.The vertices  6 and  3 are the only ones with query annotation 1 (the subscript), since they are the only two variables that are in assignments involving queries.Notice that the graph contains cycles-in this example it contains two self-loops.These cycles capture the fact that the variables  5  and  4 are updated at every iteration of the loop using their previous value.Cycles are essential to capture mutual dependencies like the ones that are generated in loops.Adaptivity is a quantitative notion, so capturing this form of dependencies is not enough.This is why we also use weights.The weight of a vertex is a function that given an initial state returns a natural number representing the number of times the assignment corresponding to a vertex is visited during the program execution starting in this initial state.For example, the vertex  6 has weight  .1 since for every initial state  the corresponding assignment will be executed one time, the vertex  5 on the other hand has weight  .() since the corresponding assignment will be executed a number of times that correspond to the value of  in the initial state , and  is the operator reading value of  from .
Third, we can finally define adaptivity using the semantics-based dependency graph.We actually define this notion with respect to an initial state , since different states can give very different adaptivity.We consider the longest walk that visits each vertex  of the semantics-based dependency graph no more than the value that the weight   assign to , and visits as many query nodes as possible.The number of query nodes visited is the adaptivity of the program with respect to .Looking again at Fig. 3(b), and assuming that  () ≥ 1, we can see that the the walk along the dashed arrows,  6 →  5 →  3 has two vertices with query annotation 1, and we cannot find another walk having more than 2 vertices with query annotation 1.So the adaptivity of the program in Fig. 3(a) with respect to  is 2. If we consider an initial state  such that  () = 0 we have that the adaptivity with respect to  is instead 1.

Static analysis.
To compute statically a sound and accurate upper bound on the adaptivity of a program , we design a program analysis framework named AdaptFun which we will describe formally in 5.The structure of AdaptFun (Fig. 4) reflects in part the definition of adaptivity we discussed in the previous section.Specifically, AdaptFun is composed by two algorithms (the ones in dashed boxes in the figure), one for building a dependency graph, which we call estimated dependency graph, and the other to estimate the adaptivity from this graph.The first algorithm, which we will describe formally in Section 5, generates the estimated dependency graph using several program analysis techniques.Specifically, AdaptFun extracts the vertices and the query annotations by looking at the assigned variables of the program, it estimates the edges by using control flow and data flow analysis, and it estimates the weights by using symbolic reachabilitybound analysis-weights in this graph are symbolic expressions over input variables.The second algorithm estimates the longest walk which respect the weights and which visit as many query nodes as possible.The two algorithm together gives us an upper bound on the program's adaptivity.
We show in Fig. 3(c) the estimated dependency graph that our static analysis algorithm returns for the program twoRounds(k) in Fig. 3(a).Vertices and query annotations are the same as the ones in Fig. 3(b) and they are simply inferred by scanning the program.As we said before, the edges are estimated using control flow and data flow analysis.For the twoRounds(k) example, every edge in Fig. 3(b) is precisely inferred by our combined analysis, this is why Fig. 3(c) contains exactly the same edges.The weight of every vertex is computed using a reachability-bound estimation algorithm which output a symbolic expression over the input variables, in the example only , representing an upper bound on the number of times each assignment is executed.For example, consider the vertex  3 , its weight is  and this provides an upper bound on the values returned by the weight function  .() associated with vertex  3 in Fig. 3(b) for any initial state.
The algorithm searching for the longest walk first finds a path  6 : 1 1 →  5 :  1 →  3 :  1 , and then constructs a walk based on this path.Every vertex on this walk is visited once, and the number of vertices with query annotation 1 in this walk is 2, which is the upper bound we expect.It is worth to note here that  3 and  5 can only be visited once because there isn't an edge to go back to them, even though they both have the weight .In this sense, instead of simply computing the weighted length of this path (2 + 1) as adaptivity AdaptBD computes the upper bound 2. Note that 2 is not always tight, for example when  = 0.

LABELED QUERY WHILE LANGUAGE
The language of AdaptFun is a standard while language with labels to identify different components and with primitives for queries, and equipped with a trace-based operational semantics which is the main technical tool we will use to define the program's adaptivity.
Expressions include standard arithmetic (with value  ∈ N ∞ ) and boolean expression, ( and ) and extended query expressions  .A query expression  can be either a simple arithmetic expression , an expression of the form  [] where  represents a row of the database and  represents an index used to identify a specific attribute of the row , a combination of two query expressions,  ⊕   , or a normal form . For example, the query expression  [3] + 5 denotes the computation that obtains the value in the 3rd column of  in one row and then add 5 to it.Command are the typical ones from while languages with an additional command  ← query( ) for query requests which can be used to interrogate the database and compute the linear query corresponding to  .Each command is annotated with a label , we will use natural numbers as labels and we will use them to record the location of each command, so that we can uniquely identify them.We also have a set LV of labeled variables, these are simply variables with a label.We denote by LV() the set of labeled variables which are assigned in an assignment command in the program .We denote by QV() the set of labeled variables that are assigned to the result of a query in the program .

Trace-based Operational Semantics
We use a trace based operational semantics tracking the history of programs execution.The operational semantics is parameterized by a database that can be access only through queries.Since this database is fixed, we omit it from the semantics but it is important to keep in mind that this database exists and it is what allow us to evaluate queries.A trace  is a list of events generated when executing specific commands.We denote by T the set of traces and we will use list notation for traces, where [] is the empty trace, the operator :: combines an event and a trace in a new event, and the operator ++ concatenates two traces.
We have two kinds of events: assignment events and testing events.Each event consists of a quadruple, and we use E asn and E test to denote the set of all assignment events and testing events, respectively.
Event  ::= (, , , •) | (, , , ) Assignment Event | (, , , •) Testing Event An assignment event tracks the execution of an assignment or a query request and consists of the assigned variable, the label of the command that generates it, the value assigned to the variable, and the normal form of the query expression,  if this command is a query request, otherwise a default value •.A testing event tracks the execution of if and while commands and consists of the guard of the command, the label of the command, the result of evaluating the guard, while the last element is •.We use the operator  () to fetch the latest value assigned to  in the trace .
We use the operator cnt to count the occurrence of a labeled variable in the trace.We denote by TL() ⊆ L the set of the labels occurring in .Finally, we use T 0 () ⊆ T to denote the set of initial traces, the ones which assign a value to the input variables.The trace based operational semantics is described in terms of a small step evaluation relation ⟨, ⟩ → ⟨ ′ , ⟩ ′ describing how a configuration program-trace evaluates to another configuration program-state.The rules for the operational semantics are described in Fig. 5.The rules for assignment and query generate assignment events, while the rules for while and if generate testing events.The rules for the standard while language constructs correspond to the usual rules extended to deal with traces.We have relations ⟨, ⟩ ⇓   and ⟨, ⟩ ⇓   to evaluate expressions and boolean expressions, respectively.Their definitions are in the supplementary material.The only rule that is non-standard is the query rule.When evaluating a query, the query expression  is first simplified to its normal form  using an evaluation relation ⟨, ⟩ ⇓  .Then normal form  characterize the linear query that is run against the database.The query result  is the expected value of the Anon.function  .applied to each row of the dataset.We summarize this process with the notation query() =  which we use in the rule query.Once the answer of the query is computed, the rules record all the needed information in the trace.As usual, we will use → * for the reflexive and transitive closure of →.
The query expression evaluation relation ⟨, ⟩ ⇓   is defined by the following rules which reduce a query expression to its normal form.

DEFINITION OF ADAPTIVITY
In this section, we formally present the definition of adaptivity for a given program.As we discussed in Section 2.2.1, we first define a dependency relation between program variables, we then define a semantics-based dependency graph, and finally look at longest walks in this graph.

May-dependency between variables
We are interested in defining a notion of dependencies between program variables since assigned variables are a good proxy to study dependencies between queries-we can recover query requests from variables associated with queries.We consider dependencies that can be generated by either data or control flow.For example, in the program the query query(  [3] + ) depends on the query query(  [2])) through a value dependency via  1 .Conversely, in the program the query query(  [2]) depends on the query query( [1]) via the control dependency of the guard of the if command involving the labeled variable  1 .
To define dependency between program variables we will consider two events that are generated from the same command, hence they have the same variable name or boolean expression and label, but have either different value or different query expression, captured by the following definition.Definition 1. Two events  1 ,  2 ∈ E differ in their value, or query value, denoted as Diff( 1 ,  2 ), if and only if: where  1 =   2 denotes the semantics equivalence between query values7 , and   projects the -th element from the quadruple of an event.
We can now define when an event may depend on another one 8 .Definition 2 (Event May-Dependency).An event  2 ∈ E asn may-depend on an event  1 ∈ E asn in a program  denoted DEP e ( 1 ,  2 , ), if and only if There are several components in this definition.The part with label (2a) requires that  1 and  1 differ in their value (Diff( 1 ,  ′ 1 )).The next two parts (2b) and (2c) capture the value dependency and control dependency, respectively.As in the literature on non-interference, and following [10], we formulate these dependencies as relational properties, i.e. in terms of two different traces of execution.We force these two traces to differ by using the event  1 in one and  ′ 1 in the other.For the value dependency we check whether the change also create a change in the value of  2 or not.We additionally check that the two events we consider appear the same number of times in the two traces -this to make sure that if the events are generated by assignments in a loop, we consider the same iterations.For the control dependency we check whether the change in  1 affect the appearance in the computation of  2 or not.For this we require the presence of a test event whose value is affected by the change in  1 in order to guarantee that the computation goes through a control flow guard.Similarly to the previous condition, we additionally check that the two test events we consider appear the same number of times in the two traces.
We can now extend the dependency relation to variables by considering all the assignment events generated during the program's execution.

Definition 3 (Variable May-Dependency
Notice that in the definition above we can also have that the two variables are the same, this allow us to capture self-dependencies.

Semantics-based Dependency Graph
We can now define the semantics-based dependency graph of a program .We want this graph to combines quantitative reachability information with dependency information.
As we discussed before, vertices and query annotations are just read out from the program .We have an edge in E trace () if we have a may dependency between two labeled variables in .A weight function  ∈ W trace () is a function that for every starting trace  0 ∈ T 0 () gives the number of times the assignment of the corresponding vertex   is visited.Notice that weight functions are total and with range N.This means that if a program  has some non-terminating behavior, the set W trace () will be empty.To rule out this situation, we consider as well-formed only graphs which have a weight for every vertex.In the rest of the paper we will implicitly consider only well-formed semantics-based dependency graphs.

Trace-based Adaptivity
We can now define the adaptivity of a program formally.This notion is formulated in terms of an initial trace, specifying the value of the input variables, as the walk on the graph G trace (), which has the largest number of query requests.Definition 5 (Walk on G trace ()).Given the semantics-based dependency graph G trace () = (V trace , E trace , W trace , Q trace ) of a program , a walk  : T 0 () → N on G trace () is a function that given as input an initial trace  0 returns a sequence of edges ( 1 . . . −1 ) for which there is a sequence of vertices ( 1 , . . .,   ) such that: •   = (  ,  +1 ) ∈ E trace for every 1 ≤  < .
Because for the adaptivity we are interested in the dependency between queries, we calculate a special "length" of a walk, the query length, by counting only the vertices corresponding to queries.Definition 6 (Query Length).Given the semantics-based dependency graph G trace () of a program , and a walk  ∈ WK (G trace ()), the query length of  is a function len q () : T 0 () → N that given an initial trace  0 returns the number of vertices which correspond to query variables in the vertices sequence, ( 1 , . . .,   ) as follows, len q () Definition 7 (Adaptivity of a Program).Given a program , its adaptivity () is function () : T 0 () → N such that for an initial trace  0 ∈ T 0 (), () ( 0 ) = max len q () ( 0 ) |  ∈ WK (G trace ())

THE ADAPTIVITY ANALYSIS ALGORITHM -ADAPTFUN
In this section, we present our program analysis AdaptFun for computing an upper bound on the adaptivity of a given program .The high level idea behind AdaptFun is to first build an estimated dependency graph G est (c) of a program  (Section 5.1) which overapproximates the semantics-based dependency graph in two dimensions: it overapproximates the dependencies between assigned variables (Section 5.1.2),and, it overapproximates the weights (Section 5.1.3).Then, AdaptFun uses a custom algorithm to estimate the longest walk on this graph, providing in this way an upper bound on the adaptivity of the program.
Given a program , the set of vertices V est () and query annotations Q est () of the estimated dependency graph can be computed by simply scanning the program .These set can be computed precisely and correspond to the same sets in the semantics-based dependency graph.This means that G est () has the same underlying vertex structure as the semantics-based graph G trace ().The differences will be in the sets of edges and weights.

Weight and Edge Estimation
The set of edges E est () and the set of weights W est () of the estimated dependency graph are estimated through an analysis combining control flow, data flow, and loop bound analysis.These analyses are naturally described over an Abstract Transition Graph of the input program, which we describe next.

Abstract Transition Graph.
We say that we have a transition from a program point  to a program point  ′ if and only if the command with label  ′ can execute right after the execution of the command with label .The Abstract Transition Graph absG() of a program  is a graph with the set of labels of program points in  (including a label ex for the exit point) as the set of vertices absV(), and with the set of transitions in  as the set of edges absE().Each edge of the graph is annotated with either the symbol ⊤, a boolean expression or a difference constraint [33].
A difference constraint is an inequality of the form  ′ ≤  +  or  ′ ≤  where ,  are variables and  ∈ SC is a symbolic constant: either a natural number, the symbol ∞, an input variable or a symbol   representing a query request.We denote by DC the set of difference constraints.
A difference constraint on an edge, denotes that after executing the command at location  the value of the variable  is at most the value of the expression  +  resp. before the execution of the command  ′ .A boolean value  on an edge,   − →  ′ , denotes that after evaluating the guard of an if or a while command with label ,  holds and the next command to be executed is the one with label  ′ .A ⊤ symbol on an edge,  ⊤ − →  ′ denotes that the command with label  is a skip, and the commands that do not interfere with any loop counter variable.
We compute difference constraints and the other annotation via a simple program abstraction method adopted from [33], described in details in the supplementary material.
Example.We show in Fig. 6(b) the abstract control flow graph, absG(twoRounds(k)) of the twoRounds(k) program we gave in Fig. 3(a) and which we also report in Fig. 6(a).

Edge
Estimation.The set of edges E est () is estimated through a combined data and control flow analysis with three components.Reaching definition analysis: The first component is a reaching definition analysis computing for each label  in the graph absG() the set of labeled variables that may reach  as follows.
(1).For each label , the analysis generates two initial sets of labeled variables,  and , containing all the labeled variables   that are newly generated but not yet reassigned before and after executing the command .
(2).The analysis iterates over absG(), and updates () and  () until they are stable.The final () is the set of reaching definitions RD(, ) for .Feasible data-flow analysis: The second component is a feasible data-flow analysis computing for every pair   ,   ∈ LV() whether there is a flow from   to   .This analysis is based on a relation flowsTo(  ,   , ) built over the sets RD(, ) for every location .This relation is defined as: Definition 8 (Feasible Data-Flow).Given a program  and two labeled variables   ,   in this program, flowsTo(  ,   , ) is This relation gives us an overapproximation of the variable may-dependency relation for direct dependencies (dependencies that do not go through other variables).Edge Construction: The third component constructs an edge by computing a transitive closure (through other variables) of the flowsTo relation.There is a directed edge from   to   if and only if there is chain of of variables in the flowsTo relation between   and   .This is defined as follows: ,   , )} We prove that the set E est () soundly approximates the set G trace ().Lemma 5.1 (Mapping from Egdes of G trace to G est ).For every program  we have: Example.Consider Fig. 3(c), the edge  6 →  5 is built by tbeflowsTo( 6 ,  5 , ) relation because  is used directly in the query expression  [] *  and we also have  5 ∈ RD(6, twoRounds(k)) from the reaching definition analysis.The edge  3 →  5 represents the control flow from  5 to  3 , which is soundly approximated by our flowsTo relation.The edge  6 →  3 is produced by the transitivity of flowsTo( 6 ,  5 , ) and flowsTo( 5 ,  3 , ).

Weight Estimation.
The set W est () of weights for the estimated dependency graph is estimated from the Abstract Transition Graph absG() of  using reachability-bound analysis [22].Specifically, we estimate as the weight of a node with label  a symbolic upper bounds on the execution times of the command with label  obtained by reachability-bound analysis.These symbolic upper bounds are expressions with the input variables as free variables, hence they correspond to the weight functions in the semantics-based dependency graphs.
Our reachability-bound algorithm adapts to our setting ideas from previous work [32,33,39].Specifically, it provides an upper bound on the number of times every command can be executed by using three steps.
(1) This steps assigns to each edge   − − →  ′ ∈ absE() a local bound as follows.We look at the strongly connected components of absG().If the edge does not belong to any strongly connected components, then the local bound is 1, representing the fact that the edge is not in a loop and so it get executed at most once.If the edge belongs to a strongly connected component and one of the variables  in  decreases, then the local bound is .Otherwise, if the edge belongs to a strongly connected component and there is a variable  that decreases in the difference constraint of some other edge, and if by removing this other edge, the original edge does not belong anymore to the strongly connected components of absG(), then the local bound is .Otherwise, the local bound is ∞.Notice that the output is either a symbolic constant in SC or a variable that is not an input variable.
(2) This step aims at determining the reachability-bound TB(, ) of every edge  ∈ E est ().
Every bound is a symbolic expression built out of symbols in SC and the operations +, * , max.
For every edge, if the local bound of this edge computed at the previous step is a symbol in SC then this is already the reachability-bound.If instead the local bound of the edge is a variable  which is not an input variable, this step will eliminate it and replace it with a symbolic expression.In order to do this, this steps will compute two quantities: first, it will recursively sum the reachability-bounds of all the edges whose difference constraint may increment the variable , plus the corresponding increment; second, it will recursively sum the reachability-bounds of all the edges whose difference constraint may reset the variable  to a (symbolic) expression that doesn't depend on it, multiplied by the maximal value of this symbolic expression.The sum of these two quantities provides the symbolic expression that is an upper bound on the number of times the edge can be reached.To compute these two quantities we use two mutually recursive procedures.Using the reachability-bound TB(, ) for every edge  = (, ,  ′ ) we can provide a bound on the visiting times of each vertex   ∈ absV().Formally:  = {TB(, )| = (, _, _)}.Notice that  is a symbolic arithmetic expression over symbols in SC.In particular, it may contain the input variables and so it may effectively be used as a function of the input -and capture loop bounds in terms of these inputs.Theorem 5.1 (Soundness of the Reachability Bounds Estimation).Let  be a program and W est () be its estimated weight set.Then, for each (  , ) ∈ W est (),  0 ∈ T 0 (),  ∈ T ,  ∈ N we have: if ⟨,  0 ⟩ → * ⟨skip,  0 ++ ⟩ ∧ ⟨ 0 , ⟩ ⇓   ∧ then cnt(, ) ≤  Notice that in this theorem, the evaluation ⟨ 0 , ⟩ ⇓   is needed in order to obtain a concrete value  from the symbolic weight  by specifying a value for the input variables through  0 .
Example.Consider again Fig. 3(c), the estimated weight for  5 is , and this is a sound estimation.For an arbitrary  0 ∈ T 0 (), we know that ⟨ 0 , ⟩ ⇓   ( 0 ) and by the weight   for the vertex  5 (as in Figure 3(b)) we know   ( 0 ) =  ( 0 ).

Adaptivity Upper Bound Computation
We estimate the adaptivity upper bound, A est () for a program  as the maximum query length over all finite walks in its estimated dependency graph, G est ().
Notice that different from a walk on G trace (), a walk  ∈ WK (G est ()) on the graph G est () does not rely on an initial trace.This because, similarly to what we did for the weights in W est () in the previous section, we use symbolic expressions over input variables.Similarly, the adaptivity bound A est () will also be a symbolic arithmetic expression over the input variables.With this symbolic expression we can prove the upper bound sound with respect to any initial trace.Theorem 5.2 (Soundness of A est ()).For every program , its estimated adaptivity is a sound upper bound of its adaptivity.
Symbolic expressions as used in the weight are great to express symbolic bounds but make the computation of a maximal walk harder.Specifically, one has to face two challenges.The first is non-termination.A naive traversing strategy leads to non-termination because the weight of each vertex in G est () is a symbolic expression containing input variables.We could try to use a depth first search strategy using the longest weighted path to approximate the longest finite walk with the weight as its visiting time.However, these approach would face the second challenge: approximation.It would consistently and considerably over-approximate the adaptivity.
To address these two challenges we design an algorithm AdaptBD combining Depth First Search and Breadth First Search.The idea of this algorithm is to reduce the task of computing the longest walk to the task of computing local versions of the adaptivity on the maximal strongly connected components (SCC) of the graph G est () and then compose them into the program adaptivity.The algorithm uses another algorithm AdaptBD SCC recursively, in order to find the longest walk for a strong connected component (SCC)of G est ().The pseudocode of AdaptBD SCC is given as Algorithm 1 Algorithm 1 Adaptivity Bound Algorithm on An SCC (AdaptBD scc (c, SCC i ))    May-Dependency of  7 on itself, and  7 's visiting times,  ( 0 ). ( 0 ) counts the execution times of command [ ← query(  [] + )] 7 .It equals to the loop iteration numbers, i.e., 's initial value.Then, as the dotted arrows, longest walk is It is worth to stress that our algorithm still compute an accurate bound w.r.t this definition, even if the definition itself is over-approximating.Indeed, the AdaptFun give us adaptivity 2 + .

IMPLEMENTATION
We implemented AdaptFun as a tool which takes a labeled program as input and outputs the upper bound on the program adaptivity and the total number of queries that the program runs.This implementation consists of a module written in OCaml for the generation of the estimated graph G est , and a module written in Python for the weight estimation algorithm (Section 5.1.3)and the algorithm AdaptBD (Section 5.2).The OCaml program takes the labeled program as input and outputs a version of the graph G est (without weights) and the abstract transition graph absG for the program.These two objects are then fed into the python program which computes the weights, and outputs the adaptivity bound and the query number.We evaluated this implementation on 25 examples with performances summarized in Tab. 1.The 1  column is the example name.For each example , the 2  column is its adaptivity rounds, AdaptFun outputs are in the the 3  and 4 ℎ columns.They are the adaptivity upper bound and 's total query requests #.The last 4 columns are AdaptFun's performance w.r.t. the program lines.We track the running time of the OCaml code for parsing the program and generating the G est (), and the running times of the weight analysis and the AdaptBD() in Python.We implemented two weight estimation methods.The first one (referred as I in Tab.1) is the one we presented formally in Section 5.1.Unfortunately, this method is accurate but slow, it doesn't performs well with big program.The second one (referred as II) is a relaxation of the first one.It is more efficient but it over-approximate complicated loops.Based on the two implementations, our AdaptFun produces two bounds on the adaptivity, corresponding to the left and right side (I | II) in the 3  , 4 ℎ and 6 ℎ columns9 .The first 5 programs are adapted

RELATED WORK
Dependency Definitions and Analysis.There is a vast literature on dependency definitions and dependency analysis.We consider a semantics definition of dependencies which consider (intraprocedural) data and control dependency [8,11,30].Our definition is inspired by classical works on traditional dependency analysis [12] and noninterference [20].Formally, our definition is similar to the one by Cousot [10], which also identifies dependencies by considering differences in two execution traces.However, Cousot excludes some forms of implicit dependencies, e.g. the ones generated by empty observations, which instead we consider.Common tools to study dependencies are dependency graphs [17].We use here a semantics-based approach to dependency graph similar, for example, to works by Austin and Sohi [5], Hammer et al. [23] and [24].Our approach shares some similarities with the use of dependency graphs in works analyzing dependencies between events, e.g. in event programming.Memon [29] uses an event-flow graph, representing all the possible event interactions, where vertices are GUI event edges represent pairs of events that can be performed immediately one after the other.In a similar way, we use edges to track the may-dependence between variables looking at all the possible interactions.Arlt et al. [4] use a weighted edges indicating a dependency between two events, e.g. one event possibly reads data written by the other event, with the weight showing the intensity of the dependency (the quantity of data involved).We also use weights but on vertices and with different meaning, they are functions describing the number of times the vertices can be visited given an initial state.Differently from all these previous works, we use a dependency graph with quantitative information needed to identify the length of chain of dependencies.Our weight estimation is inspired by works in complexity analysis and WCET.Specifically, it is inspired by works on reachability-bound analysis using program abstraction and invariant inference [21,22,34] and work on invariant inference through cost equations and ranking functions [2,3,9,18].
Generalization in Adaptive Data Analysis.Starting from the works by Dwork et al. [15] and Hardt and Ullman [25], several works have designed methods that ensure generalization for adaptive data analyses [7,13,14,16,27,31,35,36].Several of these works drew inspiration from differential privacy, a notion of formal data privacy.By limiting the influence that an individual can have on the result of a data analysis, even in adaptive settings, differential privacy can also be used to limit the influence that a specific data sample can have on the statistical validity of a data analysis.This connection is actually in two directions, as discussed for example by Yeom et al. [38].Considering this connection between generalization and privacy, it is not surprising that some of the works on programming language techniques for privacy-preserving data analysis are related to our work.Adaptive Fuzz [37] is a programming framework for differential privacy that is designed around the concept of adaptivity.This framework is based on a typed functional language that distinguish between several forms of adaptive and non-adaptive composition theorem with the goal of achieving better upper bounds on the privacy cost.Adaptive Fuzz uses a type system and some partial evaluation to guarantee that the programs respect differential privacy.However, it does not include any technique to bound the number of rounds of adaptivity.Lobo-Vesga et al. [28] propose a language for differential privacy where one can reason about the accuracy of programs in terms of confidence intervals on the error that the use of differential privacy can generate.These are akin to bounds on the generalization error.This language is based on a static analysis which however cannot handle adaptivity.The way we formalize the access to the data mediated by a mechanism is a reminiscence of how the interaction with an oracle is modeled in the verification of security properties.As an example, the recent works by Barbosa et al. [6] and Aguirre et al. [1] use different techniques to track the number of accesses to an oracle.However, reasoning about the number of accesses is easier than estimating the adaptivity of these calls, as we do instead here.

CONCLUSION AND FUTURE WORKS
We presented AdaptFun, a program analysis useful to provide an upper bound on the adaptivity of a data analysis, as well as on the total number of queries asked.This estimation can help data analysts to control the generalization errors of their analyses by choosing different algorithmic techniques based on the adaptivity.Besides, a key contribution of our works is the formalization of the notion of adaptivity for adaptive data analysis.We showed the applicability of our approach by implementing and experimentally evaluating our program analysis.
As future work, we plan to investigate the potential integration of AdaptFun in an adaptive data analysis framework like Guess and check by Rogers at al. [31].As we discussed, this framework is designed to support adaptive data analyses with limited generalization error.As our experiments show, this framework could benefit from the information provided by AdaptFun to provide more precise estimate and improved confidence intervals.Another direction we will explore is to make the uppper bounds provided by AdaptFun more precise by integrating our algorithm with a pathsensitive approach.

Fig. 2 .
The generalization errors of two adaptive data analysis examples, under different choices of mechanisms.(a) Data analysis with 2 rounds adaptivity, (b) Data analysis with 400 rounds adaptivity.(c) Same one as (a)

Fig. 3 .
(a) The program twoRounds(k), an example with two rounds of adaptivity (b) The corresponding semantics-based dependency graph (c) The estimated dependency graph from AdaptFun.

Fig. 7 .
(a) The simplified multiple rounds example (b) The estimated dependency graph from AdaptFun multiRoundsS

Fig. 8 .
(a) The multi rounds single example (b) The semantics-based dependency graph.