Overlapping Batch Confidence Intervals on Statistical Functionals Constructed from Time Series: Application to Quantiles, Optimization, and Estimation

We propose a general purpose confidence interval procedure (CIP) for statistical functionals constructed using data from a stationary time series. The procedures we propose are based on derived distribution-free analogues of the $\chi^2$ and Student's $t$ random variables for the statistical functional context, and hence apply in a wide variety of settings including quantile estimation, gradient estimation, M-estimation, CVAR-estimation, and arrival process rate estimation, apart from more traditional statistical settings. Like the method of subsampling, we use overlapping batches of time series data to estimate the underlying variance parameter; unlike subsampling and the bootstrap, however, we assume that the implied point estimator of the statistical functional obeys a central limit theorem (CLT) to help identify the weak asymptotics (called OB-x limits, x=I,II,III) of batched Studentized statistics. The OB-x limits, certain functionals of the Wiener process parameterized by the size of the batches and the extent of their overlap, form the essential machinery for characterizing dependence, and consequently the correctness of the proposed CIPs. The message from extensive numerical experimentation is that in settings where a functional CLT on the point estimator is in effect, using \emph{large overlapping batches} alongside OB-x critical values yields confidence intervals that are often of significantly higher quality than those obtained from more generic methods like subsampling or the bootstrap. We illustrate using examples from CVaR estimation, ARMA parameter estimation, and NHPP rate estimation; R and MATLAB code for OB-x critical values is available at~\texttt{web.ics.purdue.edu/~pasupath/}.

Remark.The initial segment   , 1 ≤  ≤  of the observable process {  ,  ≥ 1} is assumed to be a "collected dataset" or the output of a simulation that is exogenous to problem at hand.We assume no facility for variance reduction, e.g., by changing the measure governing the process {  ,  ≥ 1}, as is sometimes possible in simulation settings.See [11,23,24,50,62,63] for variance reduced confidence interval problems in the quantile context.

Motivation
Statistical functionals subsume a variety of interesting quantities arising in modern data settings, and are thus useful mathematical objects on which to construct confidence intervals.Consider, for instance, the following examples of statistical functionals.As a matter of notation, whenever relevant,  : Ω → S is an S-valued random variable distributed according to  and "obtainable" from the measure  governing the observed stationary time series {  ,  ≥ 1}), and  ∈ S denotes an "outcome" in S. (e) Root Finding.For  : X × S → R,  () =  ∈ X is such that ∫ (g) ARMA(,).The ARMA(, ) process is a discrete-time real-valued process {  ,  ≥ 1} having  "autoregressive" parameters,   ,  = 1, 2, . . ., , and  "moving average" parameters,   ,  = 1, 2, . . ., , and is expressed as where {  ,  ≥ 1} are independent and identically distributed (iid) random variables having mean zero and unit variance.Given observations   ,  = 1, 2, . . .,  of the process {  ,  ≥ 1}, the estimators ĉ, φ  ,  = 1, 2, . . .,  and θ  ,  = 1, 2, . . .,  of the parameters ,   ,  = 1, 2, . . .,  and   ,  = 1, 2, . . ., , are statistical functionals that can be estimated by minimizing the sum of squared residuals [14]: minimize: where the residuals are given by: In addition to the above examples, a wide variety of quantities arising within classical statistics, e.g., higher order moments, ratio of moments, clusters obtained through -means clustering, -trimmed mean, Mann-Whitney functional, and the simplicial depth functional are all statistical functionals, making the question of constructing confidence intervals on statistical functionals of wide interest.(See [75,Chapter 7] and [56,Chapter 6] for other examples and a full treatment of statistical functionals.) Remark.Whereas  () in some of the examples listed above are naturally R  -valued with  > 1, e.g., (g), the treatment in this paper is entirely real-valued, that is,  () ∈ R. Extending our methods from R to R  is straightforward but further extension into a function space involves non-trivial technical aspects.

Organization of the Paper
In the following section, we discuss literature on confidence intervals with a view toward providing perspective on how the proposed methods fit within the existing literature.This is followed by Section 3 where we present the main idea Manuscript submitted to ACM Su et al.
underlying the interval estimators we propose, along with a synopsis of results.Section 4 includes key assumptions, followed by Section 5-7 which present the theorems corresponding to the OB-I, OB-II, and OB-III limits.In Section 8, we present brief discussion on some implementation questions that we consider important.We end with Section 9 where numerical illustration using three different contexts illustrate the effectiveness of using large batch OB-I and OB-II confidence intervals.

EXISTING LITERATURE, PERSPECTIVE, AND CONTRIBUTION
In this section, we provide an overview of CIPs in general through a taxonomy that categorizes CIPs into those that assume a CLT is in effect and those that do not.We discuss CLT-based methods, followed by further perspective and a summary of the current paper's position within this landscape.(We include a concise description of the two most famous non-CLT-based methods, subsampling and bootstrapping, in Appendix A.)

CLT on θ𝑛 exists
CLT not known to exist e.g., subsampling [68], the bootstrap [20,33,68] "CLT-based" Methods Consistent Methods e.g., small batch ( = 0) OB-x Cancellation Methods e.g., large batch ( > 0) OB-x Fig. 1.A taxonomy of methods for constructing confidence intervals on statistical functionals.Consistent methods construct a consistent estimator of the variance constant , while cancellation methods allow the use of large batches and construct ratio estimators that "cancel out" the variance constant .
Analogous to the taxonomy [42] of CIPs on the steady-state mean of a real-valued process, it is instructive to categorize CIPs for statistical functionals based on whether a central limit theorem of the form √ ( θ −  ()) exists.In (2), θ is an implied point estimator of  () constructed from the time series {  ,  ≥ 1},  (0, 1) is the standard normal random variable, and  ∈ (0, ∞) is an unknown parameter often called the variance constant.Further, and as depicted in Figure 1, a CIP that assumes (2) may either be a consistent method by which we mean that the CIP constructs another observable process { σ ,  ≥ 1} from {  ,  ≥ 1} to consistently estimate , that is, Manuscript submitted to ACM or a cancellation method by which we mean that the CIP constructs a process {  ,  ≥ 1} such that ( √ ( θ −  ()),   ) d → ( (0, 1),  ) as  → ∞, and  is a well-defined non-vanishing random variable whose distribution is free of unknown quantities, e.g.,  and  ().(The canonical √  scaling in (4) can be generalized to other scalings, as considered in [47].) In consistent methods, since (2) and (3) hold, Slutsky's theorem (B.2) assures us that an asymptotically valid two-sided (1 − ) confidence interval on  () is where  1−/2 is the 1 − /2 quantile of the standard normal distribution.It is in this sense that a consistent method essentially reduces the confidence interval construction problem into the often nontrivial problem [3,11,43] of consistently estimating the variance parameter .Various consistent methods exist in the steady-state mean context.
For example, the regenerative method [13,54], the spectral procedure [4,[16][17][18]74] with certain restrictions on the bandwidth, and the batch means procedure where the variance parameter is estimated using one of various wellestablished methods, e.g., nonoverlapping batch means (NBM) [2], overlapping batch means (OBM) [2], Cramér-von Mises (CvM) estimator [2], provided the batch size tends to infinity in a way that the batch size expressed as a fraction of the total data size tends to zero.See [1,2] and references therein for a thorough account on estimating the variance parameter associated with a steady-state real-valued process.
In contrast to consistent methods, cancellation methods are based on the important idea that  need not be estimated consistently to construct a valid confidence interval on  ().This seems to have been first observed in the seminal account [71] introducing standardized time series in the context of constructing confidence intervals on the steady state mean.Specifically, in cancellation methods, since ( 2) and ( 4) hold, and  is non-vanishing, applying the continuous mapping theorem [6] leads to "cancellation" of  in the sense that leading to the two-sided (1 − ) confidence interval where   is the -quantile of  (0, 1)/ .If constructing a consistent estimator of  is the principal challenge in consistent methods, selecting   and characterizing  turns out to be the principal challenge in cancellation methods.Cancellation methods have been studied [9,46,49,61,71] in the context of constructing confidence intervals on the steady-state mean, and more recently for quantiles -see the exceptionally well-written articles [8,25].

Further Perspective and Summary of Contribution
The uniqueness of any CIP (including subsampling, the bootstrap, and what we propose here) stems from the manner in which the procedure approximates the sampling distribution of its chosen statistic.So, while subsampling uses the empirical cdf   in (83) formed from subsamples, and the bootstrap uses resampling, the methods proposed in this paper approximate the sampling distribution of the Studentized statistic ( θ −  ())/ σ by characterizing its weak limit.In particular, we assume the existence of a functional CLT governing θ and exploit the resulting structure to characterize the weak limit of ( θ −  ())/ σ .
Manuscript submitted to ACM Su et al.
To be clear, neither subsampling nor the bootstrap assume a CLT on θ , and this is their strength.(Specifically, the bootstrap and subsampling only assume the existence of the scaled weak limit on θ ; they do not assume, for instance, that  () in ( 82) is standard normal.)However, our argument is that there exist numerous important contexts where a functional CLT on θ holds and can be usefully exploited if we can identify the weak limit of the statistic in use.For example, vis-à-vis subsampling, knowledge of the weak limit allows replacing the empirical quantiles  , in (86) by their limiting counterparts, in the process allowing to dispense with subsampling's key stipulation that batch sizes be small, that is,   / → 0.
To further clarify, we now provide a summary of contribution.
( (2) We derive the weak limits (called OB-x limits, x=I,II,III) of the statistic underlying each of the proposed OB CIPs.
Of these, the OB-II limit and its bias-correction factor (Theorem 6.1) have not appeared in the literature even in the steady-state mean context to the best of our knowledge; OB-II might prove to be especially relevant in computationally intensive settings.The OB-I and OB-III limits (Theorem 5.1 and Theorem 7.1, respectively) have appeared in the literature but in the steady-state mean [1,2] and the quantile [8] contexts.The asymptotic moment expression for the OB-I limit (Theorem 5.2) has not appeared elsewhere but the corresponding result for the special case of fully overlapping batches in the steady-state mean context appeared in [18].
(3) To aid future investigation of computationally intensive contexts, our analysis of overlapping batches is general in the sense that it introduces an offset parameter   whose value connotes the extent of batching, e.g.,   = 1 connotes fully overlapping batches and   ≥   connotes non-overlapping batches with   >   corresponding to what has been called spaced batch means [37].We shall see (Theorem 5.2) that the effect of the extent of overlap features prominently in the asymptotic variance of the variance estimator.
(4) Extensive numerical experimentation over a variety of applications indicates that cancellation methods resulting from the use of large batches, that is, when   / →  > 0, exhibits behavior that is consistently better.Aspects responsible for such better behavior are not yet fully understood and should form the topic of future investigation.
(5) We provide access to code (that includes a critical value calculation module for OB-I, OB-II, and OB-III) for constructing confidence intervals on a statistical functional using our recommended OB-x methods.

MAIN IDEA AND SYNOPSIS OF RESULTS
To set the stage for precisely describing the proposed confidence interval procedure, consider partitioning the available "data"  1 ,  2 , . . .,   into   possibly overlapping batches each of size   as shown in Figure 2. The first of these batches consisting of observations  1 ,  2 , . . .,    , the second consisting of observations    +1 ,    +2 , . . .,    +  , and so on, and the last batch consisting of observations  (  −1)  +1 ,  (  −1)  +2 , . . .,   .The quantity   ≥ 1 represents the offset between batches, with the choice   = 1 corresponding to "fully-overlapping" batches and any choice   ≥   corresponding to "non-overlapping" batches.Notice then that the offset   and the number of batches   are related as Suppose that the batch size   and the number of batches   are chosen so that the following limits exist: Note that  = 0 and  ∞ = ∞ are allowed in (7).We will sometimes refer to  as the asymptotic batch size and to  ∞ as the asymptotic number of batches.Also, we will refer to  = 0 as the small batch regime, and to  > 0 as the large batch regime.
We shall see shortly that the sectioning estimator appearing in ( 8) is a candidate for centering the confidence interval that we construct.An alternative to the sectioning estimator is the batching estimator [63], obtained by averaging the point estimators θ,  ,  = 1, 2, . . .,   , that is, The sectioning and batching point estimators are the two natural choices for "centering" the confidence intervals on  ().We will see that confidence intervals constructed with the batching estimator might be especially useful in computationally intensive contexts.

Estimating the Variance Constant 𝜎 2
Since the variance constant  2 (defined in (9)) is a measure of the inherent variability of the point estimator θ ,  2 's estimation plays a key role in the confidence intervals we construct.The expression in (9) suggests that a natural estimator of  2 is the sample variance of θ,  ,  = 1, 2, . . .,   defined in (10), after appropriate scaling: where  defined in (7) is the limiting batch size.It will become clear from our later analysis that  1 () appearing in (12) is a "bias-correction" constant introduced to make σ2 OB-I (  ,   ) asymptotically unbiased.Notice that the estimator σ2 OB-I (  ,   ) of the variance constant  2 appearing in (12) uses the sectioning estimator θ when computing the sample variance.An alternative is to use the batching estimator θ in place of the sectioning estimator to obtain the second candidate estimator of the variance constant  2 : where, as we shall see in Theorem 6.1, the bias-correction constant has the more complicated form and  ∞ defined in (7) is the limiting number of batches.

Structure of the Proposed Confidence Intervals
The proposed interval has the same elements as a classical confidence interval, namely: (A) a "centering" variable, e.g., the sectioning estimator θ ∈ R, or the batching estimator θ ∈ R, as described in Section 3.1; (B) a point estimator of the asymptotic variance  2 , e.g., σ2 OB-x (  ,   ), x = I, II, III; and (C) a statistic whose weak limit supplies the critical values associated with the confidence interval.
Once the elements in (A)-(C) are specified, a (1 − ) confidence interval on  () can then be constructed in the usual way.
For example, when the sectioning estimator θ is used in (A), the variance estimator σ2 OB-I (  ,   ) is used in (B), and the Studentized root is used in (C), we obtain the (two-sided) confidence interval where is the -quantile (or critical value) of the random variable  OB-I (,  ∞ ).(A one-sided confidence interval analogous to (19) is straightforward.) Similarly, using the batching estimator θ in (A), the variance estimator σ2 OB-II (  ,   ) in (B), and the Studentized root Manuscript submitted to ACM Su et al.
in (C), we obtain our second proposed (two-sided) confidence interval where is the -quantile (or critical value) of the random variable  OB-II (,  ∞ ).
And, finally, using the sectioning estimator θ in (A), the variance estimator σ2 OB-III (  ,   ) in (B), and the Studentized root in (C), we obtain our third proposed (two-sided) confidence interval where is the -quantile (or critical value) of the random variable  OB-III (,  ∞ ).
This may cause a corresponding change in the weak limits along with the critical values, a line of investigation we do not pursue.
The preceding discussion should emphasize that the Studentized root  OB-x (  ,   ), x = I, II, III forms the essential element of the confidence intervals we propose.And, since the exact distribution of  OB-x (  ,   ), x = I, II, III is unknown in general, the outlined procedure approximates its distribution by the (purported) weak limit  OB-x (,  ∞ ), x = I, II, III.

Synopsis of Results
The proposed intervals (19), (21), and ( 23) rely crucially on the existence of the following weak limits: where  and  ∞ are the limiting batch size and number of batches as defined in (7).The existence of the weak limits  OB-x , x = I,II,III, however, needs to be established and their characterization will occupy much of the rest of the paper.
Furthermore, on our way to characterizing  OB-x , x = I,II,III, we will also establish the weak limits of the estimators σ2 OB-x (  ,   ), x = I, II, III of the variance constant  2 .The random variables  OB-x , x = I, II, III and σOB-x , x = I, II, III should be seen as distribution-free statistical functional analogues of the Student's  and  2 random variables, respectively.
As summarized in Table 1, the nature of  OB-x , x = I, II, III (and those of σOB-x , x = I, II, III) depend on the limiting batch size  and the limiting number of batches  ∞ .In particular, depending on whether  = 0 (small batch regime) or  > 0 Manuscript submitted to ACM Table 1.A synopsis of results.In the service of constructing confidence intervals on  ( ), we construct three Studentized roots  OB-x (  ,   ), x=I,II,III obtained using combinations of candidates for the point estimator of  ( ) and for the point estimator of  2 .The three Studentized roots give rise to the OB-x, x=I,II,III weak limits, whose nature depends on the limiting batch size  := lim →∞   / and the limiting number of batches  ∞ := lim →∞   .Expressions for the weak limits  OB-x (,  ∞ ), x=I,II,III appear in Theorems 5.1-7.1.Critical values for the OB-I and OB-II distributions appear on page 17 and page 24.
Batch Regime Variance Estimator Statistic (large batch regime), the statistics behave quite differently.For example, the small batch regime ( = 0) produces the normal limit ( statistics) along with consistent estimation of  2 , whereas the large batch regime ( > 0) produces limits that are functionals of the Wiener process along with no consistent estimation of  2 .The asymptotic number of batches  ∞ affects the nature of the limiting distributions in the large batch regime.See Table 1 for a synopsis.

KEY ASSUMPTIONS
In this section, we state and comment on various regularity assumptions that will be invoked when proving the technical results.Not all of these assumptions are "standing assumptions" in that some of the results to follow (especially when  = 0) will need only a subset of the assumptions.
Assumption 4 (Asymptotic Moment Existence).The sequence { θ ,  ≥ 1} of sectioning estimators is such that, for some where  is the constant appearing in Assumption 3.
Assumption 5 (Strong Invariance).The sequence { θ ,  ≥ 1} of sectioning estimators satisfies the following strong invariance principle.There exists a standard Wiener process { (),  ≥ 0} and a stationary stochastic process where the constant  > 0 and the real-valued random variable Assumption 1 on the stationarity of the sequence {  ,  ≥ 1} is mild and standard in settings where a confidence interval is sought.Assumption 2 on strong mixing is a weak asymptotic independence condition imposed to rigorize the intuitive idea that the dependence between events formed from subsets of the sequence {  ,  ≥ 1} in the far past and the far future decays to zero as their separation diverges.Assumption 1 and Assumption 2 are used only in our results involving small batches, that is, when   / → 0.
As discussed in the introductory part of the paper, Assumption 3 on the existence of a CLT on θ , is fundamental to the methods presented here.The inequality in (27) of Assumption 5, sometimes called "strong invariance, " essentially stipulates that the scaled process √  −1 θ ⌊ ⌋ −  ( P) ,  ≤  can be approximated uniformly to within  − almost surely, by a suitable standard Wiener process on a rich enough probability space.As argued in Philipp and Stout [66], and Glynn and Iglehart [45], Assumption 5 holds for a variety of weakly dependent processes.See [15] for strong invariance theorems on partial sums, empirical processes, and quantile processes.
As will become evident, Assumption 5 is used only in proving results that involve large batches, that is, when We believe all these results will still hold with a functional CLT on θ instead of Assumption 5. (Loosely, strong approximation ⇒ functional CLT ⇒ CLT -see, for instance, [44,72].)Despite this increased generality that a functional CLT affords, we have chosen to remain with Assumption 5 since the resulting proofs are more intuitive.
Remark.It is likely that Assumption 5 can be relaxed, e.g., by replacing the canonical scaling for some known  > 0, without changing most of the results reported in this paper.Such Manuscript submitted to ACM generalization is part of an ongoing investigation and entails identifying alterations needed on the technical conditions involving batch size and number of batches.

THE OB-I LIMIT
In this section, we characterize the weak limit of as described in Section 3.4.Along the way, we also characterize the asymptotic behavior of the variance estimator σ2 OB-I (  ,   ).The ensuing Section 5.1 treats the  := lim →∞   / > 0 (large batch) regime, and Section 5.3 treats the  = 0 (small batch) regime.

Large Batch Regime for OB-I
Theorem 5.1 that follows asserts that σ2 OB-I (  ,   )/ 2 and  OB-I (  ,   ) converge weakly to certain functionals of the Wiener process that we denote  2 OB-I (,  ∞ ) and  OB-I (,  ∞ ), respectively.It is important that Theorem 5.1 needs the strong invariance Assumption 5 to hold so that the dependence across batches can be characterized precisely.Theorem 5.1 (OB-I Large Batch Regime).Suppose Assumption 5 holds, and that  = lim →∞   / ∈ (0, 1).Assume also that where The following theorem characterizes the (asymptotic) moments of the OB-I variance estimator σ2 OB-I (  ,   ).
and  > 0 is the constant appearing in Assumption 5.
(a) The estimator σ2 OB-I (  ,   ) does not consistently estimate the variance parameter  2 , but converges weakly to the product of  2 and the random variable  2 OB-I (,  ∞ ) appearing in (30).As in all cancellation methods, the weak limit of  OB-I (  ,   ) does not involve  2 since it "cancels out." We slightly abuse notation for ease of exposition and use (30) to define the  OB-I random variable: (b) The factor  1 (,  ∞ ) = 1 −  is a "bias correction" factor introduced to ensure that σ2 OB-I (  ,   ) is asymptotically unbiased.
(c) The expression for  2 OB-I (,  ∞ ) in Theorem 5.1 seems to have appeared first in [1, pp. 326] for the steady-state mean context and assuming fully overlapping batches, that is, for   = 1 and  ∞ = ∞.(The reader should be aware that while  ∞ in the current paper refers to the limiting number of batches,  ∞ in [1] refers to the ratio /  →  −1 .Furthermore, a simple re-scaling of the Wiener process is needed to see that the expression appearing in Theorem 5.1 and that in [1, pp. 326] are equivalent.)Similarly, the special case of fully overlapping batches and  ∞ = ∞ for Var( σ2 OB-I (  ,   )) in Theorem 5.2 appears in [18, pp. 290] for the context of the steady-state mean.
Manuscript submitted to ACM (d) We can show through calculus on (32) that inf  ∈ (0,1) lim →∞ Var( σ2 OB-I (  ,   )) = 0 is approached as  → 0. (The infimum is not attained although there is a local minimum around  = 0.467.)This suggests using small batches but this is counter to what is seen in practice.Our numerical experience here and elsewhere suggests rather strongly that the asymptotic batch size  has a "first-order effect" on coverage probability (with large  being better), and a "second-order effect" on expected half-width (with large  being bad), whereas the asymptotic number of batches  ∞ has a "first-order effect on expected half-width" (with large  ∞ good) but a "second-order effect" on coverage probability.These arguments suggest that using (32) as the sole means of deciding the quality of confidence intervals is misleading.
(e) The offset parameter   comes into play through its effect on the limiting number of batches  ∞ .Specifically, notice that since and  ∞ < ∞ if lim    / > 0 (assuming it exists).
Now, we see that except for a set of measure zero in the probability space implied by Assumption 5, there exists Γ() such that, uniformly in , Furthermore, due to Theorem B.4, for any given  > 0, except for a set of measure zero in the probability space implied by Assumption 5, there exists  0 (, ) such that for all  ≥  0 (, ), and uniformly in , after ignoring non-integralities.
Plugging ( 40) and ( 41) in (39), we get Notice that the second term appearing on the right-hand side of ( 42) is dominant and goes to zero almost surely.Now lets calculate the weak limit of   :=  2 1
Let's now prove the second statement in (30) holds.From Assumption 5 we have where Γ is a well-defined random variable with finite mean, and We can then write where   and   (  ,   ) were introduced in (36), and both   (  ,   ) and Ẽ go to zero almost surely.Now apply to (47) the same steps leading to weak limits in (43) and ( 44) -first replace by an object that is equal in distribution and then take limit as  → ∞ -to conclude that the second assertion in (30) holds.□ Proof of Theorem 5.2.Let's next prove the asymptotic expansion appearing in (32).Simple algebra yields, for all implying that Plugging (49) and the inequality ( 42) in (36) (after noticing that we have assumed Γ appearing in Assumption 5 satisfies where  1, is defined in (31) and we recall that  is the constant appearing in Assumption 5.This proves the assertion in (31).
Using a similar but tedious calculation, we find that Su et al.
Again plugging (51) and the inequality (42) in (36) (after noticing that we have assumed Γ appearing in Assumption 5 satisfies E[Γ 4 ] < ∞), we conclude that as  → ∞, thus proving the assertion in (32).□ 5.3 Small Batch Regime for OB-I Theorem 5.1 characterizes the effect of using large batch sizes, that is, lim →∞   / =  > 0 on the asymptotic behavior of  OB-I (  ,   ) and σ2 OB-I (  ,   ).Theorem 5.3 does the same but for the small batch ( = 0) context.In particular, Theorem 5.3 asserts that when small batches are used, σ2 OB-I (  ,   ) consistently estimates  2 , and that  OB-I (  ,   ) converges to the standard normal distribution.
Through prior arguments, we proved that the first term on the right-hand side of (64) tends to  2 in probability; also, because √ ( θ −  ()) d →  (0, 1), and  := lim →∞   / = 0, Slutsky's theorem (B.2) ensures that the second term on the right-hand side of ( 64) is   (1).To see that the third term on the right-hand side of (64) also tends to zero in probability, notice again that and make use of Slutsky's theorem (B.2).This proves the first assertion of the theorem in (53).
To prove the second assertion in (53), we again apply Slutsky's theorem (B.2) to after noticing that the numerator in the expression for  OB-II (  ,   ) converges weakly to  (0, 1) due to Assumption 3 and the denominator converges in probability to  from the first assertion.□ We now make a few observations regarding Theorem 5.3.

relies on the point estimator σ2
OB-I (  ,   ) being a consistent estimator of  2 , whereas Theorem 5.1 results in a cancellation method that does not rely on the consistency of σ2 OB-I (  ,   ).This is why Theorem 5.3 insists that  ∞ = ∞ whereas Theorem 5.1 does not.
(d) As is evident from (64), characterizing the next order term for the mean and variance of σ2 OB-I (  ,   ) (akin to Theorem 5.2) will involve assuming the nature of higher order terms in the uniform convergence assumption appearing as Assumption 4.

THE OB-II LIMIT
In this section, we characterize the weak limit of As described in Section 3.4, recall that the OB-II limit  OB-II (  ,   ) differs from the OB-I limit in that it replaces the sectioning estimator θ with the batching estimator θ as the centering variable.As in the OB-I context, the ensuing sections treat the large batch and small batch regimes separately.

Proof. Proof See Appendix C. □
We make a number of observations in light of Theorem 6.1.
(a) As in Theorem 5.1, we see that the variance parameter  2 is not estimated consistently in Theorem 6.1.Instead the estimator σ2 OB-II (  ,   ) converges weakly to the product of  2 and  2 OB-II (,  ∞ ).Again, we slightly abuse notation and define the weak limit appearing in (71) as the  OB-II (,  ∞ ) random variable.

Su et al.
(b) Unlike the the OB-I interval estimator, the OB-II interval estimator uses θ as the centering variable and when estimating the variance constant.For this reason, and as we shall briefly discuss later, this makes the OB-II estimator attractive from a computational standpoint.
(d) As can be seen, the "bias correction" factor  2 (,  ∞ ) in ( 69) for the OB-II context is much more complicated.
The OB-II analogue of the OB-I asymptotic variance appearing in (32) of Theorem 5.2 has been elusive.
(e) The table in Figure 4 displays the critical values  OB-II,1− (,  ∞ ) := min   ( OB-II (,  ∞ ) ≤ ) ≥ 1 −  associated with the  OB-II distribution as a function of 1 −  and for different values of the parameters ,  ∞ .R and MATLAB code for calculating the critical values can be obtained through https://web.ics.purdue.edu/∼pasupath.

Small Batch (𝛽 = 0) Regime for OB-II
We now treat the small batch regime ( := lim →∞   / = 0) for OB-II.Like Theorem 5.1, Theorem 6.1 needs the strong invariance Assumption 5 to hold so that the dependence across batches can be characterized.Theorem 6.2 (OB-II Small Batch Regime).Suppose Assumptions 1-5 hold, and that  = lim →∞   / = 0. Assume that the number of batches   → ∞.Then, as  → ∞, From arguments identical to that in the proof of Theorem 5.3 (specifically, ( 57)-( 63)), we see that σ2 OB-II (  ,   ) consistently estimates  2 , that is, To complete the first part of the theorem's assertion in (72), we write From ( 75), we see that the first term on the right-hand side of (76) tends to  2 in probability; also, because and make use of Slutsky's theorem (B.2).This proves the first assertion of the theorem in (72).
To prove the second assertion in (72), we again apply Slutsky's theorem (B.2) to after noticing that the numerator in the expression for  OB-II (  ,   ) converges weakly to  (0, 1) due to Assumption 3 and the denominator converges in probability to  from the first assertion.□

CONSIDERATIONS DURING IMPLEMENTATION
In this section, we discuss "practitioner" questions that seem to arise repeatedly.In summary, substituting the normal or Student's  critical value for the OB critical values will not provide the correct coverage unless  = 0. And, the deviation from the nominal coverage with such substitution can become substantial as the asymptotic batch size  becomes large.

Which OB CIP?
We've presented three statistics along with their weak convergence limits OB-x, x=I,II,III, amounting to three possible CIPs.Numerical evidence to be provided in the ensuing section suggests that using these CIPs with large overlapping batches tends to result in confidence intervals having good behavior across a variety of contexts.How do the OB CIPs compare against each other?
Unfortunately, providing a satisfactory answer appears to be context-dependent and requires much further investigation, especially around the question of batch size choice.The sectioning estimator θ used within the OB-I CIP typically has variance  ( ).These expressions reveal that the batching estimator has lower variance (when using overlapping batches) and higher bias than the sectioning estimator; how these collude to decide the quality of the resulting confidence intervals is a context-dependent question.
In summary, from the standpoint of interval quality as assessed by coverage probability and expected half-width, little is known theoretically on the relative behavior of OB-x, x=I,II,III especially when implemented with their corresponding optimal batch sizes.This should form the agenda for future investigation.
Manuscript submitted to ACM Table 2. Time complexities of the three OB CIPs.Recall that  represents the size of the dataset,   represents the batch size,   represents the number of batches, and  ( ) is the time complexity of constructing the estimator of the statistical functional  ( ) using a batch of size .

CIP Time Complexity
OB-I  ( () +    (  )) OB-II  (   (  )) The difference between the proposed procedures is much clearer from the standpoint of computational complexity.
Suppose  (| − ℓ |) is the time complexity of calculating the estimator θ ({  , ℓ ≤  ≤ }) described in Section 3.1.Then, as can be seen in Table 2, simple calculations reveal that OB-II CIP is the most computationally efficient and the OB-III CIP the least computationally efficient.The relative complexities of the three CIPs become stark when using large batches with significant overlap, that is, when   / →  > 0 and   =  (1).This leads to  ( ()) complexity for OB-I and OB-II, but  (   =2  ()) complexity for OB-III.With sparse overlap resulting in finite number of asymptotic batches, that is, if  ∞ < ∞, OB-I has complexity  ( ()), OB-II has complexity  ( (  )), and OB-III has complexity ).An important qualification to the above discussion is that, depending on the specific context, the complexities listed in Table 2 can be conservative and should only be used as broad guidance.Specifically, in the sequential context where the data are revealed one (or a few) at a time, instead of all at once, the estimators θ and θ,  can often be constructed sequentially and in a way where the resulting complexities are much better than the "one shot" complexities listed in Table 2. Nevertheless, we expect OB-III to be the most computationally expensive, and OB-II to be the least computationally expensive.

NUMERICAL ILLUSTRATION
We now present numerical results from three popular contexts to gain further insight on the behavior of confidence intervals produced by OB-I, OB-II, and subsampling.

Example 1 : CVaR Estimation.
Let   the CVaR associated with the standard normal random variable.From the definition of CVaR [70], we have where  (•), Φ(•) are the standard normal density and cdf, respectively.With observations from an iid sequence {  ,  ≥ 1} of standard normal random variables, we can construct a point estimator for   as follows: We wish to construct a 0.95-confidence interval on   for  = 0.7, 0.9, 0.95 with number of observations  = 100, 500, 1000, 2000, 3000 and 5000.
Manuscript submitted to ACM Table 3.The table summarizes coverage probabilities obtained using small batch OB-I ( = 0), large batch OB-I ( = 0.1, 0.25), small batch OB-II ( = 0), large batch OB-II ( = 0.1, 0.25), and subsampling (SS) for the CVaR problem with  = 0.7.The numbers in parenthesis are estimated expected half-widths of the confidence intervals.Tables 3-5 display clear trends that will be repeated, more or less, across the different experiments we present.All methods seem to tend to the nominal coverage as the available data increases.However, OB-I and OB-II with  > 0 seem to get to the nominal coverage much faster than the rest.For example, in Table 3, OB-I and OB-II with  = 0.25 seem to get to the vicinity of the nominal coverage after only about  = 100 observations; and OB-I and OB-II with  = 0.1 seem to get to the vicinity of the nominal coverage after about  = 500 observations.Similarly, in Table 5, OB-I and OB-II with  = 0.25 seem to get to the vicinity of the nominal coverage after about  = 1000 observations, while for  = 0.1, the corresponding number is  = 2000.The performance of OB confidence intervals with small batches seems comparable to that of subsampling; both OB-x with  = 0 and subsampling seem to struggle on the CVaR problem with  = 0.95.

Example 2: Parameter Estimation for AR(1).
Consider the AR(1) process given by
OB-I ( = 0, 0.1, 0.25) OB-II ( = 0, 0. With observations from the time series {  ,  ≥ 1}, the least-squares point estimator for  () :=  (after fixing We wish to construct a 0.95-confidence interval on  = 0.5, 0.9 for   = 1 and with number of observations  = 100, 500, 1000, 5000 and 10000. Tables 6-7 are in the same format as Tables 3-5 and display the results for the AR(1) example.The trends in coverage probabilities appear to be similar to those observed in Example 1, with large batches playing a seemingly important role in ensuring close to nominal coverage.Interestingly, Example 2 seems to do a better job in distinguishing between OB-I and OB-II for the same , and in distinguishing between OB methods and subsampling.For example, due to the increased estimator bias associated with  = 0.9, OB-I with  = 0, 0.1, 0.25 appear to dominate OB-II with corresponding  = 0, 0.1, 0.25.Subsampling clearly generates intervals with smaller expected half-width but the coverage is substantially lower than nominal especially when  = 0.9.Such differences were not as evident in Example 1, probably because of the more muted effects of bias.

The Effect of Overlap in Batches
Towards understanding the effect of the batch offset parameter   , we conducted additional numerical experiments for the "more difficult versions" of the CVaR problem ( = 0.9) and the AR(1) problem ( = 0.9), with a dataset of size  = 1000.As can be seen in Table 9 and Table 10, various values of   were chosen, expressed as a function of the batch size   or the dataset size .
The trends in Table 9 and Table 10 are interesting, although predictable.Increasing   values clearly helps hasten the rate of coverage probability convergence (to nominal).This effect is pronounced in the small batch regime and most muted in the OB-I large batch regime.Correspondingly, there is also an increase in the expected half-widths, with the small batch regime exhibiting a sharp rise when   is very large, or correspondingly, the number of batches very small.
Again, this effect is most muted in the OB-I large batch regime.
The main insight is that large   values help with estimating the variance constant correctly in the small batch regime, since dependence between batch estimates reduces as   increases.However, the price is larger half-widths due to the necessarily smaller number of batches.The large batch regime avoids this problem by modeling the dependence structure between batch estimates.
Manuscript submitted to ACM Table 9.The table summarizes coverage probabilities obtained using small batch OB-I using student's  with   − 1 degrees of freedom, large batch OB-I ( = 0.25), small batch OB-II using student's t with   − 1 degrees of freedom, and large batch OB-II ( = 0.25) for the CVaR problem with  = 0.9,  = 1000,  = 10000.The numbers in parenthesis are estimated expected half-widths of the confidence intervals.
OB-I ( = 0, 0.25) OB-II ( = 0, 0.25)   ( = 0, 0.25)  ∞ ( = 0, 0.25) Since its original introduction in 1979 [32], the bootstrap has received tremendous attention due to its simplicity and wide applicability, resulting in popular refinements [28][29][30][31], the ability to handle time series [7,33], extensions to the functional context [22], higher-order corrections [51,57] to improve coverage accuracy, and most recently a computationally "cheap" version [55].Debates on whether subsampling or the bootstrap is better have continued, but it is now known that subsampling is more general in that the bootstrap requires the behavior of the bootstrap distribution Ĵ (•, P ) to be smooth (around ) when seen as a function of its second argument.We go into no further detail on this point but see [68,Section 2.3].Also see [39] for an interesting theorem on the sense in which the bootstrap in its basic form is not valid if the variance parameter does not exist.

B SOME USEFUL RESULTS
We will invoke the following useful result from [26] that provides a weak law for triangular arrays of real-valued random variables that are not necessarily identically distributed.
We will now individually characterize the behavior of Ē (  ,   ) and Ī above.
thus proving the assertion for finite  ∞ .
Next, observe that Some algebra yields and Plug in (101) and ( 102) in (100), and we have Manuscript submitted to ACM where the last equality holds by the definition of  2 (,  ∞ ) in ( 69) and since we have assumed that |  / −  | =  ( − ).

wp1→
to refer to almost sure convergence,   p →  to refer to convergence in probability, and   d →  to refer to convergence in distribution (or weak convergence).(vi) We write  d =  to mean that random variables  and  have the same distribution.(vii) The empirical measure   constructed from the sequence {  ,  ≥ 1} is given by   () =  −1

)
This work presents CLT-based overlapping batch CIPs for constructing confidence intervals on statistical functionals.There exists a well-developed literature on CLT-based OB CIPs for the steady-state mean, and more recently for quantiles, but the only treatment of statistical functionals through CLT-based methods that we are aware of is [61, Section 2.4].

( a )
Unlike in the large batch setting ( > 0) of Theorem 5.1, the first assertion of Theorem 5.3 guarantees that σ2 OB-I (  ,   ) is a consistent estimator of  2 .(b) Unlike Theorem 5.1, Theorem 5.3 does not need Assumption 5 simply due to the fact that  2 is being estimated consistently, implying that the dependence between the numerator and the denominator of  OB-I (  ,   ) does not have to be explicitly modeled.This is what allows using Slutsky's theorem in Theorem 5.3.(c) Theorem 5.3 assumes very little about the overlapping requirement of the batches apart from requiring the number of batches to diverge.In this sense, Theorem 5.3 is fundamentally different from Theorem 5.1; Theorem
/ > 0, and when the limiting number of batches  ∞ = ∞, the OB-I and OB-II critical values correspond to the rightmost columns of the tables appearing on page 17 and page 24, respectively.Looking at these columns, it should be immediately clear that the OB-I, OB-II critical values can be quite different from those of the standard normal distribution.For instance, when  = 0.1, the 0.95-quantile of the OB-I and OB-II distributions are each around 1.76 whereas the corresponding standard normal quantile Φ −1 (0.95) = 1.645, a difference of more than 7%.This difference increases as  increases, and vanishes as  → 0.When  := lim    / > 0 but the limiting number of batches  ∞ < ∞, the natural temptation, in absence of the OB-I and OB-II distributions, might be to use the Student's  critical value with  ∞ − 1 degrees of freedom.(Some algebra reveals that when  > 0,  ∞ <  −1 results in non-overlapping batches and  ∞ ≥  −1 results in overlapping batches.)However, notice again the quantiles reported on pages 17 and 24 can be quite different from the corresponding Student's  critical value with  ∞ − 1 degrees of freedom.For instance, when  = 0.2 and  ∞ = 51, the 0.95-quantile for the OB-I and OB-II distributions are 1.893 and 1.902 respectively, whereas the 0.95-quantile of the Student's  distribution with 50 degrees of freedom is 1.6749, a difference of more than 11%.As  → 0 and assuming  ∞ < ∞, the quantiles of the OB-II distribution converge to those of the Student's  distribution with  ∞ − 1 degrees of freedom; the difference between the quantiles of the  OB-I distribution and those of the Student's  distribution with  ∞ − 1 degrees of freedom persist even as  → 0.
1,  , F 1+,  ) is the strong mixing constant associated the sigma algebras  ( 1 ,  2 , . . .,    ),  (   +1 ,    +2 , . . .,    + (63)rmed by random variables in batch 1 and batch 1 + , the first inequality in(63)follows upon application of Corollary 2.5 in[34, pp.347] with  = 1,  = ∞,  = ∞, the second inequality in (63) follows since  1,  ( ) ∈ [0,  ], and the last inequality in(63)follows since Assumption 2 implies   → 0 implying in turn that the Césaro sum −1 OB Critical Values versus Gaussian or Student's  Critical Values.In the absence of the OB-I and OB-II critical value tables on page 17 and page 24 respectively, it has been customary to use critical values from the -table or the Student's  table with an appropriate number of degrees of freedom.From a practical standpoint, how much difference does it make if one uses the -table or the Student's  table versus the OB critical value table?When the batch size is large, that is, if  := lim 1  ) and bias  ( 1   ) for some  ≥ 1/2, whereas the batching estimator θ used within OB-II has typical variance  ( 1 1, 0.25) SS (  = 5, 0.75.As in previous examples, the numbers in parenthesis refer to the estimated expected half-width.The trends in Table8are consistent with those from the previous examples with OB-I delivering confidence intervals that are clearly better in terms of coverage, although nominal coverage seems to need a higher value of  than in previous examples.Subsampling does not reach nominal coverage even with  = 50000 although the generated intervals have much smaller half-widths.

Table 8 .
The table summarizes coverage probabilities obtained using large batch OB-I ( = 0.1, 0.25) and subsampling for the NHPP rate estimation problem with  ( ) = 4 + 8 .The three numbers in each column indicate the coverage probability estimates corresponding to a confidence interval on  ( ) for  = 0.25, 0.5, 0.75.The numbers in parenthesis are estimated expected half-widths of the confidence intervals.