Testing Closeness of Multivariate Distributions via Ramsey Theory

We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions p, q on d, we want to distinguish between the case that p=q versus ||p−q||Ak > є, where ||p−q||Ak denotes the generalized Ak distance between p and q — measuring the maximum discrepancy between the distributions over any collection of k disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with sub-learning sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity O((k6/7/ polyd(є)) logd(k)). On the lower bound side, we establish a qualitatively matching sample complexity lower bound of Ω(k6/7/poly(є)), even for d=2. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is Θ(k4/5/poly(є)). This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general Ak tester, we obtain dTV-closeness testers for pairs of k-histograms on d over a common unknown partition, and pairs of uniform distributions supported on the union of k unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.

sample complexity lower bound of Ω(k 6/7 /poly(ϵ)), even for d = 2.These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is Θ(k 4/5 /poly(ϵ)).This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not.
As a corollary of our general A k tester, we obtain d TV -closeness testers for pairs of khistograms on R d over a common unknown partition, and pairs of uniform distributions supported on the union of k unknown disjoint axis-aligned rectangles.
Both our algorithm and our lower bound make essential use of tools from Ramsey theory.

Introduction
Background and Motivation A fundamental statistical task is to ascertain whether a set of samples comes from a given model, where the model may consist of either a single fully specified probability distribution or a family of probability distributions.The study of this broad task was initiated in a field now known as statistical hypothesis testing over a century ago [Pea00,NP33]; see, e.g., [LR05] for an introductory textbook on the topic.In the past three decades, hypothesis testing has been extensively studied by the theoretical computer science and information-theory communities -under the name distribution testing -in the framework of property testing [RS96,GGR98].It is instructive to note that the TCS style definition of hypothesis testing is equivalent to the minimax testing definition introduced and studied by Ingster and coauthors [Ing94,Ing97,IS03].
The paradigmatic problem in distribution testing is the following: given sample access to one or more unknown probability distributions, we want to correctly distinguish (with high probability) between the cases that the underlying distributions satisfy some global property P or are "far" from satisfying the property.The primary objective is to obtain a tester that is statistically efficient, i.e., it has information-theoretically optimal sample complexity.An additional important criterion is computational efficiency; that is, the testing algorithm should run in sample-polynomial time.After the pioneering early works formulating this field [GR00, BFR + 00] from a TCS perspective, there has been substantial progress on testing a wide range of properties; see, e.g., [BFF + 01, BDKR02, BKR04, Pan08, Val11, VV11, ADJ + 11, LRR11, VV14, CDVV14, DKN15a, CDKS17, DDK18, CDKS18, DGK + 21, CJKL22, CDKL22] for a sample of works, and [Rub12,Can22] for surveys on the topic.
Here we study the problem of closeness testing (or equivalence testing) between two unknown probability distributions.Specifically, given independent samples from a pair of distributions p, q, we want to determine whether the two distributions are the same versus ϵ-far from each other.Early work on this problem [BFR + 00] focused on the setting that p, q are arbitrary discrete distributions of a given support size n, and the metric used to quantify "closeness" is the ℓ 1 -distance (equivalently, total variation distance).It is now known [CDVV14] that the optimal sample complexity of ℓ 1closeness testing for distributions with support of size n is Θ(max{n 2/3 /ϵ 4/3 , n 1/2 /ϵ 2 }).
In summary, it is known that the complexity measure determining the sample complexity of testing the equivalence (and a range of other related properties) of unstructured (i.e., potentially arbitrary) discrete distributions is the domain size of the underlying distributions.Unfortunately, this implies that if p, q are (potentially arbitrary) continuous distributions (even in one dimension!),no closeness tester with finite sample complexity exists.There are two natural approaches to circumvent this bottleneck.The first approach is to assume that p, q have some nice structure, in which case the domain size may not be the right complexity measure for the testing problem.The second approach is to make no assumptions on the underlying distributions, but relax the metric under which we measure closeness.
Interestingly, it turns out that these two seemingly orthogonal approaches are intimately related to each other.In particular, for the important special case of one-dimensional distributions, a line of works, see, e.g., [DDS + 13, DKN15b, DKN15a, DKN17], developed a general framework that yields optimal testers (for closeness and other properties) for a range of structured distribution families.
The key idea underlying these testers is to design a single tester for arbitrary one-dimensional distributions but under a different -carefully selected -metric; and then appropriately use this metric as a proxy for the total variation distance (for each structured distribution family of interest).
In more detail, for one-dimensional distributions p, q : R → R + , the appropriate metric is known as A k -distance [DL01,CDSS14a] and is defined as follows: The A k -distance between onedimensional distributions p and q, denoted by ∥p − q∥ A k , is defined as the maximum ℓ 1 -distance between the reduced distributions1 obtained from p, q over all partitions of the domain in at most k intervals.The motivation for this particular definition of the A k -distance [CDSS14a,DKN15b] between one-dimensional distributions comes from the VC-inequality (see, e.g., page 31 of [DL01]).
The positive integer k in the definition of the A k -distance is a tunable parameter that is selected appropriately depending on the application.For k = 2, the A k -distance amounts to the distance between the cumulative distribution functions (known as Kolmogorov distance).As k increases, the metric becomes stronger and converges to the total variation distance when k → ∞ (under mild assumptions on the distributions).Moreover, if the underlying distributions p, q belong to some class of shape restricted densities (e.g., univariate histograms or log-concave distributions), a finite value of k suffices so that the A k -distance closely approximates the total variation distance.
It is worth noting that, in addition to distribution testing, the one-dimensional A k distance has been has been a crucial ingredient in developing efficient learning algorithms for structured univariate distributions [CDSS13, CDSS14a, ADLS17, CLM20].
Testing Closeness of Multivariate Distributions The main motivation behind this work is to generalize the aforementioned framework to the multivariate setting with a focus on the task of closeness testing.A first step to achieve this is an appropriate generalization of the notion of A k -distance which applies to one-dimensional distributions) for distributions on R d for all d ≥ 1.Here we study the following natural definition, that has been previously used in the context of learning [DLS18] and uniformity testing [DKP19] for multivariate distributions.
Definition 1.1 (Multidimensional A k -distance).For two probability distributions (with densities/mass functions) p, q : R d → R + and k ∈ Z + , we define the multi-dimensional A k -distance between p and q as the maximum value of k i=1 |p(R i ) − q(R i )| for k arbitrarily chosen nonoverlapping axis-aligned rectangles Motivation for Definition 1.1 Recall that the total variation distance between two distributions p, q on R d is defined as d TV (p, q) = sup A∈S |p(A)−q(A)|, where S is the collection of all measurable subsets on R d .Since learning or testing under the total variation distance may be too strong a goal if the underlying distributions lack structure, a reasonable compromise is to consider alternative metrics.The VC-inequality states the following: Let A be any collection of subsets of R d with VCdimension d.Then for any distribution p on R d it holds that E[sup A∈A | p n (A)−p(A)|] = O( d/n), where p n is the empirical distribution obtained after drawing n i.i.d.samples from p.In other words, for n ≫ d/ϵ 2 , the empirical distribution is ϵ-close to p with respect to the A-metric, defined as ∥p − q∥ A def = sup A∈A |p(A) − q(A)|.For the univariate case, the A k -distance defined in the aforementioned works [CDSS14a,DKN15b,DKN15a] is obtained from the A-metric by considering the family of all unions of at most k intervals (which has VC-dimension 2k).
Our Definition 1.1 is a natural generalization of the one-dimensional definition, where we consider the family of all unions of at most k rectangles, which has VC-dimension Θ(kd).Since learning an arbitrary distribution on R d under this metric requires Θ(kd)/ϵ 2 samples, it is natural to ask whether the distribution testing problem has qualitatively lower sample complexity.We also note that the multidimensional A k -distance is a strengthening of the Kolmogorov-Sminov (KS) metric and converges to the total variation distance as k → ∞ (under mild assumptions).It should be noted that a line of work in mathematical statistics -see, e.g., [Bic69,FR79,Hen88,JPZ97] for some classical works -has developed two-sample testers (aka closeness testers) for non-parametric multivariate distributions under the KS metric.Our work can be viewed as a strengthening and generalization of these results in the minimax setting.
We believe that, in addition to being a potential tool for performing multivariate d TV -closeness testing for structured distributions, the A k distance is an interesting metric on its own merits.To see this, we recall that one of the main motivations for considering the total variation distance is the following property: If a decision algorithm is run twice on different inputs that follow two distributions that are close in total variation distance, then the acceptance probabilities will also be approximately the same in the two cases.Hence, for two distributions p, q that have passed the d TVcloseness testing, we can be confident that running some downstream decision algorithm on inputs drawn from p and from q should give similar results.For the A k distance, we have an analogous property if one restricts the algorithm in the above statement to be an axis-aligned decision tree (i.e., a decision tree whose leaf nodes follow the branching rule of x i < b for some coordinate i ∈ [d] and some real number b ∈ R) with at most k leaves.Though being a restricted family of algorithms, axis-aligned decision trees are commonly used in machine learning applications due to their exceptional interpretability; see, e.g., [YA01,BDS10,BPGB20].This suggests that testing in A k distance, even though being a weaker test compared to its d TV -counterpart for arbitrary distributions (which is provably impossible without structural information), may be sufficient for certain structured downstream decision-making tasks.
We return to our closeness testing task.One approach to solve the multidimensional2 A kcloseness testing problem is to learn p and q up to A k -distance ϵ/4, and then check whether the hypotheses are ϵ/4-close to each other.Thus, the sample complexity of closeness testing is bounded above by the sample complexity of learning (within constant factors).Since Θ(kd/ϵ 2 ) samples suffice to learn an arbitrary distribution on R d up to A k -distance ϵ, the naive "testing-by-learning" approach requires Ω(k) samples (even in one dimension and for constant ϵ).
It is natural to ask whether a better sample size could be achieved for testing, since closeness testing is, in some sense, less demanding than learning.That is, the goal is to develop a closeness tester with sample complexity strongly sublinear in k, namely O(k c ) for some constant c < 1.The aforementioned line of work on univariate distributions [DKN15b, DKN15a, DKN17] developed identity and closeness testers under the A k -distance with strongly sublinear sample complexity.These testers were also applied to give total variation distance testers for classes of "shape constrained" distributions [BBBB72,GJ14], including histograms and logconcave distributions.
This discussion motivates the following natural question: What is the sample complexity of A k -closeness testing for multivariate distributions?
Prior to this work, no closeness tester with sub-learning sample complexity was known even for d = 2.The main contribution of this work is a sample near-optimal and computationally efficient A k -closeness tester in any fixed 3 dimension.Moreover, we show that the sample complexity of our tester is optimal as a function of k, within logarithmic factors.As an immediate corollary, we obtain the first closeness tester for multivariate histogram distributions (with respect to the same unknown set of axis-aligned rectangles) under the total variation distance.Specifically, our main result (Theorem 1.2) establishes the following: For any k, d ∈ Z + , ϵ > 0, and sample access to arbitrary distributions p, q on R d , there exists a closeness testing algorithm under the A k -distance using O (k 6/7 /poly d (ϵ)) log d (k) samples.Moreover, this bound is informationtheoretically optimal as a function of k, even for d = 2.We remark that our A k -testing algorithm applies to any pair of distributions (over both continuous and discrete domains).
As a corollary, we obtain the first closeness tester (with sub-learning sample complexity) between k-histograms with respect to the total variation distance.A probability distribution on R d with density p is called a k-histogram if there exists a partition of the support into k axis-aligned rectangles R 1 , . . ., R k such that p is constant on R i , for all i = 1, . . ., k.This is one of the most basic non-parametric distribution families and have been extensively studied in statistics [Sco79, FD81, Sco92, LN96, DL04, WN07, Kle09] and computer science -including database theory [JKM + 98, CMN98, TGIK02, GGI + 02, GKS06, ILR12, ADH + 15] and theoretical ML [DDS12, CDSS13, CDSS14a, CDSS14b, ADLS17, ADK15, DDS + 13, DKN15a, DKN15b, DKN17, DKP19, CDKL22].Prior to this work, no closeness testing algorithm with sub-learning sample complexity was known for k-histograms, even for d = 2.As a corollary of our main result, we provide such an algorithm (see Corollary 2.18) for the case that the two histograms are supported on the same unknown partition.In addition, we also obtain d TV -closeness tester for uniform distributions supported on some unknown k disjoint axis-aligned rectangles (see Corollary 2.19).We remark that though histograms and uniform distributions over unions of axis-aligned rectangles are conceptually similar, these two families of distributions are orthogonal to each other.

Our Results
We study the complexity of closeness testing between two (arbitrary) distributions p, q on R d with respect to the A k distance.Our main result is the following.
Theorem 1.2 (Main Result).Given ϵ > 0, integer k ≥ 2, and sample access to distributions with density functions p, q : R d → R + , there exists a computationally efficient algorithm which draws C 2 d/3 k 6/7 log 3d (k)/ϵ α d samples from p, q, for a sufficiently large universal constant C > 0, where α d = O(d 2 2 2 d+1 ), and with probability at least 2/3 correctly distinguishes whether p = q versus ∥p − q∥ A k ≥ ϵ.Moreover, Ω min{k 6/7 /ϵ 8/7 , k} many samples are information-theoretically necessary for this hypothesis testing task, even if p, q are two-dimensional discrete distributions on a sufficiently large domain.

Discussion
To interpret Theorem 1.2, some comments are in order.We reiterate that the focus of our work is on the non-parametric setting and consequently we view the dimension d as a fixed constant.In this regime, the sample complexity of our algorithm is Õd (k 6/7 )/poly d (ϵ).
The one-dimensional special case of our closeness testing result was solved in [DKN15a], where the authors established a tight sample complexity bound of Θ(k 4/5 /ϵ 6/5 + k 1/2 /ϵ 2 ).Prior to our work, no o(k) sample upper bound was known for this testing problem even for d = 2 and ϵ = 0.99.
For the regime of fixed dimension that we focus on, our upper and lower bounds are essentially optimal in terms of their dependence on k -the main parameter of interest.For simplicity, let us fix ϵ to be a universal constant.Examining the exponent of k in the dominant term of the sample complexity, we observe a surprising pattern: the exponent begins at 4/5 when d = 1 (as follows from the prior work [DKN15a]), jumps to 6/7 when d = 2, and then stays at 6/7 as d increases (as follows from Theorem 1.2)!This suggests that the d = 1 case is a degenerate case and the essence and complexity of the problem is not entirely revealed until d = 2.Some remarks are in order regarding the dependence of the sample complexity on the parameters ϵ and d.First, we briefly comment on the log d (k) term.Perhaps surprisingly, prior work [DKP19] has shown a sample complexity lower bound of ( √ k/ϵ 2 )Ω(log(k)/d) d−1 for the easier problem of A k -uniformity testing.This suggests that the log d (k) factor is necessary for closeness testing as well, assuming that k is sufficiently large.Finally, we conjecture that the correct dependence on ϵ in the sample complexity of this task should be a fixed degree polynomial, independent of d.We leave this as an interesting technical question for future work (see Question 4.1).
Regarding our sample complexity lower bound, Theorem 1.2 does not specify how large the domain size of the hard distributions needs to be.Due to the application of Ramsey-theoretic arguments in the proof of our lower bound, we need it to be extremely large in terms of k (a tower function of k).In Section 3.4, we show that the domain size can be optimized to be (at most) doubly exponential in k -using a significantly more sophisticated construction (Theorem 3.8).
As immediate corollaries of our main theorem, we obtain d TV -closeness testers (with strongly sub-learning sample complexities) for multivariate structured distributions.In particular, we highlight here the d TV -closeness tester for distributions in R d that are k-histograms, i.e., piecewise constant over (the same) k unknown disjoint axis-aligned rectangles.Notably, the sample complexity of this tester is the same as that of our A k closeness testing.This implication and additional applications are given in Section 2.4.

Overview of Techniques
Here we provide a detailed overview of our technical approach to establish our upper and lower bounds.
Closeness Tester By definition of the A k distance, there exist k disjoint axis-aligned rectangles {R i } k i=1 on R d which witness the A k discrepancy between p and q; that is, If we knew what these rectangles were, the testing task would be easy.Indeed, we could simply consider the reduced measures of p and q over {R i } k i=1 (recall that these measures, after normalization, become distributions with support size k that we can simulate access to) and then use an optimal ℓ 1 -closeness tester as a black-box.Given the optimal ℓ 1 -closeness tester of [CDVV14], such an approach would lead to a sample complexity upper bound of O(k 2/3 ) (for constant ϵ).Of course, the difficulty is that we are not given these rectangles a priori, which intuitively could make the problem require more samples than ℓ 1 -closeness testing on a domain of size k4 .
The lack of a priori knowledge of the witnessing rectangles is the major obstacle towards developing a closeness tester with sub-learning sample complexity.Overcoming this bottleneck necessitates the bulk of the new technical ideas developed here.To achieve this, at a very high-level, we will proceed to compute some small set of rectangles that capture a "non-trivial"5 fraction of the discrepancy (i.e., A k -distance) between p and q.
A simple but important observation in this context is the following: one should not expect that an obliviously selected (i.e., without drawing samples from the underlying distributions) set of rectangles suffices for this purpose.Indeed, this holds even for the one-dimensional setting: as was noted in [DKN15a], any obliviously chosen set of intervals may capture no discrepancy between a pair of adversarially chosen one-dimensional distributions even though they have large A k distance.
That is, it appears necessary to select rectangles using samples from the tested distributions.Note that, in any dimension d, one needs at least two points in R d to define an axis-aligned rectangle.In particular, given two sample points x, y ∈ R d , we consider the following natural rectangle defined by these points, namely The main intuition behind this definition is the following.Suppose that we draw two samples x, y from the mixture (1/2)(p + q) (the uniform mixture of p and q), and they both happen to land in some rectangle R such that the discrepancy |p(R) − q(R)| is non-trivial.Then, intuitively, the rectangle R x,y will capture (in expectation) a non-trivial fraction of the rectangle R, and therefore also a non-trivial fraction of the discrepancy between p and q within R. The latter statement turns out to be true (see Proposition 2.1) and its proof makes essential use of tools from Ramsey theory.
Before we provide an overview of the ideas required to prove Proposition 2.1, we explain how to leverage this statement to develop our closeness tester.Suppose that the A k -distance between p, q is ϵ.Then at the cost of increasing k and decreasing ϵ by at most a constant factor, we can without loss of generality assume that there exist k rectangles {R i } k i=1 , each of which has probability mass approximately 1/k and witnesses roughly ϵ/k discrepancy.If we draw m samples from each of p, q, approximately m 2 /k of these rectangles will contain two samples.Given Proposition 2.1, we know that each pair of samples landing in some R i can be used to define a rectangle that, with some non-trivial probability, captures a non-trivial fraction of the discrepancy between p and q within R i .
A potential concern is how one would find the right set of rectangles defined by the sample points (i.e., that capture enough discrepancy).The statement of Proposition 2.1 only ensures the existence of such rectangles, but offers no clues on how one could reliably identify them.Perhaps the most natural approach is to to try all possible sets of Θ(m 2 /k) many rectangles defined by the coordinates of the sample points, and then run a standard ℓ 1 -closeness tester (on the corresponding reduced distributions) to compare the probability mass of p and q on the selected rectangles.Unfortunately, in addition to its computational intractability, it is not even clear whether this method can lead to any sample complexity sublinear in k.In particular, the standard analysis of the above strategy will apply the union bound on the failure probabilities of running the ℓ 1closeness tester on each possible reduced distribution (defined by each set of rectangles).Since there are at least m Ω(m 2 /k) many different ways to select the set of rectangles, this increases the sample complexity of the ℓ 1 -closeness tester by a factor of Ω(m 2 /k), making it hopeless to achieve any sublearning sample complexity (even balancing the quantities m 2 /k and m directly will give us m = k).
To circumvent this obstacle, we leverage an idea from [DKP19], that we term Grid Covering (see Definition 2.4).At a high level, we show that we can cover the set of all possible rectangles that can be defined by the sample coordinates -which we refer to as S -by a carefully chosen subset of these rectangles -which we refer to as F -such that each rectangle from S can be expressed as the union of at most polylogarithmically many rectangles from F. Moreover, F will be constructed to have the subtle property that any point in R d is contained in at most polylogarithmically many rectangles within the subset (in sharp contrast, in the worst case, a point may be included in a constant fraction of S.).
To take advantage of this property, we consider the notion of induced distributions (see Definition 2.6), p F , q F , on F: to sample from p F , we first draw a sample point x ∈ R d from p and return uniformly at random some rectangle from F that includes x (and similarly for q F ).As a consequence of the aforementioned properties of F, the discrepancy (under some appropriate metric) between p F and q F will shrink by at most a polylogarithmic factor compared to the discrepancy between p and q captured by the best Θ(m 2 /k) rectangles from our original collection of rectangles S (defined using the sample points); see Lemma 2.7.Importantly, the new pair of distributions are both discrete, and the discrepancies between them will be supported on a small number of domain elements.Therefore, one could hope to apply techniques from "standard" ℓ 1 -closeness testing of discrete distributions from there on.While this turns out to be manageable, we emphasize that the induced distributions still have very large support size.Hence, a direct application of ℓ 1 -closeness testing on an arbitrary discrete domain is not sufficient for our purposes.We will return to this issue when we analyze the sample complexity of our tester in detail.
It remains to show correctness of this scheme.That is, we want to establish that there exists a small set of rectangles defined by the sample points which capture a non-trivial amount of discrepancy between p and q with high constant probability.To show this, we return to {R i } k i=1 -a set of k rectangles which witness the A k distance between p and q.We will prove that for each of these rectangles R i , if two samples x, y ∈ R d are drawn from R i , there is a non-negligible probability that R x,y -the rectangle defined by x and y -captures a non-trivial fraction of the discrepancy in R i .(see Proposition 2.1 and and its proof in Section 2.2).
As a starting point to achieve this, we show that if two sample points x, y are drawn from the restriction of (1/2)(p + q) to a rectangle R, there is a decent probability that R x,y will capture a non-trivial fraction of the mass of (1/2)(p + q) in R (see Lemma 2.9).This statement turns out to be essentially equivalent to a result in Ramsey theory shown by De Bruijn that can be viewed as a generalization of the classical Erdős-Szekeres theorem (see Fact 2.10).In particular, De Bruijn showed that given N points in R d (for N at least doubly exponential in d), there exists a triplet (x, y, z) of these points such that one of the points z is inside the rectangle R x,y defined by the other two.This statement provides us with the desired discrepancy result for the special case that one of p, q has non-trivial probability mass in the rectangle R while the other has mass zero.
To prove the desired discrepancy result for the general case, we introduce and leverage the notion of discrepancy density of a set S ⊂ R d , defined to be the discrepancy between p, q in S divided by the total mass assigned by p and q in S (see Definition 2.12).At a high level, our analysis proceeds as follows.We define an iterative process that selects rectangles with increasing discrepancy density.As the discrepancy density approaches one, the situation qualitatively resembles the case that only one of p, q assigns non-zero mass to the rectangle.We now provide some further details of the process.Note that if in expectation the rectangle R x,y -defined by random points x, y from R -captures a non-trivial amount of discrepancy between p, q in R, we are done.Otherwise, there exists a rectangle R x * ,y * such that the probability masses of p and q in R x * ,y * differ by a negligible amount.As a result, since the probability masses of p and q within R x * ,y * are approximately the same, the complement of R x * ,y * , which we denote by S := R \ R x * ,y * , must have higher discrepancy density between p and q.Since the complement S can be shown to be a union of a small number of axis-aligned rectangles (see Claim 2.13), we can select one of these rectangles to restart the process.By iterating this procedure, we obtain a sequence of rectangles whose discrepancy densities increase monotonically until we reach the case that a random pair of points drawn from one of these rectangles can capture a non-trivial amount of discrepancy between p and q in expectation.
Up to this point, we have summarized the key ideas needed for the correctness analysis of our closeness tester.We now proceed to describe the tester in more detail and provide a sketch of its sample complexity.Using Proposition 2.1 and (an adaptation of) the grid-covering approach of [DKP19], we obtain a pair of discrete induced distributions (that we can simulate access to based on the samples drawn) such that they have poly d (ϵ) m 2 /k 2 discrepancy concentrated over approximately m 2 /k domain elements (up to polylogarithmic factors).Leveraging the guarantees of the pair of induced distributions we have constructed, it is tempting to apply the so called ℓ 1,ktester from [DKN17]6 .In particular, given samples from a pair of discrete distributions, such a tester aims at distinguishing between the cases that the underlying distributions are equal versus far in ℓ 1,k -distance -i.e., there exist k domain elements such that the ℓ 1 -distance restricted to these elements is large.Due to the "sparsity assumption" on the discrepancies, the sample complexity of ℓ 1,k -closeness testing is comparable to that of standard ℓ 1 -closeness testing on a domain of size k, even though the actual domain size of the input distributions may be much larger.In particular, using the guarantees of the ℓ 1,k tester in a black-box manner, we can detect the existing discrepancy between the pair of induced distributions obtained with sample size approximately where m is the initial number of samples drawn to construct the rectangles.Balancing the number of samples used for defining the rectangles and the number of samples used for detecting the discrepancy, we obtain that m = Θd k 7/8 /poly d (ϵ) suffices.This sample upper bound is strongly sub-linear in k, but it turns out (in hindsight) to provide a sub-optimal dependence on k.
Intuitively, the reason that the above guarantee turns out to be sub-optimal is the following.There would be a key property of our underlying discrete distributions left unused if we were to apply the guarantees of ℓ 1,k -testing in a black-box manner.Specifically, since the k rectangles that witness the A k -distance between p and q are themselves each of probability mass at most O(1/k), the rectangles defined by our sample points (that capture non-trivial discrepancies) will also each be of mass at most O(1/k).This in turn implies that in the end we only need to detect discrepancies supported on a few number of light domain elements (i.e., domain elements with small probability masses in the constructed discrete distributions).By carefully incorporating this additional property (i.e., that the bins witnessing discrepancies are themselves of small probability mass) into the analysis of the ℓ 1,k -tester, we obtain an improved sample complexity upper bound of (m 2 /k) 2/3 / poly d (ϵ) m 2 /k 2 4/3 .See Lemma 2.8 for the new tester and its analysis.We believe that this tester -customized for detecting discrepancies supported on a small number of light domain elements -may be applicable in other scenarios, as it allows us to escape from some worst-case scenarios of ℓ 1,k testing.Finally, balancing the number of samples used for defining the rectangles and that of samples used for detecting the discrepancy gives us a sample bound of approximately m = Θd (k 6/7 /poly d (ϵ)).
Sample Complexity Lower Bound Our sample complexity lower bound applies specifically for 2-dimensional distributions.This suffices for us to conclude that our sample upper bound is nearly optimal as a function of k for any constant dimension d > 1.
The starting point of our sample lower bound technique is the lower bound for one-dimensional A k closeness testing shown in [DKN15a].Specifically, we start by showing that it is no loss of generality to establish a lower bound for "order-based" testers, and then prove a lower bound for such testers.In the proceeding discussion, we elaborate on each of these steps.
We start by noting that most reasonable testers seem to only be able to take advantage of the ordering of the x-coordinates and the y-coordinates of the points they observe -and not the precise numerical values of these coordinates (see Definition 3.1).We call such a tester an orderbased tester.Intuitively, this holds because the A k distance is invariant under applying a monotonic transformation to all of the x-coordinates or all of the y-coordinates, and only the ordering of these coordinates is invariant under all monotonic transformations.In fact, we show that if there exists a non order-based A k closeness tester on a domain of size N , we can use it to construct an orderbased tester that has almost the same guarantees -albeit on a smaller domain (see Lemma 3.2).Hence, using our reduction, we can translate any sample complexity lower bound against orderbased testers into one against general testers at the cost of increasing the domain size.To obtain the reduction, we show that for any 2-dimensional A k tester on a sufficiently large domain there exists a large subset of its domain such that if the samples are drawn from the subdomain, the general tester's output will depend only on the order of the samples.In other words, restricted to this subdomain, the tester becomes exactly an order-based tester.
The argument itself resembles the one in [DKN15a].The key difference is that, due to the tester being 2-dimensional, the structure of the order information becomes much more complicated.More specifically, there is now order information from both of the dimensions.To deal with this issue, we need to take a two-fold approach.Namely, we need to first select a subset of coordinates in the first dimension to make the tester's output independent of the samples' order information in the first dimension, and then adaptively select the subsets of coordinates in the second dimension to hide the remaining order information (see Lemma 3.2).
For order-based testers, we construct families of distributions that are hard to distinguish.Lying in the center of the construction are two small gadgets, each consisting of a pair of distributions.We denote the two gadgets as Y and N respectively.In the Y gadget, the two distributions are both uniform distributions supported on the edges of a square, whose diagonals are parallel to the x and y axis respectively (which we term a "diagonal square").In the N gadget, one distribution is distributed uniformly over a randomly chosen pair of parallel edges of the square, and the other one is distributed uniformly over the remaining two edges.The key point is that though the two distributions in the Y gadget are identical and the distributions in the N gadget have A k distance equal to one (even for k = 4), we show that no order-based tester can distinguish between the two gadgets when fewer than three samples are drawn (see Section 3.2).
To construct the full hard instance, we replicate the gadgets many times in a fairly standard way.In particular, we let p, q have their supports in several "boxes".If the tester draws m samples, to introduce "noise", we produce roughly m heavy boxes on which p, q are identical.We also have k light boxes each with mass approximately ϵ/k on which p, q either use the construction of the Y gadget, and are therefore identical (if we want to construct p = q); or they use the construction of the N gadget, and are therefore far from each other (if we want to construct ∥p − q∥ A k = ϵ).As we have discussed, observing up to three samples from any of the light boxes gives an order-based tester no information regarding which case one is in.In other words, one will only gain information from light boxes with at least four samples; note that there will only exist approximately m 4 /k 3 such boxes if one draws m samples.Additionally, the m heavy boxes will "add noise" on the order of √ m, and thus one can only distinguish between the two cases if m 4 /k 3 ≫ m 1/2 (or equivalently m ≫ k 6/7 ).This heuristic argument can be made rigorous with an appropriate use of information theory (see Section 3.3).
A disadvantage of the above proof technique is that the Ramsey theory argument (used in the first step) only applies if the domain is extremely large.Using an enhancement of the technique from [DKN17], we can reduce this to domains of doubly exponential size in k (see Theorem 3.8).
To achieve this, we need to modify our square-diagonal construction so that three samples provides little information to the tester even when the numerical values of these samples are also revealed.To do this, we show that by applying carefully chosen random functions to the x-and y-coordinates, we can effectively obscure almost all non-order-based information contained in any set of three samples.For the univariate case, [DKN17] showed that for two samples, applying a random affine transformation can obscure both the difference and the average of a pair of points.However, when there are three points a < b < c, applying an affine transformation preserves the value of (a − c)/(b − c).Hence, a non-trivial amount of information may be retrieved from the tester by computing this quantity, even if a random affine transformation is applied.To address this issue in our two-dimensional setting, we will apply an exponential function x → exp(exp(λ)x), where λ is a carefully chosen uniform variable.Then, if a, b, c are not too close, (a − c)/(b − c) will be exponentially close to exp(exp(λ) (a − b)) = exp(exp(λ + log(a − b))).When λ is large compared to log(a − b), the ratio (a − c)/(b − c) will therefore have roughly the same distribution of outputs, independent of a, b, c.As a result, the transformation effectively hides any information encoded by the ratio (a − b)/(b − c).Afterwards, we can mirror the analysis from [DKN17] to apply a suitable random affine transformation to hide all of the remaining information.The details of the construction and its analysis can be found in Section 3.4.

Basic Notation
For n ∈ Z + , we denote [n] def = {1, . . ., n}.We will use S m for the set of all permutations over m distinct elements.Given m > 0, we use Poi(m) to denote the Poisson distribution with mean m.
An axis-aligned rectangle R is a set in R d that can be represented as the product of d intervals We will use p, q to denote the probability density functions of our distributions (or probability mass functions for discrete distributions).For discrete distributions p, q over [n], their ℓ 1 and ℓ 2 The total variation distance between distributions p, q is defined to be d TV (p, q) = 1 2 ∥p − q∥ 1 .Let R ⊂ R d be a subset of the domain of p.We denote by p |R the conditional distribution of p restricted to R, i.e., The reduced measure corresponding to p and R, which we denote by p R , is a discrete measure on [k] defined as

Organization
The structure of this paper is as follows: In Section 2 we develop the analysis tools required to design and analyze our closeneness tester.Section 3 contains our sample complexity lower bound.
In Section 4, we provide some conclusions and open problems.

Closeness Testing Algorithm
In this section, we describe and analyze our multivariate A k -closeness tester.The structure of this section is as follows: In Section 2.1, we present our algorithm and its analysis.The proof of our main structural result (Proposition 2.1) which relies on Ramsey theory is given in Section 2.2.In Section 2.3, we describe and analyze our new closeness tester for discrete distributions which detects discrepancies supported on a small number of light domain elements (Lemma 2.8).Finally, Section 2.4 describes some applications of our A k closeness tester to test closeness of structured distributions under the total variation distance.

The Tester and its Analysis
We start with an overview of our algorithmic approach followed by a detailed pseudo-code and analysis of our tester.
Overview of Algorithmic Approach Let R = {R i } k i=1 be a collection of k disjoint rectangles which witness the A k -distance between p, q7 .The main technical obstacle of A k closeness testing is that the algorithm does not know (a priori) such a collection of rectangles.To circumvent this issue, we draw samples from p, q and use the obtained information to construct a set of rectangles that capture a non-trivial amount of discrepancy between the underlying distributions.A natural way to construct our rectangles is as follows.Given a collection of sample points from p and q, we group these points into disjoint pairs and make our rectangles be those defined by the corresponding pairs.
Note that the number of ways to group the sample points into disjoint pairs scales exponentially with the number of samples drawn.But before we discuss how the grouping is done in our algorithm, we need to prove that this approach can work in principle, i.e., that if one draws sufficiently many samples, there exists a small set of rectangles (each defined by pairs of sample points) that capture enough discrepancy between p and q.
Let x, y be two samples drawn from the mixture (1/2)(p + q).Conditioned on the event that x, y both land in some rectangle R ∈ R of the witnessing partition, we show that R x,y -the rectangle defined by x, y -will in expectation capture a non-trivial amount of the discrepancy in R. The formal statement is specified in Proposition 2.1 and its proof is given in Section 2.2.
By applying Proposition 2.1 to each rectangle R i ∈ R, one can show the existence of a collection of k ′ = O(k) rectangles defined by the sample points which capture enough discrepancy between p, q (Lemma 2.3).It then remains to find these rectangles and invokes an appropriate closeness testing procedure to compare the probability mass of p, q on them.Trying all possible collections of rectangles defined by the sample points is certainly not computationally feasible.Even worse, the natural analysis of this brute-force strategy would require one to union-bound the failure probabilities of the closeness testing steps executed on each possible collection of rectangles.As the number of possible collections scales exponentially with the size of the collection, i.e., k ′ , each individual closeness testing routine is only allowed to fail with exponentially small probability, making the sample complexity of this approach at least linear in k.
We instead follow an approach inspired by the idea of a Good Oblivious Covering in [DKP19].In particular, we consider a sub-collection of rectangles defined by the coordinates of the sample points that form a nice "cover" of all possible such rectangles.We then proceed to define the notion of "induced" distributions of p, q on the cover such that the two corresponding induced distributions have large ℓ 2 -discrepancy supported on a small number of domain elements if and only if there exists a collection of rectangles defined by the sample points over which the probability mass of p, q differ significantly.Then, applying a novel variant of the ℓ 1,k -tester from [DKN17] (see Lemma 2.8) yields our final tester.
We are now ready to proceed with the details of the proof.
Discrepancy from Random Points Let R be an axis-aligned rectangle such that p(R) and q(R) differ substantially and x, y be sample points drawn from 1 2 (p + q) |R , the uniform mixture distribution between p, q restricted to R. We consider the rectangle defined by x, y, which we denote by R x,y .Our main structural result, serving as the direct motivation for our algorithm, shows that R x,y captures non-trivial amount of discrepancy between p and q with non-trivial probability.
Proposition 2.1 (Random Point Discrepancy).Let p, q be distributions on R d and R be an axisaligned rectangle R ⊂ R d satisfying |p(R) − q(R)| ≥ ϵ(p(R) + q(R)).Let x, y be random points sampled from (1/2)(p + q) |R .Then there exists a number α d = Cd 2 2 2 d+1 , for some sufficiently large universal constant The proof of Proposition 2.1 makes essential use of Ramsey theory and is one of the main technical contributions of this work.We defer its proof to Section 2.2.
Here we comment on the quantitative aspects of this result.Specifically, it is not clear whether the ϵ α d multiplicative factor in the right hand side of the final inequality is best possible.It is a plausible conjecture that the optimal dependence is poly(ϵ) -independent of the dimension d (see Question 4.1).Such an improvement would directly improve the sample complexity of our closeness tester, as a function of ϵ.
Existence of Witnessing Grid-aligned Rectangles We begin with an assumption that simplifies our analysis: the cumulative density function of each coordinate of p or of q is continuous.We will eventually remove the assumption in the proof of our main theorem.Suppose that ∥p − q∥ A k ≥ ϵ.Then there exists a collection of k disjoint axis-aligned rectangles x, y happen to land in the same rectangle R i , the rectangle R x,y they define will capture a nontrivial fraction of discrepancy in R i .For this reason, we restrict our attention to rectangles lying on the sample-point grid defined below.Definition 2.2 (Sample-Point Grid).Let S = {x (1) , • • • , x (m) } ⊂ R d be a set of sample points such that no two points overlap in any of their coordinates, i.e., x The sample-point grid G S (with respect to S) is the set of all points z ∈ R d such that the i-th coordinate z i is chosen from the set {x i }.Given an axis-aligned rectangle R, we say that R is a grid-aligned rectangle with respect to G S if all its vertices are grid-points from G S .
Let G S be a sample-point grid with respect to a collection of sufficiently many i.i.d.samples from (1/2)(p + q).We first show that, with high constant probability, there exist O(k) many rectangles aligned with G S that capture enough discrepancy between p, q in ℓ 2 distance.
Lemma 2.3 (Existence of a Small Set of Witnessing Grid-aligned Rectangles).Let α d > 0 be as defined in Proposition 2.1.Let p, q be distributions over , for some sufficiently large universal constant C > 0, and G S be the sample-point grid defined by these points.With probability at least 9/10, there exist k ′ ≤ 3k disjoint grid-aligned rectangles R1 , • • • , Rk ′ with respect to G S satisfying the following: We first perform some preliminary simplifications to make sure that v i = O(1/k) and ϵ i ≥ ϵ/4.Given v i > 1/k, we can subdivide R i into ⌊v i k⌋ sub-rectangles evenly along the first coordinate according to the cumulative density function of the first coordinate of (1/2)(p + q).We next discard any rectangles R i such that ϵ i < ϵ/4, which leads to us losing at most i (p(R i ) + q(R i ))ϵ/4 ≤ ϵ/2 discrepancy.In summary, after these operations, we will have a collection of k ≤ 3k rectangles R 1 , • • • , R k such that for each rectangle R i in the collection we have that v i ≤ 1/k, ϵ i ≥ ϵ/4, and k i=1 v i ϵ i ≥ ϵ/2.Let S be the set of Poi(m) many i.i.d.samples drawn and G S be the corresponding sample-point grid.We define the random variable Y i as follows: if exactly two samples x, y ∈ S fall in the same rectangle R i for i ∈ [ k], then Y i = (p(R x,y ) − q(R x,y )) 2 ; otherwise, Y i = 0.By the definition of Y i , we know that if Y i > 0, then there exists some rectangle R ⊂ R i aligned with G S such that i=1 Y i is always a lower bound on the discrepancy collected by the best collection of at most k ≤ 3k rectangles aligned with the grid G S for any instance of the set S. Consequently, to prove the lemma, it suffices to show that k i=1 Y i ≥ Ω (ϵ/4) 2α d m 2 /k 3 with probability at least 9/10.
Consider the event E i that exactly two sample points land inside R i .Then it is easy to see that , where x, y are two random points from 1 2 (p + q) |R i .By our preliminary simplification, we have that |p Combining this with Jensen's inequality then gives that Since Y i conditioned on the event E i is distributed as (p(R x,y ) − q(R x,y )) 2 and Y i is always nonnegative, we thus have Summing over all Y i 's, we obtain where the first inequality uses (Equation (1)), in the second inequality we bound from below ϵ i by ϵ/4, and in the third inequality we use the fact that b i=1 a 4 i subject to i a i = A, a i ≥ 0, is minimized at a i = A/b.
On the other hand, since Y i is defined to be non-zero only when there exist two points landing in R i , and takes values at most v 2 i , we have that where in the last inequality we use that for some sufficiently large universal constant C > 0.Then, by Chebyshev's inequality, it follows that This concludes the proof of Lemma 2.3.
Existence of Good Grid Covering By Lemma 2.3, there exist O(k) grid-aligned rectangles that capture Ω d,ϵ (m 2 /k 3 ) discrepancy between p, q.A naive tester may proceed as follows: Choose a set of k ′ = O(k) disjoint rectangles aligned with the sample-point grid, and then perform closeness testing between the reduced distributions of p, q on the chosen rectangles.Then, with non-trivial probability, the chosen rectangles will capture enough discrepancy between p, q, and a standard closeness tester would suffice.Unfortunately, the number of ways to choose k ′ disjoint grid-aligned rectangles from a grid containing m d grid points is at least m Ω(d k ′ ) .If we were to try all possible collections of k ′ disjoint grid-aligned rectangles, the resulting tester would likely be inefficient, as discussed in our techniques overview (Section 1.2), in terms of both sample complexity and computational complexity.To circumvent this issue, we will instead consider a carefully chosen subset of all grid-aligned rectangles with respect to the sample-point grid such that any gridaligned rectangle can be decomposed into the union of a small number of rectangles from the family.Moreover, the subset is carefully constructed to have the subtle property that any point x ∈ R d is contained in a small number of rectangles from the subset.This leads us to the concept of Grid Covering, which is based on the idea of Good Oblivious Covering (Definition 2 from [DKP19]).
Definition 2.4 (Grid Covering).Let m be a power of 2 and S be a set of (m + 1) points in R d and G S be the corresponding sample-point grid.A grid covering is a family of rectangles aligned with the sample-point grid, which we denote by F(G S ), satisfying the following: • Any rectangle aligned with the grid can be represented as the union of at most 2 d log d m disjoint rectangles from F(G S ).
• Any point in R d is contained in exactly log d m rectangles.
With a construction similar to that in [DKP19], we show that a Grid Covering always exists.
Lemma 2.5 (Existence of Grid Covering).Let m be a power of 2, S be a set of (m + 1) points from R d and G S be the corresponding sample-point grid.Then there exists a grid covering F(G S ).
Proof.For each coordinate j ∈ [d], let x (1) j be the j-th coordinates of the samples collected sorted in increasing order.We will refer to these numbers as the "grid values".For each i ∈ [log m], we will define I j,i as the partition of the interval [x (1) j , x (m) j ] into 2 i many sub-intervals such that each sub-interval in the partition contains an equal number of grid values.Then the rectangles in F(G S ) are those of the following form: for j ∈ [d], an interval I j ∈ i I j,i is chosen and the rectangle is simply the product of the d selected intervals I j .
Then it is easy to see that for any value z ∈ [x ], z is within log m intervals from i I j,i (one interval from each partition).As a result, any point in R d is within log d m rectangles from F(G S ).
Let R be a grid-aligned rectangle that is the product of the intervals I 1 , • • • , I d .Notice that the interval I j can be decomposed into at most 2 log m intervals from i I j,i (at most 2 intervals from each partition I j,i ).Thus, R can be decomposed into at most 2 d log d m rectangles from F(G S ).This completes the proof.
We next define the notion of the induced distribution of p, q on F(G S ).
Definition 2.6 (Induced Distribution).Given a ditribution p on R d and a family of sets F whose elements are non-empty sets in R d that are not necessarily disjoint, the induced distribution p F is defined as follows.To draw a random sample from p F , one first draws a random sample x from p.If x does not belong to any set in F, we return the special element ∅.Otherwise, we return a uniformly random set S ∈ F such that x ∈ S.
Notice that for a rectangle R ∈ F(G S ), we have that p F (G S ) (R) = p(R)/ log d m, since each point appears in exactly log d m rectangles from F(G S ).This then allows us to show that the ℓ 2 -discrepancy between the induced distributions p F (G S ) , q F (G S ) must be non-trivial if the grid G satisfies the conclusion in Lemma 2.3.Specifically, we show: Lemma 2.7.Let m be a power of 2, S be a set of (m+1) points in R d , and G S be the corresponding sample-point grid.Moreover, suppose that the conclusion of Lemma 2.3 holds for G S .Then there exists a subset of rectangles H ⊂ F(G S ) such that the following conditions hold: Proof.Since we assume that the conclusion in Lemma 2.3 is satisfied, there exist k ′ ≤ 3k many grid-aligned rectangles and By the definition of the grid covering, each R i can be decomposed into at most 2 d log d (m) rectangles from F(G S ).Let H i be the set of rectangles in F(G S ) into which R i is decomposed.We will consider , which shows (i).Moreover, by the definition of the induced distribution, for any rectangle R ∈ F(G S ), we have p It remains to show (iii).For each H i , we have Unlike the naive testing approach (running an ℓ 1 -closeness tester on many different pairs of reduced distributions), we can now run a closeness tester just on the induced distributions p F (G S ) , q F (G S ) .A technical issue is that the domain size of F(G S ) is still very large.This makes the black-box application of any ℓ 1 -closeness tester sample inefficient.Instead, we need to leverage the fact that a non-trivial fraction of the discrepancy between the two distribution is supported on a small number of elements.Interestingly, a tester with similar guarantees was developed in [DKN17] (see Lemma 2.5).However, as is, that tester is not sufficient for our purposes.More specifically, we essentially need to develop an ℓ 2 -version of it.The reason is that we need to distinguish between the cases p = q versus the case that a non-trivial amount of ℓ 2 -discrepancy is supported on a few elements that are themselves not too heavy.In particular, using tools developed in [DK16] and [CDVV14], we show the following: Lemma 2.8.Let p, q be discrete distributions on [n] and s ∈ [n].Given ϵ > 0 and Poi(m) many i.i.d.samples from p, q, for m = Θ max ϵ −4/3 , ϵ −2 / √ s , there exists a tester Flatten-Closeness that distinguishes between the following cases with probability at least 9/10: (a) p = q versus (b) there exists a set of elements H of size s such that (i) i∈H (p i − q i ) 2 ≥ ϵ 2 and (ii) The proof of Lemma 2.8 builds on the approach of the ℓ 1,k tester.An important difference is that we now need to carefully incorporate the upper bound on the mass of the elements witnessing the discrepancy into the analysis.We defer the proof to Section 2.3.
We are now ready to present the pseudo-code of our testing algorithm and provide its proof of correctness.
Algorithm 1 Multidimensional A k Closeness Tester Require: sample access to p, q on R d ; accuracy ϵ.
, where C ′ is a sufficiently large constant and α d is defined in Proposition 2.1.2: Draw Poi(m) samples from (1/2)(p + q) and denote the set of samples by S. 3: Add arbitrarily some distinct points to S such that |S| is a power of 2. 4: Construct the grid G S (Definition 2.2) and the grid covering F(G S ) (Definition 2.4).5: Run the ℓ 2 -closeness tester of Lemma 2.8 on the induced distributions p F (G S ) , q F (G S ) with accuracy parameter κ = c 2 −d log −3d k (ϵ/4) 2α d m 2 /k 3 for some sufficient small constant c > 0. 6: Accept if that closeness tester accepts; otherwise Reject.
Proof of Upper Bound in Theorem 1.2.We first present the analysis assuming that p, q are continuous distributions and in the end give a preprocessing step to make sure the algorithm works for general distributions.Let F (G S ) be defined as in Algorithm 1.If p = q, we have p F (G S ) = q F (G) .Therefore, the tester will accept with probability at least 2/3 by Lemma 2.8.
Next, we consider the case ∥p − q∥ A k > ϵ.We claim that with probability at least 9/10 there exist k ′ ≤ k grid-aligned rectangles (with respect to G S ) such that the conclusion of Lemma 2.3 is satisfied.Without the operation of adding extra points into S in Line 3 of Algorithm 1, the claim just follows from Lemma 2.3.Now it is easy to see that R 1 , • • • , R k ′ are still grid-aligned rectangles with the extra points.Hence, the claim follows.
Condition on the event that the conclusion of Lemma 2.3 holds.We can then apply Lemma 2.7, which gives us that there exists a set of elements Then, applying Lemma 2.8, gives that the tester rejects with probability at least 9/10 given that where κ = Θ(1) 2 −d log −3d (k) (ϵ/4) 2α d m 2 /k 3 and C > 0 is a sufficiently large constant.One can verify that m = C ′ k 6/7 ϵ −2α d /3 log d (k) 2 d/3 suffices, where C ′ is a sufficiently large constant.Now let us relax the assumption that the marginal distributions of p, q in each coordinate have continuous cumulative density functions.We begin with the observation that the algorithm's output essentially depends only on the order information of the sample points.That is, given two different sets of samples j are the same for each coordinate j ∈ [d], the output of the algorithm will always be the same.
Based on this observation, we know that the algorithm will satisfy the same guarantee if we give it only the "rank" information of the samples.For j ∈ [d], we sort x order.We will denote by π(j) i the rank of x (i) j in the sorted sequence.Then, for each i ∈ [m], we replace the original sample with the new sample x(i) defined as x(i) j = π(j) i .If the marginal distributions of p or q are not continuous, we may observe multiple samples sharing the same value at some coordinates.Then, when computing the rank information of the samples, we will break ties uniformly at random.Now consider the distributions p ′ , q ′ obtained by stretching any point-mass of their marginal distributions at any coordinate into an interval.If the algorithm takes samples from p ′ , q ′ instead, the guarantees are satisfied, since p ′ , q ′ are both continuous distributions and ∥p − q∥ A k = ∥p ′ − q ′ ∥ A k .On the other hand, the order of samples taken from p ′ , q ′ has the same distribution as the order of samples taken from p, q after we break ties uniformly at random.This then concludes the proof.

Proof of Proposition 2.1
Let R be an axis-aligned rectangle such that |p(R) − q(R)| ≥ ϵ(p(R) + q(R)).Let x, y be samples from (1/2)(p + q) |R -the uniform mixture of p, q restricted to R. We want to show that in expectation over x, y the discrepancy |p(R x,y ) − q(R x,y )| is large.
Warm-up: Special case p(R) > 0 and q(R) = 0. Towards establishing the desired statement, we first analyze the special case that p(R) > 0 and q(R) = 0.The proof for this case also serves as intuition regarding why selecting the interval R x,y is a good choice.
In this case, the discrepancy between p and q is simply p(R x,y ) -the probability mass of R x,y with respect to p. Therefore, whether R x,y captures enough discrepancy boils down to the following question: Let x, y be random points drawn from an arbitrary distribution D over R d .What is the minimum amount of mass captured by the rectangle R x,y in expectation?We show the quantity is indeed non-trivial.
Interestingly, the proof of this statement relies on a certain generalized version of the famous Erdős-Szekeres theorem.In particular, the generalized Erdős-Szekeres theorem bounds from above the minimum length of a sequence consisting of points in R d such that there exists a subsequence of points that is monotonic in each coordinate.
Lemma 2.9.Let x, y be random samples independently drawn from a distribution D on R d .Then We note that the above statement is qualitatively nearly tight as a function of d.
Proof of Lemma 2.9.To prove the lemma, we make essential use of the following generalized version of the Erdős-Szekeres theorem proved by De Brujin.
Fact 2.10 (De Brujin's Generalized Erdős-Szekeres Theorem, see [Kru53]).Let ψ(n, d) denote the least integer N such that every sequence of points x (1) , • • • , x (N ) in R d contains a monotonic subsequence x (i 1 ) , • • • , x (in) of length n satisfying the following: for each coordinate j ∈ [d], we have either that x or that x Then it holds ψ(n, d) = (n − 1) 2 d + 1.As an immediate corollary, we obtain the following: Corollary 2.11.Let S ⊂ R d be a set of points with size |S| ≥ 2 2 d−1 + 1.Then there exists a triple x, y, z ∈ S such that z ∈ R x,y .Furthermore, there exists a set of points S of size |S| = 2 2 d−1 such that there is no triple x, y, z ∈ S satisfying z ∈ R x,y .
Proof.Let S be an arbitrary set of m points in R d .Let x (1) , • • • , x (m) be a sequence of points in R d−1 obtained by (1) sorting the points in S based on their first coordinates, and (2) throwing away their first coordinates.Applying Fact 2.10 with n = 3 gives us that we will have a subsequence x (i 1 ) , x (i 2 ) , x (i 3 ) such that the points are either monotonically increasing or monotonically decreasing in each of the (d − 1) coordinates if and only if m ≥ 2 2 d−1 + 1.Furthermore, by our construction of the sequence of x (i) 's, the first coordinates of the corresponding points in S are always monotonically increasing.This concludes the proof.
It is worth noting that for a set of points S ⊂ R d to contain a triple x, y, z such that z ∈ R x,y , the size of S needs to be doubly exponential in d; this bound is tight since the corollary is essentially equivalent to Fact 2.10, which is itself quantitatively tight.Now let S be the set of 2 2 d−1 points such that there is no triple x, y, z ∈ S satisfying z ∈ R x,y (Corollary 2.11 ensures the existence of such a set of points).Let D be the uniform distribution over S. One can see that D(R x,y ) ≤ 2/2 2 d−1 for any x, y ∈ S. It hence follows that E x,y∼D [D(R x,y )] ≤ 2/2 2 d−1 , showing that our lower bound is qualitatively tight.
To relate Lemma 2.9 to the Generalized Erdős-Szekeres theorem (Fact 2.10), we make the following observations: (i) the probability mass of R x,y under D is equal to the probability that a third random point z drawn from D happens to land in R x,y , and (ii) drawing three random samples from D is equivalent to first drawing N random samples from D and then choosing 3 distinct points from these N points uniformly at random.Let D N be the empirical distribution obtained after drawing N i.i.d.samples from D. The observations above allow us to conclude that If we have N ≥ 2 2 d−1 + 1, Corollary 2.11 guarantees the existence of a triple x, y, z ∈ D N such that z ∈ R x,y .Hence, the probability in the last equation above is at least 1/N 3 .This completes the proof of Lemma 2.9.
General Case.We are now ready to handle the general case and complete the proof of Proposition 2.1.To do so, we leverage the concept of the discrepancy density defined below.
Definition 2.12 (Discrepancy Density).Let p, q be distributions over R d .For a set S ⊆ R d , we define the discrepancy density of S with respect to p, q as follows: ρ(S; p, q) def = 2 |p(S) − q(S)|/(p(S) + q(S)) .
The high-level intuition is the following.Let R x,y be the axis-aligned rectangle defined by x, y, where x, y are independent random samples drawn from (1/2)(p + q) |R .By Lemma 2.9, the probability mass of R x,y (with respect to the mixture distribution (1/2)(p + q)) is a non-trivial fraction of the mass of R in expectation.If the discrepancy between p, q within R x,y is a nontrivial fraction of the mass of R x,y , we are done.Otherwise, if we were to "remove" the region R x,y from R, we would discard about approximately equal amounts of p mass and q mass.Therefore, the discrepancy density of the remaining space, ρ(R\R x,y ; p, q), must have increased.We can then carve the remaining space into at most 2d many axis-aligned sub-rectangles and pick a subrectangle with significantly higher discrepancy density to restart the process.When the discrepancy density approaches one, the situation qualitatively resembles the special case where q(R) = 0 (or p(R) = 0); and if the mass of R x,y is non-trivial, the discrepancy captured will also be non-trivial.The formal proof follows.
Proof of Proposition 2.1.For notational convenience, we will denote D := (1/2) (p + q).Let x, y be two sample points drawn from D |R , the restriction of D to R.Then, by Lemma 2.9, it holds for some β d depending only on d.We will use E to denote the event {D(R x,y ) ≥ β d D(R)/2}.Then we must have that Pr We consider two complementary cases.First, if we will have and we are done.Otherwise, it holds Since we also condition on the event E, we know that there exists a rectangle R ⊂ R such that We consider the remaining space R\ R. We have that its discrepancy density satisfies where we denote where in the inequality above we bound below D( R) by β d D(R)/2 by our choice of R.This then gives that γ ≥ β d /4.It turns out that remaining space R\ R, can be carved into 2d many axis-aligned rectangles.An illustration of the d = 2 case is given in Figure 1.
Specifically, we show the following: Claim 2.13.Let R ⊆ R be an axis-aligned rectangle.The set R\ R can be decomposed into 2d Proof.The proof proceeds via induction.The base case (d = 2) is clear, as shown in Figure 1.Assume that the statement holds for d = k.We proceed to show that it still holds for d = k + 1. Suppose that R is defined by points x, y ∈ R k+1 and R is defined by points x, ỹ ∈ R k+1 .We let R 2k+1 be the rectangle that occupies the interval [x 1 , x1 ] in the first dimension and occupies the same intervals as R in the other dimensions; similarly, let R 2k+2 be the rectangle that occupies the interval [ỹ 1 , y 1 ] in the first dimension and occupies the same intervals as R in the other dimensions.
Then the remaining space R\(R 1 ∪R 2 ) lies entirely in the interval [x 1 , ỹ1 ] in the first dimension.We can then discard the first dimension.We denote the projection of R\(R 1 ∪ R 2 ) into the remaining subspace R k as R ′ , and the projection of R as R′ .We can then apply our inductive hypothesis on R ′ and R′ to obtain 2k rectangles We denote the rectangles obtained by applying the above claim to R\ R as R 1 , • • • , R 2d .Furthermore, we will denote We claim that there exists a rectangle R i * in the remaining space such that By definition of D(R i ) and γ i , we have where the first equality follows from the fact that the rectangles form a partition of the remaining space, and the second equality follows from the fact that the sum of discrepancies in each rectangle must be at least the total discrepancy in the remaining space.Substituting Equation (6) into Equation ( 7) and simplifying the result gives which implies that γ i * ≥ γ/2d.On the other hand, we also have that This then establishes the existence of an i * such that Equation (5) is satisfied.We can inductively restart the process with R ′ = R i * and ϵ ′ = ϵ(1 + λ i * ).In each iteration, the discrepancy density must increase by at least a multiplicative factor of (1 + γ/2d) ≥ (1 + C β d /2d), for some universal constant C > 0. Since the discrepancy density is at most one, the process must terminate in O d β −1 d log(1/ϵ) many iterations, and we will eventually find some rectangle R * such that Equation (3) is satisfied.
It remains to show that the mass D(R * ) is bounded below.Suppose that in the t-th iteration, we start from the rectangle R (t) with discrepancy density and end with the rectangle R (t+1 i * .Denote by R(t) the rectangle discarded and ϵ (t+1/2) := ϵ (t) 1 + γ (t) the discrepancy density of the remaining space R (t) \ R(t) .We analyze how much the mass of the rectangle can shrink in each iteration, as follows:

3
, where the first line uses our choice of R (t) i * such that Equation ( 5) is satisfied, the second line uses the elementary inequality (1 − x) ≥ (1 − x) 2 /(1 − x/2) 2 for any x ≤ 1, the third line uses the definition of γ (t) , and the last line uses the facts γ (t) i * ≥ γ (t) /2d by our choice of R (t) i * such that Equation ( 5) is satisfied and γ (t) ≥ Ω(β d ) by our choice of R(t) .
Notice that the discrepancy density increases by a multiplicative factor of 1 + γ i * in the t-th iteration.Thus, we have where we used the fact that the process terminates in at most O d β −1 d log(1/ϵ) iterations and Equation ( 8).This then shows that there exists a rectangle where , for some sufficiently large universal constant C.This concludes the proof of Proposition 2.1.

Proof of Lemma 2.8
Given a discrete distribution p, flattening [DK16] is the technique of using a small set of samples from p to appropriately subdivide its bins (domain elements) aiming to reduce the ℓ 2 -norm of the distribution.Formally, the flattening technique yields what was described in [DK16] as a split distribution.
Definition 2.14 (Definition 2.4 from [DK16]).Given a distribution p on [n] and a multiset S of elements of [n], define the split distribution p S on [n + |S|] as follows: For 1 ≤ i ≤ n, let a i denote 1 plus the number of elements of S that are equal to i. Thus, n i=1 a i = n + |S|.We can therefore associate the elements of [n + |S|] to elements of the set B = {(i, j) : i ∈ [n], 1 ≤ j ≤ a i }.We now define a distribution p S with support B, by letting a random sample from p S be given by (i, j), where i is drawn randomly from p and j is drawn randomly from [a i ].
We will use the following basic facts about split distributions.
Fact 2.15 (Fact 2.5 and Lemma 2.6 from [DK16]).Let p and q be probability distributions on [n], and S a given multiset of [n].Then: (i) We can simulate a sample from the split distributions p S or q S by taking a single sample from p or q, respectively.(ii) It holds ∥p We will also leverage the following ℓ 2 -distance estimator to develop our final tester.
We can easily convert the above ℓ 2 -distance estimator to an ℓ 2 -closeness tester, which is more applicable to our setting.
Let S be a multiset of Poi(m) i.i.d.samples from (1/2) (p + q) for some m ≤ s/100.First, we argue that the ℓ 2 -distance between p, q will not decrease by too much after flattening, by taking advantage of the fact that the ℓ 2 -discrepancy between p, q is supported on a few light elements.
Let X i be the random variable denoting the number of samples from S landing in the i-th element, i.e., X i ∼ Poi(m (p i + q i ) /2).Then the expected discrepancy restricted to the elements from the set of elements witnessing the discrepancy, i.e., H, after flattening is at least where x is a decreasing function with respect to x.Since m ≤ s/100 and (p i + q i )/2 ≤ s, it follows that (1 − e −λ i )/λ i is bounded below by 1 − e −0.01 /0.01 ≥ 0.99.This then gives us On the other hand, the variance of the discrepancy restricted to the elements in H, after flattening, is bounded above by where in the last inequality we use the fact that 1 1+X i is at most 1 and (9).This then gives us Combining ( 9) and (10), we obtain On the other hand, by Fact 2.15 and Markov's inequality, it holds By the union bound, (11), ( 12) and Corollary 2.17, it follows that the ℓ 2 -closeness tester of Corollary 2.17 succeeds with probability at least 2/3, if we take m ′ = C 1 m ϵ −2 many samples, for a sufficiently large constant C. Balancing m and m ′ (with the restriction that m ≤ s/100 in mind) then gives us that the overall tester succeeds with probability at least 2/3 if we draw Poi(m) many i.i.d.samples with m = Θ max ϵ −4/3 , ϵ −2 / √ s .
This concludes the proof of Lemma 2.8.

Applications: Closeness Testing of Multivariate Structured Distributions under Total Variation Distance
The most direct application of our multivariate A k -closeness tester is for the problem of testing closeness of multivariate histogram distributions -distributions that are piecewise constant over (the same) unknown collection of axis-aligned rectangles -with respect to the total variaton distance.This follows directly from our main theorem, since for any pair of k-histogram distributions p, q with respect to the same set of rectangles, we have Formally, we have the following: Corollary 2.18.Let {R i } k i=1 be a set of axis-aligned rectangles in R d .Suppose p, q are distributions over R d that are piecewise constant over each of {R i } k i=1 , i.e., p(x) = p(y) for any x, y ∈ R i and the same for q.Then there exists a tester which distinguishes between p = q and d TV (p, q) > ϵ with sample complexity C k 6/7 ϵ −2α d /3 log d (k) 2 d/3 , where C is a sufficiently large universal constant and α d = O(d 2 2 2 d+1 ).
We now proceed with our second application.We consider the binary hypothesis class H consisting of all possible k-unions of axis-aligned rectangles within the unit cube [0, 1] d .Given two hypotheses h 1 , h 2 ∈ H, we can test whether h 1 is equivalent to h 2 or they are far from each other under the uniform distribution over the unit cube [0, 1] d .
Corollary 2.19.Let H be the class of all possible k-unions of axis-aligned rectangles within the unit cube [0, 1] d , i.e., Let h 1 , h 2 be two unknown hypotheses from H. Given ϵ > 0 and sample access to (x, h i (x)), where x follows the uniform distribution over [0, 1] d , there exists an efficient algorithm which distinguishes with probability at least 2/3 between (i) h 1 (x) = h 2 (x) for all x, and (ii) where U is uniform distribution over [0, 1] d .Moreover, the algorithm has sample complexity , where C is a sufficiently large constant and Proof.Consider the distributions p, q defined as follows.To draw a sample from p, we take a sample (x, h 1 (x)) where x ∼ U .If h 1 (x) = 1, we return x.Otherwise, we return some arbitrarily chosen point s ̸ ∈ [0, 1] d .We define q similarly based on h 2 .If h 1 and h 2 are identical, it is easy to see that p = q.If E x∼U [1{h 1 (x) ̸ = h 2 (x)}], we claim that ∥p − q∥ A k ≥ ϵ/2.Suppose that h 1 is the union of the rectangles {R i } k i=1 and h 2 is the union of the rectangles {R ′ i } k i=1 .Then we have that Without loss of generality, we assume that the first term is larger.Then we have that On the other hand, we also have Thus, this gives ∥p − q∥ A k ≥ ϵ/2.Therefore, we can distinguish between the two cases by performing A k -closeness testing between p, q with accuracy parameter ϵ/2.

Sample Complexity Lower Bound
In this section, we prove our sample complexity lower bound.Specifically, we show that the task of A k -closeness testing gets information-theoretically harder as we go from one dimension to two dimensions.For the one-dimensional case, it was shown in [DKN15a] that the sample complexity of A k -closeness testing is Θ max k 4/5 ϵ −6/5 , k 1/2 ϵ −2 .Perhaps surprisingly, for two-dimensional distributions, we prove a sample complexity lower bound of Ω k 6/7 /ϵ 8/7 in the sublinear regime, where ϵ > k −1/8 .This lower bound clearly dominates the sample complexity of one-dimensional A k testing in the same regime.
At a very high level, we build on the lower bound framework of [DKN15a].In particular, our lower bound proof consists of two steps.First, we argue that, if the domain size is a sufficiently large function of d, k, we can assume without loss of generality that the output of the tester only depends on the relative order of samples ranked in each coordinate.This is shown in Section 3.1.
Then, for such "order-based" testers, we present two explicit families of pairs of two-dimensional distributions such that a random pair of distributions from the first family are identical, and a random pair of distributions from the second family are far from each other in A k -distance.Moreover, a random pair of distributions from the first family is hard (i.e., requires many samples) to distinguish from a random pair from the second.This step requires a carefully designed gadget consisting of distributions over R 2 supported on the edges of a square.We present the construction and analyze its key properties in Section 3.2.
Next we appropriately replicate the gadget many times to create the full hard-instance of 2dimensional A k -closeness testing.The description of the hard instance and its detailed analysis can be found in Section 3.3.Finally, we provide an alternative way to prove a sample complexity lower bound against general A k testers, while requiring the domain size to be at most doubly exponential in k.This involves a careful application of randomly chosen monotonic transformations to the x and y coordinates of all points in order to hide extra "non-order based" information that a tester can retrieve from the numerical values of the sample coordinates.This more refined construction and its analysis are presented in Section 3.4.

Order-Based Testers
Here we define the class of order-based testers and show that we can translate lower bounds against order-based testers to general testers at the cost of increasing the domain size.More formally, we consider algorithms which are restricted to obtain information from what we call the Order Sampling process, as opposed to the usual direct sampling.This can be thought of as follows.We first draw i.i.d.samples from the unknown distributions.Then, instead of feeding them directly to the algorithm, we perform an appropriate pre-processing to extract only the information related to the order of the coordinates of the samples, and reveal only the order information to the algorithm.Definition 3.1 (Order Sampling).Let p, q be a pair of distributions in R 2 .Let {(x i , y i ), ℓ i } m i=1 be m i.i.d.samples, where (x i , y i ) are sampled from (1/2)(p + q) and ℓ i records whether the sample comes from p or q.Let σ(x), σ(y) ∈ S m be the permutation representing the rank of the x-coordinates and y-coordinates accordingly.The Order Tuple associated with the m samples is given by Order({x i , y i , ℓ i } m i=1 ) = (σ(x), σ(y), ℓ).Furthermore, we will use D(p, q, m) to denote the distribution over the tuple (σ(x), σ(y), ℓ) obtained through this process.
As our first structural lemma, we show that if an algorithm is able to perform A k -closeness testing with direct sample access on a domain of size N × N , then we can always use it to build another algorithm which performs the test with only the order tuple of the same number of samples -albeit on a smaller domain of size n × n.The proof uses a Ramsey-theoretic argument and generalizes Theorem 13 in [DKN15a].
Lemma 3.2.For all n, m, k ∈ Z + where m < n and ϵ > 0, there exist N 1 , N 2 ∈ Z + such that the following holds: If there exists an algorithm A that for every pair of distributions p, q over [N 1 ] × [N 2 ] distinguishes the case p = q from the case ∥p − q∥ A k > ϵ with probability at least 4/5 while taking m samples from p and q, then there exists an algorithm A ′ that for every pair of distributions p ′ , q ′ over [n] × [n] distinguishes the case p ′ = q ′ versus ∥p ′ − q ′ ∥ A k > ϵ with probability at least 2/3 given a tuple T from the order sampling process D(p ′ , q ′ , m).
Proof.Suppose we are given the algorithm A which can perform A k -closeness testing over the domain [N 1 ] × [N 2 ] given direct i.i.d.sample access to p, q.We show that we can use A to construct another algorithm A ′ which performs the test with only tuples obtained from the order sampling process over the domain Let {(x i , y i ), ℓ i } m i=1 be the samples drawn by A. We will write A({(x i , y i ), ℓ i } m i=1 ) to denote the probability that A outputs "YES" given these samples.Before we specify our construction, we remark that we can without loss of generality assume that the image of A({(x i , y i ), ℓ i } m i=1 ) has size at most 11.This is because we can always round the probability to the nearest multiples of 1/10 and lose only 1/10 in the overall success probability.
Let p, q be the unknown distributions supported on [n] × [n].The key step is to argue the existence of two monotonic transformations f where N 2 is chosen to be a sufficiently large function of n, and N 1 is chosen to be a sufficiently large function of n and N 2 , such that if one feeds the samples {(f x (x i ), f y (y i )) , ℓ i } to A, the output of A becomes a function only of Order({(x i , y i ) , ℓ i }).In other words, we want to find two mappings f x , f y such that as long as Order ({(x i , y i ), Given such mappings, we can then define Then, it is easy to see that A ′ is an order-based tester.Furthermore, since f x , f y are both monotonic, the domain transformation will preserve the A k distance between p and q.Hence, A ′ enjoys the same guarantee and gives the correct answer with probability at least 2/3.
We next show the existence of such a pair of transformations f x , f y .We do so in two steps.First, we show the existence of the transformation f x which will make the output of algorithm A independent of the actual values of the x-coordinate.This then allows us to construct an algorithm A x that depends only on the rank information of the x-coordinates, the y-coordinates and the labels.Then we show the existence of f y , which is defined with respect to A x , that makes the output of A x independent of the actual values of the y-coordinates.This then allows us to conclude the existence of the algorithm A ′ .
For convenience, we will rewrite the tuples ), where X is the set of x-coordinates and σ(x) is the permutation which maps i ∈ [m] to the rank of ).Notice that the set of values that g X has size at most 11, since we assume the acceptance probability of A conditioned on any input can take at most 11 different values.We note that there can be at most 11 m!N m 2 m many different types of mapping g X .
If we view X as a hyper-edge of the hypergraph [N 1 ] m and the associated mapping g X as the coloring of the hyper-edge, by Ramsey's theorem, there exists a subset of vertices V of size n such that the coloring of the hyper-edges in the sub-graph V m are all the same as long as N 1 is sufficiently large compared to N 2 and m.In other words, there exists a subdomain V ⊂ [N 1 ] such that if the x coordinates of the samples are all from this subdomain, the acceptance probability of algorithm A becomes a function of only y i , ℓ i , σ(x) and independent of the actual x-coordinates X ⊆ V .We will then choose f x as the order-preserving mapping from [n] to [N 1 ], where the image is exactly V .
We next consider the algorithm A x which first applies the transformation f x and then runs the testing algorithm A on the resulting samples.From the argument above, we know that algorithm A x depends only on σ(x), {y i } m i=1 , {ℓ i } m i=1 .Similarly, we can rewrite the tuple as σ(x), σ(y), Y, {ℓ i } m i=1 , where Y is the set of y-coordinates and σ(y) is the permutation which maps i to the rank of y i .With a similar argument, as long as N 2 is sufficiently large compared to n, m, we can show the existence of an order-preserving mapping f y such that if we apply the mapping f y first and then run A x , the output of A x becomes only a function of σ x , σ y , {ℓ i } m i=1 and independent of the actual set of y coordinates Y. Notice that σ x , σ y , {ℓ i } m i=1 is exactly the order tuple Order ({(x i , y i ), ℓ i } m i=1 ).Hence, such a pair of transformations f x , f y are exactly what we need to construct algorithm A ′ .Setting A ′ ({(x i , y i ), ℓ i )}) := A({(f x (x i ), f y (y i )) , ℓ i }) then concludes the proof.

Square-Edge Distributions
We now present the building block of our lower bound construction, which consists of distributions supported on the edges of a square.Notice that though the domain is R 2 , the supports of such distributions are lower-dimensional.We will use t, r to represent such distributions and one can refer to Figure 2 for a visual illustration.probability 1/2, we have p No = t, q No = r.Otherwise, we have p No = r and q No = t.Then, if we perform order sampling with m samples from p No , q No , we obtain an order tuple following the uniform mixture of 1 2 (D(t, r, m) + D(r, t, m)).Notice that in the YES case, we have p Yes = q Yes ; in the NO case, we have ∥p Yes − qYes∥ A k = 1 deterministically, even for k = 4. Yet, we show in the next lemma that the distributions over order-tuples in the two cases are the same when m is no more than 3.This immediately gives us that no order-based algorithm can distinguish between the two cases with fewer than 4 samples.Lemma 3.5.We have that D((t + r)/2, (t + r)/2, m) = (D(t, r, m) + D(r, t, m)) /2 for m = 1, 2, 3.
Pr [σ(x) Yes = π, σ(y) . This is because the samples, ignoring the labels, in both cases come from the distribution supported uniformly on the four edges of the square.
Hence, it suffices to show that the distribution over the label vector ℓ No conditioned on any geometric patterns σ(x) No , σ(y) No is uniform.
With this observation in mind, the m = 1 case is trivial since there is only 1 geometric pattern and it is clear that the label ℓ No is uniform.For m = 2, let the coordinate of the first sample be (a, b), which divides the space into four quadrants.Then, by Fact 3.4, it holds that no matter which of the four quadrants the second sample fall into, the probability that the point comes from t is the same as it comes from r. Hence, the uniformity of ℓ No follows.
We will begin with σ(x) = (1, 2, 3), σ(y) = (1, 2, 3) and show that ℓ No is uniform conditioned on that.We claim that this is true even if we further condition on the coordinates of the "middle point": we will condition on that x 2 = x, y 2 = y for some arbitrarily chosen point (x, y) from the support.It is easy to see that the marginal distribution of ℓ 2 is uniform since it only depends on whether we are sampling from D(t, r, 3) or D(r, t, 3).For the same reason, further conditioning on the value of ℓ 2 then completely determines whether we are sampling from D(t, r) or D(r, t, 3).Consequently, (x 1 , y 1 , ℓ 1 ), (x 3 , y 3 , ℓ 3 ) are now independent samples from the lower left quadrant R (2) x,y lemma easily follows from that since we also have r = Θ(k).We first bound the mutual information as a summation over all possible order tuples grouped by the size of the order tuple (recall that for an order tuple t, the size of the order tuple, denoted as |t|, is simply the number of samples from which the order tuple is derived).We have that We will use the indicator variable H i to denote whether the i-th square is chosen to be a "heavy" square.Notice that Pr[ ) and H i is independent of X.Furthermore, if the i-th square is chosen to be a heavy square, the distribution of T i conditioned on X = 0 and X = 1 is exactly the same.This gives us that Next, we note that Pr [T i = t|X = 0, H i = 0] for |t| = λ is given by the distribution 1 2 D(t, r, λ) + 1 2 D(r, t, λ).On the other hand, Pr [T i = t|X = 1, H i = 0] for |t| = λ is given by the distribution D((t + r)/2, (t + r)/2, λ).Hence, by Lemma 3.5, it holds for any t satisfying |t| ≤ 3.This allows us to discard the summation over any t with |t| ≤ 3. Hence, the expression can be further upper bounded by where in the second line above we upper bound the difference in the numerator by their sum and upper bound the denominator by Pr[T i = t, H i = 1], in the third line above we use that i a i b i ≤ (max i a i ) ( i b i ) when a i , b i ≥ 0 and in the final equality we note that the summation over the probability of T i = t for each |t| = λ is exactly that of To show this, we first remark that max t:|t|=λ Then recall that T i can be decomposed into a binary vector ℓ i representing the labels and a permutation tuple σ i ∈ S λ × S λ representing the rank information of x and y coordinates.We note that σ i |H i = 0, |T i | = λ has the same distribution as σ i |H i = 1, |T i | = λ.Then, conditioned on σ i , the distribution of ℓ i is uniform when H i = 1.This then gives max t:|t|=λ Combining this with Equation (15) and multiplying both sides by 1 Pr[H i =1] then gives (14).Substituting Equation (14) into Equation (13) then gives us .
This concludes the proof of Lemma 3.7.
We are now ready to conclude the proof of our main lower bound result.
Proof of Lower Bound in Theorem 1.2.By Lemma 3.6, given that m < k/2, it holds that both p, q are measures of mass Θ(1) with probability at least 99% and if X = 0, it holds ∥p − q∥ A k > Ω(ϵ) with probability at least 99%.By Lemma 3.7, we have that the mutual information between the random bit X and the ordering tuple T ∼ D(p, q, m ′ ), for m ′ ∼ Poi(m), is at most O(m 7 ϵ 8 /k 6 ).This means that no algorithm, given T as input, can reliably predict the value of X with probability more than 2/3 unless m > Ω(1) min k 6/7 /ϵ 8/7 , k .By Lemma 3.6, it holds that p/ ∥p∥ 1 , q/ ∥q∥ 1 are a pair of identical distributions if X = 0 and a pair of distributions that are Ω(ϵ) far in A k distance with probability at least 99% if X = 1.Furthermore, with probability 99%, T is an order-tuple of at most O(m) many samples.Therefore, we conclude that the sample complexity of A k testing is at least Ω(1) min k 6/7 /ϵ 8/7 , k .Even though the distributions p, q used in the construction are continuous, we next show that they can be easily "rounded" to discrete distributions that remain hard for the testing algorithm.In particular, we can construct a grid G which splits the domain into Θ m 6 squares such that the mass of any square R ∈ G under (1/2)(p + q) is bounded by m 3 .Then, we consider the discrete distributions p ′ , q ′ which round the points falling in the square R ∈ G to its top-left vertex.It is easy to see that if p = q, then p ′ = q ′ .Moreover, for an arbitrary rectangle R ⊂ R 2 , we have p(R) − q(R) = p ′ (R) − q ′ (R) ± Θ( 1 m 3 ).Hence, the effect of rounding to the A k distance between p, q is at most Θ(k/m 3 ), which can be safely ignored when m < k.On the other hand, D(p, q, m ′ ) is nearly the same as D(p ′ , q ′ , m ′ ), since the distributions over the order tuples are the same as long as no two points fall in the same square (which happens with probability at most O(1/m)).Hence, the cases p ′ = q ′ and ∥p ′ − q ′ ∥ A k > ϵ are also hard to distinguish given tuples from the order sampling process unless m > Ω(1) min k 6/7 /ϵ 8/7 , k .Finally, by Lemma 3.2, we can translate any lower bound under order sampling back to the usual direct sampling.

Domain Size Optimization
The lower bound of Theorem 1.2 holds only when the domain size N is substantially larger than the other parameters.In particular, the statement does not quantitatively characterize the sample complexity as a function of the domain size.The bottleneck of the analysis lies in Lemma 3.2, which offers an inefficient (in terms of the size of the domain after the transformation) way of transforming the domain to "hide" the extra information that an algorithm can extract from the samples in addition to their relative order.In this section, we provide a more efficient and constructive way to disguise the information in the values of each samples' coordinates and build on it to provide a tighter lower bound in terms of the domain size.
The main result of this section is the following: Theorem 3.8 (Stronger Lower Bound for Discrete Distributions).Fix an integer V > 0. Let p and q be distributions on [V ] × [V ] and let ϵ > 0 be less than a sufficiently small constant.Any tester that distinguishes between p = q and ∥p − q∥ A k ≥ ϵ for some k ≤ V with probability at least 2/3 must use at least m many samples for some m with Before presenting the transformation formally, we provide some high level intuition.Recall that in the lower bound construction from Section 3.3 the domain is partitioned into r × r many squares where r = Θ(k) and the distributions are supported on squares lying on the diagonal.The argument then proceeds to bound the order information of samples coming from each of the squares.Now suppose that the algorithm is allowed to look at the absolute coordinates of the samples.If only 1 or 2 points fall in some square, the only extra information we need to hide is its absolute position and the distance between the points.To do so, we can generalize the techniques developed in [DKN17] to randomly scale and shift the square in both the x and y-axis.
For 2-dimensional A k closeness testing, if the algorithm takes Θ(k 6/7 ϵ −8/7 ) many samples, since there are Θ(k) many squares in total, 3 or more samples could fall in the same square.Then the algorithm also gets to see the ratio of distances between different pairs of points, which remains invariant even if the coordinates of the points are scaled uniformly within the square.To handle this, we will instead apply an uneven scaling on different parts of the square.In particular, we map points with x-coordinate a to exp(a λ) with some randomly chosen λ (and the same for the y-coordinate), which then makes the ratio of distances also noisy.
To formalize this idea, we first define a distribution over monotonic mappings, which we will then use to transform the points.Definition 3.9 (Distribution over monotonic mappings).Let W > 0. We define M(W ) as a distribution over monotonic mappings of the form f : [0, 1] → R + .To sample a mapping from M, we first sample three parameters λ 1 , λ 2 , λ 3 which are uniform variables over the intervals [log log W, 2 log log W ], [0, log 3 W ], [0, exp 2 log 3 W ] respectively.Then, the mapping f ∼ M(W ) is given by f (x) = exp(x exp(λ 1 )) exp(λ 2 ) + λ 3 .Let a < b < c be three points lying on [0, 1].Here we show that, as long as a, b, c are sufficiently separated, transforming the points by some random mapping f from M(W ) helps obfuscate the information a tester can retrieve from them.In particular, we argue the distribution of (f (a), f (b), f (c)) (where the randomness is over f ) is close to some fixed distribution D for any choice of well-separated points a, b, c.Lemma 3.10.Let f ∼ M(W ) such that f (x) = exp(exp(λ 1 ) x) exp(λ 2 ) + λ 3 .Then, there exists some fixed distributions D over the domain R 3 + such that for any three points a < b < c from [0, 1] satisfying we always have . First, we note that it suffices to show (log log A, log B, C) is close in total variation distance to some distribution D ′ for an arbitrary choice of a, b, c satisfying the condition in Equation ( 16) since is a bijection between (f (a), f (b), f (c)) and (log log A, log B, C).
In particular, let U 1 , U 2 , U 3 are uniform distributions over the intervals [log log W, 2 log log W ], [0, log 3 W ] and [0, exp 2 log 3 W ] respectively.We argue (log log A, log B, C) is close to the distribution U 1 × U 2 × U 3 .The proof strategy is the following.We first bound the total variation distance between log log A and U 1 .Then, conditioned on log log A and U 1 , we show log B is close to U 2 .Finally, conditioned on everything other variables, we show C is close to U 3 .
Suppose f (x) = exp(exp(λ 1 ) x + λ 2 ) + λ 3 .We have that log log A = g a,b,c (λ 1 ), where It is easy to verify that g a,b,c is monotonically increasing as a function of x for any a < b < c.Since λ 1 is uniform over [log log W, 2 log log W ], the support of log log A will be [g a,b,c (log log W ), g a,b,c (2 log log W )]. By the change of variable rule of probability density functions, we have Before we bound the total variation distance between log log A and U 1 , we discuss some useful properties of g a,b,c .
Proof.For the proof of this claim, we will temporarily drop the subscript of g a,b,c and write only g.For property (i), we have For properties (ii) and (iii), our strategy is to show that g(x) is approximately just x + log(c − b) for sufficiently large W .To do so, we consider the function h(x) := exp(g(x)) = log exp(c exp(x))−exp(a exp(x)) exp(b exp(x))−exp(a exp(x)) .Our goal now is to show h(x) is approximately (c−b) exp(x).Denote L θ (x) = log(1 − exp(−θ exp(x))).We then have For θ ∈ [1/ log log W, 1] and x ∈ [log log W, 2 log log W ], we claim L θ (x) becomes almost the 0 function in terms of its function values and its derivative when W grows.Using the inequality − log(1 − exp(−z)) ≤ 1/z for z > 0, we have Furthermore, the derivative of L θ can be bounded by Combining Equations ( 17) and (18), we then have Notice that h(x) is at most exp(x) + O(1/ log log W ).Then, we have On the other hand, since (c − b) is at least where the last inequality holds since for sufficiently large W , we have O(exp(− log log W )) ≤ 1/2.This then gives us property (ii).
Using Equations ( 19) and (17), we then have Hence, we can bound the derivative of g(x) as where in the second last equality, we use the fact that (c − b) exp(x) is at least log W/ log log W and hence lower bounded by a constant for sufficiently large W .This concludes the proof of Claim 3.11.
To bound the total variation distance between log log A and U 1 , we will introduce U a,b,c , which denotes the uniform distribution over [g a,b,c (log log W ), g a,b,c (2 log log W )].Then, by the triangle inequality, we have that d TV (log log A, U 1 ) ≤ d TV (log log A, U a,b,c )+d TV (U a,b,c , U 1 ).Notice that the second term is just the total variation distance between two uniform variables -one over the interval Hence, in total, we have d TV (U 1 , U a,b,c ) ≤ O (log log log W/ log log W ).
For the term d TV (log log A, U a,b,c ), one can see that the two variables have the same support.We will first show the PDF of log log A and U a,b,c are point-wise close.In particular, for x ∈ [g(log log W ), g(2 log log W )], we have where in the second equality we use the fact λ 1 is a uniform variable over an interval of length log log W and that g ′ Next, we show log B conditioned on log log A is close to U 2 .We can simplify the expression of log B and arrive at log B = λ 2 + log (exp(a exp (λ 1 )) − exp(b exp (λ 1 ))) .
Notice that that since log log A depends only on a, b, c, λ 1 (since λ 2 , λ 3 are cancelled in the expression of A), conditioning on log log A only makes λ 1 fixed while λ 2 is still the uniform distribution over [0, log 3 W ], which is the same as U 2 .Hence, to show that log B is close to U 2 , it suffices to show log B − λ 2 is small after fixing any valid choice of λ 1 , a, b.We can write where in the first inequality we again use that |log(1 − exp(−z))| ≤ 1/z for z > 0, and in the second inequality we use a ≤ 1, log log W ≤ λ 1 ≤ 2 log log W , b − a ≥ 1/ log log W .Then, recall that log B and U 2 are both uniform variables supported on intervals with the same lengths but different offsets (differ by O(log 2 W )). Thus, conditioned on any value of λ 1 , we have Lastly, consider the random variables C def = exp(a exp(λ 1 )) exp(λ 2 ) + λ 3 .Again, we remark that conditioning on B and A only fixes λ 1 , λ 2 .So λ 3 is still a uniform random variable over [0, exp(2 log 3 W )], just like U 3 .Hence, the total variation distance between C and U 3 can be bounded by where we use the fact a ≤ 1, λ 1 ≤ 2 log log W , λ 2 ≤ log 3 W .This concludes the proof of Lemma 3.10.
Let t, r be the square edge distributions defined in Definition 3.3.In the lower bound construction from Section 3.3, within each square, (p, q) is either (t + r)/2, (t + r)/2 , (t, r) or (r, t).Lemma 3.5 states that, if the tester is only given the order information of three samples, it cannot tell whether the samples are taken from (t + r)/2, (t + r)/2 or a random pair from (t, r) and (r, t).In order to hide the extra information, one need to apply the transformation specified in Lemma 3.2, which increases the domain size substantially.Here, we argue that applying transformations sampled from M(W ) also eliminates most of the extra information in addition to the order information.
Lemma 3.12.Let t, r be the square edge distributions defined in Definition 3.3.Let {u i , v i , b i } m i=1 be samples drawn from the pair of distributions (t + r)/2, (t + r)/2 .With probability 1/2, we draw {x i , y i , ℓ i } m i=1 from (t, r).Otherwise, we draw {x i , y i , ℓ i } m i=1 from (r, t).Let f 1 , f 2 , f 3 , f 4 be four random mappings drawn independently from M(W ).Then, the quantity ) is 0 for m = 1 and O(log log log W/ log log W ) for m = 2, 3.
Proof.We first analyze the case for m = 1.We claim that the tuple (u 1 , v 1 , b 1 ) has the same distribution as (x 1 , y 1 , ℓ 1 ).since conditioned on any values of (u 1 , v 1 ), the distribution of b 1 is uniform (and similarly for x 1 , y 1 , ℓ 1 ).Then, since f 1 , f 2 , f 3 , f 4 are all identically distributed, it follows the distributions in the two cases are the same.
We then proceed to prove the cases m = 2, 3. We remark that the total variation distance for m = 2 is at most that for m = 3 since one can always explicitly drop the extra sample and this operation will only decrease the total variation distance.Thus, we only need to consider the case m = 3.By Lemma 3.5, we have Order({u i , v i , b i } m i=1 ) has the same distribution as Order({x i , y i , ℓ i } m i=1 ).Hence, there exists a coupling J between {u i , v i , b i } m i=1 and {x i , y i , ℓ i } m i=1 such that if we sample from J we always have Order({u i , v i , b i } m i=1 ) = Order({x i , y i , ℓ i } m i=1 ).Hence, we can bound the overall total variation distance by We remark that the total variation distance inside the expectation is now for fixed values of {u i , v i , b i } m i=1 and {x i , v i , ℓ i } m i=1 that share the same order information and over the random choice of the transformations f 1 , f 2 , f 3 , f 4 .Since the transformations along the two dimensions are picked independently and b i = ℓ i under the coupling J, we thus have ) .The arguments for bounding the total variation distance over the two different dimensions are identical.We will therefore just focus on the first dimension.Then consider the event E such that min (u i − u i−1 , x i − x i−1 ) ≥ 1/ log log W for any i.By the union bound, it is easy to see that E does not hold under J with probability at most O(1/ log log W ). By the triangle inequality, we have This concludes the proof of Lemma 3.13.
We are now ready to conclude the proof of Theorem 3.8.
Proof of Theorem 3.8.Throughout the proof, we assume that m < k/2 as this is the regime where we can use the random process described in Section 3.3 to generate measures.Let X be an unbiased binary variable and p, q be a pair of measures generated according to the random process described in Section 3.3 and p ′ , q ′ be the measures obtained after applying the random transformation defined by mappings sampled from M(W ).Since the transformation is monotonic in both x and y axis, we thus have ∥p − q∥ A k = ∥p ′ − q ′ ∥ A k .Therefore, when X = 0, we have p ′ = q ′ ; when X = 1, we have ∥p ′ − q ′ ∥ A k > ϵ.By Lemma 3.13, we have the mutual information between the random bit X and the output of any algorithm that uses Poi(m) samples is at most O m 3 ϵ 4 k 2 log log log W log log W + m 7 ϵ 8 k 6 . Hence, no tester can reliably distinguish between the case that p ′ = q ′ and ∥p ′ − q ′ ∥ A k > ϵ with probability more than 2/3 unless m ≥ Ω(1) min k 2/3 ϵ −4/3 log log W log log log W 1/3 , k 6/7 ϵ −8/7 . (20) Note that the measures p ′ , q ′ are continuous.The remaining step is to turn them into discrete measures p′ , q′ such that distinguishing between p′ = q′ versus ∥p ′ − q′ ∥ A k ≥ ϵ is about as hard as p ′ = q ′ versus ∥p ′ − q ′ ∥ A k ≥ ϵ .
First, we argue that, for any horizontal or vertical strip of width at most ϵ 8 , the mass of p ′ , q ′ is at most ϵ 8k .It is easy to see the claim is true for p, q since their marginal distributions in any dimension is uniform over intervals whose lengths add up to at least k.For the transformed distribution p ′ , q ′ , the bound still holds since the transformation only stretches the distribution along x, y axis.
Then, we can construct a grid G which splits the domain into small unit squares, each of size ϵ 8 × ϵ 8 .Then, consider p′ , q′ which round the points falling in each square in G to its top-left vertex.Then, for an arbitrary rectangle R, |p ′ (R) − q ′ (R)| − |p ′ (R) − q′ (R)| is at most the mass of p ′ or q ′ in the two vertical strips and the two horizontal strips, each of width at most ϵ 8 .Thus, for any R, it holds |p Consequently, it holds ∥p ′ − q′ ∥ A k ≥ ϵ/2 if ∥p ′ − q ′ ∥ A k ≥ ϵ.On the other hand, if p ′ = q ′ , it is easy to see that we still have p′ = q′ after the rounding.Thus, if there is an algorithm which can distinguish between the cases p′ = q′ and ∥p ′ − q′ ∥ A k > ϵ/2, we can use it to distinguish between the cases p ′ = q ′ and ∥p ′ − q ′ ∥ A k > ϵ as well by simulating the rounding process.Hence, the sample complexity lower bound in Equation (20) applies to p′ , q′ as well.Finally, we note the supports of the transformed measures p ′ , q ′ are always contained in some W ′ × W ′ square where W ′ = k exp(4 log 3 W ). Hence, p′ , q′ are over a V × V discrete grid where V = Θ k exp(4 log 3 W )/ϵ .If the first term in the sample complexity bound (Equation (20)) is dominating, we must have log log W ≥ log(k/ϵ).This then implies that V is at most exp(5 log 3 W ), which further implies that log log W ≥ Ω(1) log log V .On the other hand, it is easy to see that V > W and so 1/ log log log W > 1/ log log log V .We can the rewrite Equation (20) as m ≥ Ω(1) min k 2/3 ϵ −4/3 log log V log log log V 1/3 , k 6/7 ϵ −8/7 , which is indeed the desired lower bound.

Conclusions and Open Problems
In this work, we studied the problem of closeness testing between two multidimensional distributions under the A k distance.Our main contribution is the first tester for this task with sublinear sample complexity.The sample complexity of our tester is provably near-optimal as a function of the parameter k (within logarithmic factors) for any fixed dimension d ≥ 2.
Conceptually, our sample complexity lower bound implies that the testing problem is provably harder in the multidimensional setting.In particular, there is a "phase transition" between the onedimensional and the two-dimensional cases.On the positive side, we show that as the dimension d further increases the dependency of the sample complexity on k -the main parameter of our interest -stays approximately the same.
As immediate corollaries of our A k closeness tester, we also obtain the first closeness tester for families of structured multidimensional distributions -including k-histograms and uniform distributions over unions of axis-aligned rectangles -under the total variation distance.
While Theorem 1.2 implies that our upper and lower bounds are nearly optimal in terms of their dependence on k, their dependence on ϵ do not match.In particular, the upper bound scales polynomially with 1/ϵ, where the degree of the polynomial depends on the dimension d.On the other hand, the lower bound applies to 2-dimensional distributions, and hence has a constant exponent in its (polynomial) ϵ-dependence.This leads to the following question.
Question 4.1.What is the optimal sample complexity as a function of ϵ for multidimensional A k closeness testing?
In the current and prior works, the multidimensional A k -distance is defined as the maximum discrepancy between two distributions over k disjoint axis-aligned rectangles.On the other hand, the A k -distance for univariate distributions is defined with respect to intervals.This definition inherently uses axis-aligned rectangles in R d , as the natural generalization of intervals in R. Yet, rectangles are not necessarily the only valid choice.More specifically, one can replace axis-aligned rectangles in the definition of multidimensional A k distance with other geometric shapes whose 1-dimensional projection corresponds to intervals.For example, we can use shapes like unit-balls, simplices, or any other convex set.Such natural variants of multidimensional A k distance can be used to build d TV -closeness testers of other families of structured distributions, such as log-concave distributions.This leads to the following question.Question 4.2.Are there alternative definitions of multidimensional A k distance for multivariate distributions that can lead to optimal d TV -closeness/identity testers for other multivariate shaperestricted distributions?
Exploring other notions of multidimensional A k distance is of significant interest and may lead to a unified theory of testing multivariate structured distributions.
and the same for q) and |H i | ≤ 2 d log d (m).Combining this with the fact that p F (G S ) (R) = log −d (m) p(R) and Equation (2) gives (iii).This completes the proof.
x,y∼D |R [E] ≥ β d /2, since otherwise E x,y∼D |R [D(R x,y )] will be no more than β d D(R).

Figure 1 :
Figure 1: The black rectangle represents R ⊂ R in R 2 .One can see that there is a natural way to carve the remaining space R\ R into four axis-aligned rectangles.
[log log W, 2 log log W ] and the other over [g a,b,c (log log W ) , g a,b,c (2 log log W )]. By Claim 3.11, it holds |g a,b,c (x) − x| = O(log log log W ) and g a,b,c (x) ≤ x.We thus have g a,b,c (log log W ) ≤ log log W ≤ g a,b,c (2 log log W ) ≤ 2 log log W. Hence, the total variation distance between U 1 and U a,b,c is exactly 1 2 log log W − g a,b,c (log log W ) g a,b,c (2 log log W ) − g a,b,c (log log W ) + 2 log log W − g a,b,c (2 log log W ) log log W + (g a,b,c (2 log log W ) − log log W ) | 1 log log W − 1 g a,b,c (2 log log W ) − g a,b,c (log log W ) | , where the first two terms capture the difference between U 1 and U a,b,c on the domain such that exactly one of U 1 and U a,b,c is supported on, and the last term captures the difference on the domain they are commonly supported on.For the first two terms, the numerators are of size O(log log log W ) and the denominators are at least log log W −O(log log log W ) since |g a,b,c (x)−x| = O(log log log W ). Therefore, both of them are of order O(log log log W/ log log W ). For the last term, we have g a,b,c (2 log log W ) − log log W ≤ log log W + O(log log log W ) and (a, b, c) = 1 ± O(1/ log log W ) by Claim 3.11.Then, since the interval where log log A, U a,b,c are supported on is of length at most O(log log W ). We then have d TV (log log A, U a,b,c ) ≤ O(log log log W/ log log W ). Hence, overall, we then have d TV (log log A, U 1 ) ≤ O(log log log W/ log log W ).

d
TV ({f 1 (u i )} m i=1 , {f 3 (x i )} m i=1 ) ≤ d TV ({f 1 (u i )} m i=1 , D) + d TV ({f 3 (x i )} m i=1 , D)where D is the distribution defined in Lemma 3.10.Conditioned on the event E, we then have that the expression is bounded by O(log log log W/ log log W ). Since the total variation distance is bounded by 1 and E does not hold with probability at most O(1/ log log W ). The overall total variation distance is at most O(log log log W/ log log W ).This completes the proof of Lemma 3.12 distributed as Poi(ϵ m/k).Therefore, we have Pr[|P 1 | = γ|H i = 0] = Poi(ϵ m/k, γ) ≤ (ϵ m/k) γ /γ!.Together with our assumption m < k, we can simplify the bound as I(X : P 1 ) ≤ O(1