XOR Lemmas for Communication via Marginal Information

We define the $\textit{marginal information}$ of a communication protocol, and use it to prove XOR lemmas for communication complexity. We show that if every $C$-bit protocol has bounded advantage for computing a Boolean function $f$, then every $\tilde \Omega(C \sqrt{n})$-bit protocol has advantage $\exp(-\Omega(n))$ for computing the $n$-fold xor $f^{\oplus n}$. We prove exponentially small bounds in the average case setting, and near optimal bounds for product distributions and for bounded-round protocols.


INTRODUCTION
If a function is hard to compute, is it even harder to compute it many times?This old question is often challenging, and new answers are usually accompanied by foundational ideas.We give new answers in the framework of communication complexity, accompanied by a new measure of complexity called marginal information.This de nition provides a new tool for proving lower bounds in theoretical computer science.
A wide variety of important lower bounds in computer science ultimately rely on information theoretic lower bounds in communication complexity, including lower bounds on the depth of monotone circuits [17], lower bounds on data structures [19] and lower bounds on the extension complexity of polytopes [3,16,26,30], to name a few nice examples.We refer the reader to the textbook [23] for an introduction to the basic de nitions and concepts in communication complexity, the role played by the questions we address here, and the connections to other areas.
Given a Boolean function : X × Y → {0, 1}, de ne the functions : X × Y → {0, 1} and ⊕ : X × Y → {0, 1} as follows 1 : So, computes on di erent pairs of inputs, and ⊕ computes the parity of the outputs of .If is hard to compute, are and ⊕ even harder to compute?For deterministic communication complexity, Feder, Kushilevitz, Naor and Nisan [10] proved that if |X|, |Y| ≤ 2 ℓ and requires bits of communication, then requires at least ( √ −log 2 ℓ −1) bits of communication.In this work, we study randomized communication complexity.Let ∥ ∥ denote the communication complexity of a randomized communication protocol and de ne the advantage: This quantity measures the best worst-case advantage achievable by a -bit protocol over random guessing.We can now state our main result: Theorem 1.There is a universal constant > 0 such that if > 1/ and adv( , ) < 1/2, then adv √ log( ) , ⊕ < exp(− ).
The constant 1/2 is not important, it can be replaced by any constant less than 1.Some assumption of the type > 1/ is necessary, because if , ∈ {0, 1} and ( ) = ⊕ , then adv(1, ) = 0, yet adv(2, ⊕ ) = 1.Prior to our work, the best known upper bound was proved by the second author with Barak, Braverman and Chen [2], who showed that the advantage is bounded by 1/2 for a similar choice of the other parameters.Our work builds on the work of Yu [31], who proved exponentially small bounds on the advantage in the setting of bounded-round communication protocols.
Our ideas lead to many results similar to Theorem 1. Next, we review the history that led us to the notion of marginal information, explain the intuitions behind the choices made in the de nition, and then describe all of our results in Section 1.2.
Step 1 Every protocol computing ⊕ with signi cant advantage and small communication has small marginal information; see Theorem 5.
Step 2 Marginal information is subadditive, so the marginal information for computing is smaller by a factor of ; see Theorem 6.
Step 3 Small marginal information can be compressed to give protocols with small communication; see Theorems 7 to 10.
De nitions of information are famously subtle.In order to make this strategy work, the marginal information needs to permit all 3 steps, and even minor changes to the de nition can make one of the steps infeasible.
Our current de nition builds on important insights and intuitions developed in theoretical computer science over a period of decades.An early precursor to the use of information theory in computer science is the work of Kalyanasundaram and Schnitger, who used Kolmogorov complexity to prove lower bounds on the randomized communication complexity of the disjointness function [27].The proof was subsequently simpli ed by Razborov [25], who gave a beautiful short argument that used Shannon's notion of entropy [28] and implicitly followed the outline of the steps 1,2,3 described above.This is related to the questions we study here because the disjointness function can be thought of as a way to compute the AND of 2 bits times.Step 1 is relatively easy for this problem.Step 2 involved a clever way to split the dependence between random variables, and was accomplished using the subadditivity of entropy.
Step 3 is also not too di cult.
The next chapter of the story was written during the study of parallel repetition, a vital tool in the development of probabilistically checkable proofs.Raz [24] proved the rst exponentially small bounds in this context using the Kullback-Liebler divergence as a measure of information.Given a distribution ( ), and a carefully chosen event , Raz measured the divergence In the proof, it is crucial that the event is rectangular, meaning that if , are independent, then they remain independent even after conditioning on .Once again, Step 1 is not too di cult.Raz used the subadditivity of divergence and a similar set of clever random variables as in [25] to split the dependence and accomplish Step 2. Later, Holenstein [13] introduced a method called correlated sampling to simplify the analogue of Step 3 in Raz's proof, and obtained better bounds.The second author used these tools to prove optimal bounds for parallel repetition in the setting relevant to probabilistically checkable proofs [21].Chakrabarti, Shi, Wirth and Yao [9] were the rst to propose using general measures of information complexity to address the questions we consider in this paper.Let denote the inputs, denote the public randomness and transcript of a communication protocol and ( ) denote the joint distribution induced by the protocol2 .[9] proposed to measure the mutual information Years later, this measure was renamed external information by [2].
The external information measures the information learned by an external observer about the parties' inputs.
Step 1 is easy for this measure of information.However, the subadditivity of Step 2 does not hold in general; the proof only goes through when the input distribution ( ) is a product distribution.Jain, Radhakrishnan and Sen [15], and Harsha, Jain, McAllester and Radhakrishnan [12] gave ways to implement Step 3 that led to bounds on the success probability for computing in the setting where the inputs are assumed to come from a product distribution and the communication protocols are restricted to having a bounded number of rounds.Meanwhile, Bar-yossef, Jayram, Kumar and Sivakumar [1] showed how to reframe Razborov's proof using mutual information instead of entropy, and proved other results using this formulation which contained hints of the de nition of information that came next.
The rst upper bounds on the success probability in the general setting came when the second author together with Barak, Braverman and Chen [2] adapted the methods developed in the study of parallel repetition to these problems.In contrast with the external information, they de ned the internal information, which is the sum of two mutual information terms The internal information measures what is learned by each party about the other's input.Equation (1) was the inspiration for Equation (2); indeed, each setting of corresponds to a rectangular event.
When the inputs come from a product distribution, the internal and external information are the same, and [2] proved that subadditivity holds for internal information using an argument similar to the one used in the context of parallel repetition.Moreover, they showed how to leverage the technique of correlated sampling developed by Holenstein to simulate protocols with information and communication using ≈ √ /log communication.They gave near optimal simulations of ≈ log 2 for protocols with small external information using rejection sampling and a variant of Azuma's concentration inequality.These results proved that there is a constant such that if adv( , which was the rst result along the lines of Theorem 1. Later, the second author and Braverman [6] argued that this is the right de nition of information, because the internal information cost of a function is equal to the amortized communication complexity of that function.This suggested that the internal information might well be the last word in this evolution of de nitions, because it could be de ned purely using the concept of communication complexity.It seemed like the only path to better results was through better methods to compress internal information.This is a belief we no longer hold.Nevertheless, a urry of ideas about compressing protocols with internal information and communication followed.Braverman [4] showed how to obtain protocols with communication ≈ 2 ( ) .The second author and Ramamoorthy [20] showed that if , denote the internal information learned by each party, then you can achieve communication ≈ • 2 ( ) and can also achieve communication ≈ + 4 •3 .Two excellent papers, the rst by Kol [18] and the second by Sherstov [29], showed that ≈ log 2 communication can be achieved when the inputs come from a product distribution.Ganor, Kol and Raz [11] (see also [22]) gave a nice counterexample: a function that can be computed with communication ≈ 2 2 ( ) , and internal information ≈ , but cannot be computed with communication ≈ 2 .
The next de nition to evolve was proposed by the second author together with Braverman, Weinstein and Yehudayo [7,8], inspired by the work of Jain, Pereszlényi and Yao [14].Rather than bounding the information under the distribution ( ) induced by the protocol, they bounded the in mum of information achieved in the ball of distributions that are close to the protocol.They de ned the information to be the in mum where here the in mum is taken over all distributions ( ) that are close to ( ) in statistical distance.This quantity was ultimately bounded by setting ( ) = ( | ), where here is a reasonably large event (not necessarily rectangular) that implies that the protocol correctly computes the function.The bound on Equation (3) does not lead to a bound on the information according to ( ), because it is quite possible that the points outside reveal a huge amount of information.Still, [8] were able to follow all 3 steps of the high-level approach to prove their results.Step 1 remained easy, but Steps 2 and 3 became more di cult using Equation (3).[8] obtained exponentially small upper bounds for the success probability of computing , but did not manage to prove new bounds on the advantage for ⊕ using this approach.Equation (3) may not seem very di erent from Equation ( 2), but it does involve a proxy , and we pursue the use of such proxies further in the de nition of marginal information that we discuss next.
In a paper full of new ideas, Yu [31] recently proved exponentially small bounds on the advantage of bounded-round protocols computing ⊕ .Although Yu's paper involves a potential function that super cially looks like a de nition of information, his proof does not involve a method to compress protocols whose potential is small, and we are unable to extract a de nition of information from his work.Still, his ideas inspired many of the choices made in our de nition.To de ne the marginal information, we need the concept of a rectangular distribution, which was de ned in [31]: De nition 2. Given a set consisting of triples ( ), we say that is rectangular if its indicator function can be expressed as for some Boolean functions 1 , 1 .Given a distribution ( ) and a distribution ( ), we say that is rectangular with respect to if it can be expressed as for some functions , .
For intuition, it is helpful to think of a rectangular distribution as the result of conditioning a protocol distribution ( ) on a rectangular event.That would produce a rectangular distribution, but the space of rectangular distributions actually contains other distributions that cannot be obtained in this way.
From our perspective, the most useful insight of Yu's work is that if is restricted to being rectangular, then one can allow to be quite far from in Equation ( 3) and still carry out a meaningful compression of a protocol to implement Step 3.That is because the rectangular nature of allows the parties to use hashing and rejection sampling to convert a protocol that samples from into a protocol that samples from .If ( ) = ( | ) for a rectangular event , this is easy to understand: the parties can communicate 2 bits to compute if ∈ and output the most likely value of under with ∈ .If ∉ they can output a random guess for the value of .So, it is enough to bound the information terms for ∈ , and enough to guarantee that the compression is e cient for such points.This observation is very powerful, because it allows us to throw away problematic points in the support of the distributions we are working with and pass to appropriate sub-rectangles throughout our proofs.
For all of this to work, it is crucial that the protocol retains some advantage within the support of .For this reason, we need to keep track of the information in the support of as well as the advantage within the support of , and so, for the rst time, the measure of information is going to depend on the function that the protocol computes.We are ready to state the de nition: De nition 3.For ≥ 1 and 3 = 1/15, the marginal information of a protocol for computing is de ned as , where the in mum is taken over all distributions that are rectangular with respect to the input distribution ( ), and the supremum is taken over all in the support of .
We use the letter above because it turns out that protocols computing can be e ciently compressed when M = ( ), and any compression must have communication Ω( ).Compare De nition 3 with Equations ( 2) and (3).The fact that must be tethered to is ensured by including the term ( )/ ( ).If ( ) = ( | ) for a rectangular event , ( )/ ( ) will be equal to 1/ ( ).The last term in the product computes the advantage of for computing , because under and given , the best guess for the value of is determined by the sign of , and its advantage is the absolute value of this quantity.In words, the marginal information measures the supremum over all of the information per unit of advantage, of the best rectangular approximation .
In analogy with the external information, we de ne the external marginal information: De nition 4. For ≥ 1 and = 1/15, the external marginal information of a protocol for computing is de ned as: where the in mum is taken over all distributions that are rectangular with respect to the input distribution ( ), and the supremum is taken over all in the support of .
When the distribution on inputs is a product distribution, it turns out that the external marginal information is equal to the marginal information.
To state our results about marginal information, we rst de ne the average-case measure of advantage.Given a distribution ( ) on inputs, de ne where here the expectation is over the choice of inputs as well as the random coins of the communication protocol.To study the more restricted setting where the protocols we are working with have a bounded number of rounds, de ne the worst-case and average case quantities: where throughout, the supremums are taken over -round protocols.
Returning to our high-level approach, we prove the following results about marginal information, which allow us to carry out Steps 1,2,3: (1) First, we show that a protocol with small communication and large advantage has small marginal information, to handle Step 1: Theorem 5.For every Boolean function ( ) and every protocol of communication complexity , (−1) .
For any xed , the quantity | E ( | ) [(−1) ]| measures the advantage of the protocol for computing conditioned on that value of .So, if adv ( , ⊕ ) ≥ exp(− ) via a protocol corresponding to the distribution , then the above theorem implies that M ( , ⊕ ) ≤ ( + ).Unlike all previous de nitions, for marginal information Step 1 involves signi cant work.Our proof crucially uses the fact that the protocol has bounded communication complexity: for example it would not be enough to start with a bound on the internal information.(2) Next, we prove that marginal information is sub-additive with respect to the -fold xor of .If the transcript = ( 0 , 1 , . . ., ), where denotes the 'th message of the protocol, we show Theorem 6.There is a universal constant Δ such that if ≥ 1 and is a protocol distribution for computing ⊕ with ( ) = =1 ( ), then there is a protocol for computing such that ( ) = ( ), has the same number of messages as , for > 1 the support of is identical in and , and moreover If M ( , ⊕ ) ≤ ( ), this theorem proves that M ( , ) ≤ ( ).This might well be the most technically novel part of our proof; it is certainly where we spent the most time.The main challenge is proving the result for = 2, which is very delicate.If = 2 and M ( , ⊕2 ) is small, then there is a rectangular distribution such that the pair leads to a small value of M ( , ⊕2 ).We show how to use , to generate a new pair or a new pair proving that either M ( 1 , ) or M ( 2 , ) is more or less bounded by M ( , ⊕2 )/2.We are unable to bound the length of the rst message of in terms of the length of the corresponding message of in Theorem 6, because in our proof the rst message 1 needs to encode one of the inputs of the original protocol.Fortunately, this is not a signi cant obstacle for the high-level strategy.
(3) Lastly, we show how to compress marginal information to handle Step 3. We have been able to match many of the prior results [2,4,6] about compressing information and external information with corresponding results about compressing marginal information and external marginal information, though our proofs are much more technical.Our most general simulation is captured by the following theorem: Theorem 7.For every > 0 there is a Δ > 0 such that if M ( , ) ≤ , ( ) = ( ) and moreover the messages = ( 0 , . . ., ) are such that 2 , . . ., ∈ {0, 1}, then adv (Δ( + √ log( )), ) ≥ 1/Δ.
Theorem 7 shows that if the marginal information is ( ), then one can obtain a protocol with communication √ ) that has Ω(1) advantage for computing .For the external marginal information, we prove: Theorem 8.For every > 0 there is a Δ > 0 such that if M ext ( , ) ≤ , ( ) = ( ), and moreover the messages = ( 0 , . . ., ) are such that 2 , . . ., ∈ {0, 1}, then adv (Δ log 2 , ) ≥ 1/Δ.This theorem gives improved results when the inputs come from a product distribution.It is quite possible that even better simulations can be obtained using the ideas of [5,18,29], but we have not managed to obtain such results.We also obtain results that are independent of the communication complexity: Theorem 9.For every > 0 there is a Δ > 0 such that if M ( , ) ≤ and ( ) = ( ), then adv (Δ , ) ≥ exp(−Δ ).
These results about the marginal information cost allow us to prove Theorem 1, as well as several other results of that avor.

Using Marginal Information to Prove XOR Lemmas
To state all of our results, let us de ne the average-case and worst-case measures of success: where in suc , suc the supremum is taken over -round protocols, and in suc , suc the probability is over inputs sampled from ( ).Yao's min-max theorem yields adv( , ) = inf adv ( , ), Given any distribution on X × Y, de ne the -fold product distribution on X × Y by ( ) = =1 ( ).Theorem 1 is proved by proving this stronger bound: Theorem 11.There is a universal constant > 0 such that if > 1/ and adv ( , ) ≤ , then adv ( √ /log( ), ⊕ ) ≤ exp(− ).
To prove Theorem 11, suppose that there is a protocol computing ⊕ with advantage exp(− ) and communication = • √ /log( ).If / ≥ 1, we set = / and apply Theorem 5 to show that M ( , ⊕ ) ≤ ( + ) ≤ ( ).Next, apply Theorem 6 to nd a protocol ′ with M ( ′ , ) ≤ ( ).Finally, apply Theorem 7 to obtain a protocol computing with advantage Ω(1) and communication proportional to If / < 1, we set = 1 and apply Theorem 5 to show that M ( , ⊕ ) ≤ ( ).Next, apply Theorem 6 to nd a protocol ′ with M ( ′ , ) ≤ ( ) = (1).Finally, we apply Theorem 9 to obtain a protocol computing with advantage Ω(1) and communication (1).Setting su ciently small, we obtain a contradiction in either case, which proves that there is no protocol as above.Theorem 1 can be obtained from Theorem 11 using Equation ( 4) and the fact that the worst-case success probability of a communication protocol can be increased by taking the majority outcome of several runs of the protocol.We leave these details to the reader.
This matches the result proved by [8] mentioned earlier.These corollaries are obtained by observing that if ⊆ {1, 2, . . ., } is chosen uniformly at random, and are sampled according to , then so a protocol computing with success probability exp(− /2) yields a set of ′ = Ω( ) coordinates where the protocol computes ⊕ ′ with advantage exp(−Ω( )).Again, we leave the details to the reader.When the distribution ( ) = ( ) • ( ) is a product distribution, we obtain stronger bounds: Theorem 14.There is a universal constant > 0 such that for every product distribution , if > 1/ and adv ( , ) < , then adv ( /log 2 ( ), ⊕ ) < exp(− ).