Improving the Bit Complexity of Communication for Distributed Convex Optimization

We consider the communication complexity of some fundamental convex optimization problems in the point-to-point (coordinator) and blackboard communication models. We strengthen known bounds for approximately solving linear regression, $p$-norm regression (for $1\leq p\leq 2$), linear programming, minimizing the sum of finitely many convex nonsmooth functions with varying supports, and low rank approximation; for a number of these fundamental problems our bounds are nearly optimal, as proven by our lower bounds. Among our techniques, we use the notion of block leverage scores, which have been relatively unexplored in this context, as well as dropping all but the ``middle"bits in Richardson-style algorithms. We also introduce a new communication problem for accurately approximating inner products and establish a lower bound using the spherical Radon transform. Our lower bound can be used to show the first separation of linear programming and linear systems in the distributed model when the number of constraints is polynomial, addressing an open question in prior work.


INTRODUCTION
The scale of modern optimization problems often necessitates working with datasets that are distributed across multiple machines, which then communicate with each other to solve the optimization problem at hand.A crucial performance metric for algorithms in such distributed settings is the communication complexity.Traditionally, this has referred to the number of rounds of communication needed between the machines to solve the problem, and there has been a long line of work (which we shortly describe) optimizing this metric.However, as was highlighted in [31,59,76], in many core algorithmic primitives underlying recent advances in continuous optimization, the claimed (theoretical) runtimes are predicated on the assumption of exact computations with in nite precision.When analyzed under the nite-precision model, the true runtimes can be substantially higher.Consequently, inferring the true cost of distributed optimization algorithms built with these components requires a careful analysis.To address this need, our focus in this paper is on designing, for some fundamental optimization problems, distributed algorithms e cient in the total number of bits communicated.
Before describing our setup and results, we rst provide a brief overview of prior work in the related area of distributed optimization, a mature eld encompassing problems spanning engineering, control theory, signal processing, and machine learning.For instance, multi-agent coordination, distributed tracking and localization, estimation problems in sensor networks, opinion dynamics, and packet routing are all naturally cast as distributed convex minimization [11,45,68].Classically, the primary goal in these problems was to design a communication strategy between the computational agents so that they eventually arrive at the optimal objective value [73].A considerable body of work [33,54,72,78] has therefore been devoted to obtaining asymptotic convergence guarantees for these problem classes.Going beyond asymptotic analysis, recent years have witnessed extensive progress in obtaining non-asymptotic rates (typically in terms of the number of rounds of communication) for problems in distributed machine learning such as distributed PAC learning [9], distributed online prediction [22], distributed estimation [26,35,54], and distributed delayed stochastic optimization [2,53].
A related paradigm that has recently emerged in distributed computing is that of federated learning [37].In this paradigm, the processes of data acquisition, processing, and model training are largely carried out on a network's edge nodes such as smartphones [13], wearables [32], location-based services [67], and IoT sensors [40,52], under the orchestration of a central coordinator.Similar to the recent works on distributed machine learning mentioned in the preceding paragraph, for the works in this setting as well, it is the number of rounds of communication that is typically used as a proxy for total communication cost.Additional important concerns for works in federated learning include user privacy and robustness to distribution shifts in users' samples [63] and to heterogeneity in the computational capabilities of the nodes [64].Finally, while our focus in this paper is the theory, we note that advances in the practice of distributed computing have been tremendously spurred by the development of programming models like MapReduce [21], which enable parallelizing the computation, distributing the data, and handling failures across thousands of machines.
Our setup.As mentioned earlier, only recently has there been a surge of interest in studying the bit complexity of optimization algorithms [31,59,76].In this paper, we hope to continue pushing e orts in this direction and study the number of bits communicated to solve various distributed convex optimization problems under two models of communication, de ned next.Our goal is to compute approximate solutions with e cient communication complexity.

De nition 1.1 (Coordinator Model
).There are machines (servers) and a central coordinator.Each machine can send information to and receive information from the coordinator.Any bit communicated with the coordinator counts toward the communication complexity of the algorithm.

De nition 1.2 (Blackboard Model).
There are machines and a coordinator (blackboard).Each machine can send information to and receive information from the coordinator.Only bits sent to the coordinator count toward the communication complexity of the algorithm.
The coordinator model is equivalent, up to a factor of two and an additive log bits per message, to the point-to-point model of computation, in which machines directly interact with each other.The blackboard model may be viewed as having a shared memory between the machines, since it costs the machines only to write on to the blackboard, while reading from the blackboard is free.
We consider several fundamental optimization problems that have been studied extensively outside the distributed setting: least squares regression, low rank approximation, linear programming, and optimizing a sum of convex nonsmooth functions.We provide improved communication upper and lower bounds for these problems in the aforementioned distributed settings.While we obtain nearly tight upper and lower bounds for several of these problems in the "worst-case" settings, e.g., when matrices are arbitrarily poorly conditioned, another important component of our work is in improving bounds for well-behaved inputs, e.g., well-conditioned matrices or decomposable functions.

Our Contributions
In this paper, we address the communication complexity of least squares regression, low-rank approximation, and linear programming in the coordinator model, and nite-sum minimization of Lipschitz functions in the blackboard model.Our central technical novelty lies in developing e cient -in terms of bit complexity -methods for leverage score sampling, inverse maintenance, cutting-plane methods, and the use of block leverage scores in the distributed setting and in nite arithmetic.We summarize all formal statements in this section.
General Setup.In all problems, we consider a matrix that is divided among servers as per the row-partition model.This is in contrast to the arbitrary partition model, in which each server holds a matrix A ( ) , with A = ∈ [ ] A ( ) .In our model, the th machine stores a matrix A ( ) ∈ R × , and our problem matrix A ∈ R × , with = =1 , is formed by vertically stacking all the A ( ) matrices, i.e., A = [A ( ) ].
For least squares regression and linear programming, each server additionally holds a vector b ( ) ∈ R whose vertical concatenation we denote by R ∋ b = [b ( ) ], with = =1 .When considering linear programming and nite-sum minimization, the vector c (where c is the vector that appears in the objective obtained by reducing the original nite-sum minimization using an epigraph trick) is also shared between the machines (or can be shared with ( ) communication).We explicitly describe the setup for each problem in its corresponding section.
We assume that the entries of A ( ) and b ( ) can be represented with bits.We often model this by assuming that all entries are integers in {−2 + 1, . . ., 2 }.Sometimes it will be more convenient to work with normalized vectors and matrices, in which case we allow entries to be of the form 2 − with ∈ {−2 + 1, . . ., 2 }.We say that such numbers are expressed to bits of precision.
1.1.1Least Squares Regression, ℓ Regression, and Low-Rank Approximation.In many large-scale machine learning applications, one is faced with a large, potentially noisy regression problem for which a constant factor approximation is acceptable.Speci cally, we are interested in computing an approximate solution x satisfying, for a given constant , the bound (1.1) We formalize our setup below.For least squares regression in this model, [76] gave upper and lower bounds of (2 ) and Ω( + 2 ), respectively.Their upper bound comes from sending (A ( ) ) ⊤ A ( ) 's and (A ( ) ) ⊤ b ( ) 's to the coordinator which then computes the exact solution by the normal equations.On the other hand, they show that consistent 2 linear systems can be solved exactly using only ( + 2 ) communication.Furthermore, for consistent systems, the optimal regression error is zero, and so a regression algorithm must output the precise solution.Least squares regression is, therefore, certainly as hard as solving consistent linear systems.This motivates the following question: Is solving least squares regression to constant accuracy harder than solving a consistent linear system?
Our key (and surprising) takeaway message for this setting is that for constant , regression is no harder than solving linear systems.Speci cally, we give a protocol which, for any constant > 0, achieves ( + 2 ) bits of communication for least squares regression, thus improving upon [76]'s ( 2 ) upper bound and matching its lower bound of Ω( + 2 ) for constant .Our upper bound also gives the rst separation for least squares regression between the row-partition model and the arbitrary partition model, for which [46] showed an Ω( 2 ) lower bound.
Theorem 1.4 (ℓ 2 Regression in the Coordinator Model).Given > 0 and a least squares regression problem in the setup of Problem 1.3 with input matrix A = [A ( ) ] ∈ R × and vector b = [b ( ) ] ∈ R , there is a randomized protocol that allows the coordinator to solve the least squares regression problem with constant probability and relative error (1 ± ) using Additionally, if is a known upper bound on the condition number of A then there is a protocol using ( log + 2 −1 ) communication.
If is not constant, there still remains a gap between the above bound and [76]'s lower bound of Ω( + 2 ).By proving an improved Ω( ) lower bound for ℓ 2 regression under a mild restriction on the number of rounds of the protocol, we close this gap (cf.Section 1.2.5 ).
Our upper bound from Theorem 1.4 extends to ℓ regression for 1 ≤ < 2, as captured by Theorem 1.5.Notably, our protocols for regression have a small (1) number of rounds of communication, with no dependence on the condition number of A.
Theorem 1.5 (ℓ Regression for 1 ≤ < 2 in the Coordinator Model).For the setup described in Problem 1.3, there exists a randomized protocol that, with a probability of at least 1 − , allows the coordinator to produce an -distortion ℓ subspace embedding for the column span of A using only As a result, the coordinator can solve ℓ regression (for 1 ≤ < 2) with the same communication.
While the focus of our work for regression has been on the coordinator model (Theorem 1.4 and Theorem 1.5), we note that [76] already provide optimal communication cost algorithms for constant-accuracy regression in the blackboard model, as remarked below.
Low Rank Approximation.As an application of our aforementioned least squares regression techniques, we obtain improved bounds for low-rank approximation in the distributed setting, a problem several prior works [12,14,29,38] have considered.Notably, [14] studied the variant of the problem wherein the rows 3of A are partitioned among servers, and all servers must learn a projection Π that yields an approximately optimal Frobenius-norm error: where A is the best rank-approximation of A. In this setting, [14] provide an upper bound of ( ) for constant , along with a nearly matching lower bound of Ω( ).However, their lower bound crucially requires all servers to learn the projection.A natural question we answer is if relaxing this constraint could yield a better communication complexity.In other words: Is it possible to do better when only the coordinator needs to learn the projection?Theorem 1.7 (Low-Rank Approximation in the Coordinator Model).For the setup described in Problem 1.3, suppose that the servers have shared randomness.Then there is a randomized protocol using • ( −2 + −1 ) bits of communication, that with constant probability lets the coordinator produce a rankorthogonal projection Π ∈ R × (where ≤ ) satisfying Inequality (1.2).
1.1.2High-Accuracy Least Squares Regression.Complementing our constant-factor regression results in the previous paragraphs, we study regression solved to machine precision.We show that when the matrix A ∈ R × has a small condition number (i.e., poly( )), we obtain an ( + 2 log( −1 )) communication complexity of solving least squares regression to high accuracy.Speci cally, we obtain the following result.
Theorem 1.8 (High-Accuracy ℓ 2 Regression in the Coordinator Model).Given > 0 and the least squares regression setting of Problem 1.3 with input matrix A = [A ( ) ] ∈ R × and vector b = [b ( ) ] ∈ R , there is a randomized algorithm that, with high probability, outputs a vector x such that (1.3) Let be the condition number of A. Then the algorithm uses Moreover, the vector x is available on all the machines at the end of the algorithm.
This result improves upon the ( 2 ) bound of [76] (which also gives an associated lower bound of Ω( + 2 )).The error guarantee of Theorem 1.8 is di erent than that of Theorem 1.4 in two ways.First, the error is additive in Theorem 1.8 instead of multiplicative.The main reason is that the solution produced by the algorithm of Theorem 1.8 is available to all the machines instead of only being available only to the coordinator.This is needed when we use this result for each iteration of linear programming (Theorem 1.10).We note that the solution produced by the algorithm of Theorem 1.4 can be shared among all the machines, but we would need to use the rational number representation to share it, and the communication cost would increase signi cantly in this case.The second di erence is that the dependence of the running time on the error parameter is logarithmic in Theorem 1.8.This allows us to achieve high-accuracy solutions, which are again needed for linear programming results to deal with adaptive adversary issues.
Our improvement is achieved by a novel rounding procedure for Richardson's iteration with preconditioning and has consequences outside the distributed setting as well.In particular, it implies an improvement for the bit complexity of solving a least squares regression problem (with an input that has constant bit complexity) from Remark 1.9.While our result in Theorem 1.8 operates only in the coordinator model, [76] studies this problem in the blackboard model as well.In particular, for ℓ 2 regression in the blackboard model with general accuracy parameter , [76] provides an algorithm with communication cost ( + 2 −1 ), with an associated lower bound of 1.1.3High-Accuracy Linear Programming.A core technical component in achieving the results of Section 1.1.2is the communicatione cient computation of a spectral approximation of a matrix via its intimate connection to its approximate leverage scores.We utilize this idea to develop communication-e cient high-accuracy linear programming too, as we describe next.
The work of [76] studied this problem and gave an upper bound of ( 3 + 4 ) by implementing Clarkson's algorithm [18] in the coordinator model.To obtain this bound, [76] rst note that following the analysis of the original algorithm in [18], the total number of rounds of communication is ( log ).In each round, the coordinator sends to all the servers a vector x , which is an optimal solution to the linear program Ax ≤ b.By polyhedral theory, there exists a non-singular subsystem Bx ≤ c, such that x is the unique solution of Bx = c.By Cramer's rule, each of entries of x is a ratio of integers between − !2 and !2 and can therefore be represented in ( ) bits.Multiplying all these quantities yields the claimed communication complexity.
We take a di erent approach and improve upon [76]'s above bound of ( 3 + 4 ) to ( 1.5 + 2 ).Our improvement is achieved by essentially adapting to the distributed setting recent advances in interior point methods for solving linear programs [42,43,75], with the associated toolkit of a weighted central path approach, e cient inverse maintenance, and data structures for e cient matrix-vector operations.Our rate holds for linear programs that have a small outer radius and a well-conditioned constraint matrix A, as we formalize next.
Theorem 1.10 (Linear Programming in the Coordinator Model).Given > 0, input matrix A = [A ( ) ] ∈ R × , and vectors c = [c ( ) ] ∈ R and b ∈ R in the setup of Problem 1.3, there is a randomized algorithm that, with high probability, outputs a vector x ∈ R such that where is the linear program's outer radius, i.e., ∥x∥ 2 ≤ for all feasible x.The algorithm uses bits of communication, where is the condition number of A, and is the linear program's inner radius, i.e., there exists a feasible x with x ≥ for all ∈ [ ].Moreover, the vector x is available on all the machines at the end of the algorithm.
As a special case of our Theorem 1.10, when our linear program has parameters , , and −1 of the scale poly( ), we obtain a communication complexity of ( 1.5 + 2 ), which is also an improvement over [76]'s previous bound.
Remark 1.11.While we study linear programming only in the coordinator model, [76] studies this in the blackboard model as well.Speci cally, in constant dimensions, [76] provides a randomized communication complexity of Ω( + ) for linear programming in the blackboard model.
1.1.4Finite-Sum Minimization with Varying Supports.Another problem class naturally amenable to study in the distributed setting is that of nite-sum minimization.We consider, in the blackboard model, the problem min x =1 (x) where each : R ↦ → R is -Lipschitz, convex, nonsmooth, and supported on (potentially overlapping) coordinates.We call this problem "decomposable nonsmooth convex optimization".The assumption of varying supports appears prominently in decomposable submodular function minimization [8,61] and was recently studied by [25].This problem, without this assumption, has seen extensive progress in variants of stochastic gradient descent (cf.Section 1.2.4).We formalize below the problem setup in the blackboard model.Directly adapting the algorithm of [25] to the above model yields a communication cost of (max ∈ [ ] =1 ).In this work, we improve this cost to ( =1 2 ), as formalized next.
Theorem 1.13 (Distributed Decomposable Nonsmooth Convex Optimization).Given > 0 and the setup of Problem 1.12, consider the problem min =1 ( ), where each : R ↦ → R is convex, -Lipschitz, and dependent on coordinates of .De ne ★ := arg min ∈R =1 ( ).Suppose further that we know an initial (0) ∈ R such that ∥ ★ − (0) ∥ 2 ≤ .Then, there is an algorithm that outputs a vector ∈ R such that Our algorithm uses =1 2 log( −1 ) • bits of communication, where = (log ) is the word length.At the end of our algorithm, all servers hold this solution.
Our technical novelty -modifying the analysis and slightly modifying the algorithm of [25] -yields an improvement in not just the distributed setting but also in the (non-distributed) setting [25] studied this problem in.Speci cally, as a corollary (Theorem 1.14), we improve the total oracle cost of decomposable nonsmooth convex optimization from where O is the cost of invoking the th separation oracle, and O max is the maximum of all O .Theorem 1.14 (Solving Problem 1.12.).Consider the problem min ∈R =1 ( ) with each : R ↦ → R convex, -Lipschitz, possibly non-smooth functions, depending on coordinates of , and accessible via a (sub-)gradient oracle.De ne the minimizer ★ := arg min ∈R =1 ( ).Suppose we are given a vector (0) ∈ R such that ∥ ★ − (0) ∥ 2 ≤ .Then, given a weight vector w ∈ R ≥1 with which we de ne , there is an algorithm that, in time poly( log( −1 )), outputs a vector ∈ R such that Moreover, let be the number of subgradient oracle calls to .Then, the algorithm's total oracle cost is As alluded to earlier, an important special case of decomposable nonsmooth convex optimization is decomposable submodular function minimization, which in turn has witnessed a long history of research [8,27,34,39,57].Therefore, outside of distributed optimization, an immediate application of Theorem 1.14 is an improved cost of decomposable submodular function minimization, as we describe in Corollary 1.15.1.1.5Lower Bounds.Finally, we complement our upper bounds results from the previous sections with lower bounds.[76] asked the following question: from the perspective of communication complexity, is solving a linear program harder than (exactly) solving a linear system?Towards answering this question, they showed that, in constant dimensions, checking feasibility of a linear program requires Ω( ) communication in the coordinator model, while feasibility for linear systems requires only ( + ) communication, thereby demonstrating an exponential separation between the two problems.However their lower bound for linear programs was based on a hard instance with 2 Ω ( ) constraints.So for linear program feasibility problems with constraints they leave open the possibility of a protocol with communication cost ( log ) + ( ).This is an important limitation of their lower bound, since they show for example, that a modi ed Clarkson's Algorithm [18,76], indeed gives a protocol with log dependence.This, therefore, motivates the following question: is there an exponential separation between checking feasibility of linear programs and solving linear systems when there are only poly( + ) constraints?
We answer this question in the a rmative, showing that such a separation does in fact hold, even for linear feasibility problems with ( + ) constraints.
Theorem 1.16.Any protocol solving Linear Feasibility in the coordinator model requires at least Ω( ) communication for protocols that exchange at most /log rounds of messages with each server and with log ≤ 5 .This bound holds even when the number of constraints is promised to be at most ( + ).For constant the Ω( ) lower bound holds with no assumption on the number of rounds.
In addition to linear programming, one could also ask to get tight lower bounds for relative error least squares regression as discussed above.[76] gave a lower bound of Ω( + 2 ) for constant , however our algorithm requires ( + 2 ) bits.We close this gap by showing that the term is unavoidable.Interestingly, our regression lower bound follows from the same techniques that we we use to derive our linear programming lower bound.

Technical Overview
Before providing the details of our algorithms and analyses for each of the aforementioned results, we give high-level overviews of the techniques we use for each of them.

Least Squares Regression and Subspace
Embeddings.We give two protocols for the regression problem instance of min x ∥Ax−b∥ 2 .The rst is based on sketching what we refer to as the block leverage scores, which for us is simply the sum of the leverage scores of the rows in that block (leverage scores computed with respect to A).Our second protocol is based on non-adaptive adaptive sketches [50] from the in the data streaming literature.Both of our protocols for regression operate by constructing a subspace embedding4 matrix S for the span of the columns of A and b.This is a stronger guarantee than solving the regression problem, as the coordinator may compute SAx − Sb = S(Ax − b) and output the solution to the sketched regression problem [77].The subspace embedding construction proves useful in contexts other than regression too.Indeed, we require the subspace embedding construction to get improved communication for low-rank approximation.We now describe our two approaches below.
Block Leverage Scores.While block leverage scores have previously been considered in various forms [41,51,58,60,80] as far as we are aware they have (naturally) been used only in the context of sampling entire blocks at a time.In our setting, we are ultimately interested only in sampling rows, but nd that approximating the block leverage score of each server is a useful subroutine.Specically we show that for small , sampling a × row-sketch of each block is almost su cient to estimate all the block leverage scores.The catch is that we fail to accurately estimate block leverage scores that are larger than .Intuitively, this is because such blocks could have more than "important" rows.So our approach is to attempt to estimate the leverage score of all blocks via sketching using a small value of .We might nd that a small number of blocks have leverage scores that are too big for the estimates of their leverage scores to be accurate.To x this, we focus on those blocks and sample a larger row-sketch from them in order to get a better estimate of their block leverage scores.Taking a larger sketch requires more communication per block.Crucially, however, the number of servers with leverage score greater than is at most / .Thus we may proceed in a series of rounds, where in round we focus on servers with leverage score at least 2 .There are at most /2 such servers, and for each server we take a sketch of total size roughly 2 , so each round after (of which there are only (log )) uses roughly 2 communication.We note that the rst round requires a roughly 1 × sized sketch from all servers, which yields an dependence.
Once we have estimates of the block leverage scores, we observe that sampling sketched rows from the blocks proportional to the block leverage scores su ces to obtain a subspace embedding for A.
Non-adaptive Adaptive Sketching.When = 2, our protocol runs the recursive leverage score sampling procedure of [19] adapted to the distributed setting.One potential approach is to run this algorithm by sketching the inverse spectral approximations and broadcasting them to the servers.Unfortunately, when A is nearly singular, these sketches can have a high bit complexity.To avoid this, we instead use a version of an ℓ 2 sampling sketch which can be applied on the servers' sides and sent to the coordinator, which allows the coordinator to sample from the appropriate (relative) leverage score distribution.An issue arises if some relative scores are much larger than one, as we need to truncate them to roughly one before using them as sampling probabilities (up to scaling).To x this, we rst give a subroutine to identify this subset of outlying rows.
ℓ Regression beyond = 2. Our "non-adaptive adaptive" protocol above extends to give optimal guarantees for ℓ 1 regression and ℓ regression for 1 ≤ ≤ 2 essentially by using the more general recursive Lewis weight sampling protocol of [20].
For 2 < < 4, the recursive Lewis weight sampling algorithm of [20] can also be run exactly to construct an ℓ subspace embedding, simply by broadcasting the approximate Lewis quadratic form to all servers on each round.Since the quadratic form is a × matrix, this broadcasting incurs an ( 2 ) cost per round and hence an ( 2 ) cost for computing approximate Lewis weights for all rows.The coordinator then must sample /2 rows, resulting in a cost of ( 2 + max( /2,1)+1 ).
If one is interested in sampling a coreset of rows to obtain an ℓ subspace embedding, then the /2+1 term is unavoidable as we need to sample at least /2 rows [48].However for 1 ≤ ≤ 2 our approach for ℓ 2 regression shows that the 2 term can be improved to . Whether the 2 term can be improved for all is an interesting question that we leave to future work.1.2.2High-Accuracy Least Squares Regression.While constantfactor approximations often su ce, in certain settings it is important to ask for a solution that is optimal to within machine precision, e.g., if such solutions are used in iterative methods for solving a larger optimization problem.In this setting we consider the problem instance min x ∥Ax − b∥ 2 , where, given a row-partitioned system A, b, the goal of the coordinator is to output an x for which ∥Ax − b∥ 2 ≤ min x ∥Ax − b∥ 2 + • ∥A(A ⊤ A) † A ⊤ b∥ 2 .A direct application of gradient descent requires about iterations, where is the condition number of matrix A.
Our protocol for solving this problem in the distributed setting to a high accuracy is Richardson's iteration with preconditioning, coupled with careful rounding.We precondition using a constant factor spectral approximation of the matrix to reduce the number of iterations to only log( −1 ).This version of Richardson's iteration is equivalent to performing Newton's method with an approximate Hessian since the preconditioner spectrally approximates the Hessian inverse (A ⊤ A) −1 .Computing a spectral approximation to A is equivalent to computing a subspace embedding, so to compute the preconditioner we employ our regression protocol from above.We note that the re nement sampling procedures we use in our two regression algorithms are very similar, since both are based on the same algorithm from [19].However, we believe there are su cient di erences in the speci cs to merit writing out the latter in full.Alternatively, it is possible to employ a somewhat simpler protocol (which also has the advantage of computing approximations to all leverage scores) since in the high-precision setting we allow for a condition-number dependence.
The key novelty in the implementation of our Richardson-style iteration is to communicate, in each step, only a partial number of bits of the residual vector.The idea here is that as the solution converges, the bits with high place values do not change much between consecutive iterations and therefore need not be sent every time.Using a similar idea, we show that the Richardson iteration is robust to a small amount of noise, which helps us avoid updating the lowest order bits.Overall, via a careful perturbation analysis, we show that communicating the updates on only the ( log ) middle bits of each entry su ces to guarantee the convergence of Richardson's iteration.
1.2.3 High-Accuracy Linear Programming.Similar to Section 1.2.2, we ask the question of solving linear programs to a high accuracy.These require di erent techniques than ones in fast rst-order algorithms for linear programs with runtimes depending polynomially on 1/ [6,7,79].Speci cally, interior-point methods and cuttingplane methods are the standard approaches in the high-accuracy regime.Recent advances in fast high-accuracy algorithms for linear programs [42,43,75] were spurred by developments in the novel use of the Lewis weight barrier, techniques for e cient maintenance of the approximate inverse of a slowly-changing matrix, and e cient data structures for various linear algebraic primitives.Our approach for a communication-e cient high-accuracy linear program solver builds upon these developments, e ectively adapting them into the coordinator model.
We rst describe the standard framework of interior-point methods.In this paradigm, one reduces solving the problem of min u∈S c ⊤ u to that of solving a sequence of slowly-changing unconstrained problems min u Ψ (u) def = • c ⊤ u + S (u) parametrized by , with a self-concordant barrier S that enforces feasibility by becoming unbounded as x → S. The algorithm starts at = 0, for which an approximate minimizer x ★ 0 of S is known, and it alternates between increasing and updating, via Newton's method, x to an approximate minimizer x ★ of the new Ψ .For a su ciently large , the minimizer x ★ also approximately optimizes the original problem min u∈S c ⊤ u with sub-optimality gap ( / ), where is the self-concordance parameter of the barrier function used.This self-concordance parameter typically also appears in the iteration complexity.
While this is the classical interior-point method as pioneered by [55], there has been a urry of recent e ort focusing on improving di erent components of this paradigm.The papers we use for our purposes are those by [42,43,75], which developed variants of the aforementioned central path method essentially by reducing the original LP to certain data structure problems such as inverse maintenance and heavy-hitters.Adapting these approaches to the coordinator model, we provide an algorithm for approximately solving LPs with (( 1.5 ( + log( )) + 2 ) • log( −1 )) bits of communication (Theorem 1.10).Among the tools we employ for our analysis are those for matrix spectral approximation we develop and our result for the communication complexity of leverage scores, which we use to bound the communication complexity of iteratively computing Lewis weights for computing an initial feasible solution.
1.2.4 Decomposable Nonsmooth Convex Optimization.For decomposable nonsmooth convex optimization in the blackboard model, we improve an algorithm from the literature and then adapt this improved algorithm to the distributed setting.Speci cally, we study min x∈R =1 (x), where each is -Lipschitz, convex, and dependent on some coordinates of x -note that the di erent could have overlapping supports -and the th machine has subgradient-oracle access to the th function.
Most prior works [36,66,69] and their accelerated variants [1,3,30,49,82] designed in the non-distributed variant of nite-sum minimization assumed to be smooth and strongly convex.Those designed for the distributed setting [15,70,81] also typically imposed this assumption (some exceptions include [26]), but additionally also used as their performance metric only the number of rounds of communication, as opposed to the total number of bits communicated, which is what we focus on.Variants of gradient descent [56] that are typically applicable to this problem also require a bounded condition number.There has also been work on non-smooth empirical risk minimization, but usually it requires that the objective be a sum of a smooth loss and a non-smooth regularizer.The formulation we study is a more general form of empirical risk minimization: In particular, our setting allows all to be non-smooth.
The work of [25] combines ideas from classical cutting-plane and interior-point methods to obtain a nearly-linear (in total e ective dimension) number of subgradient oracle queries for solving the problem in the non-distributed setting.This is the algorithm we modify and adapt to the distributed setting; our modi cation also yields improvements in the non-distributed setting.
We rst describe the result obtained by simply adapting the algorithm of [25] to the blackboard model.Following [25], we rst use a simple epigraph trick to reduce this problem to one with a linear objective and constrained to be on an intersection of parametrized epigraphs of : min x:x i ∈K ⊆R +1 ∀ ∈ [ ],Ax=b c ⊤ x, where K are all convex sets.All servers hold identical copies of the problem data at all times.However, each server has only separation-oracle access to the set K , which comes from the equivalence to the subgradientoracle access to by the result of [44].
We maintain crude outer and inner set approximations, K in, and K out, , to each set K (such that K in, ⊆ K ⊆ K out, ) and update x, our candidate minimizer of the (new) objective c ⊤ x using an interior-point method.Ideally, for some choice of barrier function de ned over K ∩ {Ax = b}, we would update our candidate minimizer to move along the central path through this set.However, since we do not explicitly know K, we instead use a barrier function de ned over its proxy, K out ∩ {Ax = b}.We improve our approximations of K in and K out using ideas inspired from classical cutting plane methods [74].Thus, our algorithm essentially alternates between performing a cutting-plane step (to improve our set approximation of K) and performing an interior-point method step (to enable the candidate minimizer x to make progress along the central path).
Each server runs a copy of the above algorithm.After updating the parameter and computing x ★ out -the current target for the interior-point method step -each server tests feasibility of x ★ out .If there is a potential infeasibility of the th block, x ★ out, , then the server queries x ★ out, (the th block of the current target point) and sends to the blackboard a separating hyperplane to update K out, or a bit to indicate otherwise.The other servers then read this information and update either the set K out, or K in, on their ends.It was shown in [25] that this algorithm (without the distributed setting) has an oracle query complexity of ( =1 ).In the distributed setting, this would translate to a communication complexity of (max ∈ [ ] • =1 ).Our main novelty is to modify the prior analysis (and slightly modify a speci c parameter of the algorithm) so as to obtain the more ne-grained oracle cost of ( =1 ), for any arbitrary weight vector w ≥ 0. In our distributed algorithm, we set = , since the only communication that happens in a round is when a server sends hyperplane information to the blackboard.Thus, this translates to the communication cost of ( =1 2 ), an improvement over the bound of (max ∈ [ ] • =1 ) obtained from adapting [25] to the blackboard setting.
1.2.5 Lower Bounds.We are interested in obtaining tight lower bounds for least squares regression and low-rank approximation that capture the dependence on the bit complexity .When proving such lower bounds, it is common to reduce from communication games such as multi-player set-disjointness [65].However, it is not at all clear how one could encode such a combinatorial problem into an instance of regression that would yield a good bit-complexity lower bound.Indeed, most natural reductions from the standard communication problems would result in a single bit entry of A. This motivates us to introduce a new communication game (Problem 1.17) that forces the players to communicate a large number of bits of their inputs.
Problem 1.17.The coordinator holds an (in nite-precision) unit vector v ∈ R with ≥ 3, and the servers hold unit vectors w 1 , . . ., w ∈ R respectively.The coordinator must decide between (a) v ⊤ w = 0 for all ∈ [ ] and (b) For some , |v ⊤ w | ≥ and v ⊤ w = 0 for all ≠ .
The two player-version of Problem 1.17 is reminiscent of the promise inner product problem (PromiseIP ) over F , where the goal is to distinguish between v ⊤ w = 0 and v ⊤ w = 1 for v, w ∈ F .This problem was introduced by [71] who gave an Ω( log ) lower bound and further considered by [47] who developed an -player version.We note that their -player version is for the "generalized inner product" and is therefore quite di erent from the game that we introduce.Furthermore we are not aware of a version of PromiseIP over R that is suitable for our purposes, even though real versions of the inner product problem have been studied [5].We give the following lower bound for our problem: Theorem 1.18.A protocol that solves Problem 1.17 with probability at least 0.9 requires at least Ω( log( −1 )) communication for protocols that exchange at most log( −1 )/log log( −1 ) rounds of messages with each server.
To prove this, we begin by considering = 3, and = 1 so that the game involves two players, say Alice and Bob, holding vectors v and w in R 3 .We borrow techniques from Fourier analysis on the sphere to prove an Ω( −1 ) communication lower bound.Our techniques are reminiscent of those in [62], although we require somewhat less sophisticated machinery.One might wonder why we choose to start with = 3 rather than = 2.It turns out that when = 2 the Ω( −1 ) lower bound does not hold!Indeed Alice can form the vector v ⊥ so that the problem reduces to checking if v ⊥ and w are approximately equal up to sign.This reduces to checking exact equality after truncating to approximately log( −1 ) bits.But this is easy to accomplish with (1) bits of communication by communicating an appropriate hash.It is not immediately clear whether a similar trick could apply in higher dimensions.In particular any proof of the lower bound must explain the di erence between = 2 and = 3.The di erence turns out to be that the spherical Radon transform is smoothing in dimensions 3 and higher, but not in dimension 2.
Given the = 3 case, we boost our result to higher dimensions by a viewing a -dimensional vector as the concatenation of /3 vectors each of 3-dimensions and then applying the direct-sum technique of [10].This requires us to rst prove an information lower bound on a particular input distribution.This turns out to be easier to accomplish for public-coin protocols, and we then upgrade to general (private-coin) protocols using a "reverse-Newman" result of [17].This last step is where our bounded round assumption arises from.We note that this is a purely technical artifact of our proof and can likely be avoided.Finally, we show how to extend our lower bound from two players to players.With this result, we are able to deduce new lower bounds for least-squares regression and testing feasibility of linear programs.
Least Squares Regression.[76] studied the communication complexity of the least squares regression problem and showed a communication lower bound of Ω( + 2 ).We show that obtaining a constant factor approximation to a least-squares regression problem requires Ω( ) communication, at least for protocols that use at most roughly rounds of communication.This bounded round assumption is mild since our algorithms need only (1) rounds, which is desirable.
The reduction is from Problem 1.17 above.Our approach is to construct a matrix from the inputs whose smallest singular value is roughly 2 − in case (a) and roughly 2 − /2 in case (b).To create such a matrix A we stack the vectors v, w 1 , . . ., w and additionally append an orthonormal basis for v ⊥ .We choose to be an extremely small constant so that in either case, v is approximately the singular vector of A corresponding to min (A).In case (a) we will arrange for min (A) to be roughly 2 − whereas in case (b) w will cause min (A) to increase since w has positive inner product with v.While the additive change in min (A) is small, the multiplicative e ect will be large.We then set up a regression problem involving A so that an approximate least squares solution has norm roughly 1/ min (A) in either case, allowing us to distinguish cases (a) and (b).
Linear programming.Given our new communication lower bound, our reduction to linear feasibility is simple.We pick a collection of linear constraints that forces a feasible point x to satisfy x = v

Problem 1 . 3 (
Setup in the Coordinator Model).Suppose there is a coordinator and machines that communicate with each other as per the coordinator model of communication (De nition 1.1) with shared randomness.Suppose each machine ∈ [ ] holds a matrix A ( ) ∈ R × and a vector b ( ) ∈ R .Denote A = [A ( ) ] ∈ R × and b = [b ( ) ] ∈ R , both represented with bits in xed-point arithmetic.Moreover, suppose the condition number of A is bounded by 1 .

Problem 1 . 12 (
Decomposable Nonsmooth Convex Optimization Setup).Suppose there is a blackboard/coordinator and machines that communicate with each other as per the blackboard model of communication (De nition 1.2).Suppose each machine ∈ [ ] holds an oracle O that returns a subgradient (represented with bits in xed-point arithmetic) of the function : R ↦ → R.