Approximate Earth Mover’s Distance in Truly-Subquadratic Time

We design an additive approximation scheme for estimating the cost of the min-weight bipartite matching problem: given a bipartite graph with non-negative edge costs and ε > 0, our algorithm estimates the cost of matching all but O(ε)-fraction of the vertices in truly subquadratic time O(n2−δ(ε)). Our algorithm has a natural interpretation for computing the Earth Mover’s Distance (EMD), up to a ε-additive approximation. Notably, we make no assumptions about the underlying metric (more generally, the costs do not have to satisfy triangle inequality). Note that compared to the size of the instance (an arbitrary n × n cost matrix), our algorithm runs in sublinear time. Our algorithm can approximate a slightly more general problem: max-cardinality bipartite matching with a knapsack constraint, where the goal is to maximize the number of vertices that can be matched up to a total cost B.


Introduction
Earth Mover's Distance (EMD -sometimes also Optimal Transport, Wasserstein-1 Distance or Kantorovich-Rubinstein Distance) is perhaps the most important and natural measure of similarity between probability distributions over elements of a metric space [PC19; San15; Vil+09].Formally, given two probability distributions µ and ν over a metric space pM, dq their EMD is defined as EMDpµ, νq " min !E px,yq"ζ rdpx, yqs ˇˇζ is a coupling1 of µ and ν

)
. (1) When µ and ν are discrete distributions with support size n (perhaps after a discretization preprocessing), a straightforward algorithm for estimating their EMD is to sample Θpnq elements from each, compute all Θpn2 q pairwise distances, and then compute a bipartite min-weight perfect matching.This algorithm clearly takes at least Ωpn 2 q time (even ignoring the computation of the matching), and incurs a small additive error due to the sampling.
Our main result is an asymptotically faster algorithm for estimating the EMD: Theorem 1 (Main Theorem).Suppose we have sample access to two distributions µ, ν over metric space pM, dq satisfying dp¨, ¨q P r0, 1s and query access to d. Suppose further that µ, ν have support size at most n.
For each constant γ ą 0 there exists a constant ε ą 0 and an algorithm running in time Opn 2´ε q that outputs { EMD such that { EMD P rEMDpµ, νq ˘γs.
Moreover, such algorithm takes Õpnq samples from µ and ν.
Notably, our algorithm makes no assumption about the structure of the underlying metric.In fact, it can be an arbitrary non-negative cost function, i.e. we do not even assume triangle inequality.
Beyond bounded support size.Support size is a brittle matter; indeed two distributions that are arbitrarily close in total variation (TV) distance (or EMD) can have completely different support size.Moreover, for continuous distributions, the notion of support size is clearly inappropriate and yet we would like to compute their EMD through sampling.To obviate this issue, Corollary 1.1 generalize Theorem 1 to distributions that are close in EMD to some distributions with support size n.
Corollary 1.1.Suppose we have sample access to two distributions µ, ν over metric space pM, d M q satisfying dp¨, ¨q P r0, 1s and query access to d. Suppose further that there exist µ 1 , ν 1 with support size n such that EMDpµ, µ 1 q, EMDpν, ν 1 q ď ξ, for some ξ ą 0.
For each constant γ ą 0 there exists a constant ε ą 0 and an algorithm running in time Opn 2´ε q that outputs { EMD such that { EMD P rEMDpµ, νq ˘p4ξ `γqs.
Moreover, such algorithm takes Õpnq samples from µ and ν.
For continuous µ, requiring that µ is close in EMD to a distribution with bounded support size is equivalent to saying that µ can be discretized effectively for EMD computation.Thus, such assumption is natural while computing EMD between continuous distribution through discretization.
We stress that the algorithm in Corollary 1.1 does not assume knowledge of µ 1 (nor ν 1 ) beyond its support size n.Indeed, the empirical distribution over Õpnq samples from µ (resp.ν) makes a good approximation in EMD.Finally, the sample complexity in Theorem 1 and Corollary 1.1 is optimal, up to polylogpnq factors.Indeed, Theorem 1 in [VV10] implies a lower bound of Ωpnq on the sample complexity of testing EMD closeness2 .
Matching with knapsack constraint.Applying our main algorithm to a graph-theory setting, we give an approximation scheme for a knapsack bipartite matching problem, where our goal is to estimate the number of vertices that can be matched subject to a total budget constraint.
Theorem 2 (Main theorem, graph interpretation).For each constant γ ą 0, there exists a constant ε ą 0, and an algorithm running in time Opn 2´ε q with the following guarantees.The algorithm takes as input a budget B, and query access to the edge-cost matrix of an undirected, bipartite graph G over n vertices.The algorithm returns an estimate x M that is within ˘γn of the size of the maximum matching in G with total cost at most B.
Exact solution.Computing EMD between two sets of n points boils down to computing the minimum cost of a perfect matching on a bipartite graph, a problem with a 70-years history [Kuh55].Min-weight bipartite perfect matching can be cast as a min-cost flow (MCF) instance and to date we can solve it in n 2`op1q time (namely, near-linear in the size of the distance matrix) [Che+22a].Apparently, any exact algorithm requires inspecting the entire distance matrix, thus Θpn 2 q time is the best we can hope for.In addition, even in d-dimensional Euclidean space, where the input has size d ¨n !n 2 , no Opn 2´ε q algorithm exists3 , unless SETH is false [Roh19].
The landscape is much less interesting for general metrics.Indeed, a straightforward counterexample from [Bȃd+05] shows that any Op1q-approximation requires Ωpn 2 q queries to the distance matrix.This suggests that for general metrics we should content ourselves with a additive approximation.
An extremely popular algorithm to solve optimal transport in practice is Sinkhorn algorithm [Cut13] (see [Le+21;Pha+20] for recent work).Sinkhorn distance SNK is defined by adding an entropy regularization term ´η ¨Hpζq to the EMD objective in Equation (1).Approximating SNK via Sinkhorn algorithm provably yields a εr-additive approximation to EMD and takes O ε pn 2 q time, where r is the dataset diameter [ANR17].
Graph-theoretic approaches also led to εr-additive approximations [LMR19] in O ε pn 2 q time.Notice that even though all previous approximation algorithms have roughly the same complexity as the MCF-based exact solution they are backed by experiments showing their practicality, whereas exact algorithms for EMD are still largely impractical for very large graphs.
Breaking the Opn 2 q barrier for general metrics.As mentioned above, [AZ23] was the first work to break the quadratic barrier for approximate EMD.Indeed, they show a p1`εq-multiplicative approximation algorithm for EMD on Euclidean space running in n 2´Ωεp1q time.Matching such result on general metrics is impossible, since no Op1q-multiplicative approximation can be achieved in opn 2 q time [Bȃd+05].A natural way to bypass the lower bound in [Bȃd+05] is to consider additive approximation.However, no ε-additive approximation algorithm for EMD on general metrics faster than O ε pn 2 q barrier was known prior to this work.Theorem 1 gives the first εadditive approximation to EMD for general metrics running in n 2´Ωεp1q time, thus breaking the quadratic barrier for general metrics.
We stress that, despite [AZ23] and this work both prove similar results, they use a completely different set of techniques.Indeed, in [AZ23] they approximate the complete bipartite weighted graph induced by Euclidean distances with a p1 `εq-multiplicative spanner of size n 2´Ωεp1q .Their spanner construction is based on LSH and so it hinges on the Euclidean structure.Then, they run a near-linear time MCF solver [Che+22a] to solve the matching problem on the metric induced by the spanner.In this work, instead, we build on sublinear algorithms for max-cardinality matching [Beh+23; Beh22; BKS23a; BKS23b; BRR23] and do not leverage any metric property, not even triangle inequality.Section 2 contains a detailed explanation of our techniques.
It is worth to notice that since [AZ23] operates over d-dimensional Euclidean space the input representation takes d ¨n space, and so it does not run in sublinear time.On the contrary, our algorithm assumes query access to the distance matrix and runs in sublinear time.
In this work we focus on a different access model: we do not make any assumption on the ground metric and we assume query access to the distance matrix.This model is natural whenever the underlying metric is expensive to evaluate.For example, in [ALT21] they consider EMD over a shortest-path ground metric and experiment with heuristics to avoid computing all-pair distances, which would be prohibitively expensive.
Comparison with MST.Minimum Spanning Tree (MST) and EMD are two of the most studied optimization problems in metric spaces.It is interesting to observe a separation between the sublinear-time complexity of MST and EMD for general metrics.Indeed, [CS09] shows a Õε pnq time algorithm approximating the cost of MST up to a factor 1 `ε, whereas no Op1q-approximation for EMD can be computed in opn 2 q time [Bȃd+05].Essentially, this is due to the fact that MST cost is a more robust problem than EMD.Indeed, in EMD increasing a single entry in the distance matrix can increase the EMD arbitrarily, whereas for MST this does not happen because of triangle inequality.
A valuable take-home message from this work is that allowing additive approximation makes EMD more robust.A natural question is whether we can find a ε-additive approximation to EMD in Õε pnq time, thus matching the above result on MST cost.The Ωpn 1.2 q lower bound on maxcardinality matching from [BRR23] suggests that this should not be possible4 Indeed, we can reduce max-cardinality matching to EMD by embedding the bipartite graph into a p1, 2q metric space.

Technical Overview
Computing Earth Mover's Distance between two sets of n points in a metric space can be achieved by solving Min-Weight Perfect Matching (MWPM) on the complete bipartite graph where edge-costs are given by the metric dp¨, ¨q.Here we seek a suitable notion of approximation for MWPM that recovers Theorem 1.
Min-weight perfect matching with outliers.Consider the following problem: given a constant γ ą 0, find a matching M of size p1 ´γqn in a bipartite graph such that the cost of M is at most the minimum cost of a perfect matching.A natural interpretation of this problem is to label a γ fraction of vertices as outliers and leave them unmatched; so we dub this problem MWPM with outliers.
Assuming dp¨, ¨q P r0, 1s, solving MWPM with a γ fraction of outliers immediately yields a γ additive approximation to EMD, proving Theorem 1.
The main technical contribution of this work is the following theorem, which introduces an algorithm that solves MWPM with outliers in sublinear time.For the sake of this overview, the reader should instantiate Theorem 3 with β " 1 and think of γ " p1 ´αq as the fraction of allowed outliers.
Theorem 3.For each constants 0 ď α ă β ď 1 there exists a constant ε ą 0 and an algorithm running in time Opn 2´ε q with the following guarantees.
The algorithm has adjacency-matrix access to an undirected, bipartite graph G " pV 0 Y V 1 , Eq and random access to the edge-cost function c : E Ñ R `.The algorithm returns ĉ such that, whp, cpM α q ď ĉ ď cpM β q where M α is a minimum-weight matching of size αn and M β is a minimum-weight matching of size βn.
Moreover, the algorithm returns a matching oracle data structure that, given a vertex u returns, in n 1`f pεq time, an edge pu, vq P M or K if u R V p M q, where f pεq Ñ 0 when ε Ñ 0. The matching M satisfies αn ď | M | ď βn and cpM α q ď cp M q ď cpM β q.
Notice that the algorithm in Theorem 3 does not output the matching M explicitly.However, it returns a matching oracle data structure which implicitly stores M .The rest of this overview sketches the proof of Theorem 3.
Our algorithm, in a nutshell.A new set of powerful techniques was recently developed to approximate the size of a max-cardinality matching in sublinear time [Beh+23; Beh22; BKS23a; BKS23b; BRR23].Our main contribution is a sublinear-time algorithm which leverages the techniques above to implement (a certain step of) the classic Gabow-Tarjan [GT89] algorithm for MWPM.Since the techniques above return approximate solutions, the obtained matching will be approximate as well, in the sense that we have to disregard a fraction of outliers when computing its cost to recover a meaningful guarantee.Careful thought is required for relaxing the definitions of certain objects in the Gabow-Tarjan algorithm so as to accommodate their computation in sublinear time.The bulk of our analysis is devoted to proving that these relaxations combine well and lead to the guarantee in Theorem 3.
Roadmap.First, we will review (a certain step of) the Gabow-Tarjan algorithm that we will use as our template algorithm to be implemented in sublinear time.Then, we will review some recent sublinear algorithms for max-cardinality matching.Finally, we will sketch how to combine these tools to approximate the value of minimum-weight matching.

A Template Algorithm
The original Gabow-Tarjan algorithm operates on several scales and this makes it (slightly) more involved.We focus here on a simpler case where all our edge weights are integers in r1, Cs, for C " Op1q.We will see in Section 6 that we can reduce to this case (incurring a small additive error).Here we describe our template algorithm, at a high level.
A linear program for MPWM.First, recall the linear program for MWPM together with its dual.Here we consider a bipartite graph GpV " V 0 Y V 1 , Eq and cost function cp¨, ¨q P r1, Cs.We can interpret the following LP so that x u,v " 1 iff u and v are matched, whereas primal constraints require every vertex to be matched.
x u,v ¨cpu, vq subject to subject to φ u `φv ď cpu, vq @pu, vq P E φ u ě 0. @u P V, A high-level description.Essentially, our template algorithm is a primal-dual algorithm which (implicitly) maintains a pair pM, φq, where M is a partial matching (so primal infeasible), and tφpvqu vPV is a vertex potential function, or an (approximately) feasible dual solution.Moreover, for each e P M the dual constraint corresponding to e is tight.In other words, the pair pM, φq satisfies complementary slackness.The algorithm progressively grows the dual variables tφpvqu vPV and the size of M .When M has size ě p1 ´γqn then we are done.Indeed, throwing out γn vertices (as well as their associated primal constraints) we have that pM, φq is a (approximately) feasible primal-dual pair that satisfies complementary slackness, thus it is (approximately) optimal.
The primal-dual algorithm.We maintain an initially empty matching M .Inspired by the dual, we define a potential function φ : V Ñ Z and we enforce a relaxed version of the dual constraints: φpuq `φpvq ď cpu, vq `1 for each pu, vq P E.Moreover, we maintain that φpuq `φpvq " cpu, vq for each pu, vq P M (complementary slackness).Let T be the set of edges s.t. the constraints above are tight.Orient the edges in T so that all edges in M Ď T are oriented from V 0 to V 1 and all edges in T zM are oriented from V 1 to V 0 .We denote the set of free (unmatched) vertices F and let F 0 " F X V 0 F 1 " F X V 1 .We say that a path P " pv 0 Ñ ¨¨¨Ñ v 1 q is an augmenting path if v 0 P F 0 , v 1 P F 1 and P alternates between edges in T zM and M .When we say that we augment M wrt P we mean that we set M Ð M ' P .We alternate between the following two steps: 1. Find a a large set of node-disjoint augmenting paths tP 1 . . .P ℓ u.Augment M wrt these paths.Decrement φpvq -= 1 for each v P Ť i P i X V 1 , to ensure the relaxed dual constraints are satisfied.
2. Define R as the set of vertices that are T -reachable5 from F 0 .Increment φpr 0 q += 1 for each r 0 P R X V 0 , and decrement φpr 1 q -= 1 for each r 1 P R X V 1 .This preserves the relaxed dual constraints and (eventually) adds some more edges to T .
Analysis sketch.It is routine to verify that steps 1 and 2 preserve the relaxed dual constraints.At any point the pair pM, φq satisfies, cpM q ď ř vPV0YV1 φpvq ď cpM 1 q `n for any perfect matching M 1 .We can content ourselves with this additive approximation; indeed in Section 6 we will see how to charge it on the outliers.To argue that we have few free vertices left after O γ p1q iterations, notice that at iteration t we have φ| F0 " t and φ| F1 " 0. Computing a certain function of potentials along pM ' M 1 q-augmenting paths shows that |F | ¨t ď Opnq.Thus, O γ p1q iterations are sufficient to obtain |F | ď γn.The arguments above are sufficient to show that our template algorithm finds an (almost) perfect matching with (almost) minimum weight.We will shove both almost under the outlier carpet in Section 6.

Implementing the Template in Sublinear Time
Our sublinear-time implementation of the template algorithm hinges on matching oracles.
Matching oracles.Given a matching M 1 we define a matching oracle for M 1 as a data structure that given u P V returns v P V if pu, vq P M 1 and K otherwise.Note that given a matching oracle for M 1 , if we are promised that |M 1 | " Ωpnq then O γ plog nq calls to such oracle are enough to estimate |M 1 | ˘γn.We stress that all matching oracles that we use have sublinear query time.
Finding large matchings in sublinear time.An important ingredient in our algorithm is the LargeMatchingpG, A, ε, δq subroutine (Theorem 5), which is due to [BKS23a].Given A Ď V , LargeMatchingpG, A, ε, δq returns either K or a matching oracle for some matching M 1 in GrAs.If there exists a matching in GrAs of size δn, then LargeMatching returns a matching oracle for some M 1 in GrAs with |M 1 | " Ω δ pnq.Else, if there are no matchings of size δn in GrAs LargeMatching returns K.The parameter ε controls the running time and essentially guarantees that LargeMatching runs in Opn 2´ε q time while the matching oracle it outputs runs in Opn 1`ε q.
We will use LargeMatching to implement both step 1 and step 2 in the template algorithm.However, this requires us to relax our notions of maximal set of node-disjoint augmenting paths, as well as that of reachability.A major technical contribution of this work is to find the right relaxation of these notions so that: 1) We can analyze a variant of the template algorithm working with these relaxed objects and still recover a a solution which is optimal if we neglect a γ fraction of outliers.
2) We can compute these relaxed objects in sublinear time using LargeMatching as well as previously constructed matching oracles.
These relaxed notions are introduced in Section 3, point (1) is proven in Section 4 and point (2) is proven in Section 5.
Implementing step 1 in sublinear time.In [BKS23a] the authors implement McGregor's algorithm [McG05] for streaming Max-Cardinality Matching (MCM) in a sublinear fashion using LargeMatching (see Theorem 6 in this work).McGregor's algorithm finds a size-Ωpnq set of nodedisjoint augmenting paths of fixed constant length, whenever there are at least Ωpnq of them.This notion is weaker than that of a maximal node-disjoint set of augmenting paths required in step 1 of our template algorithm in two regards: first, it only finds augmenting paths of fixed constant length; second, it finds only a constant fraction of such paths (as long as we have a linear number of them).
In our template algorithm, the invariant φ| F1 " 0 is maintained (in step 2) because R X F 1 " H.In turn, R X F 1 " H holds exactly because in step 1 we augment M with a maximal node-disjoint set of augmenting paths.Since our sublinear implementation of step 1 misses some augmenting paths, the updates performed in step 2 will violate the invariant φpvq " 0 for some v P F 1 .
A careful implementation of step 2 (see next paragraph) guarantees that only missed augmenting paths that are short lead to a violation of φ| F1 " 0.Moreover, repeatedly running the sublinear implementation of McGregor's algorithm from [BKS23a], we ensure that we miss at most γn short paths, for γ arbitrary small.Thus, we can flag all vertices that belong to missed short augmenting paths as outliers since we have only a small fraction of them.
Implementing step 2 in sublinear time.We implement an approximate version of the reachability query in step 2 as follows.We initialize the set of reachable vertices R as R Ð F 0 .Then, for a constant number of iterations: we compute a large matching M 1 Ď T zM between the vertices of R X V 0 and V 1 zR; then we add to R all matched vertices in Ť M 1 as well as their M -mates, namely mate M puq for each u P Ť M 1 .Notice that if a Ωpnq-size matching Ď T zM between R X V 0 and V 1 zR exists, then we find a matching Ď T zM between R X V 0 and V 1 zR of size at least Ωpnq.This ensures that: (i) after a constant number of iterations LargeMatching returns K; (ii) when LargeMatching returns K there exists a vertex cover C of ppR X V 0 q ˆpV 1 zRqq X T zM of size γn.Only constraints corresponding to edges incident to C might be violated during step 2. Furthermore, |C| " γn is small and so we can just label vertices in C as outliers.
As we pointed out in the previous paragraph, the invariant φ| F1 " 0 might be violated in step 2 if R X F 1 ‰ H.We already showed that whenever we the missed augmenting path causing the violation of φ| F1 " 0 is short we can charge this violation on a small set of outliers.To make sure that no long augmenting path leads to a violation of φ| F1 " 0 we set our parameters so that the depth of the reachability tree built in step 2 is smaller than the length of "long" paths.Thus, any long path escapes R and cannot cause a violation.
Everything is an oracle.The implementation of both step 1 and step 2 operates on the graph T of tight constraints.To evaluate pu, vq P T , we need to compute φpuq and φpvq.In turn, the potential values depend on previous iterations of the algorithm.None of these iterations outputs an explicit description of the objects described in the template (potentials, matchings, augmenting paths or sets of reachable vertices).Indeed, these objects are output as oracle data structures, which internals call (eventually multiple) matching oracles output by LargeMatching.We prove that essentially all these oracles have query time Opn 1`ε q for some small ε ą 0. A careful analysis is required to show that we can build the oracles at iteration i `1 using the oracles at iteration i without blowing up their complexity.
Paper organization.In Section 3 we define some fundamental objects that we will use throughout the paper.In Section 4 we present a template algorithm to be implemented in sublinear time, and prove its correctness.In Section 5 we implement the template algorithm in sublinear time.In Section 6 we put everything together and prove the main theorems stated in the introduction.

Preliminaries
We use the notation ra, bs :" ta . . .b ´1u, rbs " r0, bs, and pa ˘bq :" ra ´b, a `bs meaning that c¨pa˘bq " pac˘bcq.We denote our undirected bipartite graph with G " pV, Eq, and the bipartition is given by V " V 0 Y V 1 .Our original graph is complete and for each pu, vq P V 0 ˆV1 we denote with cpu, vq the cost of the edge pu, vq.We stress that none of our algorithms require cp¨, ¨q to be a metric.Given a matching M we denote its combined cost with cpM q.For each u P V we say that u " mate M pvq iff pu, vq P M .When the matching M is clear from the context we denote with F the set of unmatched (or free) vertices, and set F i :" F X V i for i " 0, 1.
When we say that an algorithm runs in time t we mean that both its computational complexity and the number of queries to the cost matrix cp¨, ¨q are bounded by t.The computational complexity of our algorithms is always (asymptotically) equivalent to their query complexity, so we only analyse the latter.All our guarantees in this work hold with high probability.
Definition 3.1 (Augmenting paths).Given a matching M over G " pV, Eq we say that P " pv 0 , v 1 . . .v 2ℓ`1 q is an augmenting path w.r.t.M if pv 2i , v 2i`1 q P EzM for each i " 0 . . .ℓ and pv 2j`1 , v 2j`2 q P M for each j " 0 . . .ℓ ´1.When we say that we augment M w.r.t.P we mean that we set M Ð M ' P , where ' is the exclusive or.
We use the same notion of 1-feasible potential as in [GT89].Definition 3.2 (1-feasibility conditions).Given a potential φ : V ÝÑ Z we say that it satisfies 1-feasibility conditions with respect to a matching M if the following hold.
(ii) For each pu, vq P M , φpuq `φpvq " cpu, vq.Definition 3.3 (Eligibility Graph).We say that an edge pu, vq is eligible w.r.t.M if: pu, vq R M and φpuq `φpvq " cpu, vq `1 or; pu, vq P M and φpuq `φpvq " cpu, vq.We define the eligibility graph as the directed graph G E " pV, E E q that orients the eligible edges so that, for each eligible pu, vq P V 0 ˆV1 , we have pu, vq P E E if pu, vq R M and pv, uq P E E if pu, vq P M .
Notice that, whenever a potential is 1-feasible w.r.t.M , then all edges in M are eligible.Definition 3.4 (Forward Graph).We define the forward graph G F " pV, E F q as the subgraph of the eligibility graph containing only edges from V 0 to V 1 .That is, we remove all edges pv, uq such that pu, vq P M .Now, we introduce two quite technical definitions, which provide us with approximate versions of the notion of "maximal set of node-disjoint augmenting paths" and "maximal forest".Definition 3.5 (pk, ξq-Quasi-Maximal Set of Node-Disjoint Augmenting Paths).Given a graph G " pV, Eq and a matching M Ď E we say that a set P of augmenting paths of length at most k is a pk, ξq-QMSNDAP if for any Q such that Q Y P is a set of node-disjoint augmenting paths of length ď k we have |Q| ď ξn.
Intuitively, P is a pk, ξq-QMSNDAP if we can add only a few more node-disjoint augmenting paths of length ď k to P before it becomes a maximal.Next we introduce an approximate notion of "maximal forest" F in the eligibility graph G E rooted in F 0 .F is obtained starting from the vertices in F 0 and adding edges (in a way that we will specify later) so as to preserve the F has |F 0 | connected components and has no cycles.This construction will ensure that the connected component of our forest have small diameter and small size.We maintain that whenever v P V 1 is added to F, then mate M pvq is also added to F. F is approximately maximal in the sense that the cut pF, V zFq in G E admits a small vertex cover.Definition 3.6 (pk, δq-Quasi-Maximal Forest).Given the eligibility graph G E " pV, E E q w.r.t. the matching M , and the set of vertices F 0 Ď V 0 we say that F is a pk, δq-QMF rooted in F 0 if: For each u P F there exists v P F 0 at hop distance from u at most k.
4. Every connected component of F has size at most 2 k . 5. The edge set E E X pF ˆV zFq has a vertex cover of size at most δn.Now, we introduce a few results from past work on sublinear-time maximum caridnality matching.The following theorem, which is the main technical contribution of [BKS23a], states that we can compute a εn-additive approximation of the size of a maximum-cardinality matching in strongly sublinear time.
Theorem 4 (Theorem 1.3, [BKS23a]).There is a randomized algorithm that, given the adjacency matrix of a graph G, in time n 2´Ωεp1q computes with high probability a p1, εnq-approximation μ of µpGq.After that, given a vertex v, the algorithm returns in n1`f pεq time an edge pv, v 1 q P M or K if v R V pM q where M is a fixed p1, εnq-approximate matching, where f is an increasing function such that f pεq Ñ 0 when ε Ñ 0.
The algorithm in Theorem 4 does not exactly output a matching, but rather a matching oracle.Namely, it outputs a data structure that stores a matching M implicitly.We formalize the notion of matching oracle below.Definition 3.7 (Matching Oracle).Given a matching M , we define the matching oracle match M p¨q as a data structure such that match M puq " v if pu, vq P M and match M puq " K otherwise.Throughout the paper we denote with t M the time complexity of match M p¨q.
Similarly to matching oracles, we make use of membership oracles mem A p¨q and potential oracles eval φ p¨q where A Ď V and φ is a potential function defined on V .As expected, mem A puq returns Theorem 5 (Essentially Theorem 4.1, [BKS23a]).Let G " pV, Eq be a graph, A Ď V be a vertex set.Suppose that we have access to adjacency matrix of G and an A-membership oracle mem A with t A query time.We are given as input a sufficiently small ε ą 0 and δ in ą 0.
There exists an algorithm LargeMatchingpG, A, ε, δ in q that preprocesses G in Õδin ppt A `nq¨n 1´ε q time and either return K or construct a matching oracle match M p¨q for a matching M Ă GrAs of size at least δ out n where δ out " 1 2000 δ 5 in that has Õδin ppt A `nqn 4ε q worst-case query time.If µpGrAsq ě δ in n, then K is not returned.The guarantee holds with high probability.
Theorem 6 roughly says that, in sublinear time, we can increase the size of our current matching (oracle) by Ωpnq, whenever there are Ωpnq short augmenting paths.
Theorem 6 (Essentially Theorem 5.2, [BKS23a]).Fix two constants k, γ ą 0. For any sufficiently small ε in ą 0, there exists ε out " Θ k,γ pε in q such that the following holds.There exists an algorithm AugmentpG, M in , k, γ, ε in q that makes O k,γ p1q calls to LargeMatching which take Õk,γ `n2´εin ˘time in total.Further, either it returns an oracle match M out p¨q with query time Õk,γ pn 1`εout q, for some matching M out in G of size |M out | ě |M in | `Θk,γ p1q ¨n (we say that it "succeeds" in this case), or it returns Failure.Finally, if the matching M in admits a collection of γ ¨n many node-disjoint length p2k `1q-augmenting paths in G, then the algorithm succeeds whp.
Theorem 6 differs from Theorem 5.2 in [BKS23a] in that it specifies that the only way Augment accesses the graph is through LargeMatching.We will use this property crucially to prove Lemma 5.2.

A Template Algorithm
In this section we study min-weight matching with integral small costs cp¨, ¨q P r1, Cs, where C is constant.We will see how to lift this restriction in Section 6. Algorithm 1 gives a template algorithm realising Theorem 3 that assumes we can implement certain subroutines; in Section 5 we will see how to implement these subroutines in sublinear time.
Comparison with Gabow-Tarjan.Intuitively, our template algorithm implements the Gabow-Tarjan algorithm [GT89] for a fixed scale in an approximate fashion.Indeed, instead of finding a maximal-set of node-disjoint augmenting paths we find a pk, ξq-QMSNDAP and instead of growing a forest in the eligibility graph we grow a pk, δq-QMF.See Figure 1 for a representation of step 1 and step 2.
Analysis.Here we analyse Algorithm 1 and show that it satisfies the following theorem.
Theorem 7. Fix a constant γ ą 0. Suppose that we have adjacency-matrix access to the bipartite graph G " pV 0 Y V 1 , Eq and random access to the cost function c : E Ñ r1, Cs, with C " Op1q.Then, with high probability, Algorithm 1 returns ĉ such that cpM 1´γ q ď ĉ ď cpM OPT q where M 1´γ is a min-weight matching of size p1 ´γqn and M OPT is a min-weight matching of size n.
To prove Theorem 7, we need a series of technical lemmas.
For each e P E update cpeq Ð cpeq{γ (this is implemented lazily).Execute the following two steps for T iterations: • Step 1. Find a pk, ξq-QMSNDAP P in the eligibility graph G E .Augment M w.r.t.paths in Sample a set S of O γ,C plog nq edges in M with replacement.Discard the 3γ|S| edges with highest costs and let Σ be the sum of costs of remaining edges.
Step 1 Step 2 Step 2 Figure 1: We color the edges of M in red and the edges of T zM in blue.On the left we have an example of step 1. Solid edges represent paths in the QMSNDAP P that we augment M along in step 1.On the right we have an example of step 2. All vertices colored or circled in green belong to the QMF F. Circles help us visualize the implementation of step 2, described in Section 5.In Algorithm 3 F is built sequentially, where each iteration (lines 1-5) adds some edges to F. At first, only the non-circled green vertices belong to F. The first step adds the green-circled black edges, and the second step adds the green-circled green edges.
Proof Roadmap.The proof of Theorem 7 goes as follows.We prove that, after T iterations, all free vertices in F 0 have potential T .On the other hand, the majority of free vertices in F 1 have potential 0. We call spurious the free vertices in F 1 with non-zero potential and we show there are only few of them.Then, (roughly) we look at the final matching M generated by Algorithm 1 and a perfect matching M 1 and consider the graph G 1 having M ' M 1 as its set of edges.G 1 can be partitioned into cycles and augmenting paths.Each augmenting path starts in a free vertex in F 0 and ends in a free vertex in F 1 .If the 1-feasibility conditions are satisfied by all edges, then computing a certain function of potentials along an augmenting path and combining the results for all augmenting paths yields an upper bound on the total number of free vertices.Unfortunately, not all edges satisfy the 1-feasibility constraints.We fix this by finding a small vertex cover of the 1-unfeasible edges.We say that such cover a suitable set of broken vertices.Ignoring spurious and broken vertices is sufficient to make our argument work.
Lemma 4.1.After t P rT `1s iterations we have φpuq " t for each u P F 0 .
Proof.After t " 0 iterations, we have φpuq " 0 for each u P V .First, we notice that the set of unmatched (or free) vertices F only shrinks over time, and so does F 0 .Moreover, at each iteration we increase the potential of free vertices in F 0 by 1.
Define the set S of v P F 1 such that φpvq ‰ 0 as the set of spurious vertices.
Lemma 4.2.After T iterations we have at most γn spurious vertices.
Proof.We prove that at each iteration we increase the number of spurious vertices by at most γn{T .A vertex cannot become spurious in Step 1. Indeed, in Step 1 we only decrease the potential of matched vertices.If a vertex v P F 1 becomes spurious in Step 2, it means that there exists an augmenting path P from some u P F 0 to v contained in a connected component of F. Let Q be such that Q Y P is a maximal set of node-disjoint augmenting paths of length ď k.By Definition 3.5 we have |Q| ď ξn.Define the set of forgotten vertices as Ť QPQ Q. Thanks to item 3 in Definition 3.6, the path from u to w has length ď k, thus P has length at most k.Recall that P is an augmenting path w.r.t. the graph obtained augmenting M along P at the end of Step 1.Therefore, P intersects a path in Q Y P.
We now argue that that P cannot intersect P 1 P P. Suppose by contradiction that it does.Let P " pP 0 . . .P ℓ q and P 1 " pP 1 0 . . .P 1 ℓ 1 q.Let P s be the first (w.r.t. the order induced by P ) node where P and P 1 intersect.We first rule out the case that s is even: for s " 0, P 0 " u P F 0 implies that u did not belong to an augmenting path P 1 in Step 1.Moreover, for s " 2i ą 0 if P 2i " P 1 j then P 2i´1 " mate M pP 2i q P tP 1 j´1 , P 1 j`1 u, where M is the matching obtained at the end of Step 1.Now suppose that s is odd, and hence P s P V 1 X P 1 .Then φpP s q is decreased by 1 at the end of Step 1, hence no edge outside of M incident to P s is eligible in Step 2.
Thus, P must intersect a path in Q.On the other hand Ť QPQ Q contains at most kξn vertices, so at most kξn connected component of F contain a forgotten edge.Moreover, by item 4 of Definition 3.6 every connected component of F has size at most 2 k , thus at most kξn2 k " γ{T vertices become spurious.
We say that B Ď V is a suitable set of broken vertices if all pu, vq P pV 0 zBq ˆpV 1 zBq are 1-feasible.
Lemma 4.3.After T iterations, there exists a suitable set of broken vertices of size at most γn.
Proof.First, we prove that every edge pu, vq P V 0 ˆV1 , which is 1-feasible at the beginning of Step 1, is also 1-feasible at the end of Step 1. Suppose that pu, vq becomes 1-unfeasible in Step 1.Let M and M 1 be the matching at the beginning and at the end of Step 1 respectively.Potentials only decrease in Step 1, so in order for pu, vq to become 1-unfeasible w.r.t.M 1 we must have pu, vq P M 1 .Moreover, we decrease the potential of v only if pu, vq P P , for some augmenting path P .Thus, at the beginning of Step 1 we had φpuq `φpvq " cpu, vq `1, which implies φpuq `φpvq " cpu, vq at the end of Step 1, thus pu, vq is 1-feasible w.r.t.M 1 , contradiction.Now, we grow a set of suitable broken vertices B. We initialize B " H and show that each iteration Step 2 increases the size of B by at most γn{T .If pu, vq P V 0 ˆV1 is 1-feasible at the beginning of Step 2 and becomes 1-unfeasible in Step 2, then we must have u P F and v R F. Indeed, by item 2 in Definition 3.6 if pu, vq P M 1 then either both u and v belong to F or neither of them does.This ensures that the sum of their potentials is unchanged.Else, if pu, vq R M 1 then in order for it to violate 1-feasibility we must increase φpuq by one and not decrease φpvq, and this happens only if u P F and v R F. Item 5 in Definition 3.6 ensures that there exists a vertex cover U Ď V for the set of new 1-unfeasible edges with |U | ď δn " γn{T .We update B Ð B Y U .Thus, after T iterations we have |B| ď γn.M of vertices currently matched to vertices in Let S be the set of spurious vertices and recall that |S| ď γn by Lemma 4.
M YSq and notice that they have the same size.Define A " A 0 Y A 1 .Let M 1 be a perfect matching over A.
The graph G A " pA, M ' M 1 q contains exactly ℓ :" |F 0 X A 0 | " |F 1 X A 1 | node-disjoint paths P 1 . . .P ℓ where P i starts in f piq 0 P F 0 X A 0 and ends in f piq 1 P F 1 X A 1 .We define the value of a path P as VpP q " ÿ pu,vqPM 1 XP pcpu, vq `1q ´ÿ pu,vqPM XP cpu, vq.
By 1-feasibility of φ we have where the last equality holds by definition of (non-)spurious vertices and Lemma 4.1.Then, we have Cn ě n `cpM 1 q ě ř ℓ i VpP i q ě ℓT .Thus, ℓ ď Cn{T " γn and Let φ be the potential at the end of the execution of Algorithm 1. Denote with M ALG the final matching obtained by Algorithm 1 and with M OPT a min-weight perfect matching.Given a matching M , we denote with M rαs the matching obtained from M by removing the αn edges with highest cost.Lemma 4.5.We have cpM ALG r2γs q ď cpM OPT q.Proof.Let M ALG zB be the matching obtained from M ALG by removing all edges incident to vertices in B. Since |B| ď γn we have cpM ALG rγs q ď cpM ALG zB q.Notice that all edges in M ALG zB are 1-feasible.For each pu, vq P M ALG zB we have cpu, vq " φpuq `φpvq and for each pu, vq P M OPT we have φpuq `φpvq ď cpu, vq `1.Thus, Now, it is sufficient to notice that, since all edges have costs in r1{γ, C{γ ´1s, removing any γn edges from M ALG rγs decreases its cost by n.Thus, cpM ALG r2γs q ď cpM OPT q.Now, we are ready to prove Theorem 7.
Proof of Theorem 7. Thanks to Lemma 4.5, we know that cpM ALG r2γs q ď cpM OPT q.Moreover, by Lemma 4.4 we have |M ALG | " n ´|F 0 | ě p1 ´4γqn, thus defining M 1´8γ as the min-weight matching of size p1 ´8γqn, we have cpM 1´8γ q ď cpM ALG r4γs q.We are left to prove that the estimate ĉ " n |S| Σ returned by Algorithm 1 satisfies cpM ALG r4γs q ď ĉ ď cpM ALG r2γs q.Let S and Σ be defined as in Algorithm 1 and let w be maximum such that 3γ|S| edges in S have cost ě w.If α w ¨n is the number of edges in M ALG that cost ě w, then using standard Chernoff Bounds arguments we have that, whp, |α w ´3γ| ď γ 2 {C.From now on we condition on this event.Notice that p1´αwqn p1´3γq|S| Σ is an unbiased estimator of cpM ALG rαws q.Moreover, since all costs are in r1{γ, C{γs, then O γ,C plog nq samples are sufficient to have p1´αwqn p1´3γq|S| Σ concentrated, up to a factor p1 ˘γ2 C q, around cpM ALG rαws q.Hence, assuming that γ is sufficiently small, we have where the last containment relation holds because all costs are ď C{γ and so cpM ALG rαws q ď Cn{γ.Since all costs are ě 1{γ we have cpM ALG rαw`3γ 2 s q ď cpM ALG rαws q ´3γn and cpM ALG rαw´3γ 2 s q ě cpM ALG rαws q 3γn.Thus, picking γ small enough to have α w ˘3γ 2 Ď r2γ, 4γs we have Therefore, we have cpM 1´8γ q ď ĉ ď cpM OPT q and rescaling γ gives exactly the desired result.
Observation 8.As in the proof of Theorem 7, define w as the maximum value such that there are at least 3γ|S| edges with cost ě w in S and define α w such that exactly α w ¨n edges in M have cost ě w.We have, whp, |α w ´3γ| ď γ 2 {C, thus for γ small enough cpM ALG rαws q ď cpM ALG r2γs q ď cpM OPT q and (up to rescaling γ) |M ALG rαws | ě p1 ´γqn.Moreover, given an edge e P M ALG we can decide whether e P M ALG rαws simply by checking cpeq ď w.

Implementing the Template in Sublinear Time
In this section we explain how to implement Step 1 and Step 2 from the template algorithm in sublinear time.

From Potential Oracles to Membership Oracles
Throughout this section, we would like to apply Theorem 5 and Theorem 6 on the eligibility graph G E " pV, E E q and forward graph G F " pV, E F q.However, we do not have random access to the adjacency matrix of these graphs.Indeed, to establish if pu, vq P V 0 ˆV1 is eligible we need to check the condition φpuq `φpvq " cpu, vq `1 (or φpuq `φpvq " cpu, vq).However, we will see that the potential φp¨q requires more than a single query to be evaluated.Formally, we assume that we have a potential oracle eval φ p¨q that returns the value of φp¨q in time t φ .Whenever checking whether pu, vq is an edge of G F (G E q requires to evaluate a condition of the form φpuq `φpvq " cpu, vq `1 (or φpuq `φpvq " cpu, vq) we say that we have potential oracle access to the adjacency matrix of G F (G E ) with potential oracle time t φ .We can think of t φ as Õpn 1`ε q and we will later prove that this is (roughly) the case.
Potential functions with constant-size range.If our potential function φ : V Ñ R has range size |R| ď R then we say that it is an R-potential.If the eligibility (forward) graph is induced by R-potentials for R " Op1q we can rephrase Theorem 5 and Theorem 6 to work with potential oracle access, without any asymptotic overhead.The following theorem is an analog of Theorem 5 for forward graphs.
Lemma 5.1.Let G F " pV, E F q be a forward graph w.r.t the R-potential φ, let A Ď V be a vertex set.Suppose we have a potential oracle eval φ with oracle time t φ and an membership oracle mem A with t A query time.We are given as input constants 0 ă ε ď 0.2 and δ in ą 0.
There exists an algorithm LargeMatchingForwardpφ, A, δ in q that preprocesses G F in ÕR ppt A tφ `nq ¨n1´ε q time and either returns K or constructs a matching oracle match M p¨q for a matching M Ă G F rAs of size at least δ out n where δ out " in " Θ δin,R p1q that has ÕR ppt A `tφ `nqn 4ε q worst-case query time.If µpG F rAsq ě δ in n, then K is not returned.The guarantee holds with high probability.
Proof.Without loss of generality, we assume that φ takes values in rRs.Suppose that G F rAs has a matching of size δ in n.We partition the edges E F rAs " E F X pA ˆAq into R 2 sets E 0,0 . . .E R´1,R´1 such that pu, vq P E i,j iff φpuq " i and φpvq " j.Then, there exist i, j P rRs such that G i,j " pV, E i,j q has a matching of size δ in n{R 2 .Moreover, once we restrict ourselves to G i,j , each edge query pu, vq P E i,j becomes much easier.Indeed, we just need to establish if i `j " cpu, vq `1.In order to restrict ourselves to G i,j it suffices to set A 1 " A X pφ ´1ptiuq ˆφ´1 ptjuqq.Then the membership oracle mem A 1 runs in time Opt A `tφ q.Hence, using Theorem 5 we can find a matching of size δ out n, where δ out " 1 2000¨R 10 δ 5 in .Algorithmically, we run the algorithm from Theorem 5 R 2 times (once for each pair pi, jq) and halt as soon as the algorithm does not return K.
The following is an analog of Theorem 6 for eligibility graphs.
Lemma 5.2.Let ε in ą 0 be a sufficiently small constant.Let α k,γ and β k,γ be constants that depend on k and γ and set ε out :" α k,γ ¨εin .We have an R-potential oracle eval φ with running time t φ " Õpn 1`εin q, a matching oracle match M in with running time t M in " Õpn 1`εin q and an eligibility graph G E " pV, E E q w.r.t.φ and M in .
Further, either it returns an oracle match M out p¨q with query time Õk,γ,R pn 1`εout q, for some matching βk,γ ¨n (we say that it "succeeds" in this case), or it returns K. Finally, if the matching M in admits a collection of γ ¨n many node-disjoint augmenting paths with length ď k in G E , then the algorithm succeeds whp.
Proof.We derive Lemma 5.2 combining Theorem 6 and Lemma 5.1.First, we notice that Theorem 6 says that the algorithm succeeds (whp) whenever there are γ 1 n node-disjoint augmenting paths (NDAP) with length exactly 2k 1 `1, while Lemma 5.2 has the weaker requirement that there are at least γn NDAP of length ď k.A simple reduction is obtained invoking Theorem 6 with γ 1 " γ{k for all k 1 such that 2k 1 `1 ď k (notice that all augmenting paths have odd length).In this way, if there exists a collection of γn NDAP of length ď k then there exists a k 1 ď pk ´1q{2 such that we have γ 1 n NDAP of length exactly 2k 1 `1.All guarantees are preserved since we consider both γ and k constants.Now, we are left to address the fact that we do not have random access to the adjacency matrix of G E , but rather potential oracle access.We notice that, according to Theorem 6, the implementation of Augment from [BKS23a] never makes any query to the adjacency matrix besides those performed inside LargeMatching.Moreover, Lemma 5.1 implies that LargeMatchingForward is not asymptotically slower than LargeMatching as long as R " Op1q.
Finally, we observe that in Algorithm 1 each potential is increased (or decreased) at most T " O C,γ p1q times.Hence, φ is a R-potential for R " 2T `1 " O C,γ p1q.Thus, we can consider R a constant when applying Lemma 5.1 or Lemma 5.2.

Implementing Step 1
In this subsection we implement Step 1 from the template algorithm in sublinear time.Here we assume that we have at our disposal a potential oracle eval φ in running in time t φ in " Õpn 1`εin q and a matching oracle match M in with running time t M in " Õpn 1`εin q.We will output a potential oracle eval φ out running in time t φ out " Õpn 1`εout q and a matching oracle match M out with running time t M out " Õpn 1`εout q.We show that there exists a pk, ξq-QMSNDAP A such that: the matching M out is obtained from M in by augmenting it along all paths in A; φ out is obtained from φ in by subtracting 1 to φ in pvq for each Algorithm 2 Implementation of Step 1.
Set k and γ as in Algorithm 1.
Set M out Ð M and ε out Ð ε.
Analysis.First, we observe that the algorithm above correctly implements the template, with high probability (all our statements henceforth hold whp).Initialize A Ð H.For each run of AugmentEligiblepG, M, k, γ, εq we decompose M ' M 1 into a set of augmenting paths P and a set of alternating cycles C and we set A Ð A Y P. When AugmentEligiblepG, M, k, γ, εq returns K it means (by Lemma 5.2) that there are at most γn node-disjoint augmenting paths of length ď k that do not intersect Ť A. Hence, A is a pk, ξq-QMSNDAP .Clearly, match M out implements the matching obtained from M in by augmenting along the paths in A.
To see that the implementation of eval φ out puq is correct it is sufficient to notice that in the template algorithm we decrement φpuq iff: piq u P V 1 , and piiq there exists an augmenting path P P A intersecting u.Since every node belongs to at most one path in A then u is matched in M out and pu, vq P M out is an M in -eligible edge.Thus, piiq is equivalent to: piiiq v " match M out satisfies φ in pvq `φin puq " cpu, vq `1.Finally, we bound ε out as a function of ε in .
Lemma 5.3.Step 1 can be implemented in Õpn 2´ε q time for some constant ε ą 0.Moreover, the oracle match M out has running time t M out and the oracle eval φ out has running time t φ out such that t M out , t φ out " Õpn 1`εout q and ε out " O γ,k pε in q.
Proof.Let ε ą 0 and β k,γ as in Lemma 5.2.Algorithm 2 runs AugmentEligible at most 1{β k,γ `1 " O k,γ p1q times because the set A increases by β k,γ ¨n after each successful run of AugmentEligible.Thus, there can be at most 1{β k,γ successful runs.It is apparent that, by Lemma 5.2, Step 1 can be implemented in Õk,γ pn 2´ε q time.Now we prove the bound on oracles time.First, we observe that t φ out " Opt M out `tφ in q " Õpn 1`εout q.Moreover, at every iteration we have ε 1 ď α k,γ ¨ε, hence ε out ď α

Implementing Step 2
In this subsection we implement Step 2 from Algorithm 1 in sublinear time.Once again, we assume that we have at our disposal a potential oracle eval φ in running in time t φ in " Õpn 1`εin q and a matching oracle match M in with running time t M in " Õpn 1`εin q.We will output a potential oracle eval φ out running in time t φ out " Õpn 1`εout q.We show that there exists a pk, δq-QMFF with respect to M in such that φ out puq " φ in puq `1 for each u P F X V 0 and φ out pvq " φ in pvq ´1 for each The execution of Algorithm 3 is represented in Figure 1, where vertices colored in the same way are added to F during the same iteration.
Analysis.First, we prove that Algorithm 3 implements the template (all guarantees hold whp).Namely, that F is a pk, δq-QMF, where k and δ are defined as in Algorithm 1.With a slight abuse of notation, in Algorithm 3 we used F to denote the set of nodes in the forest.Here, we understand that for each u P F 1 t zF t we have an edge pu, match Mt`1 puqq and for each v P F 2 t zF 1 t we have an edge pv, match M in pvqq.Let τ be the total number of times LargeMatchingForward runs successfully in Algorithm 3. We will see that τ ď k{2.Notice that F τ is the last forest produced by Algorithm 3 and for each u P F τ zF 0 we add an edge incident to u, thus F τ is a forest with |F 0 | connected components, one for each u P F 0 .Now we show that F τ is a pk, δq-QMFw.r.t.M in .We refer to the notation of Definition 3.6.Item 1 is clearly satisfied.Item 2 is satisfied because of line 4 in Algorithm 3.
Algorithm 3 Implementation of Step 2. Set δ as in Algorithm 1. Initialize t Ð 0, F 0 Ð F 0 , where F 0 is the set of M in -unmatched vertices in V 0 .Implement mem F0 puq as: match M in puq == K. Repeat until LargeMatchingForward returns K: 1.A t Ð pF t X V 0 q Y pV 1 zF t q.Implement eval φ out puq as eval φ in puq `mem Ft puq ¨p´1q 1 uPV 1

Let
Now we show that Item 3 is satisfied.Define k " 6000p2T `1q 10 {δ 5 as in Algorithm 1 and recall that φ is a p2T `1q-potential.Thanks to Lemma 5.1, at each step we increment |F| by at least 1 2000p2T `1q 10 ¨δ5 n.Thus, no more than r2000p2T `1q 10 {δ 5 s ď k{2 iterations are performed and we cannot have more than k hops between u P F and v P F 0 if u belongs to the connected component of v.
Now we prove that Item 4 is satisfied.At each iteration, the size of each connected component of F t at most triples.Indeed, let C be a connected component of F. In step 3 we add to C at most |C| vertices (because we add a vertex for each edge in a matching incident to C) and in step 4 we add to C at most one more vertex for each new vertex added in step 3. Now we prove that Item 5 is satisfied.Algorithm 3 halts when LargeMatchingForwardpM in , pFX V 0 q Y pV 1 zFq, δq returns K.This may only happen when there is no matching between of F X V 0 and V 1 zF of size δn.This implies that there exists a vertex cover of size ď δn.Moreover, this is a vertex cover for the whole E E X pF ˆV zFq because all edges in E E X V 1 ˆV0 are in M in and by Item 2 have both endpoints either in F or in V zF.
It is easy to check that φ out puq " φ in puq `1 for each u P F X V 0 and φ out pvq " φ in pvq ´1 for each v P F X V 1 .
Lemma 5.4.Step 2 can be implemented in Õpn 2´ε q time for some constant ε ą 0.Moreover, the oracle eval φout has running time t φ out " Õpn 1`εout q and ε out " O k,δ pε in q.
Proof.For s " 0 . . .τ denote with ε s ą 0 a constant such that t As " Õpn 1`εs q, where t As is the running time of mem As .Notice that ε 0 " ε in .At step s we choose εs :" 2ε s as the ε parameter in Lemma 5.1.This implies that LargeMatchingForward runs in Õpn 1`εs ¨n1´εs q " Õpn 2´εs q time.We have already proved that τ ď k, thus Algorithm 2 takes Õpn 2´ε q time in total, where ε :" min sPr0,τ s ε s " ε in .
Denote with t F the query time of mem F .For each s, we have t Fs`1 " t Ms`1 `tM in `tFs .Thanks to Lemma 5.1 we have t Ms`1 " Õppt As `tφ in `nqn 4εs q.Moreover, t As " t Fs .Thus, t Fs`1 " pt Fs `tφ in `nqn 8εs `tM in `tFs .Since, t F0 " t φ in " t M in " Õpn 1`εin q we have t φ out " t φ in `tFτ " Õpn 1`9 τ ¨εin q " Õk pn 1`O k,δ pεinq q.

Implementing the Template Algorithm
We can put together the results proved in the previous subsections and show that Algorithm 1 can be implemented in sublinear time.
Theorem 9.There exists a constant ε ą 0 such that Algorithm 1 can be implemented in time Opn 2´ε q.Moreover, using the notation in Observation 8, we can return a matching oracle match M ALG rαw s running in time Opn 1´ε q such that M ALG rαws satisfies |M ALG rαws | ě p1 ´γqn and cpM ALG rαws q ď cpM OPT q and match M ALG rαw s .Proof.Algorithm 1 runs T " O C,γ p1q iterations, and a single iterations consists of Step 1 and Step 2. At iteration s denote with ε psq in the value of ε in for Step 1 input (or, equivalently, the value of ε out for Step 2 output at iteration s ´1) and with ε psq out the value of ε out for Step 1 output (or, equivalently, the value of ε in for Step 2 input at iteration s).Every time we run either Step 1 or Step 2, the value of ε out is at most some constant factor larger than ε in .This translates into ε psq in " O C,k,γ pε ps´1q out q and ε psq out " O C,k,γ pε psq in q.Thus, after T iterations ε pT q out is arbitrarily small, provided that ε p0q in is small enough.To conclude, we notice that the initial matching is empty and the initial potential is identically 0, so the first membership oracle and potential oracle run in linear linear, thus we can set ε p0q in arbitrarily small.Finally, let M ALG be the last matching computed by Algorithm 1.We have at our disposal a matching oracle match M ALG running in time Õpn 1`ε pT `1q in q, so we can easily sample the a S of Oplog nq edges from M ALG in time Õpn 1`ε pT q out q.This conclude the implementation of Algorithm 1.
Moreover, we compute w as the largest value such that at least 3γ|S| edges in S have cost ě w and define α w such that exactly α w ¨n edges in M ALG have cost ě w.Then, we implement a matching oracle match M ALG rαw s for M ALG rαws running in time Õpn 1`ε pT q out q as follows: given u P V 0 Y V 1 we set v Ð match M ALG puq; if cpu, vq ă w then we return v, else we return K. Thanks to Observation 8, we have |M ALG rαws | ě p1 ´γqn and cpM ALG rαws q ď cpM OPT q.
6 Proof of our Main Theorems In this section we piece things together and prove Theorem 3.Then, we use Theorem 3 to prove Theorem 1, Corollary 1.1 and Theorem 2.

Proof of Theorem 3
In this subsection we strengthen Theorem 7, extend its scope to arbitrary costs and combine it with Theorem 9 to obtain Theorem 3. We restate the latter for convenience.
Theorem 3.For each constants 0 ď α ă β ď 1 there exists a constant ε ą 0 and an algorithm running in time Opn 2´ε q with the following guarantees.
The algorithm has adjacency-matrix access to an undirected, bipartite graph G " pV 0 Y V 1 , Eq and random access to the edge-cost function c : E Ñ R `.The algorithm returns ĉ such that, whp, cpM α q ď ĉ ď cpM β q where M α is a minimum-weight matching of size αn and M β is a minimum-weight matching of size βn.
Moreover, the algorithm returns a matching oracle data structure that, given a vertex u returns, in n 1`f pεq time, an edge pu, vq P M or K if u R V p M q, where f pεq Ñ 0 when ε Ñ 0. The matching M satisfies αn ď | M | ď βn and cpM α q ď cp M q ď cpM β q.
Roadmap of the proof.Theorem 7 works only for weights in r1, Cs.In order to reduce to that case, we need to find a characteristic cost w of min-weight matchings with size in rαn, βns.Then, we round every cost to a multiple of 1 2 γ 2 w, where γ is a small constant.We show that, thanks to certain properties of the characteristic cost w, the approximation error induced by rounding the costs is negligible.Finally, we pad each size of the bipartition with dummy vertices to reduce the problem of finding a matching of approximate size βn to that of finding an approximate perfect matching, which is addressed in Theorem 7.
Notation.Similarly to Theorem 3, we denote with M ξ the min-weight matching of size ξn in G.Likewise, we will define a graph s G and denote with Ď M ξ the min-weight matching of size ξn in s G.As in Section 5, given a matching M , we denote with M rδs the matching obtained from M by removing the δn most expensive edges.We denote with µpM q the cost of the most expensive edge in M .Given w ě 0, we denote with G ďw the graph of edges which cost ď w.Throughout this subsection, fix a constant 0 ă γ ă pβ ´αq{4.
Reduction from arbitrary weights to r1, Cs.The next technical lemma shows that, if we can can solve an easier version of the problem in Theorem 3 where we allow an additive error γ 2 wn on an instance where w is an upper bound for the cost function; then we can also solve the problem in Theorem 3.This reduction is achieved by finding a suitable characteristic cost w in sublinear time and running the aforementioned algorithm on G ď w.
Lemma 6.1.Suppose that there exists an algorithm that takes as input a bipartite graph s Eq endowed with a cost function c : s E Ñ r0, ws, outputs an estimate ĉ and a matching oracle match M such that (whp) ĉ satisfies cp Ď M α q ď ĉ ď cp Ď M α`γ q `γ2 wn while M satisfies | M | ě αn and cp M q ď cp Ď M α`γ q `γ2 wn.Suppose also that such algorithm runs in time Opn 2´ε q and match M runs in time Opn 1`ε q for some ε ą 0, where n " | s Then, there exists an algorithm that takes as input a bipartite graph G " pV 0 Y V 1 , Eq endowed with a cost function c : E Ñ R `, outputs an estimate ĉ and a matching oracle match M such that (whp) ĉ satisfies CpM α q ď ĉ ď cpM β q while M satisfies | M | ě αn and cp M q ď cpM β q.Moreover, such algorithm runs in time Opn 2´ε q and match M runs in time Opn 1`ε q for some ε ą 0, where n " |V 0 | " |V 1 |.
and the cost function implicitly, because an explicit construction would take Ωpn 2 q time.We have n Run the algorithm in hypothesis on s G and let ĉ and match M be its outputs.We set c " ĉ ´p2 ´2β `ξqn.We set M :" M X pV 0 ˆV1 q and implement match M puq as follows.Let v Ð match M puq.If v " K, return K.If either u or v is dummy, return K; else return v.It is easy to see that Ď M 1 X pV 0 ˆV1 q is a min-weight matching of size pβ ´ξqn in G, hence cp Ď M 1 q " cpM β´ξ q `|D 0 | `|D 1 | " cpM β´ξ q `2p1 ´β `ξqn ď cpM β q `p2 ´2β `ξqn.
Finally, we can prove Theorem 3.
Proof of Theorem 3. We notice that combining Theorem 7 and Theorem 9 we have a sublinear implementation of Algorithm 1 that takes a graph bipartite graph G " pV 0 Y V 1 , Eq and a cost function c : E Ñ r1, Cs as input, outputs an estimate ĉ and a matching oracle match M .The estimate ĉ satisfies cpM 1´δ q ď ĉ ď cpM 1 q, M satisfies | M | ě p1 ´δqn and cp M q ď cpM OPT q.Moreover, such algorithm runs in time Opn 2´ε q and match M runs in time Opn 1`ε q for some ε ą 0.Then, combining Lemma 6.3, Lemma 6.2 and Lemma 6.1 we obtain an algorithm that takes as input a bipartite graph G " pV 0 Y V 1 , Eq endowed with a cost function c : E Ñ R `, outputs an estimate ĉ and a matching oracle match M such that (whp) cpM α q ď ĉ ď cpM β q, | M | ě αn, and cp M q ď cpM β q.Moreover such algorithm runs in time Opn 2´ε q and match M runs in time Opn 1`ε q for some ε ą 0.

Proof of Theorem 1 and Corollary 1.1
Since Corollary 1.1 is more general than Theorem 1 we simply prove the former.
Corollary 1.1.Suppose we have sample access to two distributions µ, ν over metric space pM, d M q satisfying dp¨, ¨q P r0, 1s and query access to d. Suppose further that there exist µ 1 , ν 1 with support size n such that EMDpµ, µ 1 q, EMDpν, ν 1 q ď ξ, for some ξ ą 0.
For each constant γ ą 0 there exists a constant ε ą 0 and an algorithm running in time Opn 2´ε q that outputs { EMD such that { EMD P rEMDpµ, νq ˘p4ξ `γqs.
Moreover, such algorithm takes Õpnq samples from µ and ν.
Fix a constant γ ą 0. From each probability distribution µ, ν we sample (with replacement) a multi-set of m " Θpn logpnqq points.We use V µ , V ν to denote the respective multi-sets, and μ, ν to denote the empirical distributions of sampling a random point from V µ , V ν .Let T µ , T ν be the transport plans realizing EMDpµ, µ 1 q and EMDpν, ν 1 q respectively.Namely, T µ is a coupling between µ and µ 1 such that EMDpµ, µ 1 q " E px,yq"Tµ rdpx, yqs and likewise for T ν .For each sample x in μ we sample x 1 " T px, ¨q and let V 1 µ be the multi-set of samples x 1 for x P V µ .Define V 1 ν similarly.Let μ1 and ν1 be the empirical distributions of sampling a random point from V 1 µ and V 1 ν .
Proof.First, we observe that V 1 µ is distributed as a multi-set of m samples from µ 1 .For any point x 1 with at least pγ{4nq-mass in µ 1 , we expect Ωplog nq samples of x 1 in V 1 µ , so by Chernoff bound we have that with high probability the number of samples of x 1 concentrates to within p1 ˘γ{4qfactor of its expectation.Furthermore, with high probability at most pγ{2q-fraction of the samples correspond to points with less than pγ{4nq-mass in the original distribution.Thus overall, the empirical distribution μ1 is within γ TV distance of µ 1 .Finally, EMDpμ 1 , µ 1 q ď TVpμ 1 , µ 1 q beacuse dp¨, ¨q P r0, 1s.Lemma 6.6.EMDpµ, μq ď ξ `2γ with high probability.
Proof of Corollary 1.1.We consider the bipartite graph with a vertex for each point in V µ , V ν and edge costs induced by dp¨, ¨q.We apply the algorithm guaranteed by Theorem 3 to find an estimate of the min-weight matching over between p1 ´γqm and m vertices.We return the cost estimate { EMD on the bipartite graph (normalized by dividing by m).Theorem 3 guarantees that { EMD P rEMDpμ, νq ˘γs.Then using triangle inequality on EMD, as well as Lemma 6.6 on both µ and ν we obtain ˇˇ{ EMD ´EMDpµ, νq ˇˇď ˇˇ{ EMD ´EMDpμ, νq ˇˇ`EMDpµ, μq `EMDpν, νq ď 4ξ `5γ.
Scaling γ down of a factor 5 we retireve Corollary 1.1.

Proof of Theorem 2
Theorem 2 (Main theorem, graph interpretation).For each constant γ ą 0, there exists a constant ε ą 0, and an algorithm running in time Opn 2´ε q with the following guarantees.The algorithm takes as input a budget B, and query access to the edge-cost matrix of an undirected, bipartite graph G over n vertices.The algorithm returns an estimate x M that is within ˘γn of the size of the maximum matching in G with total cost at most B.
Proof.Let M be the maximum matching in G with total cost at most B, and let |M | " ξn.We perform a binary search for ξ using the algorithm from Theorem 3 as a subroutine.This loses only a factor logpnq in query complexity, which gets absorbed in Opn 2´ε q by choosing a suitable constant ε.

Lemma 4. 4 .
After T iterations of template algorithm we have have |F 0 | " |F 1 | ď 4γn.Proof.Denote with M the final matching obtained by Algorithm 1.Let B be a suitable set of broken vertices with |B| ď γn, as in Lemma 4.3.Partition B " B M Y B F , where B F :" B X F is the set of unmatched vertices in B and B M is the set of matched vertices in B. Consider the set B 1 LargeMatchingForwardpφ, A t , δq return match Mt .3. F 1 t Ð F t Y tmatch Mtpuq | u P F t u Implement mem F 1 t puq as: mem Ft puq or match Mt puq P F t .4. F 2 t Ð F 1 t Y tmatch M in puq | u P F 1 t u Implement mem F 2 t puq as: mem F 1 t puq or match M in puq P F 1 t .