New Graph Decompositions and Combinatorial Boolean Matrix Multiplication Algorithms

We revisit the fundamental Boolean Matrix Multiplication (BMM) problem. With the invention of algebraic fast matrix multiplication over 50 years ago, it also became known that BMM can be solved in truly subcubic $O(n^\omega)$ time, where $\omega<3$; much work has gone into bringing $\omega$ closer to $2$. Since then, a parallel line of work has sought comparably fast combinatorial algorithms but with limited success. The naive $O(n^3)$-time algorithm was initially improved by a $\log^2{n}$ factor [Arlazarov et al.; RAS'70], then by $\log^{2.25}{n}$ [Bansal and Williams; FOCS'09], then by $\log^3{n}$ [Chan; SODA'15], and finally by $\log^4{n}$ [Yu; ICALP'15]. We design a combinatorial algorithm for BMM running in time $n^3 / 2^{\Omega(\sqrt[7]{\log n})}$ -- a speed-up over cubic time that is stronger than any poly-log factor. This comes tantalizingly close to refuting the conjecture from the 90s that truly subcubic combinatorial algorithms for BMM are impossible. This popular conjecture is the basis for dozens of fine-grained hardness results. Our main technical contribution is a new regularity decomposition theorem for Boolean matrices (or equivalently, bipartite graphs) under a notion of regularity that was recently introduced and analyzed analytically in the context of communication complexity [Kelley, Lovett, Meka; arXiv'23], and is related to a similar notion from the recent work on $3$-term arithmetic progression free sets [Kelley, Meka; FOCS'23].


Introduction
Boolean Matrix Multiplication (BMM) is one of the most basic and fundamental combinatorial problems.It can be solved in O(n ω ) time, where 2 ≤ ω < 2.3716 [33,74] is the exponent of (integer) matrix multiplication.The algebraic technique underlying Strassen's [66] and all subsequent "fast matrix multiplication" algorithms have several limitations (discussed in Section 1.1) related to generalizability, elegance, and practical efficiency.Therefore, in addition to the line of work trying to enhance this algebraic machinery aiming to reach ω = 2, a parallel line of work aims to match (or improve) these subcubic bounds with different combinatorial techniques.
The first result in this direction is the "Four-Russians" algorithm by Arlazarov, Dinic, Kronrod, and Faradžev [12] that achieves o(n3 ) complexity by precomputing the answers to small subinstances.This approach can give O(n 3 / log2 n) time, but nothing faster [11].After many decades, Bansal and Williams [13] used regularity lemmas to gain an additional log 0.25 n factor speed-up.The usefulness of graph regularity techniques was undermined a few years later when Chan [23] gave a better, O(n 3 / log 3 n) bound only using simple divide-and-conquer.Yu [75] optimized the divide-and-conquer method to achieve an O(n 3 / log 4 n) bound that stood since 2015. 1 pessimistic conjecture that has been popular since the 90s [65,56] states that truly subcubic running times are impossible for combinatorial algorithms; that is, one may be able to shave some more logarithmic factors, but we cannot reach O(n 3−ϵ ) time with ϵ > 0.
Main Result.In this paper, we prove a new regularity decomposition theorem that leads to a quasi-polynomial 3 saving for BMM, combinatorially, coming much closer than before to refuting Conjecture 1.1.

Theorem 1.2 (Combinatorial BMM).
There is a deterministic combinatorial algorithm computing the Boolean product of two n × n matrices in time n 3 /2 Ω( 7 √ log n) .
We immediately get a similar combinatorial super-poly-logarithmic saving for many other problems (that can be reduced to BMM).The list includes central problems in their domains such as context-free grammar parsing from formal languages [68], computing the transitive closure of a directed graph [35,60], join-project queries from databases [10,69], parameterized problems such as k-clique [61] and k-dominating-set [34], and various matrix product problems [59,29,70].
Our (and most previous) results are obtained by designing algorithms for the simpler triangle detection problem and then using a well-known subcubic equivalence with BMM [72].Therefore, we focus on triangle detection below.
A New Regularity Decomposition Theorem.In abstract terms, a graph is regular if it behaves somewhat pseudo-randomly, and a regularity decomposition theorem states that any graph can be decomposed into a "small" number of regular subgraphs.Such results are interesting mathematically because they say that any graph can be simplified dramatically, and also algorithmically because they let us reduce a problem from arbitrary to (the often easier) random-like graphs.The possibility and efficiency of such results depend on the precise notion of regularity; generally, there is a trade-off between strength (i.e.how close "regular" is to random) and efficiency (i.e. the number of subgraphs in the decomposition).
For example, the celebrated Szemerédi's Regularity Lemma [67] yields a decomposition into subgraphs with very strong pseudo-random properties, but to achieve meaningful results for graphs with density δ, the number of parts inherently scales as a tower function of height poly(1/δ) [42]. 4 comparable but weaker notion of regularity due to Frieze and Kannan [38] admits decompositions into fewer, but still exponentially many parts (specifically, 2 O(δ −2 ) ).At the other end of the spectrum, expander decompositions use a much weaker notion of pseudo-randomness, but significantly gain in efficiency.
Considering a specific problem, e.g.triangle detection, the challenge is finding a sweet spot in which the regularity notion is strong enough to make the problem algorithmically easy, yet weak enough to make the decomposition efficient.Unfortunately, expanders are too weak [6].Based on Szemerédi-regularity and Frieze-Kannan-regularity, Bansal and Williams [13] indeed scored nontrivial algorithmic improvements for Boolean Matrix Multiplication: A (log * (n)) Ω(1) -shave based on Szemerédi's Regularity Lemma, 5 and a (log n) 1/4 -shave based on Frieze-Kannan regularity.In both cases, however, due to the excessive number of pieces in the regularity decomposition, it seems hopeless to go beyond log-shaves.
In this paper we employ a regularity notion called grid regularity that was recently introduced by Kelley, Lovett and Meka [52] in the context of communication complexity and is based on similar results in the work of Kelley and Meka [53] on 3-term arithmetic progressions.This regularity notion is weak but still useful for triangle detection and BMM.Phrased in terms of matrices, the key takeaway from their work is that a single grid-regular matrix is not necessarily very random, but the product of two grid-regular matrices is very random (see Theorem 2.1).Equivalently, in a 3-layered graph in which both edge sets are grid regular, the number of 2-paths from left to right behaves randomly.Our main contribution is that we (1) establish a decomposition theorem for this notion of regularity into quasi-polynomially many parts (specifically, 2 O((log δ −1 ) 7 ) ), and (2) provide an efficient deterministic algorithm to compute this decomposition.

On Combinatorial Algorithms
Despite the large number of papers on combinatorial algorithms for BMM, there is currently no satisfying and precise definition for this notion.The main reason for this, we believe, is that there are multiple strong motivations for seeking combinatorial algorithms (discussed below) that are not necessarily consistent with one another.A simple algorithm need not be practical, or vice versa, and an algorithm that generalizes to one setting may not generalize to another.Therefore, one can either focus on precise definitions that are limited to one motivation or embrace a more inclusive but loose definition.Many examples of the former approach exist (see below), and they give rise to interesting research questions.The latter, however, is more popular in the community: we currently have no truly subcubic algorithm for BMM other than Strassen's algorithm (and its successors) so we should first seek to break Conjecture 1.1 by any other technique and hope that at least one of the motivations gets satisfied.
To help shed light on this, let us review the limitations of the existing algebraic technique that motivate us to seek other algorithms.The first two may be more well-known, but the third is more pressing for fine-grained complexity and algorithm design.Along the way, we discuss to what extent our algorithm satisfies each consideration.
• Simplicity: Strassen's algorithm (and even more so for its successors) exploits cancellations using formulas that may be considered unintuitive; consequently, the values manipulated at intermediate stages of the computation are quite uninterpretable.Instead, one may hope for techniques that are simpler and more interpretable; this is probably the historical reason for the name "combinatorial".Some works have proposed precise definitions along these lines, e.g. for solving triangle detection an algorithm can only generate sets by basic operations on the neighborhoods of nodes, and strong lower bounds exist [11,32].Unfortunately, the current definitions are not flexible enough to capture the regularity decomposition techniques of Bansal-Williams [13] and of our paper, nor even the simple divide-and-conquer of Chan [23] and Yu [75], even though these techniques are widely-considered "combinatorial" (even by the authors proposing the definitions).In particular, our algorithm decomposes the graph by repeatedly finding and removing irregular pieces (i.e., that contain too many bicliques) and then uses a brute-force algorithm on the (sparse) parts.
• Practical Efficiency: Here, one should make a distinction between Strassen's n 2.81 algorithm, and its successors that have reduced ω much further.The latter algorithms are considered "galactic" and the interest in them is mostly theoretical.On the other hand, Strassen's algorithm has been used in practice, but the gains are limited; some reasons include its bad locality (generating too many cache misses) and the need to manipulate large numbers.For many decades, researchers have sought techniques that are more efficient in practice and also have worst-case guarantees. 6The lack of success in finding such algorithms partly motivated Conjecture 1.1 in the 90s [65,56,9].A definition of this notion has to be empirical, and determining if our algorithm satisfies it requires experiments.Regularity decompositions are infamously impractical, but our underlying paradigm of decomposing the input into pseudo-random parts has the potential to be practical.
• Generalizability: Much of the interest in BMM and triangle detection is because they are simplified special cases of more difficult, important problems.A technique that only solves the special case is not satisfying.In the following, we give four examples of such problems where (1) for well-established reasons, Strassen's technique does not give a truly subcubic algorithm, and (2) a truly subcubic algorithm that does generalize would be groundbreaking as it would refute a popular conjecture in fine-grained complexity that has nothing to do with "combinatorial algorithms", i.e. we do not know how to refute it with any algorithms.In each case, a precise definition of "combinatorial" in the sense of generalizing to that particular setting can readily be made: it is an algorithm that solves the corresponding problem.
-To refute the famous All-Pairs Shortest-Paths (APSP) Conjecture, it is enough (in fact, equivalent) to solve weighted generalizations of BMM and triangle detection in truly subcubic time: (min, +)-matrix-multiplication and Negative-Triangle [72].The issue with Strassen's technique is that it exploits cancellations by subtracting numbers, and min does not have an inverse.
-To refute the Online Matrix-Vector (OMV) Conjecture [48], we need to solve BMM in an online setting in which the columns of the second matrix arrive one by one and, at each step, we must output the answer before seeing the next column, in a total time that is truly subcubic.The issue with Strassen's algorithm is that its formulas depend on later columns.
-To refute the Hyper-Clique Conjecture [58], we need a generalization to hypergraphs that lets us detect a 4-clique in a 3-uniform hypergraph.A technique entirely different from Strassen's is needed because formulas that reduce the number of multiplications provably do not exist-the border rank of the corresponding tensor matches the trivial upper bound [58].
-To refute the famous 3-SUM Conjecture [62,54] (and also the APSP Conjecture [73]), it is enough to obtain a generalization to the (witness) reporting setting.In particular, we would like an algorithm that preprocesses a graph in truly subcubic time and can then enumerate all triangles with constant delay.Unfortunately, the witness information is lost under the cancellations that are exploited in Strassen's algorithm.
Does our new algorithm and, more generally, the regularity decompositions technique generalize to these four settings?For the first three, it is unclear and left for future research (there was much less incentive to do it before this work since it was outperformed by divideand-conquer). 7For the reporting setting, Abboud, Fischer, and Shechter [4] recently observed that the ideas in the Bansal-Williams algorithm can give a triangle enumeration algorithm with n 3 / log 2.25 n preprocessing and constant delay, which is the state-of-the-art even when using algebraic techniques.Building on our new decomposition theorem, we give an improved bound, demonstrating that our techniques are useful beyond "combinatorial" algorithms.

Theorem 1.3 (Triangle Enumeration Algorithm).
There is a deterministic algorithm that preprocesses a given graph in time n 3 /(log n) 6 • (log log n) O (1) and then enumerates all triangles with constant delay.
It should appear strange that we shave only a log 6 n factor and not a super-poly-log.The reason for this is that O(n 3 / log 6 n) is the best we know how to achieve (using any technique) for triangle enumeration even on random graphs (see Section 7.1).Thus, it is a natural limit for any technique (like ours) based on a reduction to random-like instances.Moreover, due to the reduction from 3-SUM to triangle enumeration [62,54], shaving any additional log ϵ n factor over our bound would improve on the longstanding upper bound for (integer) 3-SUM [14] (see Section 7.4).

Machine Model
We assume the standard Word RAM model with word size Θ(log n) (where n is the input size).Since, for most of our algorithmic results, additional log-factors in the running times would not matter, the choice of the machine model is not crucial.Only in Section 7, when we care about log-factors, this choice matters.

Graphs and Matrices
We typically denote sets (of nodes) by X, Y, Z and matrices by A, B, C.Moreover, we typically view binary matrices A ∈ {0, 1} X×Y as bipartite graphs on the node sets X, Y , where an edge (x, y) is present if and only if A(x, y) = 1. Let be matrices.We denote by AB their standard matrix product, and by A • B a scaled matrix product defined by Following the bipartite graph analogy, for sets denote the submatrix restricted to the rows in X ′ and the columns in Y ′ -that is, the subgraph induced by X ′ ∪ Y ′ .Let A T denote the transpose of A-that is, the subgraph obtained by exchanging the sides X and Y .We call the density of A. For nodes x ∈ X and y ∈ Y , we define their (relative) degrees as We say that A is ϵ-left-min-degree or simply ϵ-min-degree if min (The symmetric notion of ϵ-right-min-degree is never used in the paper.)Moreover, for α, ϵ, δ ≥ 0, we say that A is (α, ϵ, δ)-uniform if

Grid Regularity
Recall that we abstractly consider a graph regular if it behaves somewhat pseudo-randomly. 8In this paper we employ the following formal notion of regularity, defined via the "grid norm" of a matrix.Specifically, for a matrix with non-negative entries A ∈ R X×Y

≥0
and integers k, ℓ ≥ 1, we define its (k, ℓ)-grid norm as ; note that equivalently Strictly speaking, ∥ • ∥ U (k,ℓ) is not necessarily a norm, but we will nevertheless intuitively treat it as such. 9In combinatorial terms, the grid norm of a bipartite graph A ∈ {0, 1} X×Y measures (up to normalization) the number of (k, ℓ)-bicliques that occur as subgraphs of A (including subgraphs in which some nodes of the biclique coincide).Note that the grid norm ∥A∥ U (k,ℓ) ranges from E[A] to 1, and thereby constitutes some measure of pseudo-randomness: On the one hand, purely random bipartite graphs (with edge density E[A]) have grid norm ∥A∥ U (k,ℓ) ≈ E[A], whereas structured graphs (e.g., graphs with large induced subgraphs of increased density) often have larger grid norms.In this spirit, we say that A is (ϵ, k, ℓ)-regular if For specific constant values of k and ℓ, grid norms have appeared in many previous mathematical works (e.g., [43,44]), and also implicitly in recent algorithmic structure-to-randomness reductions [3,2,51].

Kelley-Lovett-Meka's Structural Theorem
What makes grid norms useful for us?In a recent result, Kelley, Lovett and Meka [52] use analytical methods to obtain the following structural result, linking the regularity of two graphs A and B to their product matrix.

Technical Overview
Theorem 2.1 is the starting point Our goal in the following is to exploit this structural theorem algorithmically and derive an improved combinatorial algorithm for Boolean matrix multiplication.In this section we describe our key ideas.
Boolean Matrix Multiplication and Triangle Detection.Recall that the Triangle Detection problem is to test whether a given undirected, tripartite graph (X, Y, Z, A, B, C) (with vertex parts X, Y, Z and edge parts A ∈ {0, 1} X×Y , B ∈ {0, 1} Y ×Z , C ∈ {0, 1} X×Z ) contains a triangle (that is, a vertex triple (x, y, z) ∈ (X, Y, Z) with A(x, y) = B(y, z) = C(x, z) = 1).While at first glance Triangle Detection appears to be a simpler problem than BMM (note that the output consists of a single bit versus n 2 bits), it is known since the early days of fine-grained complexity that both problems are, in fact, equivalent in terms of subcubic algorithms [72].Specifically, if Triangle Detection can be solved in time O(n 3 /f (n)), then Boolean Matrix Multiplication is in time O(n 3 /f (n 1/3 )).This reduction is essentially loss-less for the quasi-polynomial speed-up that we aim for in this paper.Therefore, we focus on designing an efficient algorithm for Triangle Detection in the following exposition.
Triangle Detection on Regular Graphs.We start by describing a dream scenario to understand how Kelley-Lovett-Meka's structural theorem [52] comes into play.
Our aim is to solve Triangle Detection in time n 3 /2 Ω(d) for some parameter d.Moreover, let ϵ > 0 be a small constant (say, ϵ = 1 160 ).For the dream scenario suppose that the edge parts A and B are regular in the sense of Theorem Under these assumptions, Theorem 2.1 yields that the scaled matrix product A•B is (E[A] E[B], 80ϵ, 2 −ϵd/2 )-uniform.Explicitly, for our choice of ϵ = 1 160 , this means that at least a (1 − 2 −ϵd/2 )-fraction of the entries of In particular, the matrix A • B (and thereby also AB) has zeros in at most a 2 −ϵd/2 -fraction of its entries (assuming that A and B are nonzero).
This puts us in a win-win situation: Either the matrix C is sparse (E[C] ≤ 2 −ϵd/2 ), in which case we can detect a triangle in time n 3 /2 Ω(d) (by enumerating all n 2 /2 Ω(d) edges in C and all remaining nodes y ∈ Y ).Or the matrix C is dense (E[C] > 2 −ϵd/2 ), and it follows from the uniformity that AB and C have a common nonzero entry.Note that this certifies that there is a triangle without the need to compute anything further.
Our Regularity Decomposition.Of course, we cannot simply assume the dream scenario where A and B satisfy the regularity and min-degree conditions.Instead, we hope to decompose A and B into regular pieces in the same flavor as Szemerédi's or Frieze-Kannan's regularity lemmas.For grid regularity, unfortunately, such a decomposition theorem was not known.
One of our key contributions is such a decomposition theorem, see Theorem 3.1.We emphasize that this theorem is novel even existentially (i.e., even without the extra requirement that the decomposition must be computed efficiently).There is an algorithm ABDecomposition(X, Y, Z, A, B, ϵ, d) that computes a collection of tuples 2. For all k ∈ [K]: (ii) A k and B T k are both (ϵ, 2, d)-regular and ϵ-min-degree. 3.
The algorithm is deterministic and runs in time n 2 • exp(d 7 poly(ϵ −1 )) (where To illustrate how these four properties become useful, we complete the description of the Triangle Detection algorithm.We first precompute the decomposition as in the theorem.Additionally, define Property (1) of the theorem states that AB = k A k B k (here, by slight abuse of notation, in the sum we interpret each term A k B k as the X × Z-matrix by extending A k B k with zeros).Therefore, the set of triangles in the original graph is exactly the disjoint union of the triangles in the tripartite subgraphs (X k , Y k , Z k , A k , B k , C k ).It thus remains to detect a triangle in any of these subgraphs.
For each such subgraph, we are again in a win-win situation: If at least one of the edge parts is sparse (i.e., ), then we can solve the subinstance efficiently in time d) .Otherwise, Property (2) of the theorem implies that we are in the dream scenario that A k and B T k are (ϵ, 2, d)-regular and ϵ-min degree.Following the same argument as before, building on the structural Theorem 2.1, it follows that A k B k and C k share a common nonzero entry, which entails the existence of a triangle.In summary, the algorithm solves each sparse subinstance in time d) and stops as soon as it encounters a dense subinstance.
The remaining Properties (3) and ( 4) are necessary to bound the running time of this algorithm.On the one hand, by Property (3) solving all sparse instances takes total time On the other hand, precomputing the decomposition, and testing for each subinstance, whether it is dense or sparse, takes time . By choosing d = Θ( 7 √ log n) sufficiently small such that the precomputation time becomes O(n 2.1 ), say, the total running time becomes n 3 /2 Ω( 7 √ log n) as claimed.The remainder of this overview is devoted to a proof overview of Theorem 3.1.

Enforcing Regularity and Min-Degree
Towards proving the decomposition theorem, our first milestone is to develop tools to enforce the (a) regularity and (b) min-degree conditions.
Both tools follow a common theme: To achieve some property we either certify that (a large part of) the given graph already satisfies the property, or that we can alternatively find a large induced subgraph which is substantially denser than average (density increment).In the former case we have been successful, and in the latter case we will simply recurse on the selected denser piece.Since the density increases with each recursive call, we control the recursion depth and the loss we thereby incur.More details follow in Sections 3.2 and 3.3.
Enforcing Min-Degree.Let us start with the conceptually easier min-degree property.Here, specifically, we would like to ensure the ϵ-min-degree condition, i.e. that all nodes x ∈ X satisfy deg In fact, it is enough for us if we can find a subgraph of, say, half the total size that satisfies that it is ϵ-min-degree.An easy algorithm is to repeatedly remove low-degree nodes x until the remaining graph becomes ϵ-min-degree.If this algorithm terminates before removing half the nodes, then we have succeeded in finding a large ϵ-min-degree subgraph.If instead the algorithm removes half the nodes in X and the graph A ′ is still not ϵ-min-degree, then we claim that the remaining graph has density e., we have found a density increment.For more details, see Lemma 5.1.
Enforcing Regularity.The more challenging task is to ensure that a graph is regular (or that we can alternatively find a density increment).Following the terminology from [52] (which in turn originates from [53]), we refer to this step as "sifting".Specifically, we rely on the following theorem that we will later apply with k = 2 and ℓ = d: The algorithm is deterministic and runs in time The existential claim of Theorem 3.2 was already established by Kelley, Lovett and Meka [52] (up to insignificant changes in the parameters) by a simple "one-shot" proof.While it is possible to turn their ideas into a randomized sifting algorithm, we follow a different proof that can ultimately be turned into a deterministic algorithm.The rough idea is to prove that whenever A is not (ϵ, k, ℓ)regular, then either many nodes x ∈ X have exceptionally high degree deg (in which case we can return the set X ′ of such high-degree nodes and Y ′ = Y ), or we can find a large induced subgraph of A that is (ϵ, k − 1, ℓ)-irregular (see Lemma 4.2).In the latter case, we recurse on that subgraph, so after at most k recursive calls we find a density increment.
In order to detect this exceptionally irregular subgraph, it is necessary to obtain an accurate estimate of its grid norm.To this end, we prove that any grid norm ∥A∥ U (k,ℓ) can be approximated up to some additive error α > 0 in time n 2 • α −O(kℓ(k+ℓ)) by a deterministic algorithm (Lemma 4.6).
Here we crucially build on the technology of oblivious samplers.We defer further details to Section 4.

Decomposing A
As a warm-up and building block towards Theorem 3.1, let us first focus on decomposing a single bipartite graph A ∈ {0, 1} X×Y .Specifically, we establish the following decomposition with four analogous properties to Theorem 3.1.
The algorithm is deterministic and runs in time Theorem 3.3 in itself is already an interesting regularity decomposition, which we believe will likely find further applications in the future.The proof of the theorem is along the following lines.First of all, if Consider the following subtask (see Lemma 5.2): The goal is to find X * ⊆ X and Y * ⊆ Y such that the induced subgraph A[X * , Y * ] is (ϵ, 2, d)-regular and ϵ-min-degree; we call X * × Y * a good rectangle.We can find a good rectangle using the density increment technique.First, make half of X satisfy the min-degree condition.Then, apply Theorem 3.2 to certify that this remaining graph is (ϵ, 2, d)-regular.If both steps succeed we have successfully identified a good rectangle (namely, the entire remaining graph).Otherwise, if either step fails and instead returns a large subgraph with density at least (1 + ϵ 2 ) E[A], we simply recurse on the denser subgraph to find a good rectangle.With each recursive call the density strictly increases, and thus this process eventually terminates.In fact, the recursion depth is bounded by O(d/ϵ) given that the initial density is E[A] ≥ 2 −d .Since with each recursive call we reduce the number of vertices to a (ϵ E[A]) −O(d) = exp(−d 2 poly(ϵ −1 ))fraction (by Theorem 3.2), the returned good rectangle covers at least a exp(−d 3 poly(ϵ −1 ))-fraction of the original graph.
Coming back to Theorem 3.3, we can compute the decomposition using density decrements.Namely, we repeatedly find good rectangles X * × Y * as in the previous paragraph, take the subgraph (X * , Y * , A[X * , Y * ]) as one part in the decomposition, and then remove all edges in the rectangle X * × Y * from A. Eventually A becomes 2 −d -sparse and at this point we return the remaining trivial decomposition {(X, Y, A)}.In each step we remove exp(−d 3 poly(ϵ −1 )) • |X| • |Y | edges from A, and therefore this process leads to at most L ≤ exp(d 3 poly(ϵ −1 )) many parts.
So far we have neglected Property 3, but it turns out that a closer inspection of this process indeed yields that Proving this statement builds on the critical insight that, for any good rectangle that we remove, we always have . Specifically, let A ℓ denote the remaining matrix A before the ℓ-th step of the algorithm, and let L 1 be the smallest index such that Then we can express We can apply the same argument to analyze the algorithm in phases.That is, letting L i be the smallest step with we can similarly bound

Decomposing AB
We finally turn to the full decomposition from Theorem 3.1.The idea is to use the one-part decomposition developed in the previous subsection as a black-box to decompose A, and to decompose B via density increments/decrements.Unfortunately, the details of this step are much more intricate.
Let us first sketch an approach that will not work out as planned.In light of the previous subsection, the hope is that we can find a good rectangle ℓ=1 as a good cube.If there was an algorithm to find good cubes, then we would easily obtain the desired decomposition: We repeatedly find a good cube as parts in the decomposition and remove the edges in Y * × Z * from B.
However, we face serious problems trying to find a good cube.The natural idea is to use the density increment technique to find We thus recurse on that subgraph to find a good cube.The second issue, that B[Y ℓ , Z * ] is not ϵ-min-degree, is more serious.In contrast to the regularity condition we cannot enforce the min-degree condition for the whole graph Our solution to this issue is somewhat reminiscent to the divide-and-conquer approaches for Triangle Detection.As outlined before, we can compute a good cube d)-regular and ϵ-min-degree.By tweaking the parameters of the min-degree lemma, we can further achieve that |Z ℓ | ≥ (1 − γ)|Z * | for some parameter γ to be determined soon.As before, for each good cube we emit {(X ℓ , Y ℓ , Z ℓ , A ℓ , B[Y ℓ , Z ℓ ])} as one part of the decomposition, and then remove the edges in Y * × Z * from B and repeat.However, we additionally recurse on all subinstances on the vertex parts (X ℓ , Y ℓ , Z * \ Z ℓ ) to cover the edges missed in the previous parts.To control the cost caused by this additional layer of recursion we need to guarantee that

Further Improvements?
While we have successfully achieved quasi-polynomial savings combinatorially for BMM (and many other problems), Conjecture 1.1 still stands, and the question remains whether we can do better.E.g., can we achieve savings of the form 2 Θ( √ log n) rather than 2 Θ( 7 √ log n) , or possibly even truly polynomial savings?
Over random matrices in which each entry is 1 with probability p, we can solve BMM in O(n 2.5 ) time, 10 which means that the general framework of worst-case to random-case reductions via regularity decompositions could go much further.However, it is not clear whether the specific notion of grid regularity can go beyond 2 Θ( √ log n) savings, because the known lower bounds for Triangle Removal (à la Behrend's construction [15,76]) seem to apply as well.We have focused on presenting our new technique in an easy and modular fashion rather than on optimizing the constant in the quasi-polynomial savings.It remains an interesting open question to fine-tune the parameters.

Sifting
In this section we describe the "sifting" algorithm that, given a graph A, either determines that A is regular or finds a subgraph of A that is denser than average.Kelley, Lovett and Meka [52] have proved this statement via a non-algorithmic proof that can rather easily be turned into a randomized algorithm.Our approach here differs from that original version, as our more ambitious goal is to obtain a deterministic sifting algorithm.We start with a simple inverse of Markov's inequality: Lemma 4.1 (Inverse of Markov's Inequality).Let Z be a random variable that takes values in [0, 1].Then, for any α ∈ [0, 1], The high-level idea behind the sifting algorithm is that we can either (1) find a denser subgraph by simply taking the high-degree nodes, or (2) recurse on a smaller subgraph with parameter k − 1.This idea is recorded in the next lemma.Here and for the remainder of this section we write Lemma 4.2 (Recursive Sifting).Let A ∈ {0, 1} X×Y , let δ, ϵ > 0 and k, ℓ ≥ 1 and assume that ∥A∥ U (k,ℓ) ≥ (1 + ϵ)δ.Then one of the following two cases applies: 2. or k > 1 and there is some x ∈ X such that: • deg A (x) ≥ δ k , and Proof.First consider the case k = 1.We prove that case 1 applies by sampling x ∈ X uniformly at random, and showing that with probability at least ϵ • δ ℓ this choice satisfies deg A (x) ≥ δ.Indeed, using the inverse Markov inequality: Next, let k > 1 and suppose that case 1 does not hold.We prove that selecting a uniformly random element x ∈ X satisfies case 2 with positive probability.In fact, we prove that a uniformly random element x ∈ X satisfies the following two stronger properties with positive probability: Clearly, (i) fails with probability at most ϵ 2 • δ kℓ (by the assumption that case 1 of the lemma statement does not hold).Considering property (ii), we first bound the following expectation: Recall that Y x is the set of neighbors of x.Hence, for any function f we can rewrite the expecta- Therefore, by the inverse Markov inequality, for a uniformly random x ∈ X property (ii) holds with probability at least By a union bound, both properties (i) and (ii) hold simultaneously with positive probability at least ϵ 2 • E[A] kℓ .Finally, consider an element x satisfying (i) and (ii); we show that x also satisfies the two conditions from the lemma statement.Since δ For the sifting algorithm we also need the following lemma about approximating ∥ • ∥ U (k,ℓ) .We postpone the deterministic proof of Lemma 4.3 to Section 4.1, and encourage the reader to instead think of Lemma 4.3 as the straightforward randomized algorithm (that subsamples X k × Y ℓ to approximately count the number of (k, ℓ)-bicliques).There is a deterministic algorithm that computes, for all x ∈ X, an approximation v x satisfying that v x = ∥A x ∥ U (k,ℓ) ± (α/ deg A (x) We are ready to prove Theorem 3.2.For convenience, we restate the statement here.Theorem 3.2 (Sifting).Let A ∈ {0, 1} X×Y , let ϵ > 0 and k, ℓ ≥ 1.There is an algorithm Sift(X, Y, A, ϵ, k, ℓ) that returns either Algorithm 4.1 Implements the algorithm from Theorem 3.2.
return "regular" From this alternative algorithm we can easily obtain the desired algorithm Sift(X, Y, A, ϵ, k, ℓ): Simply call and return Sift'(X, Y, A, , k, ℓ).We design Sift'(X, Y, A, δ, ϵ, k, ℓ) as a simple recursive algorithm; see Algorithm 4.1 for the pseudocode.In a first step (Lines 2 to 4) we construct the set X ′ ← {x ∈ X : deg A (x) ≥ δ} of highdegree nodes.If this set turns out to be sufficiently large, |X ′ | ≥ ϵ 2 • δ kℓ • |X|, we can successfully return X ′ and Y ′ ← Y .Otherwise, we distinguish two cases: If k = 1, then we simply report "regular" (Lines 5 to 6).If instead k > 1, then our principle strategy is to identify a node x ∈ X such that subgraph A x is as irregular as possible, and to recurse on that subgraph (Lines 8 to 10).Specifically, using Lemma 4.3 we compute approximations v x of ∥A x ∥ U (k−1,ℓ) , for all x ∈ X, with parameter α = ϵδ 2 2k 2 .We then select the element x ∈ X with deg A (x) ≥ δ k that maximizes v x .Finally, we recurse on Sift'(X, Y x , A x , δ, ϵ . (Here we tweak the parameter ϵ to account for the loss in the approximation.) 2 ).Indeed, in the base case we return the sets X ′ and Y ′ = Y with size at least It remains to prove that whenever the given graph is irregular, i.e. ∥A∥ U (k,ℓ) ≥ (1 + ϵ)δ, then our algorithm does not return "regular".On the one hand, if k = 1 then Lemma 4.2 implies that the set of high-degree nodes X ′ = {x ∈ X : deg A (x) ≥ δ} has size at least ϵ 2 • δ ℓ • |X|, and therefore the algorithm terminates in Line 4 by returning X ′ , Y .On the other hand, consider the case k > 1.Then either the algorithm terminates in Line 4, or Lemma 4.2 implies that there is some node By induction the recursive call does not return "regular".
Running Time.The recursion depth of the algorithm is at most k, hence it suffices to bound the running time of a single execution.The only costly step is the computation of the approximations of ∥A x ∥ U (k−1,ℓ) which takes time n 2 • α −O(kℓ(k+ℓ)) = n 2 • (ϵδ/k) −O(kℓ(k+ℓ)) by Lemma 4.3; all other steps can be implemented in time O(n 2 ).

Deterministic Regularity Approximation
To obtain a deterministic algorithm, it remains to prove Lemma 4.3.As the key tool in our derandomization we rely on oblivious samplers as developed in an extensive line of research [26,41,77,39,63,46] (see also the survey [40]).For our application the exact dependence on the accuracy parameters ϵ, δ does not matter much, and we thus rely on one of the early constructions: 11Lemma 4.4 (Oblivious Sampling [41]).Let X be a set and let δ, ϵ > 0. There is a deterministic algorithm computing, in time |X| • poly(ϵ −1 , δ −1 , log |X|), a family S of subsets S ⊆ X such that We call S an (ϵ, δ)-oblivious sampler of X. Then: Lemma 4.6 (Regularity Approximation via Oblivious Sampling).Let A ∈ {0, 1} X×Y , let δ, ϵ > 0 and k, ℓ ≥ 1, and let S, T be (ϵ, δ)-oblivious samplers of X and Y , respectively.Then: Proof.Fix y 1 , . . ., y ℓ ∈ Y and consider the function f (x) = j∈[ℓ] A(x, y j ).We sample a uniformly random set S ∈ S. Since S is an (ϵ, δ)-oblivious sampler of X, with probability at least 1 − δ we have that Since moreover both expectations are [0, 1]-bounded, we conclude that Now unfix y 1 , . . ., y ℓ ∈ Y .From the previous consideration it follows that We can now apply the same argument again to A[S, Y ], with the roles of X and Y interchanged, to obtain that and therefore The claim follows.
Next, we enumerate each x ∈ X and compute , by enumerating each T ∈ T and each tuple (y 1 , . . ., y ℓ ) ∈ T ℓ .Finally, we return as the desired approximations.
For the correctness, let A ′ x ∈ {0, 1} X×Y be the matrix obtained from A where all columns y ̸ ∈ Y x are zeroed out (in contrast to A x where we have deleted these columns).Then: , and thus, by the previous Lemmas 4.5 and 4.6, By the definitions of A x and A ′ x , we have that .

Regularity Decompositions
In this section we establish the regularity decompositions (Theorems 3.1 and 3.3).The structure of this section closely follows the outline from Section 3.
We start with the following lemma stating that any graph can either be made ϵ-min-degree without loosing many nodes, or we can find a denser subgraph.

The algorithm is deterministic and runs in time O(|X| |Y |).
Proof.Consider the following algorithm; for the pseudocode see Algorithm 5.1.Initially, we assign X ′ ← X and A ′ ← A. As long as there exists some x ∈ X ′ with deg When this rule terminates we output the resulting set X ′ (Case 1).If, however, we reach size |X ′ | ≤ (1 − γ) • |X|, then we stop the algorithm prematurely and return the set X ′ from that stage of the algorithm (Case 2).
Correctness of Case 1. Suppose that the algorithm terminates in Case 1.It is clear that A ′ is ϵ-min-degree.Moreover, one can easily verify that E[A ′ ] ≥ E[A] since we have only removed nodes with degree smaller than average.Finally, since the algorithm has not stopped in Case 2 before, we indeed have Correctness of Case 2. Suppose now that the algorithm terminates in Case 2. We first argue that To this end, we let X ′ and A ′ be as when the algorithm terminates.As this happens in the first iteration when

Since the density of A ′ only increases over the course of the algorithm, and since we only remove nodes with deg
Algorithm 5.2 Implements the algorithm from Lemma 5.2.
1: procedure GoodRect(X, Y, A, ϵ, d) return GoodRect(X ′ , Y, A ′ , ϵ, d) return GoodRect(X ′′ , Y ′′ , A[X ′′ , Y ′′ ], ϵ, d) We precompute the degrees of all nodes in X, and sort X according to these degrees.Throughout we maintain the set X and the size |A ′ |.In each step we can find in constant time a node x ∈ X ′ with deg if it exists (namely the node in X ′ with smallest degree).
We remark that, while the previous lemma only guarantees the left-sided min-degree condition, it is equally possible to guarantee the condition on both sides.However, we never need this stronger statement in our upcoming proofs and therefore stick to this simpler version.

A-Decomposition
In this subsection we prove Theorem 3.3 (i.e., the decomposition of a single bipartite graph into regular subgraphs).As outlined in Section 3, the proof consists of two steps: A method to find good rectangles via density increments (see Lemma 5.2), and a decomposition via density decrements that repeatedly remove good rectangles (see Theorem 3.3).
Correctness of Property 3. First assume that the algorithm does not recurse and returns X ′ , Y .In this case, Lemma 5.1 guarantees that |X ′ | ≥ where in the latter bound we used that log(1+ϵ) ≥ ϵ for all ϵ ∈ (0, 1), and that initially Running Time.The algorithm reaches recursion depth at most log ), which causes a negligible overhead in the running time.The call to MinDegree (Lemma 5.1) takes time O(n 2 ), and the call to Sift (Theorem 3.2) takes time All in all, the running time is n 2 • exp(d 3 poly(ϵ −1 )) as claimed.

Theorem 3.3 (A-Decomposition).
Let A ∈ {0, 1} X×Y , let ϵ ∈ (0, 1) and d ≥ 1.There is an algorithm ADecomposition(X, Y, A, ϵ, d) computing tuples {(X ℓ , Y ℓ , A ℓ )} L ℓ=1 with X ℓ ⊆ X, Y ℓ ⊆ Y , and A ℓ ∈ {0, 1} X ℓ ×Y ℓ such that: It is easy to see that Property 1 holds: We either cover A entirely in the base case, or we cover some part X * × Y * and remove that part in the recursive call.Moreover, Property 2 easily follows from the guarantee of Lemma 5.2.It remains to prove Properties 3 and 4, and to analyze the running time.
Correctness of Property 3.For a collection of tuples S as returned by the algorithm, let us define Our goal is to prove that C(S) ≤ (d + 2) • |X| |Y |, for any input (X, Y, A), where S is the set returned by the algorithm.This is clear whenever E[A] ≤ 2 −d (as then the algorithm returns the trivial partition {(X, Y, A)}).By induction we prove that whenever |X| |Y | , and thus by induction: Correctness of Property 4. Note that L is the recursion depth of the algorithm.To prove that L is bounded as claimed, we argue that with every recursive call the density of the matrix decreases.Specifically, each output X * , Y * has size at least

, and the density of the submatrix A
, the algorithm had terminated already).Thus, each recursive call reduces the density of the graph by )), and the algorithm necessarily terminates after L ≤ exp(d 3 poly(ϵ −1 )) recursive calls.
Running Time.We have already bounded the recursion depth in the previous paragraph, so focus on a single execution.The dominant cost is the call to GoodRect (Lemma 5.2) which takes time

AB-Decomposition
We finally turn to the proof of Theorem 3.1.The outline, as discussed in Section 3, follows the previous subsection on a high-level (but differs in many more difficult technical aspects).We first devise a method to find good cube via density increments (see Lemma 5.3), and then derive the decomposition via density decrements that repeatedly remove good cubes (see Theorem 3.1).
The algorithm is deterministic and runs in time Proof.Let us start with a description of the algorithm; for the pseudocode see Algorithm 5.4.We first call MinDegree(Y, Z, B, ϵγ In the latter case we simply recurse on the subinstance induced by X, Y ′ , Z (Lines 2 to 4).
Next, we run Theorem 3.3 to compute an edge decomposition {(X ℓ , Y ℓ , A ℓ )} L ℓ=1 of (X, Y ′ , A ′ ) (Line 5).The hope is that, for all pieces ℓ ∈ [L], the induced graphs B[Y ℓ , Z] also satisfy the mindegree and regularity conditions, in which case we could return the decomposition without changes.To ensure both, we enumerate each ℓ ∈ [L] (Line 6).By calling MinDegree(Z, ℓ is ϵ-min-degree or B ℓ has increased density.The former case is exactly the desired min-degree condition, and in the latter case we again recurse on the subinstance induced by X, Y ℓ , Z ℓ (Lines 7 to 9).It remains to ensure regularity.To this end we call Sift(Z ℓ , Y ℓ , B T ℓ , ϵ, 2, d) (Theorem 3.2), which either certifies that B T ℓ is (ϵ, 2, d)-regular, or finds Z ′′ ⊆ Z ℓ and Y ′′ ⊆ Y ℓ such that the density of the induced subgraph B ℓ [Y ′′ , Z ′′ ] increases.In the latter case, again, we recurse (Lines 10 and 11).
If after all these tests the algorithm has not recursed, we finally return ).Some properties of the algorithm are easy to prove.For instance, Properties 1 and 7 follow immediately from Theorem 3.3.Property 2 is easy to prove as well: Theorem 3.3 implies that for each ℓ ∈ [L], we have that E[A ℓ ] ≤ 2 −d or that A ℓ is ϵ-min-degree and (ϵ, 2, d)-regular.In addition, the algorithm only terminates and returns an output after certifying that, for all ℓ ∈ [L], B T ℓ is ϵ-min-degree and (ϵ, 2, d)-regular.The other properties require more work.We start with the following claim: Claim.The algorithm only recurses on subgraphs of B with density at least (1 Algorithm 5.4 Implements the algorithm from Lemma 5.3. 1: procedure GoodCube(X, Y, Z, A, B, ϵ, γ, d) return GoodCube(X, Y ′ , Z, A ′ , B ′ , ϵ, γ, d) for each ℓ ∈ [L] do 7: Proof.If the algorithm recurses in Lines 3 and 4, then the claim is immediate.After passing Line 4, Lemma 5.1 guarantees that the graph B ′ is ϵγ 2 -min-degree.From this min-degree condition we know that, for any set In particular, if the algorithm recurses in Line 4 then we recurse on a subgraph of density ; here in the last step we used that γ ∈ (0, 1  2 ) and ϵ ∈ (0, 1).Similarly, if the algorithm recurses in Line 11 then the density is at least Running Time.We finally analyze the running time of the algorithm.As argued before, the recursion depth is bounded by O(dϵ −1 γ −1 ).In each execution of Algorithm 5.4 we call ADecomposition once which takes time n 2 • exp(d 3 poly(ϵ −1 )).In addition we call Sift L times taking time L•n 2 •exp(d 2 poly(ϵ −1 )).The total running time becomes n 2 •γ −1 exp(d 3 poly(ϵ −1 )).
There is an algorithm ABDecomposition(X, Y, Z, A, B, ϵ, d) that computes a collection of tuples 2. For all k ∈ [K]: A k and B T k are both (ϵ, 2, d)-regular and ϵ-min-degree.

3.
The algorithm is deterministic and runs in time Proof.We start with the description of the algorithm; see Algorithm 5.5 for the pseudocode.Throughout we assume an additional input parameter 0 ≤ h ≤ d which acts somewhat as the recursion depth of the algorithm.For the initial call, we set h = 0.The algorithm has two bases cases.If B is sufficiently sparse, E[B] ≤ 2 −d , then we return the trivial decomposition {(X, Y, Z, A, B)} (Lines 2 and 3).Moreover, if the algorithm has reached recursion depth h = d, then we return a trivial decomposition of size at most 2 d .Specifically, we split B into submatrices B 1 , . . ., B 2 d ∈ {0, 1} Y ×Z each of density at most 2 −d and return the decomposition {(X, Y, Z, A, B i )} 2 d i=1 (Lines 4 to 6).Otherwise, run GoodCube(X, Y, Z, A, B, ϵ, γ, d) (Lemma 5.3) with the parameter γ = We then recursively compute, for each ℓ ∈ [L], the decomposition S ℓ of (X ℓ , Y ℓ , Z * \ Z ℓ , A ℓ , B ′ ℓ ) (Lines 8 to 10).Moreover, we recursively compute the decomposition S * of (X, Y, Z, A, B − B[Y * , Z * ]) (Line 11; here, we denote by B − B[Y * , Z * ] the matrix obtained from B after zeroing out the entries in Correctness of Property 1.For a set S as returned by the algorithm, we write where, as in the theorem statement, we interpret each term A ′ B ′ in the sum as an X × Z-matrix by extending with zeros.The goal is to prove that Σ(S) = AB, where S is the set returned by the Algorithm 5.5 Implements the algorithm from Theorem 3.1.
Arbitrarily partition B into submatrices for each ℓ ∈ [L] do 9: Compute return ABDecomposition'(X, Y, Z, A, B, ϵ, d, 0) algorithm on input (X, Y, Z, A, B).This is clear in both base cases.So assume that the algorithm recurses.Then by induction: here, in the second-to-last step we have applied Property 1 of Lemma 5.3.

Correctness of Property 2.
In both base cases we return partitions in which all parts B k are sparse, In the recursive case the output consists of the union of three different sets: For each (X ℓ , Y ℓ , Z ℓ , A ℓ , B ℓ ) the claim follows from Property 2 of Lemma 5.3, and for each element in S ℓ or S * the claim holds by induction.
Correctness of Property 3.For a collection S of tuples as returned by the algorithm, let us write Our goal is to prove that C(S) where S is the output of our algorithm with inputs (X, Y, Z, A, B, ϵ, d, h).In the sparse case, if E[B] ≤ 2 −d , then we return the trivial decomposition with C(S) = |X| |Y | |Z|.We prove by induction that otherwise the following bound applies: Clearly, this upper bound is true if h = d, so suppose that h < d.Then the algorithm recurses and we report S with C(S) In the following we bound these three contributions individually.
For the first contribution we readily exploit Property 4 of Lemma 5.3: For the second contribution we exploit Property 5 of Lemma 5.3: Here, in the last step, we have used our choice of γ = Summing over all three contributions (1), ( 2) and ( 3) yields the claimed bound on C(S).
Correctness of Property 4.
for the sets Y * , Z * returned by Lemma 5.3.Moreover, let L = exp(d 3 poly(ϵ −1 )) be as in Lemma 5.3.We prove by induction that where S is the set returned by the algorithm.This bound is easily verified in the two base cases.
If the algorithm recurses instead then and In the end we plug in values for L and M to obtain |S| ≤ exp(d 5 γ −1 poly(ϵ −1 )) = exp(d 7 poly(ϵ −1 )), as stated.
Running Time.From the previous consideration we also learn that the number of recursive calls is bounded by exp(d 7 poly(ϵ −1 )).In each recursive call, the dominant step is to call GoodCube in time n 2 • exp(d 3 poly(ϵ −1 )).Hence, the total time is n 2 • exp(d 7 poly(ϵ −1 )).

Boolean Matrix Multiplication and Triangle Detection
In this section we formally derive our efficient algorithm for Boolean Matrix Multiplication from the regularity decompositions developed in the previous sections.Our algorithm relies on the following fine-grained reduction from BMM to Triangle Detection due to Vassilevska Williams and Williams [72]: Lemma 6.1 (Boolean Matrix Multiplication to Triangle Detection, [72]).If Triangle Detection is in time O(n 3 /f (n)) (for some nondecreasing function f (n)), then Boolean Matrix Multiplication is in time O(n 3 /f (n 1/3 )).

Theorem 6.2 (Triangle Detection).
There is a deterministic combinatorial detecting whether a graph contains a triangle in time n 3 /2 Ω( 7 √ log n) .
Proof.We assume without loss of generality that the input graph is tripartite, (X, Y, Z, A, B, C).Let ϵ = 1 160 and let d ≥ 1 be a parameter to be determined later.Using Theorem 3.
For each piece we distinguish two cases: ) by exploiting the respective sparseness.
For the correctness first observe that there is a triangle in the given graph if and only if there exists (x, z) ∈ X × Z with (AB)(x, z) ≥ 1 and C(x, z) = 1.Since Property 1 of Theorem 3.1 guarantees that AB = k A k B k , there is a triangle in the original graph if and only if there is some k ∈ [K] and (x, z) ∈ X k × Z k with (A k B k )(x, z) ≥ 1 and C k (x, z) = 1.The correctness of the first case is thus clear.But it remains to argue that if then there exists a triangle in (X k , Y k , Z k , A k , B k , C k ).Indeed, by Property 2 of Theorem 3.1 and the first two assumptions, we have that A k and B T k are (ϵ, 2, d)-regular and ϵ-min-degree.In this case, Theorem 2.1 implies that A • B is (E[A] E[B], 80ϵ, 2 −ϵd/2 )-uniform.By definition, this means that not more than a 2 −ϵd/2 -fraction of the entries in A In particular, (since we have E[A], E[B] > 0) it follows that at most a 2 −ϵd/2 -fraction of the entries in A • B are nonzero.Using finally that E[C k ] > 2 −ϵd/2 , we conclude that there exists some common entry (x, z) ∈ X k × Z k where both (AB)(x, z) ≥ 1 and C(x, z) = 1.
Let us finally analyze the running time.Detecting a triangle in each sparse sub-instance takes time using Property 3 of Theorem 3.1.Furthermore, precomputating the regularity decomposition takes time . This running is optimized by picking d = Θ( 7 √ log n), where the constant is sufficiently small such that the preprocessing time becomes O(n 2.1 ), say.For this choice, the total running time is indeed Our main Theorem 1.2 is immediate by combining Lemma 6.1 and Theorem 6.2.

Triangle Enumeration
In this section we give an improved algorithm for enumerating triangles in graphs, based on our previous decomposition theorems:

(Triangle Enumeration Algorithm).
There is a deterministic algorithm that preprocesses a given graph in time n 3 /(log n) 6 • (log log n) O (1) and then enumerates all triangles with constant delay.
We make an important distinction: A triangle listing algorithm receives as input a graph and returns as output a list of its t triangles-here, we care about triangle listing algorithms with running times of the form O(n 3 /f (n) + t).A triangle enumeration algorithm first preprocesses a graph in time O(n 3 /f (n)).Afterwards, it can enumerate all triangles in the graph with constant delay (i.e., upon query, the algorithm spends time O(1) to report the next triangle).For the majority of this section we work with triangle listing algorithms, but in Section 7.3 we show that both types are equivalent in our context. 12e structure the remainder of this section as follows: We quickly give the main idea of our algorithm in Section 7.1, with details following in Section 7.2.Finally, in Section 7.4 we include a proof that further improvements to our enumeration algorithm would entail a 3-SUM-algorithm that is faster than what is currently known.
1. We arbitrarily partition X into I = ⌈|X|/s⌉ groups X 1 , . . ., X I of size at most s; similarly partition Z into J = ⌈|Z|/s⌉ groups Z 1 , . . ., Z J of size at most s.
2. We enumerate each tuple (i, j, S, T ) where i ∈ [I], j ∈ [J] and where S ⊆ X i , T ⊆ Z j have size |S|, |T | ≤ r.Note that log s r ≤ r log s ≤ log n 1000 log log n • 100 log log n = log n 10 ; therefore, there are at most n • n • n 0.1 • n 0.1 = n 2.2 such tuples (i, j, S, T ).Moreover, we can encode each such tuple in O(log n) bits which takes O(1) machine words. 13For each tuple (i, j, S, T ) we prepare a list of all edges in C[S, T ] and store a pointer (of constant word size) to this list.
3. Next, we enumerate each tuple (y, i, j) where y ∈ Y , i ∈ [I] and j ∈ [J].We partition the set N A (y) ∩ X i (i.e., the set of neighbors of y in X i ) arbitrarily into subsets S of size r (plus possibly one subset of size less than r); let S(y, i) denote the resulting partition.Similarly, we partition N B (y) ∩ Z j into subsets of size at most r (plus possibly one subset of size less than r); let T (y, j) denote the resulting partition.Now, for each pair S ∈ S(y, i), T ∈ T (y, j) we query list of edges associated to the tuple (i, j, S, T ).We enumerate each edge (x, z) in this list associated to (i, j, S, T ) and store the triangle (x, y, z).
It is easy to verify that this algorithm lists all triangles in G. Let us focus on the running time. Step The time bound from the lemma statement follows by plugging in the chosen parameters s and r. 13 Of course, the exact bit representation matters.The easiest option here is to represent each set S as a (sorted) list of its at most r elements.As each element can be represented using log s bits, this representation indeed takes r log s ≤ 1 10 log n bits in total.

Theorem 7.2 (Triangle Listing).
There is a deterministic algorithm that lists all t triangles in a given graph in time O(n 3 / (log n) 6 • (log log n) O(1) + t).
Proof.Let G = (X, Y, Z, A, B, C) be a given tripartite graph.We design a recursive algorithm that lists all triangles in G.To this end, we maintain one (global) list of triangles and each recursive call of the algorithm appends triangles to this list.Let n denote the total number of nodes in the original graph G (at the top level of the recursion), and let γ, δ, ϵ ∈ (0, 1) and d ≥ 4/ϵ be parameters to be determined later.The first step is to call Theorem 3.1 on input (X, Y, Z, A, B, ϵ, d) to compute a decomposition By the guarantee of Theorem 3.1 this decomposition partitions the set of triangles in G, and thus the remaining goal is to separately list all triangles in the graphs G k .To this end, for each k ∈ [K], we distinguish the following cases: We further subdivide the graph G k based on the approximate degrees in X k with respect to Z k .Specifically, let L = ⌈ϵd/2⌉ and split X k into buckets X k,1 , . . ., X k,L defined by For each ℓ ∈ [L], we define the matrix C k,i ∈ {0, 1} X k ×Z k as the submatrix of C k obtained by zeroing out all rows not in X , our remaining goal is to list the disjoint union of triangles in the graphs G k,ℓ .For each ℓ ∈ [L], we instead distinguish the following three subcases: ). (It is easy to check that the triangles in this graph are exactly the triangles in G k,ℓ .) (It is easy to check that the triangles in this graph are in one-to-one correspondence to the triangles in G k,ℓ .)Note that we have exchanged the node and edge sets such that Lemma 7.1 benefits from minimizing This case is symmetric to the previous case.More precisely, we can reduce to the previous case by considering instead the graph (Z k , Y k , X k , B T k , A T k , C T k ) whose triangles are clearly in one-to-one correspondence with those in G k .
Finally, we add one more rule to the algorithm: As soon as we reach recursion depth H (for some parameter H to be determined), we simply solve the instance by brute-force in time O(|X| |Y | |Z|).This completes the description of the algorithm.The correctness should be clear from the in-text explanations.
Running Time.It remains to bound the running time.For simplicity, we already fix all the parameters here, and then analyze the cases individually: Claim 7.4 (Case 2).The total running time of Case 2 is where t 2 is the number of triangles listed in Case 2.
Proof.We can assume that E[A k ], E[B k ] > 2 −d whenever we enter Case 2 (since Case 1 did not apply).Thus, the regularity decomposition (Theorem 3.1) guarantees that A k and B T k are both (ϵ, 2, d)-regular and ϵ-min-degree.As a corollary of Theorem 2.1 we obtain that denote the number of triangles in G k .By Lemma 7.1 and Theorem 3.1, it follows that Case 2 indeed takes total time Proof.Focus on some pair (k, ℓ) that falls into Case 3.3, i.e., where ).The analysis of this case is inspired by the previous algorithm for purely random graphs.Our goal is to prove that both (i) the number of 2-paths in G k,ℓ (via the edge parts A k and C k,ℓ ) and (ii) the number of triangles in G k,ℓ behave as if G k,ℓ was purely random.We start with (i).By the construction of C k,ℓ , it is immediate that the density of C k,ℓ is at least Therefore, we can bound Next, we turn to (ii) and bound the number of triangles in G k,ℓ from below.Using Theorem 2.1 we infer that at most a 2 −ϵd/2 ≤ 2 −L -fraction of the entries in By combining both statements, we can bound the running time of Lemma 7.1 as follows: By our choice of parameters, 2γL(d + 2) 2 = 1 4 .Therefore, the total running time (which is obtained as the sum of Equations ( 4) and ( 5)) is at most To obtain the time bound claimed in the theorem statement, recall that we initially call the algorithm at recursion depth h = 0, that K ≤ exp(poly(d, ϵ −1 )) = exp(poly(log log n)), and that H = O(log log n).Hence, the term (2KL) H • n 2.3 ≤ n 2.3+o (1) is negligible in the total running time.

Triangle Enumeration versus Triangle Listing
We finally turn our triangle listing algorithm into an enumeration algorithm.In fact, we prove the following equivalence: 14 time O(n 3 /(log n) 6+ϵ ) and constant delay entails an algorithm for the 3-SUM problem in expected time O(n 2 /(log n) 2+ϵ ′ ) (i.e., a (log n) ϵ ′ improvement over the Baran-Demaine-Pȃtraşcu algorithm).Since this statement is not explicit in [54], we devote this section to an almost-self-contained proof.Specifically, we prove the following statement: Lemma 7.9 (Reducing 3-SUM to Triangle Listing).Let α ≥ 0. If there is a randomized algorithm listing all t triangles in a graph in expected time O(n 3 /f (n)+t) (for some computable nondecreasing function f (n)), then there is a randomized 3-SUM algorithm in expected time O(n 2 /f (n 2/3 ) 1/3 ). 15he proof relies on the following standard lemma on linear hashing: Lemma 7.10 (Linear Hashing [28]).We add an edge (x, y) ∈ X × Y to A if and only if there is some a ∈ S such that In this case, we say that the edge (x, y) is labeled with a.We add edges to B and C in the analogous way.We then use the efficient triangle listing algorithm to list all triangles in G (viewing G as an unlabeled graph).For each triangle (x, y, z) that is returned, test whether it has three edge labels (a, b, c) that satisfy a + b + c = 0.In this case we report "yes", and if no such triangle is found, return "no".
Running Time.Sampling solutions and constructing the graph G takes negligible time, so the critical contribution is the running time of the listing algorithm.To this end, we first bound the expected number of triangles in the graph.
Claim.The expected number of triangles in G is O(n + n 3 m 3 ).Proof.Fix a triple a, b, c ∈ S. We analyze how many triangles (x, y, z) are labeled with (a, b, c).Recall that for (x, y, z) to be a triangle, it satisfies nine constraints-three for each edge.Focus on the constraints involving h 1 : Recall that necessarily z 1 = h 1 (0).Thus, the second and third constraints imply that there are only |Φ| = O(1) feasible choices for x 1 and y 1 .In particular, it follows that there are at most O(1) triangles labeled with (a, b, c).Next, let us write kΦ = {ϕ 1 + • • • + ϕ k : ϕ 1 , . . ., ϕ k ∈ Φ}.Summing all three constraints, we obtain that 3h 1 (0) − h 1 (a) − h 1 (b) − h 3 (c) ∈ 3Φ, and by applying the almost-linearity twice, it follows that h 1 (0) − h 1 (a + b + c) ∈ 5Φ.By the analogous argument for h 2 and h 3 , we conclude that there is a triangle labeled with (a, b, c) only if Observe that the graph G has O(m 2 ) vertices.Therefore, given the previous claim, the total expected running time is bounded by O(m 6 /f (m 2 ) + n 3 /m 3 ).By choosing m = ⌈n 1/3 • f (n 2/3 ) 1/9 ⌉, this becomes O(n 2 /f (n 2/3 ) 1/3 ) (using that f is nondecreasing).
for each phase.After the d-th phase the process has reduced the density of the remaining graph to at most 2 −d , and the process terminates.Therefore, all in all, we have L ℓ=1 |X ℓ | |Y ℓ | ≤ 2d • |X| |Y |.(In the formal proof we obtain a slightly sharper bound.) d)regular and ϵ-min-degree.Then we can apply Theorem 3.3 to decompose the matrix A[X, Y * ] into pieces {(X ℓ , Y ℓ , A ℓ )} L ℓ=1 .However, the subgraphs B[Y ℓ , Z * ] are not necessarily (ϵ, 2, d)-regular and ϵ-min-degree.The first issue is fixable: Using the sifting algorithm we can test whether all subgraphs B[Y ℓ , Z * ] are (ϵ, 2, d)-regular-if any such subgraphs fails this test, then Theorem 3.2 instead returns a denser subgraph of B[Y ℓ , Z * ].
say.And indeed, using that L ℓ=1 |X ℓ | |Y ℓ | ≤ poly(d)•|X| |Y | and by setting γ = 1 poly(d) small enough this can be achieved.The overhead nevertheless leads to a significant blow-up in the dependence on d, from exp(d 3 ) to exp(d 7 ).See Section 5 for the details.

Claim 7 . 3 (
Cases 1 and 3.1).The total running time of Cases 1 and 3.1 is O(|X| |Y | |Z|/(log n) 99 ).Proof.The algorithm deals with Cases 1 and 3.1 in time O(|X k | |Y k | |Z k |/2 ϵd/4 ), which, by our choice of the parameters ϵ, d, is O(|X k | |Y k | |Z k |/(log n) 100 ).In total, taking into account all k ∈ [K], and possibly the at most L repetitions of Case 3.1, by Theorem 3.1 this becomes

Lemma 7 . 7 (
Equivalence of Triangle Enumeration and Listing).The following equivalences hold (in terms of deterministic algorithms):

Let n ≥ m ≥ 1 .
There is a family H = {h : [−n . .n] → [m]} of hash functions that can be sampled in expected time poly(log n) and evaluated in constant time, and that satisfies the following properties:• (Almost-Linearity) There exists some constant-size set Φ such that for all h ∈ H and all keys a, b∈ [−n . .n], we have h(a + b) − h(a) − h(b) + h(0) ∈ Φ.• (Pairwise Independence) For all distinct keys a, b ∈ [−n .. n] and buckets x, y ∈ [m]:P h∈H [h(a) = x and h(b) = y] ≤ O( 1 m 2 ).Proof of Lemma 7.9.Let S denote the given 3-SUM-instance, and assume that S ⊆ [−n c . .n c ] for some constant c.As a first step, we randomly sample O(n log n) pairs a, b ∈ S and test whether −a − b ∈ S. If we find a solution in this step, we stop and return "yes".Otherwise, sample three linear hash functions h 1 , h 2 , h 3 : [−n c . .n c ] → [m] as in the previous lemma, and construct the following tripartite graph G = (X, Y, Z, A, B, C).As vertex parts, we take X = [m] × [m] × {h 3 (0)}, Y = [m] × {h 2 (0)} × [m], Z = {h 1 (0)} × [m] × [m].

h 1
(0) − h 1 (a + b + c) ∈ 5Φ, h 2 (0) − h 2 (a + b + c) ∈ 5Φ, h 3 (0) − h 3 (a + b + c) ∈ 5Φ.Finally, to bound the expected number of triangles, we distinguish two cases.On the one hand, the number of triples with a+b+c = 0 is at most O(n) as otherwise, with high probability, we would have detected a 3-SUM solution in the first step of the algorithm.Each such triple contributes at most O(1) triangles.On the other hand, there are up to n 3 triples with a + b + c ̸ = 0.Each such triple contributes at most O(1) triangles, and only if the final three constraints are satisfied.By pairwise independence, each constraint holds with probability at most O( |Φ| m ) = O( 1 m ), and the three constraints are independent.It follows that these triples contribute O( n 3 m 3 ) triangles in expectation.
Compute approximations v x of ∥A x ∥ U (k−1,ℓ) by Lemma 4.3 with parameter α = ϵδ 2 Running Time.It is easy to check that this algorithm can be implemented in time O(|X| |Y |): 1 2 • |X|, and thus |X * | |Y * | ≥ 1 2 • |X| |Y |.Next, consider the recursive cases.By Lemma 5.1 and Theorem 3.2, in both cases we recurse on a rectangle of size at least min{ 1 2 , ϵ 16 • E[A] 2d } • |X| |Y | = ϵ 16 • E[A] 2d • |X| |Y | and with density at least (1 + ϵ 2 ) E[A].It follows by induction that the algorithm returns a rectangle of size 1 runs in negligible time O(n).In Step 2 we enumerate at most n 2.2 tuples (i, j, S, T ), and for each such tuple we spend time at most O(s 2 ) to prepare the list of edges in C[S, T ].The total time of this step is O(n 2.2 • s 2 ) which we loosely bound by O(n 2.3 ).In Step 3 we enumerate all |Y | • I • J tuples (y, i, j).For each such tuple, we spend time O(s) to prepare the sets S(y, i) and T (y, j).
Claim 7.5 (Case 3.3).The total running time of Case 3.3 is O KLn 2.3 + |X| |Y | |Z| (log n) 99 + t 3.3 , where t 3.3 is the number of triangles listed in Case 3.3.