Optimal algebraic Breadth-First Search for sparse graphs

There has been a rise in the popularity of algebraic methods for graph algorithms given the development of the GraphBLAS library and other sparse matrix methods. An exemplar for these approaches is Breadth-First Search (BFS). The algebraic BFS algorithm is simply a recurrence of matrix-vector multiplications with the $n \times n$ adjacency matrix, but the many redundant operations over nonzeros ultimately lead to suboptimal performance. Therefore an optimal algebraic BFS should be of keen interest especially if it is easily integrated with existing matrix methods. Current methods, notably in the GraphBLAS, use a Sparse Matrix masked-Sparse Vector (SpMmSpV) multiplication in which the input vector is kept in a sparse representation in each step of the BFS, and nonzeros in the vector are masked in subsequent steps. This has been an area of recent research in GraphBLAS and other libraries. While in theory these masking methods are asymptotically optimal on sparse graphs, many add work that leads to suboptimal runtime. We give a new optimal, algebraic BFS for sparse graphs, thus closing a gap in the literature. Our method multiplies progressively smaller submatrices of the adjacency matrix at each step. Let $n$ and $m$ refer to the number of vertices and edges, respectively. On a sparse graph, our method takes $O(n)$ algebraic operations as opposed to $O(m)$ operations needed by theoretically optimal sparse matrix approaches. Thus for sparse graphs it matches the bounds of the best-known sequential algorithm and on a Parallel Random Access Machine (PRAM) it is work-optimal. Our result holds for both directed and undirected graphs. Compared to a leading GraphBLAS library our method achieves up to 24x faster sequential time and for parallel computation it can be 17x faster on large graphs and 12x faster on large-diameter graphs.


Introduction
Breadth-First Search (BFS) is a principal search algorithm and fundamental primitive for many graph algorithms such as computing reachability and shortest paths. Let n and m refer to the number of vertices and edges, respectively. By labeling vertices 1..n, a symmetric n × n adjacency matrix, A, can be constructed so that every nonzero element of the matrix denotes an edge leading to O(m) nonzeros in total. Hence each column or row vector of this matrix describes the adjacency or neighborhood of a vertex. The linear algebraic BFS algorithm is simply a recurrence of matrix-vector multiplications with this adjacency matrix and the previous multiplication product. It solves the x k+1 = Ax k relation, where each matrix-vector product captures the next level in the search. The computation is optimal if it makes O(m) algebraic operations overall. Observe that this recurrence can be iterated to give x n+1 = A n x 1 and therefore matrix exponentiation of A by repeated squaring 1 leads to a sublinear-time, parallel BFS, but requires Ω(n 3 log n) work.
For sparse graphs, computing the algebraic BFS by x k+1 = Ax k is appealing due in part to the availability of highly optimized matrix libraries that are finely tuned to the computer architectures. These libraries take advantage of the memory subsystem and it is this low-level interaction with hardware that enables the algebraic BFS to be faster in practice than the theoretically optimal combinatorial algorithm. Newer approaches employ a Sparse Matrix masked-Sparse Vector (SpMmSpV) multiplication [1,6,26,27]. In these methods, previously visited frontier vertices are masked out of the sparse input vector at each step. This can lead to an optimal BFS on sparse graphs, but makes more algebraic operations than needed. Moreover, many of the new SpMmSpV-BFS methods add work that degrade the performance to O(mn) time.
We give an algebraic BFS that is optimal for sparse graphs. Our method multiplies progressively smaller submatrices of the adjacency matrix at each step. It masks both row and column vectors in A so it avoids returning nonzeros from the transpose of a column vector in subsequent calculations. Only the column vectors are needed to project the next frontier so the row vectors for the current frontier vertices are also masked during the multiplication on that frontier. This exploits the symmetry of A and thus requires at most m nonzeros. Our new BFS also short-circuits each matrix-vector multiplication by updating the mask within a step and thus references n − 1 nonzeros in total. This holds for both undirected and directed graphs. In contrast, an optimal SpMmSpV-BFS, even with short-circuiting, references all 2m nonzeros in the worst-case on an undirected graph. Therefore our approach requires significantly fewer algebraic operations than a theoretically optimal SpMmSpV-BFS method. Our submatrix multiplication method is quite simple so it is surprising that it has been overlooked [8]. Our new method can be easily integrated with existing matrix methods, and may benefit the masking techniques in the GraphBLAS library [7,11,25,26], so we expect it would provide substantial value in practical settings.
Although our technique is applicable to dense graphs, it only leads to optimal performance on sparse graphs. For the remainder of this paper we will only consider sparse graphs. We summarize our contribution in Section 3. A brief background on sparse matrix methods for BFS is given in Section 4. We review the current SpMmSpV approaches in Section 5. In Sections 6 and 7 we define a new algebraic BFS by submatrix multiplication and analyze the asymptotic bounds on operations. Then in Section 8 we describe our main algorithm and show that its performance matches the combinatorial algorithm, and on a Parallel Random Access Machine (PRAM) it is work-optimal. Then in Sections 9 and 10 we use the popular Compressed Sparse Row (CSR) format to demonstrate the theoretical contribution and how easily it can be integrated with existing sparse matrix methods. Experimental results of this are given in Section 11 where we demonstrate faster performance than the leading GraphBLAS library, achieving up to 24x faster sequential time and for parallel computation it can be 17x faster on large graphs and 12x faster on large-diameter graphs. 1 Ex.

Notation
All following descriptions are for simple, undirected graphs denoted by G = (V, E) with n = |V | vertices and m = |E| edges. The number of neighbors of a vertex is given by its degree d(v) = |{u|(u, v) ∈ E}|. Let D G denote the diameter of G, meaning the maximum distance between any pair of vertices in G. The vertices in G are labeled [n] = 1, 2, 3...n. Let A ∈ {0, 1} n×n be the adjacency matrix for G. We use A[·, ·] to denote the submatrix of A by its rows and columns. For example, A[{1, 2, 4}, {3, 4}] is the submatrix given by rows 1, 2, 4 and columns 3, 4 in A. We also use A i, * , A * ,i for the i th row and column vectors of A, respectively. The support of a vector x, denoted by supp(x), refers to the set of indices corresponding to nonzero entries in x, thus supp(x) = {i|x(i) = 0}.
Correctness in the algebraic BFS requires only the distinction between zero and nonzero values and this also holds in our method. In addition to the Arithmetic semiring it is safe to use the Boolean (OR for addition, AND for multiplication) or Tropical min-plus semiring, both of which also avoid bit-complexity concerns. When appropriate, we'll denote the addition and multiplication operators by the conventional symbols, ⊕ and ⊗, respectively. To keep our discussion simple, we'll assume that any algebraic operation takes O(1) time.

Our contribution
We give a new algebraic BFS by submatrix multiplication that takes fewer algebraic operations than a theoretically optimal SpMmSpV-BFS method. We denote our algebraic BFS by the recurrence is the submatrix of A with row and column indices given by the set V k containing vertices not yet found in the search as of step k. The vector x k is also masked by the indices in V k so in all steps the matrix and vector are compatible. We will show that our BFS takes O(n) algebraic operations as opposed to O(m) operations of an optimal SpMmSpV-BFS. Our results hold for both undirected and directed graphs, only A T [V k+1 , V k ] is used for directed graphs. Our main theoretical results are given by the following theorems.

Theorem 1. Breadth-First Search can be computed by
x k using one nonzero in each row will multiply n − 1 nonzeros in A.
We introduce a new algebraic BFS in Algorithm 1 based on Theorems 1 and 2. The adjacency matrix remains unchanged, rather we are masking the rows and columns in the matrix that corresponds to previously visited vertices. The input vector in each step is also effectively masked so it is a sparse vector. Hence our method multiplies a sparse submatrix by a sparse subvector in decreasing size each step. This leads to an asymptotic speedup over the conventional algebraic method for both sequential and parallel computation. Our algorithm is optimal on sparse graphs and work-optimal on a PRAM. Our main algorithmic results are the following.  The sequential and parallel versions of this algorithm are deterministic and asymptotically optimal for any ordering of matrix and vector indices. The current state-of-the-art SpMmSpV-BFS approaches are only optimal if the vector indices are unordered [1,26]. It also appears that other recent SpMmSpV methods take O(mn) time overall for BFS because their masking method requires an elementwise multiplication with a dense vector or explicitly testing every vertex in each step [6,26,27].

Background
The sequential combinatorial algorithm for Breadth-First Search is well-known [9] and attributed to the 1959 discovery by Moore [20]. The algorithm proceeds iteratively where in each step it finds the neighbors of vertices from the previous step such that these neighbors are also unique to the current step. Each step constructs a set of vertices that are not in other steps, known as the frontier set. These sets are the levels of the BFS tree and so the traversal is called level-synchronous. The algorithm takes O(m + n) time due to referencing O(n) vertices and testing O(m) edges. To avoid cycles the algorithm must proceed one level at a time. Since the graph diameter is bounded by n, then there are D G levels, hence the search is inherently sequential. As of yet, there is no sublinear-time, parallel algorithm for BFS that achieves O(m + n) work.
Not long after Moore's discovery, Floyd and Warshall used the adjacency matrix A to solve transitive closure and shortest-path problems [14,24] which are generalizations of Breadth-First Search. The linear algebraic BFS algorithm is simply the x k+1 = Ax k recurrence. Suppose that A is a sparse matrix, then computing an algebraic BFS by x k+1 = Ax k takes O(mn) time because all O(m) nonzeros in A are multiplied in all O(n) steps. This Sparse Matrix-Vector (SpMV) approach is clearly wasteful because nonzeros in A will be multiplied by zeros in x. The situation is not improved using a Sparse Matrix Sparse Vector (SpMSpV) multiplication where both the input vector and the matrix are in sparse format. Applying SpMSpV in BFS can still take Ω(mD G ) time because nonzeros reappear in x, meaning every vertex is visited again, and so x becomes dense. This is readily observed given a long path connected to a clique where the search starts with a vertex in this clique. Hence it is not enough to treat both A and x as sparse, and consequently a straightforward SpMSpV method for BFS is not optimal. But practical implementations have provided speedup over the combinatorial algorithm [2,3,5,6,18]. These practical implementations rely on the level-synchronous sequential algorithm where the focus is on parallelizing the work within a level of the BFS tree. Newer methods that mask frontier vertices in subsequent input vectors to avoid revisiting vertices will be the subject of our review.

Related work
On average there ared = 2m/n nonzeros in each column of A. In a single SpMSpV multiplication there are f nonzeros in the sparse vector and thus Ω(df ) operations on average. Accessing a total of f unique columns in A over all BFS steps then leads to O(m) runtime, and is therefore optimal for sparse graphs. This can be achieved by SpMmSpV methods that hide or mask previously seen frontier nonzeros from the sparse input vector at each BFS step. But many of the current approaches add more work that degrades the runtime. In the following review of current SpMSpV and SpMmSpV algorithms for BFS, we ignore any work related to initialization, parallelization, or other overhead that do not affect the asymptotic complexity.
In an algebraic BFS on sparse graphs, the nonzeros from the multiplication must be written to a new sparse output vector. Then using SpMSpV for the algebraic BFS requires a multi-way merge due to the linear combination of either rows or columns of A that are projected by the nonzeros in the sparse vector in the multiplication. Strategies for efficient merging include using a priority queue (heap) or sorting, but this results in Ω(n log n) runtime for BFS [27]. Another popular method is to employ a sparse accumulator (SPA) [1,15] which is comprised of a dense vector of values, a dense vector of true/false flags, and an unordered list of indices to nonzeros in the dense vector. This SPA can be used to mask out previous frontier nonzeros. But it is stated in [1] that there is no known algorithm for SpMSpV that attains the lower-bound of Ω(df ) if the indices in the sparse vector must be sorted. This is because the list of row indices in the SPA must be sorted if the sparse matrix was stored with ordered indices and the multiplication algorithm requires that ordering [15]. Thus SpMmSpV methods using a SPA take Ω(n log n) time if the output vector needs sorted indices, making their use in a BFS non-optimal.
The focus of these new SpMSpV methods is in efficiently reading and writing the sparse vector. But there is an analysis gap on the asymptotic cost of preventing previous frontier vertices in the BFS from reappearing in the sparse vector. Masking out these frontier nonzeros was analyzed in [26] and it appears to require an elementwise multiplication with a dense masking vector, which must be O(n) size to accommodate all vertices. This suggests these SpMmSpV methods with masking take O(mn) time for BFS. In [6] an elementwise multiplication with a dense predecessor array is performed in each step of the BFS to mask the old frontier, leading to O(mn) runtime. The SpMmSpV method in [6] also required sorted output so either method of a priority queue or SPA leads to suboptimal time. The SpMmSpV algorithm for BFS in [27] tests all vertices in each step and zeros out those in the output vector that have already been reached, leading to Ω(mn) time. A masked, column-based matrix-vector method for BFS that relies on radix sorting is given in [26] but takes Ω(m log n) time. The authors allow unsorted indices to avoid the Ω(log n) factor but elementwise multiplication with the dense masking vector results in O(mn) time. Sorted vectors are also used in [1], thereby taking Ω(m log n) time for BFS. A version with unsorted indices is given in [1] but the authors do not describe how visited vertices are avoided or masked.
Although our focus is on an optimal matrix-based BFS, we highlight a popular method known as Direction-Optimizing BFS [2]. The Direction-Optimizing BFS is not asymptotically optimal, but works well in practice for low-diameter, real-world graphs. For steps where the frontier is large, the algorithm in [2] switches to a "bottom-up" evaluation where the neighbors of each unvisited vertex, rather than a frontier vertex, is scanned until a visited neighbor is found and thus the unvisited vertex is marked visited. In practice this could perform fewer edge checks because it stops on the first visited neighbor. But suppose a small fraction of edges without a visited endpoint are checked during the "bottom-up"phase. These edges could be in the same component of the BFS traversal but are still some steps away from being reached, or they are in different components and will not be reached. If this fraction is repeatedly checked as a function of the input, it leads to sub-optimal runtime. For example, if m 16 edges are checked as little as log n times, it leads to Ω(m log n) runtime.

Submatrix multiplication
Recall that matrix-vector multiplication by the outer-product is the linear combination of the column vectors in the matrix scaled by the entries in the input vector as follows.
The neighbors of vertex i are the nonzero elements in the A * ,i column vector of A. Here we show the dense matrix and vector for illustration only, so the reader should keep in mind that zeros are ignored in the calculations including those in the vectors. The linear combination of the A * ,2 and A * ,3 columns result in the nonzeros at 1, 2, 3, 4 indices of the product vector. In BFS this corresponds to finding the neighbors 1, 3, 4 of vertex 2 and neighbors 1, 2 of vertex 3. The search continues by multiplying the matrix with this new product vector. But in the next step the A * ,2 and A * ,3 columns are projected again resulting in redundant operations that do not add new vertices to the search. This will result in each vertex being revisited leading to O(mn) time. Masking nonzeros in x prevents their recurrence and leads to O(m) optimal time. But masking the vector alone still incurs twice as many algebraic operations than theoretically needed. For example, merely ignoring the 2, 3 elements in this next input vector does not eliminate their recurrence because the columns A * ,1 , A * ,4 will give 2, 3 again in the following output vector. A theoretically optimal SpMmSpV method for BFS will make We notice that the computation can be performed over progressively smaller submatrices of A so at most half the nonzero elements in A are needed. This is because each unique frontier nonzero over all steps can be produced by a single A(i, j) element. As described earlier, nonzeros can reappear in the input vector in later steps because they are found in the transpose of their column vectors. Then after each step k, masking the transpose elements A j, * , A * ,j and x k (j) for each nonzero j th index of x k will ignore these nonzeros for all remaining steps. Moreover, only column vectors are needed to progress the search so the A j, * row vectors can be masked at step k, one step earlier than their transpose counterparts. This eliminates returning frontier vertices that are adjacent. Therefore the algebraic Breadth-First Search can be computed over submatrices of A in which the row and column indices that remain are those not used in preceding steps. This is illustrated in Figure 1. Consequently the number of algebraic operations are reduced. For sparse graphs the algebraic method will now be as efficient as the combinatorial BFS.
We emphasize that A can be left unchanged by masking the appropriate submatrices of A. It is also implicit in the following descriptions that the submatrices of A and the input vectors x are compatible at each step because the column span of A always matches the row dimension of the input vector.
Proof. Recall the matrix-vector outer-product is a linear combination of column vectors in A, Step 1 Step 2 Step 3 Step 4 Figure 1: Optimal algebraic BFS starting from vertex 2.
Only nonzeros in x k can produce nonzeros in the resultant vector x k+1 because any nonzero x k+1 (i) vector element is due to a nonzero x k (j) ⊗ A(i, j) product. Also observe that a A * ,j column vector can only produce a nonzero x k+1 (i) if A(i, j) is nonzero. Thus, subsequent operations on the A * ,j column vector do not produce new x k+1 (i) nonzeros. For Breadth-First Search this does not update a new level. Then for each j in supp(x k ) at step k, the A j, * , A * ,j can be ignored in all remaining steps. Since only column vectors are used in the linear combination then each A j, * row vector can also be ignored at step k. Hence Proof. Only nonzeros are required in Breath-First Search so the matrix-vector product in Theorem 1 is computed from only the j indices in V k by Now for all remaining steps since V k does not contain indices from V k−1 then the multiplication over the submatrix A[V k+1 , V k ] does not include A * ,j , A j, * . Then there cannot be a vertex i at some later step that produces j by x k+1 (j) = x k (i) ⊗ A(j, i) because all A j, * are prohibited. Thus each A * ,j column vector can be multiplied only once and subsequently any element in A is accessed no more than once.
Notice it is valid to compute BFS by x k+1 = A[V k , V k ]x k where the submatrices are symmetric in each step, and still remain asymptotically optimal. Using this symmetric form makes some of the theoretical results more convenient. For example, we can show there is a linear transformation on the conventional algebraic recurrence that produces our output. Namely, we can derive Thus we can show the equality between our method and the conventional recurrence. An interested reader can refer to the Appendix A for the proof.

Bounds on algebraic operations
The submatrices A[V k+1 , V k ] strictly decrease in size as the search progresses and so the number of algebraic operations also decreases. A simple analysis will show that our submatrix approach accesses at most m nonzeros on any sparse graph. Given flexibility on updating the mask, it only requires n − 1 nonzeros and therefore we can design an algorithm that takes O(n) algebraic operations as opposed O(m) operations of an optimal SpMmSpV-BFS. Moreover, this O(n) bound holds for both undirected and directed graphs, demonstrating that our technique has significant advantages over that of an optimal SpMmSpV-BFS.
It is easy to see that an optimal SpMmSpV-BFS method multiplies 2m nonzeros in A because each column of A, and therefore every nonzero in A, is used in the computation. Our method takes half as many operations.

Claim 2. The transpose of any
x will not be multiplied at any step.
Proof. At each step k, the column vectors A * ,j for j ∈ supp(x k ) are multiplied. Then these column vectors and their transpose are masked in all remaining steps. Thus any A(i, j) element multiplied at step k, and its transpose element A(j, i), will not be accessed in later steps. Moreover, the A(j, i) elements are not multiplied in step k because the A j, * row vectors are also masked at step k. Therefore the transpose of the A(i, j) matrix elements multiplied in computing x k+1 = A[V k+1 , V k ]x are not used.

Lemma 1. Computing Breadth-First Search by
It follows from Claim 2 that Lemma 1 holds. Now observe that only one nonzero in each row of A is required to produce a correct output in x. Hence at most n nonzeros in A are needed to compute BFS by this method. It also does not require a symmetric matrix and therefore holds for both undirected and directed graphs. This leads to only O(n) algebraic operations.

Theorem 2. Computing Breadth-First Search by
x k using one nonzero in each row will multiply n − 1 nonzeros in A.
Proof. Claim 2 establishes that the transpose of A(i, j) elements used in computing BFS by Theorem 1 are not multiplied. Moreover, only one nonzero from a row in A is needed by the computation. It follows from Claim 1 that columns are not repeated, hence only unique A(i, j) elements can produce the frontier vertices. Excluding the source vertex, this leads to a total of n − 1 nonzeros in A in the multiplication.
We remark that short-circuiting, as proposed in Theorem 2, can also be applied to SpMmSpV-BFS methods, meaning the evaluation of the semiring to produce a new frontier nonzero stops when the result is not zero. But this still leads to more algebraic operations than our approach. Observation 1. An optimal SpMmSpV-BFS using one nonzero in each row of A saves between 0 and 2(m − n) algebraic operations, and therefore in total requires between 2n and 2m algebraic operations.
Proof. If a column vector is projected by the matrix-vector multiplication at step k then its transpose will appear in the linear combination at step k + 1. Short-circuiting the semiring evaluation on this transpose, now a row vector, at step k + 1 will evaluate just one nonzero. But not all nonzeros from the column vector at step k will be multiplied at step k+1 because some will have been previous frontier nonzeros. In the worst case, such as a path, there are no savings from short-circuiting. At best the short-circuiting can save d(v) − 2 operations in each row because at least one nonzero must be a parent from a previous frontier and another evaluates to a new frontier nonzero, leading to v∈V (d(v) − 2) = 2(m − n) fewer evaluations. Hence short-circuiting saves between 0 and 2(m − n) algebraic operations as claimed. Then the total number of algebraic operations after short-circuiting is between 2m − (2(m − n)) = 2n and 2m. Therefore an optimal SpMmSpV-BFS with short-circuiting requires between 2n and 2m algebraic operations.
We see from Observation 1 that an optimal SpMmSpV-BFS with short-circuiting requires at best 2n algebraic operations in the most optimistic scenario, and in the worst case no semiring evaluations are short-circuited and hence it multiplies all 2m nonzeros. In contrast, our approach takes at most n − 1 nonzeros as asserted by Theorem 2.
Let us briefly explore our submatrix multiplication on directed graphs. Recall the conventional linear algebra BFS is given by the recurrence x k+1 = Ax k , with the implicit assumption that the underlying graph is undirected. With a directed graph the adjacency matrix A is not symmetric. By convention, egress and ingress edges are indicated as nonzeros in the rows and columns of A, respectively. Since BFS traverses the egress edges of frontier vertices, we use x k+1 = A T x k . Now observe that the outer-product for the matrix-vector multiplication proceeds as usual by projecting the column vectors from A T that correspond to nonzeros in x k . Since our method does not mask out the column vectors for nonzeros in x k until the next step, then it works on both undirected and directed graphs.

Optimal algorithm
We now give our main algorithm for sparse graphs that employs the new method in Theorem 1. Our new Algorithm 1 does not specify a sparse matrix format to be as general as possible.

Algorithm 1
Require: A ⊲ adjacency matrix in sparse representation Initialize V 1 with 1, 2, ..n Initialize x 1 with the source Set V 2 := V 1 \ supp(x 1 ) 1: for k = 1, 2, . . . until end of component do 2: for all j ∈ supp(x k ) do 3: Each j corresponding to nonzeros in x k projects a A[V k+1 , j] masked column vector in the multiplication. The subscripts i corresponding to nonzeros in A[V k+1 , j] produce the next x k+1 frontier nonzeros. After operating on an A(i, j) element during step k, the i th row in A is masked by removing i from V k+1 (line 6). This ensures only one nonzero in a row is multiplied and at the end of the step the V k+2 has been realized. Therefore any A(i, j) multiplied at a step k will not be used in subsequent calculations, and by Claim 2 its transpose will not be multiplied. Algorithm 1 is also work-optimal on a PRAM under the Work-Depth (WD) model [4,17,21]. In this model a for all construct denotes a parallel region in which instructions are performed concurrently. All other statements outside this construct are sequential. The scheduling of processors is handled implicitly and an arbitrary number of simultaneous operations can be performed each step. The work is the sum of actual operations performed and the number of processors is not a parameter in the WD model. The depth is the longest chain of dependencies, often the number of computation steps, and is denoted by D. The work-complexity, W , denotes the total number of operations. A parallel algorithm in WD can be simulated on a p-processor PRAM in O(W/p + D) time. Using p = W/D processors achieves optimal work on the PRAM. Proof. The algorithm computes Breadth-First Search by Theorem 1. A processor is assigned to a nonzero in the sparse matrix representation, so there are O(m) processors for the edges. At each step k every processor reads its x k (j) value and performs the algebraic operations for the matrix product and writes out the new x k+1 (i) value, specifically the processor performs, A processor writing the x k+1 (i) value also updates V k+1 so i can be masked out of subsequent calculations. Hence only the submatrix A[V k+1 , V k ] is used in the computation at each step k. Observe that concurrent writes to the same i index in x k+1 do not change the result and so arbitrary write resolution suffices. The same holds for updating i in Our Algorithm 1 is an optimal algebraic BFS algorithm for sparse graphs that is deterministic and does not depend on ordering or lack of ordering in the matrix and vector indices. Although this algorithm fills a gap in the study of BFS, it offers no theoretical advantage over the simple combinatorial algorithm. However, we believe it offers a practical advantage over a theoretically optimal SpMmSpV-BFS because it requires O(n) instead of O(m) algebraic operations. Our technique of hiding or masking portions of A and the input vector could benefit new algebraic graph libraries that already feature masked sparse linear algebra operations [7,11,25,26], specifically for SpMmSpV. Since our method uses progressively smaller submatrices of A each step, it could prove useful in lowering the communication cost in distributed computing methods that bitmap and distribute partitions of the adjacency matrix [22]. Next we'll demonstrate how to easily integrate our method in the popular Compressed Sparse Row (CSR) format which is used in many sparse matrix libraries [13,16,23].

Practical optimal sequential algorithm
The CSR format is a well-known sparse matrix representation that utilizes three arrays, nz, col, and row, to identify the nonzero elements. The nz array holds the nonzero values in row-major order. The col array contains the column indices for nonzeros in the same rowmajor order of nz, and row is an array of col indices for the first nonzero in each row of the matrix. The last value in row must be one more than the last col index. A matrix-vector multiplication in CSR iterates over the gap between successive values of the row array to access each nonzero in a row. Recall we can use A T in the matrix-vector multiplications since it works for to both directed and undirected graphs. So we give the basic CSR matrix-vector multiplication for y ← A T x in Listing 1.

Listing 1 Transpose Matrix-vector multiplication in CSR
Require: nz,col,row ⊲ CSR data structures Require: x, y ⊲ input and output vectors Our Algorithm 2 requires a simple modification to this CSR matrix-vector multiplication. We add an array, T , to store nonzero indices for vertices that have been visited. This is used to limit the multiplication over the appropriate submatrix of A every step. At each step we iterate over only nonzeros in x, the indices of which are stored in another array L. We could have used a sparse vector representation for x each step, but for simplicity we just increment a pointer in L. Observe that each col[i] is immediately marked visited. This effectively updates V k+1 and simultaneously ensures that only the first nonzero in a row is used at step k, hence there will be only n unique entries in the array L. Then by Theorem 2 the total number of algebraic operations is 2(n − 1) rather than 2m. If desired, all col[i] in a step can contribute to the output by simply adding a test if T [i] is zero before the inner loop over the row array, then removing the update to T [col [i]] and allowing L to take O(m) values.
Since Algorithm 2 is based on Algorithm 1 and does not add any new operations, then Theorem 5 follows immediately. It isn't difficult to see that Algorithm 2 is very similar to the combinatorial algorithm. Algorithm 2 is algebraic in the sense that it solves BFS by the matrix equations in Theorem 1 using any appropriate semiring for the addition and multiplication operations. Our intent here is to demonstrate that existing sparse matrix methods require only a simple adaptation to achieve the optimality of the combinatorial algorithm while maintaining their practical advantages. Our main algorithmic result is given in Algorithm 1 since it captures the general concept given in Theorem 1 and therefore is amenable to many forms of sparse matrix and sparse vector implementations.
10 Practical work-optimal parallel algorithm We give parallel, work-optimal CSR algorithm in Algorithm 3 that is both simple and practical. Aside from the use of for all over the L and row arrays, a significant difference between this algorithm and Algorithm 2 is that we have to avoid adding duplicates to the L array. We use the PRAM model here so we can focus on the main aspects of the algorithm without adding the complexity of how physical machines and parallel programming manage write conflicts. Later, we'll comment on how we handle synchronization for our practical implementation of Algorithm 3.
In a PRAM all processors execute instructions simultaneously and can concurrently access any memory cell from a global pool of memory. During the execution of Algorithm 3, two processors at level k of the BFS tree can find the same new vertex u = col[i] at level k + 1. Concurrent write to y(col[i]) at line 6 does not pose a problem because the same value is written by any processor. But we have to avoid adding u to L more than once. This can be handled with a new parent array P where processors concurrently write to P [u] their frontier vertex v that reached u. It doesn't matter which processor wins the write to P [u] so any write resolution protocol suffices. In the next clock cycle, each processor reads P [u] and only the processor with a matching parent vertex for u can add u to L, hence avoiding duplicates.
Algorithm 3 has the same work and depth as Algorithm 1 so Theorem 6 follows. A practical implementation of Algorithm 3 will need to handle read and write conflicts on a specific machine architecture and parallel programming methodology. We created a multithreaded implementation of Algorithm 3 using OpenMP. Atomics are used for reading and updating the T array. To avoid adding duplicates to L, a local array is kept for each thread where it can store the new vertices discovered from its subset of frontier vertices. Each thread iterates over these vertices in its local array and atomically updates T to remove any that had been discovered by other threads. This eliminates all duplicates so only unique vertices can be added to the next frontier. Since there can only be O(m) duplicates overall then our OpenMP implementation remains asymptotically work-optimal.
Each thread must update L with its final subset of new frontier vertices. To avoid synchronization on L, each thread is given a unique block of space in L to which it can concurrently add its frontier subset without write conflicts. This can be achieved without wasting space as follows. Each thread concurrently writes the count of its frontier vertices to a new global array indexed by thread IDs. Since each thread has a unique ID then no synchronization is needed. All threads wait until this array has values from every thread. Then in parallel, each thread sums up the values in this global array from the first thread ID to its own. This gives each thread a unique starting offset in L to write their data and there is no overlap or gaps, thereby achieving full concurrency on L.

Experiments
We will show that Theorem 1 leads to significant savings in algebraic operations on realworld graphs. Then we'll compare the sequential and parallel runtime performance of our approach with that of the GraphBLAS library. We begin by comparing Algorithm 2 to simple CSR implementations for BFS using dense vector (SpMV), sparse vector (SpMSpV), and masked sparse vector (SpMmSpV). In each of the BFS implementations in our tests, the visited vertices in a search are tracked in a similar manner as in Algorithm 2. In the next descriptions we denote these test implementations as follows. Let SpMV-BFS denote the dense vector method, SpMSpV-BFS for the sparse vector method, and finally SpMmSpV-BFS for a "masked" sparse vector method. In SpMV-BFS all nonzeros in A are multiplied each step. In SpMSpV-BFS we only count algebraic operations due to nonzeros in the sparse vector. The SpMmSpV-BFS is nearly identical to Algorithm 2 but tests visited frontier vertices after the algebraic operations, and is therefore an asymptotically optimal "masked" sparse vector method.
Listed in Table 1 are graphs from the Stanford Network Analysis Project (SNAP) [19] used in the experiments. 2 We chose a source vertex for each graph such that BFS is run for the entire reported diameter. 3 We count two algebraic operations for the (⊕, ⊗)-semiring operations in the inner loop of the CSR multiplication. A comparison of the methods is  illustrated by a log-scale histogram in Figure 2 where it is clear that Algorithm 2 requires orders of magnitude fewer algebraic operations than the other methods with the exception of the masked sparse vector approach. The SpMmSpV-BFS should take 4m operations as we claimed in Sections 1 and 6. At the worst end, the SpMV-BFS multiplies all 2m nonzeros in A every step leading to 4m(D G + 1) algebraic operations. In contrast, Algorithm 2 multiplies n − 1 nonzeros as asserted by Theorem 2, leading to 2(n − 1) algebraic operations in total. All of this bears out in the experimental results listed in Table 2. A simple modification to Algorithm 2 was described in Section 9 that would include all nonzeros in a row, then according to Lemma 1 our optimal algebraic BFS should take 2m algebraic operations on a sparse graph. We tested this modification in the experiments and verified that it indeed results in 2m algebraic operations on each of the graphs; therefore a straightforward implementation at worst takes half the number of operations as an optimal SpMmSpV-BFS.
We compared our algorithm implementations against the SuiteSparse GraphBLAS library [10][11][12]. This is considered a complete reference implementation and for convenience we will refer to it simply as GraphBLAS. This library implements SpMmSpV-BFS using a masked sparse vector [12, c.f. UserGuide pg. 191]. We use version 2.2.2 and 3.0.1 of Graph-BLAS for the sequential and parallel tests, respectively. The parallel BFS implementations of both our method and GraphBLAS use OpenMP multithreading. The experiments were run on a single workstation with over 256 GB of RAM and 28 Intel Xeon E5-2680 cores. Each BFS starts at vertex 0 so the absolute diameter is less than those reported in Table 1. We use T GB to denote the runtime for GraphBLAS and T Alg for the runtime of either Algorithm 2 or Algorithm 3. In the plots to follow the performance comparison is given as the ratio T GB /T Alg so if T Alg is faster then the ratio is greater than one and will appear higher on the y-axis. In all experiments our sequential and parallel implementations were faster than GraphBLAS, often by over an order of magnitude. A comparison of sequential runtime is given in Figure 3. Our BFS is 19-24x faster than GraphBLAS on the road network graphs, and on average it is 22x faster. These graphs have large diameter so inefficiencies are compounded with each step. The gains for the lower diameter graphs are more modest but on average we are still more than 3.5x faster. In these graphs the work is more concentrated within each level of the BFS rather than distributed over the total number of levels. Memory cache effects could boost performance in these graphs more than the large diameter graphs where there is considerably less work in each level of the BFS tree. Although our approach requires signficantly fewer algebraic computations, we suspect memory access is the bottleneck and thus accounts for lower than expected improvement over GraphBLAS on the larger, low-diameter graphs.
We remark that our OpenMP implementation of Algorithm 3 performs more algebraic operations than the sequential implementation of Algorithm 2 because threads can concurrently produce the same frontier vertices in the matrix-vector multiplication before these duplicates are removed. This and the overhead of managing parallel and sequential regions account for the gap in performance with respect to the experimental results for the sequential algorithm. But since there are O(m) duplicates overall, the implementation remains asymptotically work-optimal. We remind the reader that details on our implementation are at the end of Section 10.
The parallel runtime is given in Figure 4 where it is also evident that our method is considerably faster than GraphBLAS. In two-thirds of the tests our algorithm completed in less than one-tenth of a second, and took over one second for just one test where two threads were used on the largest graph, com-Orkut, which took 1.59 seconds. At 64 threads on the two largest graphs, com-Orkut and com-LiveJournal, our BFS completes in 0.153 and 0.082 seconds resulting in 11x and 17x faster time than GraphBLAS, respectively. For com-Orkut we achieve over 1.5 GTEPS (Giga Traversed Edges Per Second). Also notable on these large graphs is that our performance with respect to GraphBLAS scales linearly on the whole. The exception is a drop-off with com-Amazon at 64 threads. We don't have an explanation for this drop but both our method and GraphBLAS suffered it. For the large diameter graphs our performance peaks with fewer threads, again because the per level work is very low so the overhead of adding more threads becomes significant. Since we only have 28 physical cores, running 64 virtual threads on these graphs with very few edges in each BFS level may compound cache misses. But we are still about 9-12x faster than GraphBLAS at our peak for these graphs. The raw wallclock timings are available in Appendix B.

Conclusion
We introduced a new algebraic formulation for Breadth-First Search that ensures optimal work on sparse graphs by multiplying progressively smaller submatrices of the adjacency matrix A. We show that BFS can be computed by the recurrence x k+1 = A[V k+1 , V k ]x k , where masking row and column vectors in A corresponding to frontier nonzeros in each input vector will prevent multiplying the same frontier nonzeros. Our submatrix multiplication approach takes O(n) instead of the O(m) algebraic operations of a theoretically optimal SpMmSpV-BFS on both undirected and directed graphs. We gave a general algorithm using our submatrix formulation for BFS, showing it is workoptimal for both sequential and parallel processing. We demonstrated it is easily amended with CSR sparse-matrix multiplication and verified it leads to significant performance improvement over the current state-of-the-art SpMmSpV-BFS. We believe our approach can be easily integrated with other existing matrix methods, and benefit those that support masking operations such as the GraphBLAS.
Our paper closes a gap between the linear algebraic and graph-theoretic solution for BFS. The methods we introduced may be useful in devising linear algebraic formulations of other graph algorithms. Proposition 1. There is a linear transformation on y k = Ay k−1 that gives x k+1 = A[V k , V k ]x k , specifically x k+1 = A k y k and subsequently x k+1 = A k A k−1 x 1 .
Proof. We first show that S k x k is equal to S k y k . Here y k and x k contain nonzeros that have not been produced by previous steps. It follows from Theorem 1 that x k contains only such nonzeros. Now S k y k annihilates the nonzeros in y k that are not in V k , hence S k y k is equal to x k . Since S k is idempotent then S k x k = S k (S k y k ) = S k y k . Using this equality and Claim 5 we can show that A k y k is equal to A k x k .
A k y k = S k AS k y k (Claim 5) This leads to x k+1 = A k x k = A k y k . Iteration on y k = Ay k−1 yields y k = A k−1 y 1 . Since the source vector is the same for this conventional recurrence and that of Theorem 1, then x 1 = y 1 , giving the result x k+1 = A k y k and x k+1 = A k A k−1 x 1 as claimed.
We emphasize that the result of Proposition 1 supposes that a chosen semiring is applied consistently. If the arithmetic semiring was used to produce x k+1 then it must be used to compute A k A k−1 x 1 . Tables 3 and 4 list the wallclock timings in seconds for the parallel runtime experiments plotted in Figure 4 of Section 11.