Parallelizing Sequential Graph Computations

This article presents GRAPE, a parallel GRAPh Engine for graph computations. GRAPE differs from prior systems in its ability to parallelize existing sequential graph algorithms as a whole, without the need for recasting the entire algorithm into a new model. Underlying GRAPE are a simple programming model and a principled approach based on fixpoint computation that starts with partial evaluation and uses an incremental function as the intermediate consequence operator. We show that users can devise existing sequential graph algorithms with minor additions, and GRAPE parallelizes the computation. Under a monotonic condition, the GRAPE parallelization guarantees to converge at correct answers as long as the sequential algorithms are correct. Moreover, we show that algorithms in MapReduce, BSP, and PRAM can be optimally simulated on GRAPE. In addition to the ease of programming, we experimentally verify that GRAPE achieves comparable performance to the state-of-the-art graph systems using real-life and synthetic graphs.

graph algorithms into their models. While graph computations have been studied for decades and a number of sequential (single-machine) graph algorithms are already in place, to use Pregel, for instance, one has to "think like a vertex" and recast the existing algorithms into a vertex-centric model, and similarly when programming with other systems. The recasting is nontrivial for people who are not very familiar with the parallel models. This makes these systems a privilege for experienced users only. Is it possible to have a system such that we can provide sequential (single-machine) graph algorithms as a whole (subject to minor changes) and parallelize the computation across multiple processors without drastic degradation in either performance or functionality of the existing systems?
GRAPE. To answer this question, we develop GRAPE, a parallel GRAPh Engine for graph computations such as graph traversal, graph pattern matching, graph connectivity, and collaborative filtering. Using familiar terms, we refer to a graph computation problem Q as a class of queries and an instance Q of Q as a query of Q. GRAPE differs from prior graph systems in the following.
(1) Ease of programming. GRAPE supports a simple programming model. For a class Q of graph queries, users only need to provide (existing) sequential (incremental) algorithms for Q with minor additions. There is no need to revise the logic of the existing algorithms, and it substantially reduces the effort to "think parallel." This makes parallel graph computations accessible to a large group of users who know the conventional graph algorithms covered in undergraduate textbooks.
(2) Termination and correctness. GRAPE parallelizes the sequential algorithms based on a combination of partial evaluation and incremental computation. It guarantees to converge at correct answers under a monotonic condition as long as the sequential algorithms provided are correct.
(3) Graph-level optimization. GRAPE naturally inherits all optimization strategies available for sequential algorithms and graphs, such as indexing, compression, and partitioning. In contrast, these strategies are hard to implement for vertex-centric programs.
(4) Scalability. The ease of programming does not imply performance degradation compared with the state-of-the-art systems such as vertex-centric Giraph [6] (Pregel) and GraphLab, and blockcentric Blogel. For instance, Table 1 shows the performance of the systems for Single-Source Shortest-Path (SSSP) queries over Friendster [4], a social network with 65 million users and 1.8 billion relationships (edges) using 192 processors; GRAPE outperforms Giraph, GraphLab, and Blogel in both response time and communication costs (see Section 7 for more results).
A principled approach. To see how GRAPE achieves these results, we present its underlying principles. Consider a graph G that is partitioned into fragments (F 1 , . . . , F n ), and distributed across n processors (P 1 , . . . , P n ), where F i resides at P i for i ∈ [1, n], respectively. Given a query Q ∈ Q and a fragmented graph G, GRAPE computes the answer Q (G) to Q in G based on the following.
Partial evaluation. Given a function f (s, d ) and the s part of its input, partial evaluation is to specialize f (s, d ) with respect to the known input s [42]. That is, it performs the part of f 's computation that depends only on s and generates a partial answer (i.e., a residual function f that depends on the as yet unavailable input d. For each processor P i in GRAPE, its local fragment F i is its known input s, while the data residing at other processors account for the yet unavailable input d. GRAPE computes Q (F i ) at all processors P i 's in parallel as partial evaluation.
Incremental computation. Graph computations are often iterative. If Q (G) cannot be obtained in one step by combining partial results Q (F i ), GRAPE exchanges selected partial results as messages between processors and computes Q (F i ⊕ M i ) by treating message M i to P i as updates to certain status variables associated with nodes and edges in F i . It incrementally computes changes ΔO i to Q (F i ) such that Q (F i ⊕ M i ) = Q (F i ) ⊕ ΔO i , making maximum reuse of previous results Q (F i ). Here (i) M i is a message designated to worker P i , where fragment F i resides; (ii) F i ⊕ M i is the abbreviation of the following steps: (a) deduce the change ΔF i to F i from M i (we will show how to deduce the change in Section 3.2) and then (b) apply the change ΔF i to fragment F i ; and (iii) Q (F i ) ⊕ ΔO i is to apply the change ΔO i to the old result Q (F i ). It is to minimize the use of notations that we reload the notation ⊕; it will be clear from the context what operation ⊕ means. Incremental computation is often more efficient than recomputing Q (F i ⊕ M i ) starting from scratch since, in practice, M i is typically small and so is O i . Better still, it may be bounded: Its cost depends only on the sizes of the changes M i to input F i and changes ΔO i to output Q (F i ), not on the size |F i | of the entire fragment F i [29,57], thus minimizing unnecessary recomputation.
Workflow. Based on partial evaluation and incremental computation, GRAPE works as follows.
(1) Plug. GRAPE offers a simple programming interface, as shown in Figure 1. For a class Q of graph queries, developers need to specify three functions, PEval, IncEval, and Assemble in the algorithm panel. PEval and IncEval are (often existing) sequential (single-machine) algorithms for Q, for partial evaluation and incremental computation, respectively; and Assemble is typically straightforward (see examples shortly). These can be picked from a library of graph algorithms; the only addition is a specification of messages for communication between processors.  (2) Play. In the configuration panel, users may pick such a specification (PEval, IncEval, and Assemble) registered for Q, a graph G, a graph partition strategy, and a number n of processors to work with (Figure 1). Given a query Q ∈ Q and a partitioned graph G, GRAPE parallelizes PEval, IncEval, and Assemble across n processors and computes Q (G) in three phases, as shown in Figure 2.
(a) Each processor P i first executes PEval against its local fragment F i to compute a partial answer Q (F i ) in parallel. This facilities data-partitioned parallelism via partial evaluation.
(b) Then each processor P i may exchange partial results with other processors via synchronous message passing under BSP [65]. Upon receiving message M i , processor P i incrementally computes local answer Q (F i ⊕ M i ) by IncEval, operating on its local fragment F i "updated" by M i .
(c) The incremental step iterates until no further updates M i can be made to any F i . At this point, Assemble pulls partial answers Q (F i ⊕ M i ) from P i for i ∈ [1, n] and assembles Q (G).
That is, GRAPE parallelizes sequential algorithms as a whole and computes a simultaneous fixpoint by taking IncEval as the intermediate consequence operator. It guarantees to reach a fixpoint under a monotonic condition if the sequential algorithms are correct for Q. Moreover, it minimizes iterative recomputation by using IncEval and supports graph-level optimization on F i . Example 1.1. Consider SSSP, a routine graph computation problem. Given a directed graph G with edges labeled with positive weights and a source node s in G (as a query Q), the goal is to find Q (G) including the shortest distance dist(s, v) from s to all nodes v in G.
Using GRAPE, one can pick the familiar Dijkstra's algorithm [32] as PEval and a bounded sequential incremental algorithm of Ramalingam and Reps [56,57] as IncEval. The algorithm of Ramalingam and Reps [56] essentially propagates changes to vertices or edges to other vertices in the graph following an order defined on some "keys" of the affected vertices. Similarly, the algorithm by Ramalingam and Reps [57] handles unit changes to graphs. The only addition to GRAPE is that, for each fragment F i , a variable dist(s, v) of positive numbers is declared for each node v, initially ∞ (except dist(s, s) = 0). As shown in Figure 2, PEval first computes Q (F i ); it then repeats incremental steps IncEval to compute Q (F i ⊕ M i ), where messages M i include updated (smaller) dist(s, u) (due to new "shortcut" from s) for border nodes u (i.e., nodes with edges across different fragments). GRAPE guarantees the termination of the fixpoint computation when no more dist(s, v) can be changed to a smaller value. At this point, Assemble takes a union of Q (F i ) as Q (G), which is provably correct (see Section 3 for details). drastic degradation in performance or functionality. To this end, GRAPE adopts the synchronization mechanism of BSP for its simplicity. As opposed to the prior systems, (a) GRAPE parallelizes sequential algorithms based on fixpoint computation with partial evaluation and incremental computation. (b) Following data-partitioned parallelism, given a partitioned graph, GRAPE allows workers to operate on different fragments in parallel, and exchanges among workers only the updated values of the status variables associated with border nodes. In contrast, for iterative computations, MapReduce needs to repartition a graph and ship its entire state in each round [50]. (c) The vertex-centric model of Pregel (synchronized) is a special case of GRAPE when each fragment is limited to a single vertex. The communications of Pregel are via "interprocessor" messages, and a message from a node often has to go through several supersteps to reach another node. GRAPE reduces excessive messages and the scheduling cost of Pregel since communications within the same fragment are local. GRAPE also facilitates graph-level optimization methods that are hard to implement in vertex-centric systems (this is similarly so for asynchronized GraphLab). (d) Closer to GRAPE are block-centric models [63,71]. However, the programming interface of Tian et al. [63] is still vertex-centric, and Yan et al. [71] is a mix of vertex-centric and block-centric programming (V-compute and B-compute). The B-compute interface is essentially vertex-centric programming, treating each block as a vertex. Users have to recast existing sequential algorithms into a new model. In contrast, GRAPE takes sequential algorithms PEval and IncEval from the GRAPE library and applies them to blocks in parallel without recasting. (e) None of the prior systems uses (bounded) incremental steps to speed up iterative computations. (f) To the best of our knowledge, none of these systems provides assurance on termination and the correctness of parallel graph computations.
Partial evaluation has been studied for certain XML [19] and graph queries [29]. There has also been a host of work on incremental graph computation [24,29,57]. This work makes a first effort to provide a uniform model by combining partial evaluation and incremental computation together to parallelize sequential graph algorithms as a whole.
Parallelization of graph computations. A number of algorithms have been developed in Map-Reduce, vertex-centric models, and others [29,73]. In contrast, GRAPE aims to parallelize existing sequential graph algorithms without revising their logic and work flow. Moreover, parallel algorithms for MapReduce, BSP (vertex-centric or not), and PRAM can be easily migrated to GRAPE (Section 4.2).
Prior work on automated parallelization has focused on the instruction or operator level [54,58] by breaking dependencies via symbolic and automatic analyses. There has also been work at the data partition level [75] to perform multilevel partitioning ("parallel abstraction") and adapt locality-optimized access to different parallel abstractions. In contrast, GRAPE aims to parallelize sequential algorithms as a whole and make parallel computation accessible to end users, while others [54,58,75] target experienced developers of parallel algorithms. There have also been tools for translating imperative code to MapReduce (e.g., word count [55]). GRAPE advocates a different approach by parallelizing the runs of sequential graph algorithms to benefit from data-partitioned parallelism,without translation. This said, the techniques of other authors [54,55,58,75] are complementary to GRAPE.
Simulation results. Prior work has mostly focused on simulations between variants of PRAM with different memory management strategies, to characterize bounds of slowdown for deterministic or randomized solutions [38]. There has also been recent work on simulation of PRAM on MapReduce and BSP [43]. In particular, Karloff et al. [43] define a framework MRC for MapReduce computations and show that a large class of PRAM algorithms can be efficiently simulated by MRC with certain restrictions. This work extends that of Fan et al. [31] by providing new optimal

Symbols
Notations Q A class of graph queries Q A query Q ∈ Q G Graph, directed or undirected P 0 , P i P 0 : coordinator; P i : workers (i ∈ [1, n]) P Graph partition strategy G P The fragmentation graph of G via partition strategy P F i The ith fragment of graph G via partition strategy P deterministic simulation results of MapReduce, BSP, and PRAM on GRAPE, adopting the notion of optimal simulations [66].

PRELIMINARIES
We start with a review of basic notations.

Graphs. We consider graphs
, indicating its content, as found in social networks, knowledge bases, and property graphs. We use two notions of subgraphs. A graph G = (V , E , L ) is called a subgraph of G if V ⊆ V , E ⊆ E, and for each node v ∈ V (respectively, edge e ∈ E ), L (v) = L(v) (respectively, L (e) = L(e)). A subgraph G is said to be induced by V if E consists of all the edges in G whose endpoints are both in V .
Partition strategy. Given a graph G and a number m, a graph partition strategy P partitions G into fragments F = (F 1 , . . . , F m ) such that each In vertex partition (a.k.a. edge-cut) [12,18]), a cut edge from fragment F i to F j has a copy in each of F i and F j . Denote by We refer to those nodes in F i .I ∪ F i .O as the border nodes of fragment with regard to partition strategy P. Note that F .I = F .O and F .I = F .O .
In edge partition (a.k.a. vertex-cut) [47], the cut vertices are called entry vertices and exit vertices for the partitions, which correspond to the sets F .O ∪ F .I and F .I ∪ F .O , respectively. In general, a border node is a vertex that relates to vertices or edges in two different fragments.
The fragmentation graph G P of G via P is an index such that, given each border node v in F i .I (respectively, As will be seen shortly, we will make use of G P to deduce the directions of messages.
The notations of this article are summarized in Table 2.

PROGRAMMING WITH GRAPE
Here, we first introduce the parallel model of GRAPE. We then show how to program with GRAPE. Following BSP [65], GRAPE works with a coordinator P 0 and a set of m workers P 1 , . . . , P m .

The Parallel Model of GRAPE
Given a graph partition strategy P and sequential algorithms PEval, IncEval, and Assemble for a class Q of graph queries, GRAPE parallelizes the computations as follows. It first partitions graph G into fragments F = (F 1 , . . . , F m ) with strategy P and distributes the fragments across m sharednothing virtual workers (P 1 , . . . , P m ). It maps m virtual workers to n physical workers such that fragment F i resides at worker P i for i ∈ [1, m]. When n < m, multiple virtual workers mapped to the same worker share memory. It also constructs fragmentation graph G P . Note that graph G is partitioned once for all queries Q ∈ Q posed on graph G.
Parallel model. Given a query Q ∈ Q, GRAPE computes answer Q (G) to Q in the partitioned graph G, as shown in Figure 2. Upon receiving Q at coordinator P 0 , GRAPE posts the same query Q to all the workers. To simplify the discussion, here we adopt synchronous message passing following BSP [65]. We will show how GRAPE implements point-to-point communication in Section 6. Furthermore, GRAPE also works under asynchronous parallel models [27]. Its parallel computation consists of the following three phases.
(1) Partial evaluation (PEval). In the first superstep, upon receiving query Q, each worker P i computes partial result Q (F i ) locally at its fragment F i using PEval, in parallel (for i ∈ [1, m]). It also identifies and initializes a set of update parameters for each F i that records the status of certain nodes (e.g., border nodes). At the end of the process, it generates a message from the update parameters at each P i and sends it to coordinator P 0 (see Section 3.2 for update parameters).
(2) Incremental computation (IncEval). GRAPE iterates the following supersteps until it terminates. Each superstep consists of two steps, one at the coordinator P 0 and the other at the workers.
(2.a) Coordinator. Coordinator P 0 checks whether for all i ∈ [1, m], worker P i is inactive; that is, P i is done with its local computation and there exists no pending message designated for P i . If so, GRAPE invokes Assemble and terminates (see below). Otherwise, P 0 composes a message M i by aggregating messages from the last superstep (see details shortly), sends M i to worker P i for i ∈ [1, m], and triggers the next superstep.
IncEval by treating M i as updates, in parallel for all i ∈ [1, m]. Here, F i ⊕ M i denotes the fragment F i that is updated with message M i ; that is, F i after its update parameters are changed with the values in M i . At the end of the process, IncEval automatically finds the changes to the update parameters in each F i and sends the changes as a message to coordinator P 0 (see Section 3.3 for details).
GRAPE supports data-partitioned parallelism by partial evaluation on local fragments, in parallel by all workers. Its incremental step (2.b) speeds up iterative graph computations by reusing the partial results from the last superstep to minimize unnecessary recomputation.
(3) Termination (Assemble). The coordinator P 0 decides to terminate the process if there exists no more change to any update parameters (see (2.a) above). If so, P 0 pulls partial results from all workers and computes Q (G) by invoking Assemble. It returns query answer Q (G).
We will show in Section 4 that the parallel process converges at correct answers under a monotonic condition as long as the sequential algorithms PEval, IncEval, and Assemble are correct; moreover, the simple parallel model does not lose expressive power.

PEval: Partial Evaluation
We now introduce the programming model of GRAPE. GRAPE provides a programming interface for users to extend (existing) sequential algorithms with message declarations. GRAPE registers the algorithms as stored procedures in its API library ( Figure 1) and maps them to a query class Q.
More specifically, for a class Q of graph queries, one only needs to provide three core functions: PEval, IncEval, and Assemble (see the Plug Panel in Figure 1), referred to as a PIE program. These are conventional (existing) sequential algorithms and can be picked from the Library API of GRAPE. We next elaborate the three functions in a PIE program.
Function PEval takes a query Q ∈ Q and a fragment F i of G as input and computes partial answers Q (F i ) at worker P i in parallel for all i ∈ [1, m]. It may be any existing sequential algorithm for Q. One only needs to extend it with the following additions: ࢪ partial result is kept in a designated variable; and ࢪ message specification as its interface to IncEval.
Communication among workers is conducted via message passing. Messages are defined in terms of update parameters of each fragment F i as follows.
(1) Message preamble. Function PEval (a) declares status variablesx associated with vertices and edges for each fragment F i and (b) specifies a set C i of nodes and edges relative to F i .I or F i .O with regard to each fragment F i . The status variables associated with C i are denoted by C i .x and are referred to as the update parameters of F i . The variables are declared and initialized in PEval. At the end of PEval, it sends the values of C i .x to coordinator P 0 .
Intuitively, variables in C i .x are the candidates to be updated by incremental steps. In other words, messages M i to worker P i are updates to the values of variables in C i .x.
More specifically, in GRAPE, C i is specified by an integer d and S, where S is either F i .I or F i .O . That is, C i is the set of nodes and edges within d-hops of nodes in S. In most cases, d = 0 and C i is F i .I or F i .O . However, in some applications, one needs d ≥ 0, (e.g., subgraph isomorphism; SubIso, see Section 5.1). In such cases, C i may include nodes and edges from other fragments F j of G.
A message M i is a set of key-value pairs x, val , where x is a status variable declared in C i .x and val is its value. GRAPE supports arbitrarily typed status variables; for example, val of M i can be a numeric value (e.g., in the algorithm for SSSP in Example 3.1), a multiset of tuples r , key, value (in the simulation of MapReduce in the proof of Theorem 4.2), or even a user-defined structure (a class).
(2) Message segment. PEval specifies function aggregateMsg to resolve conflicts when multiple messages from different workers attempt to assign different values to the same update parameter (variable). When such a strategy is not provided, GRAPE picks a default exception handler.
(3) Message grouping. GRAPE deduces updates to C i .x for i ∈ [1, m] and treats them as messages exchanged among workers. More specifically, at coordinator P 0 , GRAPE identifies and maintains C i .x for each worker P i . Upon receiving messages from P i 's, GRAPE works as follows.
(a) Identifying C i . It deduces C i for i ∈ [1, m] by referencing fragmentation graph G P , and C i remains unchanged in the entire process. It maintains update parameters C i .x for F i . (b) Composing M i . For messages from each P i , GRAPE does the following: (i) it identifies variables in C i .x with changed values; (ii) it deduces the designations P j of the messages by referencing the fragmentation graph G P ; if P is edge-cut, the variable tagged with a node v in F i .I will be sent to worker if P is vertex-cut, it identifies nodes shared by F i and F j (i j); and (ii) it combines all changed variable values designated to P j into a single message M j and sends M j to worker P j in the next superstep for all j ∈ [1, m].
If a variable x is assigned a set S of values from different workers, function aggregateMsg is applied to S to resolve the conflicts, and its result is taken as the value of x. When a node v has copies v i ∈ F i and v j ∈ F j residing in different fragments; for example, when v is a border node in F i .O ∩ F j .I (i j), v i .x and v j .x are treated as the same status variable x and are assigned the same value.
These are automatically conducted by GRAPE, which minimizes communication costs by passing only updated variable values. To reduce the workload at the coordinator, alternatively, each worker may maintain a copy of G P and deduce the designation of its messages in parallel (see Section 6).
For a pair (s, v) of nodes, denote by dist(s, v) the shortest distance from s to v (i.e., the length of a shortest path from s to v). Given graph G and a node s in V , GRAPE computes dist(s, v) for all nodes v ∈ V . It adopts an edge-cut partition [18]. It deduces F i .O by referencing G P and stores F i .O at each fragment F i .
As shown in Figure 3, PEval (Lines 1-14) is verbally identical to Dijkstra's sequential algorithm [32]. The only changes are the message preamble and segment (underlined). It declares an integer variable dist(s, v) for each node v, initially ∞ (except dist(s, s) = 0). It specifies min as aggre-gateMsg to resolve conflicts: If there are multiple values for the same dist(s, v), the smallest value is taken by the linear order on integers. The update parameters are At the end of its process, PEval sends C i .x to coordinator P 0 . At P 0 , GRAPE maintains dist(s, v) for all v ∈ F .O = F .I . Upon receiving messages from all workers, it takes the smallest value for each dist(s, v). It finds those variables with smaller values, deduces their destinations P j by referencing fragmentation graph G P , groups them into messages M j , and sends M j to worker P j .

IncEval: Incremental Evaluation
respectively, for the next round of incremental computation.
Function IncEval can be any existing sequential incremental algorithm for Q. It shares the message preamble of PEval. At the end of the process, it identifies changed values to C i .x at each fragment F i and sends the changes as messages to P 0 . Upon receiving the messages at coordinator P 0 , GRAPE composes these messages as described in 3(b) in Section 3.2.
Boundedness. Graph computations are typically iterative. GRAPE reduces the costs of iterative computations by promoting bounded incremental algorithms for IncEval.
Consider an incremental algorithm IncEval for Q. Given G, Q ∈ Q, Q (G) and updates M to G, where ΔO denotes changes to the old output Q (G). It is said to be bounded if its cost can be expressed as a function in the size of |CHANGED| = |M | + |ΔO | (i.e., the size of changes in the input and output [28,57]). Intuitively, |CHANGED| represents the updating costs inherent to the incremental problem for Q itself. For a bounded IncEval, its cost is determined by |CHANGED|, not by the size |F i | of entire F i , no matter how big |F i | is. That is, it reduces computation on possibly big F i to smaller data bounded by O (|CHANGED|).
Example 3.2. Continuing with Example 3.1, we provide IncEval in Figure 4. It is the sequential incremental algorithm for SSSP developed in Ramalingam and Reps [56,57], in response to changed dist(s, v) for v in F i .I (here message M i includes changes to dist(s, v) for v ∈ F i .I deduced from G P ). Using a queue Que, it starts with M i , propagates the changes to affected area, and updates the distances (see Ramalingam and Reps [56,57] for details). The partial result is now the revised distances (Line 11).
At the end of the process, IncEval sends to coordinator P 0 updated values of those status variables in C i .x, as in PEval. It applies aggregateMsg min to resolve conflicts.
The changes to the algorithm of Ramalingam and Reps [56,57] are underlined in Figure 4. Following those works [56,57], one can show that IncEval is bounded: Its cost is determined by the sizes of "updates" |M i | and the changes to the output. This reduces the cost of iterative computation of SSSP (the while and for loops).
Note that IncEval only needs to deal with changes M i (for example, changes to dist(s, v) for v ∈ F i .I in Example 3.2). That is, changes are restricted to the update parameters, rather than generic updates.

Assemble Partial Results
Function Assemble takes partial results Q (F i ⊕ M i ) and fragmentation graph G P as input and combines Q (F i ⊕ M i ) to get complete query answer Q (G). It is triggered when no more changes can be made to update parameters C i .x for any i ∈ [1, m].
The GRAPE process terminates with correct Q (G). Indeed, the updates to C i .x are "monotonic": the value of dist(s, v) for each node v decreases or remains unchanged. There are finitely many such variables. Furthermore, dist(s, v) is the shortest distance from s to v, as warranted by the correctness of the sequential algorithms of Fredman and Tarjan [32], and Ramalingam and Reps [56,57] (i.e., PEval and IncEval).
Putting these together, one can see that a PIE program parallelizes a graph query class Q provided with a sequential algorithm (PEval) and a sequential incremental algorithm (IncEval) for Q. Moreover, Assemble is typically a straightforward algorithm. A large number of sequential (incremental) algorithms are already in place for various Q, after decades of study of graph computations. Thus GRAPE is promising for making parallel graph computations accessible to a large group of users.
(1) There have been methods for incrementalizing algorithms, to get incremental algorithms from their batch counterparts [11,24]. Moreover, incremental algorithm IncEval only needs to deal with changes to status variables (update parameters), not necessarily generic updates (although to focus on the main idea, we present IncEval using the familiar notion of incremental graph algorithms). Such changes are aggregated by function aggregateMsg and depend on how aggregateMsg is defined. Hence it is often not hard to develop IncEval by revising a batch algorithm in response to changes to update parameters, as will be shown by the case of CC (connected components) in Section 5.2.
(2) Incremental IncEval speeds up iterative computations by minimizing unnecessary recomputation of Q (F i ) at each worker P i , no matter if IncEval is bounded or not. Indeed, boundedness is not the only criterion for the effectiveness of incremental algorithms. Alternative performance guarantees for incremental graph algorithms have been developed, such as semi-boundedness [28], localizable incremental algorithms, and relative boundedness [24].
(3) In contrast to existing graph systems, GRAPE parallelizes sequential algorithms PEval and IncEval as a whole, with the additional declaration of a message segment (PEval). As a result, users do not have to "think like a vertex" [49,50,63,71] when programming. As opposed to vertex-centric and block-centric systems, GRAPE runs sequential algorithms on entire fragments. Moreover, IncEval employs incremental evaluation to reduce cost, which is a unique feature of GRAPE.
(4) GRAPE aims to help users develop parallel programs, especially those who are more familiar with conventional sequential programming. This said, users of GRAPE still need to know the domain knowledge required to design update parameters and aggregate functions.

FOUNDATION OF GRAPE
We next present fundamental results underlying GRAPE. We first identify a condition under which a PIE program guarantees to converge at correct answers under GRAPE (Section 4.1). We then demonstrate the expressive power of GRAPE by simulating BSP, MapReduce, and PARM (Section 4.2).

Correctness of Parallel Model
Consider a partition strategy P and a PIE program ρ for a class Q of graph queries, where ρ consists of functions PEval, IncEval, and Assemble. Given a query Q ∈ Q, a graph G, and a natural number m, the GRAPE parallelization of ρ can be modeled as a simultaneous fixpoint operator defined on m fragments. More specifically, it starts with PEval for partial evaluation and conducts incremental computation by taking IncEval as the intermediate consequence operator: x i (via message); and R r i denotes partial results (including values of C i .x i ) computed at fragment F i after the (r + 1)-th superstep. The computation proceeds until it reaches r 0 such that Note that the computation does not reach a fixpoint as long as update parameters C i .x i keep changing. This is consistent with the parallel model of GRAPE (Section 3.1).
There has been a large body of work on fixpoint computation to study (a) whether a fixpoint computation converges [20,35,52,74]; and (b) how to accelerate fixpoint computation [35,40,60,69]. In this article, we mainly focus on (a). Issue (b) has been addressed in Fan et al. [27], which is based on GRAPE.
As an example, here we identify one convergence guarantee for the simple parallel model as a sufficient condition. We start with some notations.
(1) We say that PIE program ρ terminates under GRAPE with P if, for all queries Q ∈ Q and all graphs G, there always exists r 0 such that at superstep r 0 , (2) We say that a PIE program ρ with PEval, IncEval and Assemble is correct for Q with regard to P if, for all queries Q ∈ Q and all graphs G fragmented into F 1 , . . . , F m with P, is the answer to Q in G, and R r i is the partial result computed at F i after the (r + 1)-th superstep.
We say that GRAPE correctly parallelizes ρ with partition strategy P if, for all queries Q ∈ Q and all graphs G, ρ always terminates under GRAPE with P and returns Q (G).
(3) We say that PEval and IncEval satisfy the monotonic condition with regard to partition strategy P if, for graphs G and every variable x ∈ C i .x for i ∈ [1, m], (a) the values of x are from a finite Intuitively, condition (a) says that x draws values from a finite domain, and condition (b) says that x is updated "monotonically" following p x . These ensure that PIE programs with PEval and IncEval terminate under GRAPE with P. For instance, dist(s, v) in Example 3.1 can only be changed in the decreasing order (i.e., it is computed by function min for aggregateMsg of IncEval in the active domain of G), and hence PEval and IncEval for SSSP satisfy the monotonic condition.
We next provide a condition that warrants the correctness of GRAPE parallelization.
Theorem 4.1 (Assurance Theorem). Consider a PIE program ρ with PEval, IncEval, and Assemble for a class Q of graph queries. GRAPE correctly parallelizes ρ with a graph partition strategy P if (a) PEval and IncEval satisfy the monotonic condition with regard to P, and (b) ρ with PEval, IncEval and Assemble is correct for Q with regard to P.
More specifically, (1) under the monotonic condition, the PIE program ρ guarantees to terminate under GRAPE and, better yet, (2) it converges at correct answer Q (G) for all queries Q ∈ Q and all graphs G as long as the sequential algorithms PEval, IncEval, and Assemble of ρ are correct for Q. In other words, condition (a) guarantees termination of a PIE program under GRAPE, and conditions (a) and (b) put together guarantee the correctness of a PIE program under GRAPE.
Proof. We show the correctness of Theorem 4.1 by analyzing the computations of a PIE program. Consider any run of a PIE algorithm depicted in Figure 5. Observe the following.
. . are from a finite domain, we know that there exists a number n such that R t +1 i = R t i for all i ∈ [1, m] and t ≥ n. That is, ρ terminates.
(1) The fixpoint computation model does not reduce the expressive power of GRAPE. Indeed, (a) fixpoint computation has sufficient expressive power; many data mining and machine learning algorithms can be modeled as fixpoint computations [60,69]. (b) We can conduct any computation in PEval and IncEval, and hence by GRAPE. We will see a formal characterization in Section 4.2.
(2) The monotonic condition is a sufficient condition for GRAPE computations to converge, but it is not a necessary condition. Indeed, there has been a large body of work on convergence [35,40,60,69] from which other characterizations can be deduced.
(3) It does not mean that only algorithms satisfying the monotonic condition can be parallelized in GRAPE. As will be shown by Theorem 4.2, any MapReduce algorithm can be migrated to GRAPE without extra complexity, and not all MapReduce algorithms are monotonic. The monotonicity is just a sufficient condition under which users do not have to worry about convergence.

The Expressivity of GRAPE
We next show that the simple parallel model of GRAPE does not imply degradation in the expressivity. As a result, GRAPE can readily switch to other parallel models without extra complexity.
Following Valiant [66], we say that a parallel model M 1 can optimally simulate model M 2 if there exists a compilation algorithm that transforms any program with cost C on M 2 to a program with cost O (C) on M 1 . The cost includes computational and communication costs. For GRAPE, these are measured by the running time of PEval, IncEval, and Assemble on all the processors and by the total size of the messages passed among all the processors in the entire process.
We show that GRAPE optimally simulates popular parallel models MapReduce [22], BSP [65], and PRAM [66]. Note that GRAPE parallelization is modeled as a simultaneous fixpoint computation. Moreover, GRAPE is a BSP system under the the following constraints: (1) in each round of computation, GRAPE runs the same function PEval or IncEval, while other parallel systems may run different user-defined functions in different rounds (e.g., MapReduce); and (2) GRAPE only allows the status variables of the same vertex in different fragments to be exchanged, while there is no such restriction in some other parallel systems. We show that, despite these restrictions, GRAPE does not degrade in expressive power: It is as powerful as MapReduce, BSP, and PRAM.
As a consequence of the result, all algorithms developed for graph systems based on these models can be migrated to GRAPE without increasing complexity bounds, including Pregel [50], GraphX [34], Giraph++ [63], and Blogel [71]. The result below is stronger than its counterpart in Fan et al. [31] in that it does not use key-value pairs (messages) in the simulation (see electronic appendix for proof). (2) all MapReduce programs using n processors can be optimally simulated by GRAPE using n processors; and (3) all CREW PRAM algorithms using O (P ) total memory, O (P ) processors and t time can be run in GRAPE in O (t ) supersteps using O (P ) processors with O (P ) memory.

Remark.
(1) Theorem 4.2 aims to show the expressive power of GRAPE (e.g., all MapReduce algorithms can be migrated to GRAPE without increasing the complexity bounds). Nonetheless, it is possible that some simulated applications are not efficient in practice due to a possible large constant in the simulation complexity O (C) (see the proof of Theorem 4.2 in the electronic appendix).
(2) As indicated by Theorem 4.2, all parallel algorithms for MapReduce, BSP, and PRAM are also supported by GRAPE. Moreover, those graph computations that have effective (e.g., bounded) incremental algorithms can be accelerated by GRAPE.
(3) Compared with the vertex-centric model, the ability to run sequential algorithms over an entire fragment has several benefits. One of them is that it can reduce the number of supersteps, as demonstrated by SSSP. This is because within the fragment each worker can do some computation that would have required extra supersteps in a vertex-centric system like Pregel. This is analogous to running multiple "local-supersteps" in a worker before running a global superstep. Similarly, it can reduce communication since no message passing is needed within a fragment, and this happens also because of the reduction in supersteps. Finally, because the sequential algorithms have access to the entire fragment, existing sequential algorithms can be executed, and optimization techniques that are developed for the sequential algorithms are inherited in the parallel setting. Hence, GRAPE can speed up parallel computations and achieve better performance by conducting efficient fragment-level local computations, without incurring excessive communication costs.
(4) However, for algorithms that make only one or very few fragments "active" at a time, GRAPE may not speed up their parallel computations. These include "local" queries to find neighbors of a given node, or k nearest neighbors (kNN) queries with very small constant k. The evaluation of such queries is restricted to a small subgraph localized by the given node. Such a localized subgraph may be entirely contained in one fragment at one worker, and hence may not fully enjoy parallel processing unless we allow a fine-grained parallelization within the fragment by, for example, using parallelized IncEval or partitioning the fragment into multiple small virtual fragments. In addition, GRAPE may not make P-complete problems such as Depth First Search (DFS) more efficient than other parallel platforms; these algorithms are inherently difficult to parallelize.

GRAPH COMPUTATIONS IN GRAPE
We have seen how GRAPE parallelizes graph traversal SSSP (Section 3). We next show how GRAPE parallelizes existing sequential algorithms for a variety of graph computations. We take graph pattern matching (defined in terms of graph simulation and subgraph isomorphism), graph connectivity, and collaborative filtering as examples (Sections 5.1-5.3, respectively).
We adopt edge-cut [12,18] in this section unless stated otherwise. Under vertex-cut [47] and other graph partition strategies, PIE programs can be developed similarly.

Graph Pattern Matching
We start with graph pattern matching, which is commonly used in social media marketing [30], social network analysis [25], and knowledge base expansion [23], among other things.
A graph pattern is a graph We study two semantics of graph pattern matching.
For (u, v) ∈ R, we refer to v as a match of u. It is known that if G matches Q, then there exists a unique maximum relation [39], referred to as Q (G). If G does not match Q, then Q (G) is the empty set. Moreover, Q (G) can be computed in O ((|V Q | + |E Q |)(|V | + |E|)) time [25,39]. Graph pattern matching via graph simulation is stated as follows.
ࢪ Input: A directed graph G and a graph pattern Q. ࢪ Output: The unique maximum relation Q (G).
We next show how GRAPE parallelizes graph simulation.
(1) PEval. GRAPE takes the sequential algorithm of Henzinger et al. [39] as PEval to compute Q (F i ) in parallel. Its message preamble declares a Boolean status variable x (u,v ) for each query node u in V Q and each node v in F i , indicating whether v matches u, initialized true. It takes F i .I as candidate set C i . Before giving the details of PEval, we first review the algorithm in Henzinger et al. [39]. The simulation algorithm [39] computes the match set sim(u) for each query node u via least fixpoint computation. The initial match set sim(u) contains all possible candidate matches of u. These match sets are then iteratively refined by removing nonmatching nodes. The process stops when a fixpoint is reached.  As shown in Figure 6, the main body of PEval (Lines 3-17) is almost identical to the simulation algorithm of Henzinger et al. [39], except the underlined parts to preprocess fragments. More specifically, PEval first preprocess each fragment F i by removing incoming edges and their associated "foreign nodes" and by including nodes to which there exists an outgoing edge from F i (Lines 1-2). Such preprocessing is conducted to comply with the semantics of simulation relations. More specifically, the match status of a data node v (i.e., whether v matches some query node u) is determined by the complete match status of all v's outgoing neighbors. This also implies that the match status of v is propagated and updated via the reverse direction of edges linked to v. The preprocessing yields fragment For each node u ∈ V Q , PEval starts with a set sim(u) of candidate matches v in F i (Lines 3-8) and iteratively removes from sim(u) those nodes that violate the simulation condition (Lines 9-17). It uses post(v) and pre(v) to keep track of successors and predecessors of node v, respectively (see Henzinger et al. [39] for details). It refines sim(u) for all u ∈ V Q . The partial result Q (F i ) is designated (Line 18). At the end of the process, PEval sends C i .x = {x (u,v ) | u ∈ V Q , v ∈ F i .I } to coordinator P 0 . That is, the updated match status is propagated via the reverse direction of edges. At coordinator P 0 , GRAPE maintains x (u,v ) for all v ∈ F .I . Upon receiving messages from all workers, it changes x (u,v ) to false if it is false in one of the messages. This is specified by min as aggregateMsg, taking the order false ≺ true. GRAPE identifies those variables that become false, deduces their destinations by referencing G P , groups them into messages M j , and sends M j to P j .
(2) IncEval is the sequential incremental simulation algorithm of Fan et al. [28] in response to edge deletions. The changes to sim(u) are "equivalent to" removing some nodes from sim(u), which can be also seen as the results of removing some relevant edges. Thus, propagating the changes of these nodes can be done by propagating the changes of deleted edges. Hence we can use the algorithm of Fan et al. [28] for edge deletions. Note that we just make use of the algorithm for edge deletions as IncEval to process changes to x (u,v ) , but IncEval does not have to handle generic edge deletions in the graph.
As shown in Figure 7, if status variable x (u,v ) is changed to false by message M i , it is treated as deleting "cross edges" to v ∈ F i .O. Using a stack (Line 1), it starts with changed status variables in M i , propagates the changes to the affected area, and removes from sim those matches that become invalid (Lines 3-7; see Fan et al. [28] for more details). The partial result is now the revised sim relation (Line 8). At the end of the process, IncEval sends to coordinator P 0 those values of the status variables in C i .x that have been set false in the process, along the same lines as how PEval does it.
As shown in Fan et al. [28], IncEval is semi-bounded: Its cost is decided by the sizes of "updates" |M i | and changes to the affected area necessarily checked by all incremental algorithms for Sim, not by |F i |. This reduces the cost of iterative computation of graph simulation (the while and for loops).
(3) Assemble simply takes Q (G) = i ∈[1,n] Q (F i ), the union of all partial matches; that is, the sim relation computed at each fragment F i at the end of the process.
(4) The correctness of the GRAPE parallelization is warranted by Theorem 4.1 and the monotonic updates to C i .x. Indeed, x (u,v ) is initially true for each border node v and is changed at most once to false, taking the order false ≺ true. Furthermore, x (u,v ) denotes whether v matches u, as warranted by the correctness of the sequential algorithms [28,39] (PEval and IncEval).
Subgraph isomorphism. We next parallelize subgraph isomorphism, under which a match of pattern Q in graph G is a subgraph of G that is isomorphic to Q. More specifically, a match of Q in G is a subgraph G = (V , E , L ) of G such that there exists a bijective function h from V Q to V , where (1) for each node u ∈ V Q , L Q (u) = L (h(u)) and (2) e = (u, u ) is an edge in Q if and only if e = (h(u), h(u )) is an edge in G and L Q (e) = L (e ).
Graph pattern matching via subgraph isomorphism seeks to compute the set Q (G) of all matches of Q in G. It is intractable: It is NP-complete to decide whether Q (G) is nonempty.
GRAPE parallelizes Turbo ISO , the sequential algorithm of Han et al. [37] for subgraph isomorphism. It has two supersteps, one for PEval and the other for IncEval, outlined as follows.
(1) PEval identifies update parameters C i .x at each fragment F i . It declares an integer variable dist(s, t ) as the status variable for each pair of nodes s and t in F i , to record their distance in fragment F i . It computes the d Q -neighbor N d Q (s) of each border node in s ∈ F i .I ∪ F i .O . Here, d Q is the diameter of pattern Q; that is, the length of the shortest path between any two nodes in Q when Q is treated as an undirected graph; and N d (v) is the subgraph of G induced by the nodes Intuitively, upon receiving update parameters C i .x from all workers, coordinator P 0 completes the d Q -neighbor of each border node s in the entire graph G and sends the d Q -neighbor to workers where s resides to compensate information loss caused by fragmentation of graph G. After this step, one can directly apply Turbo ISO to each expanded fragment in parallel.
More specifically, PEval computes C i .x at fragment F i , as shown in Figure 8. PEval performs a standard Breadth-First Search (BFS) traversal from each border node s in F i .I ∪ F i .O to identify (a) a set V (s) of nodes that are reachable from s in d Q hops in F i (Lines 2-10) and (b) a set E (s) of edges that are associated with nodes in V (s) (Line 11). Here, fragment F i is treated as an undirected graph in the BFS traversal, ignoring the orientations of the edges (Line 7). PEval annotates each node v in V (s) with dist(s, t ) from s (Lines 8-9). These compose a local (annotated) d Q -neighbor N d Q (s) in fragment F i . To simplify the discussion, PEval sends these local N d Q (s)'s to coordinator P 0 (Line 13). In practice, the union of all these d Q -neighbors is sent to P 0 in a single message. The size of such a message is bounded by the size |G | of graph G.
Upon receiving the local versions of N d Q (s) for all s in F .I ∪ F .O , coordinator P 0 expands each of them to the d Q -neighbor N d Q (s) in the entire graph G. This is specified by procedure Expand as aggregateMsg, which performs a BFS-like traversal on the data received and combines necessary nodes and edges by making use of fragmentation graph G P (see Section 2) and the annotations associated with the nodes. A message M i is composed and sent to worker P i for i ∈ [1, m], including all the nodes and edges in the d Q -neighbor of s in the entire graph G, for each s ∈ F i .O ∪ F i .I .
(2) IncEval is the sequential algorithm Turbo ISO [37]. Given a pattern Q and a graph G, Turbo ISO finds all isomorphic matches of Q in G as follows. (1) It first picks a start vertex from query Q and rewrites Q into a tree Q by performing a BFS search. Each node in the tree corresponds to a Neighborhood Equivalence Class (NEC), by merging nodes with the same labels and neighborhoods.
(3) For each candidate region, it computes an order on the nodes in Q based on the number of their candidate matches in the region. (4) It then searches matches within the candidate region in this order. During the search, it only combines partial matches of the NECs instead of inspecting all possible enumerations. (5) Finally, it expands matches of the NECs to get exact matches of Q.
As shown in Figure 9, IncEval computes Q (F i ⊕ M i ) at each worker P i in parallel, on fragment F i extended with d Q -neighbor of each node in F i .O ∪ F i .I by applying Turbo ISO . IncEval sends no messages since the values of variables in C i .x remain unchanged. As a result, IncEval is executed once, and hence two supersteps suffice.     (4) The correctness of the process is assured by Turbo ISO and the locality of subgraph isomorphism:

Graph Connectivity
We next study graph connectivity for computing connected components (CC).
Consider an undirected graph G. A subgraph G s of G is a connected component of G if (a) it is connected (i.e., for any pair (v, v ) of nodes in G s , there exists a path between v to v ), and (b) it is maximum (i.e., adding any node to G s makes the induced subgraph no longer connected).
The CC problem is stated as follows and is known to be in O (|G |) time [13].
ࢪ Input: An undirected graph G = (V , E, L). ࢪ Output: All connected components of G.
GRAPE parallelizes CC as follows. It picks a sequential CC algorithm as PEval. At each fragment F i , PEval computes its local connected components and creates their ids. The component ids of the border nodes are exchanged with neighboring fragments. The (changed) ids are then used to incrementally update local components in each fragment by IncEval, which simulates a "merging" of two components whenever possible, until no more changes can be made.
(1) PEval declares an integer status variable v.cid for each node v in fragment F i , initialized as its node id. As shown in Figure 10, PEval first uses a standard sequential Depth-First Search (DFS) traversal to compute the local connected components of F i (Line 1). For each local component C, (a) PEval creates a "root" node v r carrying the minimum node id in C as v r .cid (Lines 3-4) and (b) links all the nodes in C to v r , and sets their cid as v r .cid (Lines 5-6). These can be completed in one pass of the edges of F i via DFS. At the end of process, PEval sends {v.cid | v ∈ F i .I } to coordinator P 0 . In other words, the set consists of the update parameters at fragment F i .
At P 0 , GRAPE maintains v.cid for each v ∈ F .I . It updates v.cid by taking the smallest cid, if multiple cids are received, by taking min as aggregateMsg in the message segment of PEval. It groups the nodes with updated cids into messages M j and sends M j to P j by referencing G P .
(2) IncEval incrementally updates the cids of the nodes in each fragment F i upon receiving M i , in parallel, as shown in Figure 11. Observe that message M i sent to P i consists of v.cid with updated (smaller) values. For each v.cid in M i , IncEval finds the root v r of v (Line 3) and updates v r .cid to the minimal one (Lines 4-5). IncEval then propagates the changes from every updated root node v r to all nodes linked to v r by changing their cids to v r .cid (Lines 6-8). At the end of the process, IncEval sends to coordinator P 0 the updated cids of nodes in F i .I , just as in PEval.
One can verify that the incremental algorithm IncEval is bounded: It takes O (|M i |) time to identify the root nodes and O (|AFF|) time to update cids by following the direct links from the roots, where AFF consists of only those nodes with their cid changed. Hence, it avoids redundant local traversal.
(3) Assemble merges all the nodes having the same cid in a bucket as a single connected component, and returns the set of all these buckets as all the connected components.
(4) Correctness. The process terminates as the cids of the nodes are monotonically decreasing by the definition of aggregateMsg until no changes can be made. Moreover, it correctly merges two local connected components by propagating the smaller component id.

Collaborative Filtering
As an example of machine learning, we consider collaborative filtering (CF) [48], a method commonly used for inferring user-product rates in social recommendation. It takes as input a bipartite graph G that includes two types of nodes, namely, users U and products P, and a set of weighted edges E ⊆ U × P. (1) Each user u ∈ U (respectively, product p ∈ P) carries an (unknown) latent factor vector u.f (respectively, p.f ).
(2) Each edge e = (u, p) in E carries a weight r (e), estimated as u.f T * p.f (possibly ∅, i.e., "unknown") that encodes a rating from user u to product p. The training set E T refers to edge set {e ∈ E | r (e) ∅}; that is, all the known ratings. The CF problem is as follows.
That is, CF predicts all the unknown ratings by learning the factor vectors that "best fit" E T . A common practice to approach CF is to use the Stochastic Gradient Descent (SGD) algorithm [48], which iteratively (1) computes a prediction error ϵ (u, p) = r (u, p) − u.f T * p.f , for each e = (u, p) ∈ E T and (2) updates u.f and p.f accordingly toward minimizing ϵ ( f , E T ). The SGD algorithm [48] is inherently sequential. To parallelize it, a nice idea has been proposed by DSGD [33], based on a partition of the dataset such that, at each round of computation, different workers can process disjoint datasets independently without conflicts. More specifically, it partitions the user set U into m disjoint subsets U (1), U (2), . . . ,U (m), and, similarly, the product set P into disjoint P (1), P (2), . . . , P (m) such that U = m i=1 U (i) and P = m j=1 P (j), for a constant m. Correspondingly, the training set E is divided into m 2 blocks, such that each 1 ≤ i, j ≤ m, a block E (i, j) identified by a pair (i, j), is the subset of E induced by U (i) and P (j). Clearly, E = 1≤i, j ≤m E (i, j). Two blocks E(i, j) and E(i , j ) are independent if i i and j j . DSGD parallelizes SGD by utilizing the property that the factor vectors of independent blocks can be updated in parallel without conflicts.
Adopting the partition strategy of DSGD, GRAPE parallelizes the sequential SGD algorithm such that different workers can run SGD on different fragments of a graph G in parallel without conflicts. GRAPE partitions G by a vertex-cut strategy and distributes the m 2 blocks into m different fragments. More specifically, it defines where v.f is the factor vector of v (initially ∅) and t is an integer (initially 0) that bookkeeps a timestamp at which v.f is last updated. The candidate set C i consists of the border nodes in set F i .I .
As shown in Figure 12, PEval essentially runs the sequential SGD algorithm of Koren et al. [48] on the training block E (i, i) as follows. Each time it picks an edge (u, p) from the training set uniformly at random and computes the prediction error ϵ (u, p), and it updates local factor vectors by a magnitude proportional to γ in the opposite direction of the gradient as: ( 1 ) By the partition strategy, the training blocks E (1, 1), E (2, 2), . . . , E (d, d ) on different fragments line up on the diagonal of the rating matrix and are independent of each other. At the end of its process, , it defines aggregateMsg as max on timestamps). This is well-defined since different workers process independent blocks. GRAPE then groups the updated vectors into messages M j and sends M j to P j as usual. That is, GRAPE passes the latest updates to factor vectors to workers.
In addition, coordinator P 0 selects m independent blocks to be processed in the next round. To do this, P 0 simply picks a permutation p 1 p 2 . . . p m of {1, 2, . . . ,m} following some fixed strategy (e.g., simple cycle scheduling). It sends a pair (j, p j ) along with M j to P j . By the partition strategy, block E(j, p j ) belongs to F j and E (j, p j ) is independent of E (i, p i ) if j i.
(2) IncEval iteratively updates the factor vectors of independent blocks. As shown in Figure 13, IncEval first updates the factor vectors with the latest changes (Lines 1-2). It then extracts the training set B i for the current round based on the block identifier assigned by P 0 (Line 3) and runs the sequential SGD algorithm [48] on B i just like PEval (Line 4). Since the training sets B 1 , B 2 , . . . , B d are extracted from the independent blocks E (1, p 1 ), E (2, p 2 ), . . . , E(d, p d ), they can be processed in parallel without conflict. At the end of the process, it sends the updated vectors in C i like PEval.
(3) Assemble simply takes the union of all the factor vectors of nodes from all the workers.
(4) Correctness. Observe that a permutation p 1 p 2 . . . p m of {1, 2, . . . ,m} corresponds to a mmonomial stratum of DSGD [33]. The permutation in each round is picked according to a stratum selection strategy. It is known that the strategy guarantees the convergence of DSGD (see Gemulla et al. [33] for more details). As a result, GRAPE converges and correctly infers CF models by the correctness of DSGD.

IMPLEMENTATION OF GRAPE
We next outline an implementation of parallel graph engine GRAPE.
Architecture overview. GRAPE adopts a four-tier architecture, depicted in Figure 14.  (1) Its top layer is a user interface. As shown in Figure 1, GRAPE supports interactions with (a) developers who specify and register sequential PEval, IncEval, and Assemble as a PIE program for a class Q of graph queries (the plug panel) and (b) end users who make use of PIE programs from API library, pick a graph G, enter queries Q ∈ Q, and "play" (the play panel). GRAPE parallelizes the PIE program, computes Q (G), and displays Q (G) in result and analytics consoles.
(2) At the core of the system is a parallel query engine. It manages sequential algorithms registered in the GRAPE API, makes parallel evaluation plans for PIE programs, and executes the plans for query answering (see Section 3.1). It also enforces consistency control and fault tolerance (see below).
(3) Underlying the query engine are (a) an MPI Controller (message passing interface) for communications between coordinator and workers, (b) an Index Manager for loading indices, (c) a Partition Manager to partition graphs, and (d) a Load Balancer to balance workload (see below).
(4) The storage layer manages graph data in the Distributed File System (DFS). It is accessible to the query engine, Index Manager, Partition Manager, and Load Balancer.
Message passing. The MPI Controller of GRAPE makes use of a standard MPI for parallel and distributed programs. It currently adopts MPICH [5], which is also the basis of other parallel graph systems such as GraphLab [49] and Blogel [71]. It generates messages and coordinates messages in synchronization steps using standard MPI primitives.
We remark that, at the conceptual level, to simplify the discussion, we adopt a coordinator to aggregate messages (Section 3). In practice, GRAPE implements point-to-point message passing instead: Workers exchange messages directly without going through a coordinator, accumulate messages received in a buffer, and the aggregation function is invoked at each worker. It is easy to verify that this implementation and the centralized aggregation with a coordinator produce the same results since the aggregation function is invoked after all messages are received.
Graph partition. The Graph Partitioner supports a variety of built-in partition algorithms. Users may pick (a) METIS, a fast heuristic algorithm for sparse graphs [44]; (b) an edge-cut partition [12,18] and a vertex-cut partition [47]; (c) 1-D and 2-D partitions [17], which distribute vertex and adjacent matrix to the workers, respectively, with an emphasis on maximizing the parallelism of graph traversal; and (d) a fast streaming-style partition strategy [62] that assigns edges to high-degree nodes to reduce cross edges. New data partition strategies can also be deployed at GRAPE.
Multithread. GRAPE supports multithreading. At each worker, there are multiple working threads, each acting as a virtual worker and handling one fragment. During computation, the threads at the same worker are maintained in a pool; a main thread at the worker takes the responsibility of assigning fragments and workload to idle threads in the pool. At the end of each round of the computation, the main thread generates messages and communicates with peer workers. The main thread is able to process messages in a buffer even when its working threads are still computing. This allows workers to overlap computation and communication and reduce response time.
Graph-level optimization. In contrast to prior graph systems, GRAPE supports data-partitioned parallelism by parallelizing the runs of sequential algorithms. Since fragments of a graph are graphs themselves, all optimization strategies developed for sequential (batch and incremental) algorithms can be readily used by GRAPE to improve the performance of PEval and IncEval over graph fragments. As examples, next we outline some of the graph-level optimization strategies.
(1) Indexing. Any indexing structure effective for sequential algorithm can be computed offline and directly used to optimize PEval, IncEval, and Assemble. GRAPE can support indices including (1) a 2-hop index [21] for reachability queries and (2) a neighborhood-index [45] for candidate filtering in graph pattern matching. Moreover, new indices can be incorporated into the GRAPE API library.
(2) Compression. Another strategy is query preserving compression [26] at the fragment level. Given a query class Q and a fragment F i , each worker P i computes a smaller F c i offline via a compression algorithm, such that for any query Q in Q, Q (F i ) can be computed from F c i without decompressing F c i , regardless of what sequential PEval and IncEval are used. As shown in Fan et al. [26], this compression scheme is effective for graph pattern matching and graph traversal, among other things.
(3) Dynamic grouping. GRAPE dynamically groups a set of border nodes by adding a "dummy" node and sends messages from the dummy nodes in batches, instead of one by one. This effectively reduces the amount of message passing in each synchronization step.
To the best of our knowledge, many of these optimization strategies are not supported by the state-of-the-art vertex-centric and block-centric graph query systems. For instance, indexing and query-preserving compression for sequential algorithms do not carry over to vertex-centric programs, and block-centric programming essentially treats blocks as vertices rather than graphs.
Fault tolerance. GRAPE employs an arbitrator mechanism to recover from both worker failures and coordinator failures (i.e., single-point failures). More specifically, it reserves a worker P a as arbitrator and a worker S c as a standby coordinator. It keeps sending heartbeat signals to all workers and the coordinator. In case of failure, (a) if a worker fails to respond, the arbitrator transfers its computation tasks to another worker; and (b) if the coordinator fails, it activates the standby coordinator S c to continue parallel computation. It is also possible for GRAPE to adopt the optimistic recovery mechanism introduced in Schelter et al. [60] for general fixpoint paradigm [15].

Consistency.
Multiple workers may update copies of the same status variable. To cope with this, (a) GRAPE allows users to specify a conflict resolution policy as function aggregateMsg in PEval (Section 3.2), (e.g., min for SSSP and CC (Section 5)), based on a partial order on the domain of status variables (e.g., linear order on integers). Based on the policy, inconsistencies are resolved in each synchronization step of the PEval and IncEval processes. Moreover, Theorem 4.1 guarantees the consistency when the policy satisfies the monotonic condition. (b) GRAPE also supports default exception handlers when users opt not to specify aggregateMsg. In addition, GRAPE allows users to specify generic consistency control strategies and register them in the GRAPE API library.

EXPERIMENTAL STUDY
Using real-life and synthetic graphs, we next empirically evaluate GRAPE for its (1) efficiency and scalability, (2) communication costs, (3) effectiveness of incremental steps, and (4) compatibility with optimization techniques developed for sequential graph algorithms. We used real-life graphs larger than those that Fan et al. [31] experimented with. We evaluated the performance of GRAPE compared with Giraph (a open-source version of Pregel), GraphLab, and Blogel (the fastest block-centric system we are aware of). We compared GRAPE with the prior graph systems by parallelizing existing sequential algorithms with a preliminary implementation of GRAPE [9].
Experimental setting. We used five real-life graphs of different types, including To make use of unlabeled Friendster for Sim and SubIso, we assigned up to 100 random labels to nodes. We also randomly assigned weights to UKWeb, traffic, and Friendster for testing SSSP.
Synthetic graphs. To evaluate the scalability of GRAPE (Exp-1 and Exp-2), we also developed a generator to produce synthetic graphs G = (V , E, L) controlled by the numbers of nodes |V | (up to 250 million) and edges |E| (up to 2.5 billion), with L drawn from an alphabet L of 100 labels.
Partitioning and Loading. We used XtraPuLP [61] as the default graph partition strategy. In theory, GRAPE works regardless of what partitioning strategy is used and guarantees to converge under the conditions given in Theorem 4.1. In practice, different strategies may yield partitions with various degrees of skewness and stragglers, which have an impact on the performance of GRAPE. Here, we picked XtraPuLP, which is widely used in practice. On Friendster, for example, XtraPuLP took about 16 minutes, and our computations took at most 62 seconds. However, graph partitioning is performed once offline. Afterward, various queries are answered online on the same partition. The partitioning costs for traffic, UKWeb, DBpedia, and movieLens are 6.0, 598.1, 32.2, 4.3 seconds, respectively.
GRAPE loads graph data from a distributed file system by each worker simultaneously. It takes four workers about 20 minutes to import Friendster for the first time (16s, 24m, 44s, 10s for traffic, UKWeb, DBpedia, movieLens, respectively). After the first loading, the graph is "serialized" to the storage in a compact format, which largely reduces the loading time to 40s (2s, 86s, 4s, 2s for traffic, UKWeb, DBpedia, movieLens, respectively) for reloading afterward when necessary.
It should be remarked that GRAPE is able to load a graph G once and process the query workload (i.e., a set of queries) posed on G without reloading G. In contrast, GraphLab, Giraph, and Blogel require the graph to be reloaded each time a single query is issued, and loading is costly over large graphs. In favor of these systems, we exclude the loading cost when reporting the experimental results.
Queries. We randomly generated the following queries for SSSP, Sim, and SubIso. (a) We sampled 10 source nodes in each graph used and constructed an SSSP query for each node. (b) We generated 20 pattern queries for Sim and SubIso, controlled by |Q | = (|V Q |, |E Q |) (the number of nodes and edges, respectively) using labels drawn from the graphs experimented with.
Algorithms. We implemented the PIE programs (PEval, IncEval, and Assemble) for the query classes given in Sections 3 and 5, namely, SSSP, Sim, SubIso, CC, and CF, which are registered in the API library of GRAPE. We adopted basic sequential algorithms and only used optimized Sim to demonstrate how GRAPE inherits optimization strategies developed for sequential algorithms (Exp-3).
We also implemented algorithms for these query classes for Giraph, GraphLab, and Blogel. We used the "default" code provided by the systems when available and made our best efforts to develop "optimal" algorithms otherwise. We also used the "default" graph partition algorithms provided by these systems (i.e., hash partitioning for GraphLab and Giraph and Voronoi partitioning for Blogel). We implemented synchronized algorithms for both GraphLab and Giraph for the ease of comparison. As observed by other works [40,41,68], neither asynchronous model nor synchronous model outperform the other for different algorithms, input graphs, and cluster scales. We expect the observed relative performance trends to hold on other similar graph systems.
We deployed the systems on a cluster of up to 12 machines, each with 16 processors (Intel Xeon 2.2GHz) and 128G memory (thus in total 192 processors). This is the best configuration we could afford. Each experiment was repeated 5 times, and the average is reported here.
Experimental results. We next report our findings.
Exp-1: Efficiency. We first evaluated the efficiency and scalability of GRAPE by varying the number n of processors used, from 64 to 192. For each algorithm, we chose datasets based on its applications in the real world to demonstrate meaningful computations. For SSSP and CC, we experimented with real-life graphs UKWeb, traffic, and Friendster. For Sim and SubIso, we used Friendster and DBpedia. We used movieLens for CF as its application in movie recommendation.
(1) SSSP. Figure 15(a-c) reports the performance of the systems for SSSP over Friendster, UKWeb, and traffic, respectively. We report the average over 10 SSSP queries on each graph. The results on other graphs are consistent (not shown). From the results, we can see the following.
(a) GRAPE outperforms Giraph, GraphLab. and Blogel by 14,842, 3,992 and 756 times, respectively, over traffic with 192 processors (Figure 15(a)). In the same setting, it is 556, 102, and 36 times faster over UKWeb (Figure 15(b)), and 18, 1.7, and 4.6 times faster over Friendster (Figure 15(c)). These results demonstrate that by simply parallelizing sequential algorithms without further optimization, GRAPE already outperforms the state-of-the-art systems in response time for SSSP.
The improvement of GRAPE over all the systems on traffic is much larger than on Friendster and UKWeb since the traffic graph has a larger diameter. In addition, (i) for Giraph and GraphLab, this is because synchronous vertex-centric algorithms take more supersteps to converge on graphs with large diameters, such as traffic. Using 192 processors, Giraph take 10, 749 supersteps over traffic and 161 over UKWeb; similarly for GraphLab. In contrast, GRAPE is not vertex-centric, and it takes 31 supersteps on traffic and 24 on UKWeb. (ii) Blogel also takes more (1, 690) supersteps over traffic than over UKWeb (42 supersteps) and Friendster (23 supersteps). It generates more blocks over traffic (with larger diameter) than UKWeb and Friendster. Since Blogel treats blocks as vertices, the benefit of parallelism is degraded with more blocks.
(b) In all cases, GRAPE take less time when n increases. On average, it is 1.4, 2.3, and 1.5 times faster for n from 64 to 192 over traffic, UKWeb, and Friendster, respectively. (i) Compared with the results in Fan et al. [31] using less processors, this improvement degrades a bit. This is mainly because the larger number of fragments leads to more communication overhead. On the other hand, such impact is significantly mitigated by IncEval that only ships changed update parameters. (ii) In contrast, Blogel does not demonstrate such consistency in scalability. It takes more time on traffic when n is larger. When n varies from 160 to 192, it also takes longer over Friendster. Its communication cost dominates the parallel cost as n grows, "canceling out" the benefit of parallelism. (iii) GRAPE has scalability comparable to GraphLab over Friendster and scales better over UKWeb  and traffic. Giraph has better improvement with larger n, but with constantly higher cost (see (a)) than GRAPE.
(c) GRAPE significantly reduces supersteps. It takes on average 22 supersteps, while Giraph, GraphLab, and Blogel take 3,647, 3,647, and 585 supersteps, respectively. This is because GRAPE runs sequential algorithms over fragmented graphs with cross-fragment communication only when necessary, and moreover, IncEval ships only changes to status variables. In contrast, Giraph, GraphLab, and Blogel pass vertex-vertex (vertex-block) messages.
(d) SSSP under Blogel runs in VB-model. In each superstep, it first runs V-compute over all vertices to identify "active" vertices (i.e., vertices whose distance value is updated); it then runs B-compute on active vertices within blocks. Compared with pure vertex-centric models, running a sequential algorithm within blocks reduces communication cost. However, its V-compute incurs redundant computations since it runs over all vertices in each superstep. In contrast, GRAPE runs sequential algorithms within partitions and leverages incremental computation to reduce redundant computation and communication cost. In each round, IncEval only runs on affected vertices.
(2) CC. Figure 15(d, e) reports the performance for CC detection and tells us the following. (a) Both GRAPE and Blogel substantially outperform Giraph and GraphLab. For instance, when n = 192, GRAPE is on average 12,094 and 1,329 times faster than Giraph and GraphLab, respectively. (b) Blogel is faster than GRAPE in some cases (e.g., 3.5 seconds vs. 17.9 seconds over Friendster when n = 192). This is because Blogel embeds the computation of CC in its graph partition phase as precomputation, while this graph partition cost (on average 357 seconds using its built-in Voronoi partition) is not included in its response time. In other words, without precomputation, the performance of GRAPE is already comparable to the near "optimal" case reported by Blogel.
CC in GRAPE also works better than the one in Giraph++ [63]. This is because, after exchanging messages between blocks, Giraph++ invokes computation on all internal vertices and a large part of the computation is redundant. In contrast, IncEval of GRAPE processes only those affected vertices by capitalizing on auxiliary indices that were inherited from sequential algorithms.
(5) CF. For collaborative filtering, we used real-life movieLens [3] with a training set |E T | = 90%|E|. We compared GRAPE with the built-in SGD-based CF in GraphLab and with CF implemented for Giraph and Blogel. It should be remarked that CF favors "vertex-centric" programming since each user or product node only needs to exchange data with its neighbors, as indicated by the fact that GraphLab and Giraph outperform Blogel. Nonetheless, as shown in Figure 15(k), GRAPE is on average 4.1, 2.6, and 12.4 times faster than Giraph, GraphLab, and Blogel, respectively, when the number n of processors varies from 64 to 192. Moreover, GRAPE scales well with n.
(6) Scalability of GRAPE. As observed in McSherry et al. [51], the speed-up of a system may degrade over more processors. We thus evaluated the scalability of GRAPE, which measures the ability to keep the same performance when both the size of graph G (denoted as (|V |, |E|)) and the number n of processors increase proportionally. We varied n from 64 to 192, and for each n, deployed GRAPE over a synthetic graph. The graph size varies from (50M, 500M ) (i.e., 50 million nodes and 500 million edges; denoted as G 1 ) to (250M, 2.5B) (denoted as G 5 ), with a fixed ratio between edge number and node number and proportional to n. The scalability at, for example, (128, G 3 ) is the ratio of the time using 64 processors over G 1 to its counterpart using 128 processors over G 3 . As shown in Figure 15(l), GRAPE preserves a reasonable scalability (close to linear scalability, the optimal scalability).
We further evaluated the COST [51] of GRAPE, which denotes the hardware configuration (the number of cores) required by GRAPE to outperform a competent single-threaded implementation.  Table 3. For CC, we adopted its original implementation 1 [51] as the single-threaded version. For SSSP, Sim, SubIso, and CF, no implementations are given in McSherry et al. [51], and we adopted the best single-threaded implementations to our knowledge for comparison with GRAPE. Following McSherry et al. [51], for large input graphs that are unable to fit into RAM, we used the external I/O on SSD as an extension to RAM (marked with * in Table 3). From Table 3 we can see the following.
(1) For SSSP, CC, Sim, and CF, GRAPE achieves speedup over single-threaded implementations with just 2 or 4 cores over all tested input graphs.
(2) For SubIso, even with its relatively heavy prefetching cost (Section 5.1), GRAPE still outperforms the single-threaded implementations with 24 or 32 cores (just 2 physical machines).
(3) According to McSherry et al. [51], GraphLab had a COST of 512 cores and Spark GraphX had unbounded COST (no configuration can outperform single-threaded). Therefore, GRAPE demonstrates better scalability than GraphLab and GraphX with smaller extra overhead for parallel graph computations.
It should be remarked that parallelization overheads are inevitable for all distributed/parallel systems, including but not limited to GRAPE. Nonetheless, parallel processing often works better when dealing with large-scale graphs that are beyond the capacity of a single machine.
Exp-2: Communication cost. The communication cost (in bytes) reported by Giraph, GraphLab, and Blogel depends on their own implementation of message blocks and protocols. As observed in Han et al. [36], these built-in message or byte counters differ from each other: Blogel counts crossprocess bytes, GraphLab reports cross-machine bytes, and Giraph tracks cross-partition bytes. For a fair comparison, we adopted a third-party tool, Nethogs [10], following the practice [36]. It tracks the total bytes sent by each machine during the run by monitoring the system file /proc/net/dev. This metric, better aligned to parallel models of the systems, reveals consistent results with better insights.
In the same setting as Exp-1, Figure 16 reports the communication costs of the systems. The results show that, in all cases, GRAPE incurs much less communication cost than Giraph and GraphLab. On datasets excluding traffic, with 192 processors, it ships on average 0.08%, 1.1%, 0.3%, 0.18%, and 8.4% of the data shipped for SSSP, Sim, CC, SubIso, and CF by Giraph, and 0.11%, 0.14%, 0.3%, 0.19%, and 44% by GraphLab, respectively; moreover, it reduces their cost by 6 and 5 orders of magnitude for SSSP and CC on traffic, respectively. While it ships more data than Blogel for CC due to the precomputation of Blogel remarked earlier, it only ships 6.2%, 0.1%, 1.9%, and 4.8% of the data shipped by Blogel for Sim, SubIso, SSSP, and CF, respectively. On traffic, GRAPE also reduces the communication cost of Blogel by 4 and 3 orders of magnitude for SSSP and CC, respectively.
(1) SSSP. Figure 16(a-c) shows that both GRAPE and Blogel incur communication costs that are orders of magnitudes less than those of GraphLab and Giraph. This is because vertex-centric programming incurs a large amount of messages. Both block-centric programs (Blogel) and PIE programs (GRAPE) reduce unnecessary messages and trigger inter-block communication only when necessary. We also observe that GRAPE ships 0.9% and 10% of the data shipped by Blogel over UKWeb and Friendster, respectively. Indeed, GRAPE ships only changed values of update parameters, and needs fewer supersteps. These significantly reduce the size and number of messages. (2) CC. Figure 16(d-f) demonstrates similar improvement of GRAPE over GraphLab and Giraph for CC. It ships on average 0.2% and 0.3% of the data shipped by Giraph and GraphLab on datasets excluding traffic, and 0.0015% and 0.0003% on traffic, respectively. Since Blogel precomputes CC (see Exp-1(2)), it ships little data. Nonetheless, GRAPE is not far worse than the near "optimal" case of Blogel on Friendster and UKWeb, and it ships only 0.05% of the data shipped by Blogel on traffic.
(3) Sim. Figure 16(g, h) reports the communication cost for graph simulation over Friendster and DBpedia, respectively. One can see that GRAPE ships substantially less data (e.g., on average 0.9%, 0.1%, 4.9% of the data shipped by Giraph, GraphLab, and Blogel, respectively). Observe that the communication cost of Blogel is much higher than that of GRAPE, even though it adopts interblock communication. This shows that the extension of vertex-centric to block-centric by Blogel may not suffice to reduce messages when it comes to complex queries. GRAPE works better than these systems by employing incremental IncEval to reduce redundant messages and computation.
(4) SubIso. Figure 16(i, j) reports the results for SubIso over Friendster and DBpedia, respectively. The results are consistent with their counterparts for Sim. On average, GRAPE ships 0.18%, 0.24%, and 0.11% of the data shipped by Giraph, GraphLab, and Blogel, respectively.
(6) Synthetic. In the same setting as Figure 15(l), Figure 16(l) reports the communication cost for SSSP using synthetic graphs. The results demonstrate that more communication cost is incurred over larger graphs and more processors, due to increased border nodes, as expected.
Exp-3: Incremental computation. We evaluated the effectiveness of incremental IncEval. We implemented a batch version of GRAPE for Sim queries, denoted as GRAPE NI , which uses PEval to perform iterative computations and handle the messages instead of IncEval. It mimics the case when no incremental computation is used. As shown in Figure 17(a), over Friendster, (1) GRAPE outperforms GRAPE NI by 9.0 times with 192 processors and (2) the gap is larger when less processors are employed (e.g., 11.0 times when 64 processors are used). This is because the fewer processors used, the larger the fragments reside at each processor, and, as a consequence, heavier computation costs are incurred at each superstep. This verifies that incremental steps effectively reduces redundant local computations in iterative graph computations. The results on DBpedia are consistent (not shown).

Exp-4. Compatibility.
We also evaluated the compatibility of optimization strategies developed for sequential graph algorithms with GRAPE parallelization. For a query class Q, a sequential algorithm A and its optimized version A * for Q, denote the speedup of the optimization as T ( A) T ( A * ) . Denote the running time of GRAPE parallelization of A (respectively, A * ) asT p (A) (respectivle, T p (A * )) for a given number n of processors. Ideally, T ( A) T ( A * ) should be close to T p ( A * ) ; that is, GRAPE preserves the speedup from the optimization. That is, the impact of the optimization is not "dampened out" by parallelization overhead such as synchronization and message passing.
We make a case for graph simulation. We evaluated two sequential algorithms, one from Henzinger et al. [39], and the other is an optimized version that employs indices to reduce candidates [25]. Using Sim queries over Friendster. we found that the average speedup of sequential algorithms is 1.24. Varying n from 64 to 192, we report the speedup of the parallelized algorithms of GRAPE in Figure 17(b). The results on DBpedia are consistent (not shown). The results suggest that the speedup is close to its sequential counterpart. Such optimization cannot be easily encoded in vertex programs of Giraph and GraphLab and the V-mode and B-mode programs of Blogel.
Summary. From the experimental results, we find the following.
(1) By simply parallelizing sequential algorithms, the performance of GRAPE is already comparable to state-of-the-art systems. Using from 64 to 192 processors over real-life graphs excluding traffic, GRAPE is on average 484, 36, and 15 times faster than Giraph, GraphLab, and Blogel for SSSP; 151, 6.8, and 16 times for Sim; 149.3, 34.2, and 9.6 times for SubIso; and 4.6, 2.6, and 12.4 for CF, respectively. For CC, it is 1,377 and 212 times faster than Giraph and GraphLab, respectively, and is comparable to the "optimal" case of Blogel although Blogel embeds the computation of CC in its graph partition phase. On traffic, for SSSP and CC, GRAPE is on average 4, 3 and 2 orders of magnitude faster than Giraph, GraphLab, and Blogel, respectively. The results on synthetic graphs are consistent.
(3) GRAPE demonstrates good scalability when using more processors since its incremental computation mitigates the impact of more border nodes and fragments.
(4) Incremental steps effectively reduce iterative recomputation. For Sim, it improves the response time by 9.6 times on average.
(5) GRAPE inherits the benefit of optimized sequential algorithms. For Sim, it is on average 20% faster by using the algorithm of Fan et al. [25] instead of the algorithm of Henzinger et al. [39].

CONCLUSION
We have proposed an approach to parallelizing sequential graph algorithms. Given a class Q of graph queries, users can devise existing sequential algorithms for Q with minor changes, without recasting the entire algorithms into a new model. GRAPE parallelizes the computation and guarantees to converge at correct answers under a monotonic condition, as long as the sequential algorithms are correct. Moreover, graph algorithms that are developed for existing parallel graph systems can be migrated to GRAPE without incurring extra complexity. We have verified that GRAPE achieves comparable performance to the state-of-the-art graph systems for various query classes and that (bounded) IncEval effectively reduces unnecessary recomputation and hence the cost of iterative graph computations. We hope that GRAPE will make parallel graph computations accessible to a large group of users who are more familiar with sequential algorithms.
A preliminary implementation of GRAPE is available at the GRAPE website [9]. We are in the process of implementing asynchronous message passing, based on Fan et al. [27]. We are also implementing a lightweight transaction controller, to support not only queries but also updates such as insertions and deletions of nodes and edges. When the update load is light, GRAPE adopts nondestructive updates that have proved useful in functional databases [64]. Otherwise, it switches to multiversion concurrency control [14] that keeps track of timestamps and versions, as adopted by existing distributed systems.
One topic for future work is to revise the asynchronous model of Fan et al. [27] to maximize the benefit of pipelined parallelism and data-partitioned parallelism. Another topic is to develop methods for incrementalizing graph algorithms with performance guarantees, extending other work [11,16,24].