GraphSet: High Performance Graph Mining through Equivalent Set Transformations

Graph mining is of critical use in a number of fields such as social networks, knowledge graphs, and fraud detection. As an NP-complete problem, accelerating computation performance is the main target for current optimizations. Due to excellent performance, state-of-the-art graph mining systems mainly rely on pattern-aware algorithms. Despite previous efforts, complex control flows introduced by pattern-aware algorithms bring significant overhead and also impede further acceleration on heterogeneous hardware. To address these challenges, we propose a set-based equivalent transformation approach to optimize pattern-aware graph mining applications, which can leverage classic set properties to eliminate most control flows and reduce computation overhead exponentially. We further implement a high-performance pattern-aware graph mining system supporting both CPU and GPU, namely GraphSet, to automatically apply these transformations. Evaluation results show that GraphSet outperforms state-of-the-art cross-platform and hardware-specific graph mining frameworks by up to 3384.1x and 243.2x (18.0X and 10.2x on average), respectively.


INTRODUCTION
Graph data are widely used in many important fields including social networks [20], knowledge graphs [33,36,46,47], fraud detection [15], and so on.Graph processing algorithms can be categorized into two main types: graph computation and graph mining.While the former has received significant attention in last decades, the latter, which is an NP-complete problem, still faces severe challenges.
Graph mining applications, such as Pattern Matching (PM) [21], Motif Counting (MC) [34], and Frequent Subgraph Mining (FSM) [6], aim to discover structural information in graphs and have been widely applied in various domains including community detection [4,26], protein function prediction [14,35], image segmentation [52], to name a few. Figure 1 illustrates an example of graph mining, which aims to identify all subgraphs in a data graph that match a given pattern (rectangle).Typical graph mining algorithms share common patterns that generally expand from a vertex or an edge in a graph and explore much larger subgraphs step by step until finding all subgraphs that meet certain requirements.Therefore, several general graph mining frameworks have been developed [10-12, 18, 23, 29, 31, 32, 40, 43, 45].
As graph data size scales up and patterns grow complex, there is an increasing need for highly efficient graph mining frameworks.There are two typical approaches to implementing a graph mining algorithm: pattern-oblivious [43] and pattern-aware [32].However, due to significant performance advantages, state-of-the-art graph mining systems [10,12,29,31,32,40] mainly use pattern-aware algorithms.By analyzing input patterns and their structural information, pattern-aware algorithms construct a schedule (a.k.a., search order and restrictions (a.k.a., symmetry order) [29,32,40] to achieve efficient mining.Pattern-aware algorithms have twofold advantages.From a computational perspective, they leverage the connectivity information of subgraphs and avoid a large number of subgraph isomorphism checks, which is also an NP-complete problem.From a memory perspective, they are generally implemented using a DFS (Depth First Search) scheme, thus avoiding memory overflow problems caused by storing an excessive amount of intermediate data.Despite previous efforts, there are still two main limitations in current pattern-aware graph mining systems.First, complex control flows in pattern-aware algorithms consume too much time.Figure 2 shows an example of a pattern-aware algorithm that counts the number of occurrences for a given pattern (Cycle-6-Tri).An efficient pattern-aware implementation often involves numerous control flow statements, such as for loops (Lines 14 ∼ 16) and if statements (e.g., Lines 4 and 8), which are unfriendly for modern processors and accelerators that prefer regular computations.Such control flows make it challenging to fully utilize underlying computing ability, especially for heterogeneous hardware like GPU, which has powerful computing ability but is not good at processing control flows.Second, the time complexity of graph mining algorithms is very high.Most graph mining algorithms are NP-complete.As the number of vertices in searched patterns increases, the computation overhead of graph mining algorithms also increases exponentially.
Even if we can fully utilize the computing capability of the underlying hardware, current graph mining frameworks still fail to handle complicated patterns due to large computation complexity.
In this work, we observe that most control flows in pattern-aware graph mining algorithms can be transformed through equivalent set operations, leading to a significant reduction in computation complexity.For example, in Figure 2, the final result of  at Line 17 is not related to specific values of loop variables, but only relates to the execution times of nested loops.Thus we can calculate the iteration times of nested loops with set operations, such as inclusionexclusion principle [16], to avoid actual execution of control flows.
However, building a pattern-aware graph mining system based on this observation still faces several challenges.First, computations in various graph mining applications have complex and diverse dependencies.The execution results of operations in the innermost loops of some graph mining applications may depend on the values of loop variables, such as in Frequent Subgraph Mining [6].For these applications, we cannot simply calculate the number of loop iterations instead of performing nested loops.Second, a large number of set operations can be introduced when transforming control flows.Although these transformations reduce the overhead of control flows, set operations also cause significant overhead.Therefore, highly efficient set operations are required.Third, it is complicated for normal users to manually analyze dependencies and perform transformations for different graph mining applications.An automatic transformation framework is necessary.
To address these challenges, we propose a novel set-based transformation approach for pattern-aware graph mining applications, which can generate a transformation-friendly schedule and apply dependence-aware transformations to eliminate most control flows and reduce computation overhead exponentially.Furthermore, we perform set-based computation reduction to reduce the computation overhead of set operations according to the structural information of patterns and the properties of set operations.Finally, we implement a high-performance pattern-aware graph mining system, namely GraphSet1 , by integrating the above techniques.GraphSet provides flexible interfaces and automatically applies setbased transformations to various graph mining algorithms.To fully utilize the underlying computing ability of CPU and GPU, Graph-Set also performs in-depth architecture-aware optimizations and generates efficient code on specific hardware.
We evaluate the performance of GraphSet with 4 commonly used graph mining applications across 7 real-world graphs on Intel CPUs and NVIDIA V100 GPUs.Comparing with existing systems, GraphSet outperforms the state-of-the-art cross-platform framework Pangolin [11] and GPU system G 2 Miner [12] by up to 3384.1× and 243.2× (18.0× and 10.2× on average), respectively.GraphSet also achieves nearly-linear speedup in most tasks within 64 GPUs.
We summarize the main contributions of our work as follows: • We propose set-based transformation techniques that transform control flows into set operations and reduce computation overhead exponentially for common graph mining algorithms ( §4).Graph mining aims to discover all distinct subgraphs (called embeddings) which are isomorphic to an input pattern in a large data graph.For better understanding, a pattern and an embedding can be regarded as a template and an instance, respectively.Figure 1 is a graph mining example.The rectangle pattern is isomorphic to the embeddings which are subgraphs of the data graph.
Pattern-Aware Algorithms.Recent graph mining systems [10,29,31,32,40] use the pattern-aware algorithm due to its computation and memory efficiency.Figure 2 gives a concrete example of embedding counting using pattern-aware graph mining algorithms.In pattern-aware algorithms, the pattern is searched according to a specific order of its vertices, which is called a schedule.In Figure 2, it uses an order of , , , , ,  to find all embeddings of a Cycle-6-Tri pattern in the data graph.Since  is the first vertex to be searched in the schedule, it can be any vertex in the data graph (i.e., Line 2 in Figure 2).Then vertex  is searched after  in the schedule.Observing the connectivity information that there is an edge between  and  in the pattern,  is required to be connected with  in the data graph.Therefore,  is searched in the neighborhood of  (i.e., Line 3 in Figure 2).Similarly, since  is connected with both  and  in the pattern,  is searched in the intersection of neighborhoods of  and  (i.e., Lines 6 and 7 in Figure 2).A pattern with  vertices has ! different schedules, and their performance differs greatly [40].To achieve better performance, schedule selection methods based on experience [29], sampling [28], and modeling have been widely studied [31,32,40].Inclusion-Exclusion Principle (IEP).IEP [2,16] is a counting technique used to calculate the size of a set that is the union of several sets.It is a fundamental concept in combinatorics.
Equation (1) shows the formula of IEP, where   represents the th set.We can use IEP to calculate the execution times of some nested loops.For example, for the nested loops shown in Lines 8 to 12 in Figure 4 ①, when the condition is that the loop variables of vD, vE, and vF are different from each other, then we can use the following formula to get the execution times of the innermost workload in Line 12, where {( 1 ,  2 ,  3 )} can be calculated by IEP.
IEP optimization is also applicable to labeled graphs.For example, if the labels of vertices D and E in input patterns are different, then the intersection of LoopD and LoopE must be an empty set.Preprocessing such cases can further reduce the computation overhead of IEP.

OVERVIEW
Figure 3 shows the workflow of GraphSet.GraphSet takes a userdefined graph mining application as input.The pattern-aware schedule generator first generates an efficient and transformationfriendly schedule according to user-specific pattern ( §4.1).To transform the control flow into set operations, GraphSet then performs dependence-aware reschedule and computation transformation ( §4.2).After that, GraphSet performs a set-based computation reduction to further reduce the number of IEP counting and set operations ( §4.3).Finally, GraphSet performs architecture-aware optimizations for the implementations of parallel strategy, set operation, and generates high performance code for both CPU and GPU for a given computation reduced schedule ( §6).

SET-BASED TRANSFORMATIONS
In this section, we perform a series of set-based equivalent transformations to eliminate control flows and reduce computation overhead for pattern-aware graph mining algorithms.Figure 4 shows how a pattern-aware algorithm is transformed step by step.Graph-Set first generates a transformation-friendly schedule according to the input pattern (Figure 4

Generating Transformation-Friendly Schedule
For pattern-aware graph mining, it is critical to specify a schedule for a given pattern.A pattern with  vertices has ! possible schedules, and the performance gaps of different schedules span a wide range for graph mining algorithms.Therefore, GraphSet first generates an efficient and transformation-friendly schedule with the most layers of perfectly nested loops for given patterns.
To facilitate further optimization, we convert innermost nested loops into perfectly nested loops, where all computations are contained in the innermost loop.Figure 4 illustrates a set-based transformation with a pattern-aware algorithm step-by-step.Figure 4(a)  and Listing 1 show two different schedules and their corresponding pseudocodes.For the second schedule (i.e., Listing 1), since loopE and loopF depend on the value of vC, only the innermost two loops can be converted into perfectly nested loops.In contrast, the schedule in Figure 4(a) has three layers of perfectly nested loops.
Since our optimization target is to eliminate control flows in perfect nested loops, the more layers of the loop, the better the optimization effect.Therefore, GraphSet generates a schedule with the most layers of perfectly nested loops.Suppose that in a given pattern, there are at most  vertices that any two of them are not directly connected.If the last  vertices of a schedule are not directly connected in the input pattern, then this schedule is the one with the most layers of perfectly nested loops.We can explain this rule from the perspective of intersection.The last searched  vertices are not directly connected in the pattern, which means that there is no intersection operation in the innermost  loops.When the layers of perfectly nested loops in multiple schedules are identical, we adopt an performance model [40] to select the optimal one.

Transforming Control Flows into Set Operations
After getting the perfectly nested loops, we have the opportunity to transform the control flow in the loops into set operations.In this section, we propose a general transformation approach for graph mining applications.

Basic Transformation.
In this section, we transform the control flow introduced by perfectly nested loops.Since the number of iterations of the loops affects the result, we design a statistical approach based on set operations to calculate the number of iterations.By using the inclusion-exclusion principle (IEP), we can count the number of iterations instead of iterating through all nested loops.We still use the pseudocode of counting embeddings at Lines 2 ∼ 7 of Figure 5 to illustrate the basic idea of the transformation.For convenience, we call the computation inside the innermost loop (e.g., cnt = cnt + 1) the original function.Obviously, the execution result of the original function is not related to the value of the loop variables but only depends on the number of iterations of the perfectly nested loops.If we can get the number of iterations of the loops, assuming it is , we can execute cnt = cnt + n instead of the loops and the original function.
Since graph mining applications require that the ids of vertices in an embedding are different (i.e., Line 6 in Figure 5), we cannot directly calculate the number of iterations by multiplying the length of each loop.Fortunately, we can use the inclusion-exclusion principle (IEP) in combinatorial mathematics to calculate the number of iterations (i.e., the function times() at Line 9 of Figure 5).The calculation process of IEP consists of intersection and difference operations of loop sets.In this way, we successfully transform the control flow of the perfectly nested loops into set operations.

Dependence-Aware Transformation. The basic transformation above cannot be directly applied to general graph mining algorithms as the execution result of an original function may be
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.related to the vertex ids of the perfectly nested loops.For example, if the original function is func(vE): setE = setE ∪ {vE}, which is used for Frequent Subgraph Mining in §5.2, the execution result of func(vE) is related to vE.Therefore, we cannot just count the number of iterations and ignore the value of vE.To address this, GraphSet performs dependence-aware transformation for general graph mining algorithms.
We present dependence-aware transformation with two steps.The first step is dependence-aware reschedule.It analyzes the dependence between the original function and loop variables, and extract loop variables, which the original function depends on, from the perfectly nested loops.Thus the original function will have no dependence on the variables in the inner layers of perfectly nested loops anymore.The second step is computation transform.It transforms the control flow in perfectly nested loops into set operations by counting the number of perfect nested loop iterations, assuming the number is , and try to continuously execute the original function  times with fixed parameters in a more efficient way.For convenience, we formally define the iteration vertex and iteration space of the -th loop in the  layers of perfectly nested loops as   and   ( ∈ [1, ]), respectively.Dependence-Aware Reschedule.If the value of iteration vertex affects the execution result of an original function, we say that this function has a dependence on the vertex.The core idea is to convert perfectly nested loops with function-dependent vertices into perfectly nested loops without function-dependent vertices.
We analyze the dependence between the original function and the loops.For iteration vertex   , if the original function is commutative and has a dependence on   , we extract   and its corresponding loop from the perfectly nested loops.If the original function can be split into multiple parts with different dependencies (like FSM in §5.2), we place each part separately inside a perfectly nested loop, perform a dependency analysis and extract the iteration vertex.
As shown in Figure 4(b), if the original function func(vE) satisfies commutative law, we can extract the iterator vertex vE along with loopE from the nested loop, and thus the function func(vE) no longer has a dependence on the remaining two-layer perfectly nested loop (i.e., Line 5 ∼ 7 in Figure 4(b)).Computation Transformation.By dependence-aware rescheduling, the original function no longer depends on the perfectly nested loop, so we get the opportunity to eliminate the control flow in the loop.Similar to the basic transformation in §4.2.1, we transform nested loops into set operations and calculate the number of loop iterations, which is also the execution times of the original function.
After getting the execution times, we try to reduce the repeated execution of the original function.Letting the overhead of computation func(vE) be  , we define that a computation is reducible if and only if there is an algorithm whose overhead is less than  ×  and can get the same result as the original function called  times repeatedly.For example, if there is a composition function func(n, vE), which is equal to execute func(vE)  times, we can replace the workload at Line 3∼8 in Figure 4(b) with one call to func(n, vE) to reduce the computation, as shown in Figure 4(c).
In the worst case, we cannot find an efficient composition function, but we can still simply execute the original function  times.Doing so has twofold advantages.First, we have transformed the control flow into computation-friendly set operations by calculating the number of iterations, which brings some optimization.Second, repeating the original function  times with the same parameters has better locality and more parallel opportunities.

Set-Based Computation Reduction
After applying dependence-aware transformations, the overhead in perfectly nested loops changes from the cost of control flow and a large number of set difference operations to the cost of a fixed and small number of intersection and difference operations introduced by the calculation of IEP.We further reduce the computation overhead from two perspectives: 1) directly reducing the number of IEP calculation by loop invariant IEP reduction, and 2) reducing the number of set operations in IEP calculation by common intersection elimination and pattern-aware difference simplification.Loop Invariant IEP Reduction.Since the main computational overhead after the transformation of control flow is introduced by IEP calculation, we first consider reducing the number of times of IEP calculation (i.e. the call to function times()).Observing the results of some IEP calculation do not change with the iteration of the outer loop, we search and eliminate these loop invariant results.
According to the calculation process of IEP, as long as the input sets of the function times() does not change, the result will not change.Taking Figure 4(c) as an example, the input sets are loopD -{vE} and loopF -{vE}.Based on the properties of set operations, we have which means that when vE does not belong to loopD or loopF, no matter what the value of vE is, the results of times() are the same.Therefore, we can calculate this common result in advance and reuse it when vE does not belong to the two sets, as shown in Lines 5 ∼ 6 in Figure 4(d).Although this optimization introduces a check of whether each vE is in the union or not at Line 5, it is still beneficial because each vE has a high probability of not being in union_DF, thus avoiding executing times() with  (2  ) set operations.Common Intersection Elimination.To reduce the number of intersection operations in IEP calculation, we utilize the properties of set intersection and eliminate the common calculation of intersections.Specifically, we expand the intersection operation of the IEP formula, write them in the form that all the operands are neighbors (i.e., N(v)), and then extract the common intersection.
We use a more complicated IEP calculation of three sets in Figure 6 to illustrate the reduction of set operations in IEP calculation more clearly.For the convenience of illustrating the basic idea of reducing set operations in IEP calculation, we substitute the formula of IEP into the function times() and we get Figure 6

GRAPHSET-BASED GRAPH MINING
This section illustrates how GraphSet can be used to optimize pattern-aware graph mining applications.We first describe programming interfaces, then present several case studies.

Programming Interfaces
GraphSet provides easy-to-use and flexible programming interfaces, which enable users to apply set-based transformations for different applications.The workflow of graph mining algorithms can be generally divided into two steps: specifying patterns to be mined and searching for embeddings.Accordingly, we design two types of interfaces in GraphSet.Pattern Specification.Listing 2 shows some Pattern Specification APIs.Users can utilize these APIs to describe the structure of patterns they are interested in.Searching for Embeddings.After specifying patterns to be mined, GraphSet starts searching for embeddings and performs post-processing (e.g., a composition function) on them.All the optimizations proposed in the set-based transformation ( §4) can be performed by GraphSet in a fully automated way, except for the generation of composition functions.GraphSet provides two ways to get the composition function of an algorithm: 1) users can utilize the supported operators built in GraphSet and let GraphSet automatically generate the composition function, or 2) they can directly provide the composition function.Listing 3 presents relevant interfaces.Users need to inherit Orig_UDF or Comp_UDF and overload the functions.---------------------------------------------2 # After code generation , the code is as follows : For some commonly used operators, GraphSet natively supports the automatic set-based transformation of the original functions described by these operators.Table 1 shows the operators supported by GraphSet.For each operator, there is an interface provided by GraphSet corresponding to it.All interfaces take a in-place computation, i.e.,  =  (, ).Users can directly use these operators in orig_func(v_id) to describe original functions.The argument v_id is the vector of IDs of iteration vertices in the schedule.For example, the ID of vE is v_id [4].GraphSet can automatically analyze dependencies based on the computation workload related to v_id and perform dependence-aware reschedule.
GraphSet also provides an interface comp_func(n, v_id) for userdefined composition function so as to cope with the situation that original functions are too complex to be described by the operators GraphSet provides.The argument n indicates the execution times of the original function.

Set-Based Transformation Case Studies
In this section, we use three common graph mining applications to illustrate how to apply set-based transformations.Counting Applications.Many graph mining problems, such as Clique Counting and Motif Counting [34], only need to count the number of embeddings.We have discussed the basic transformation for counting embeddings in §4.2.1, and this transformation is also applicable for other counting applications.As shown in Listing 4, users can utilize the operator add(cnt,1) to describe the original function of cnt = cnt + 1. Frequent Subgraph Mining (FSM) [6].FSM is another common graph mining application.Note that although vertices of patterns in FSM are labeled, to describe and simplify the pseudocode conveniently, we treat vertices as unlabeled.Listing 5 is the example of mining the Cycle-6-Tri pattern in FSM with the minimum node image (MNI) [6] support measure.Listing 6 shows the usage of  # ----------------------------------------------2 # After code generation , the code is as follows :  # ----------------------------------------------2 # After code generation , the code is as follows : Pattern Existence Query [29].It aims to determine whether a pattern exists in a data graph.Obviously, a pattern exists in a data graph if and only if its occurrence times are greater than 0. Therefore, the pattern existence query can be optimized to the form shown in Listing 7 after set-based transformations.

ARCHITECTURE-AWARE OPTIMIZATIONS
In this section, we describe the main techniques of GraphSet's efficient cross-platform implementations on both CPU and GPU.Two-Level Parallel Strategy.To adapt to different architectures and parallel computing capabilities, such as CPU and GPU, we propose a two-level parallel strategy: intra-and inter-group parallelism.We partition execution threads into different groups.All threads within a group process a set operation in parallel to achieve intragroup parallelism, and different groups traverse through different branches of a DFS tree with a work-stealing algorithm to achieve inter-group parallelism.
Since CPU has few cores and its hardware parallelism is far lower than inter-group parallelism of an application, only a single thread in each group can achieve workload balance.For the CPU implementation of GraphSet, a thread is used as a group for intergroup parallelism, and sequential algorithms are used within a group.To achieve better workload balance, different groups traverse through edge sets instead of vertex sets commonly used in other CPU graph mining systems [31,32,40].For GPU with much higher hardware parallelism, it needs a suitable parallel way to fully utilize its computing ability.Simply using a parallel model similar to the CPU implementation would result in low warp utilization (about 4.4%) and global memory bandwidth utilization (about 13%) [38].GraphSet uses a warp (32 threads) as a group.Since the average set cardinality in graph mining applications is usually smaller than 32, which is the number of threads in a warp, a warp is able to provide sufficient parallelism for set operations in general.
When both CPU and GPU are available for calculation, GraphSet chooses CPU or GPU for calculation according to the parallelism of a task.For most graph mining problems that can provide enough parallelism, GPU's powerful parallel computing ability can be effectively utilized, and CPU only needs to do some lightweight work, such as preprocessing data.But when the parallelism is insufficient, GPU cannot be fully utilized because its performance is limited by latency rather than bandwidth.Therefore, when the parallelism provided by a workload is not greater than the number of CPU cores, our system directly uses CPU for calculation, which has higher execution efficiency and saves the overhead of GPU data transmission and kernel launches.Set Operation Optimizations.Set operations, especially intersection, account for most of the execution time in graph mining.An efficient strategy of set operations is important for performance.Based on the two-level parallel strategy, we select the serial merging algorithm for CPU intersection and the merge path algorithm [22] for GPU intersection.Since the merge path algorithm needs scan operations before storing intersections, significant extra overhead is introduced due to multiple synchronizations.Therefore, if we only need the cardinality of intersections instead of storing intersection, inefficient scan operations can be avoided.GraphSet analyzes the usage of intersections in pre-processing.If an intersection is not used for intersecting with other sets, the scan operation will not be performed, and only the cardinality of the intersection will be stored.This optimization is very effective for the set-based transformation in §4 when calculating the execution times of original functions.Efficient Code Generation.Most graph mining systems [10,11,18,29] use adaptable code to run algorithms, which means that the same code is used to process arbitrary input patterns.Since patterns can be different at each execution, such adaptable code lacks opportunities to optimize the implementation for each input pattern.To address this problem, GraphSet automatically generates high-performance graph mining code on CPU and GPU for original functions and composition functions according to user input patterns.The generated code only keeps necessary parts, thus reducing the occurrences of loops and conditional statements, as well as the usage of hardware resources such as registers.These optimizations are particularly effective on GPU.For example, if the occupancy of GPU is bounded by the number of available registers, the reduction of used registers can improve occupancy and program performance.

EVALUATION 7.1 Methodology
Platforms.Our CPU experiment is performed on a single machine with Intel Xeon Platinum 8259CL CPU @ 2.50GHz, 1 socket (16 cores, hyper-threading disabled) and 256 GB memory.Our GPU experiment is evaluated on NVIDIA Tesla V100 (32GB memory) GPUs for PCIe with CUDA 11.We exclude graph loading, preprocessing or compiling time in all systems.Graph Mining Applications.We evaluate GraphSet on four popular graph mining applications.All patterns in the evaluation are edge-induced.Pattern Matching (PM) aims at discovering all embeddings of input patterns in a graph.We adopt 6 patterns as shown in Figure 8. k-Motif Counting (k-MC) aims at counting all connected patterns with the size of .k-Clique Counting (k-CC) aims at counting a fully-connected edge-induced pattern with k vertices.k-Frequent Subgraph Mining (k-FSM) aims at listing all frequent labeled patterns with  edges in a graph.A pattern is frequent if its support is not less than a given threshold.We choose the minimum node image (MNI) [6] support measure similar to previous systems [11,29].Note that 3-FSM in our evaluations is the same as 4-FSM in Pangolin, Sandslash and G 2 Miner's evaluations because the definitions of k-FSM in their evaluations are different from other systems.Datasets.We use 7 real-world graphs as shown in Table 2.The numbers of vertices and edges range from 96.6K to 65.6M, and 1.1M to 3.6B, respectively.Patents dataset has two versions, and the labeled version is only used in FSM.

Overall Performance
Pattern Matching We compare GraphSet with Sandslash, Peregrine, GraphPi, G 2 Miner, GPSM and PBE for Pattern Matching experiments using 6 patterns on four real-world graphs.We do not compare GraphSet with Pangolin, because we fail to run pattern matching by using their open-source code.Figure 9 compares the performance using a log scale.GraphSet-CPU outperforms Sandslash, Peregrine and GraphPi by up to 66.1×, 42509× and 8.38× respectively (4.25×, 148.47× and 3.17× on average) for 6 patterns on different graphs.Although GraphPi and Peregrine also does some optimizations when counting embeddings, their optimizations are not as general and thorough as GraphSet's set-based transformation optimization.For GPU systems, GraphSet-GPU outperforms GPSM, PBE, and G 2 Miner by up to 412081×, 3774× and 243× respectively (2354×, 186× and 10.23× on average).
To demonstrate the generalizability of GraphSet, we compare GraphSet with GSI [51] and cuTS [48] with the 6 datasets and 33 queries used in cuTS's evaluation.To run the three systems successfully, we extend the directed edges of these datasets and queries into undirected edges.As shown in table 3, on the datasets roadNet-PA, roadNet-TX and roadNet-TA, GraphSet performs comparably to cuTS with execution times ranging from 1 6 milliseconds, and outperforms GSI 248X on average.On other datasets, GSI fails, and GraphSet outperforms cuTs 781X on average.k-Motif Counting For k-Motif Counting, we perform experiments with GraphSet, Sandslash, Peregrine, G 2 Miner and Pangolin on four real-world graphs, where  is the pattern size.As shown in Table 4, GraphSet-CPU outperforms Sandslash, Peregrine and Pangolin-CPU by up to 23.22×, 5.00× and 456× respectively (11.77×, 3.78× and 114× on average), and GraphSet-GPU outperforms Pangolin-GPU and G 2 Miner by up to 16.44× and 32.80× respectively (15.86× and 9.20× on average).Pangolin uses the pattern-oblivious algorithm, which brings much computation overhead, resulting in low performance.GraphSet-CPU is only 3.78× faster than Peregrine on average, because the patterns used in motif counting are relatively small and the computation complexity is not high enough.Graph-Set has more transformation opportunities for patterns with more vertices.k-Clique Counting We perform 4-CC and 5-CC with GraphSet, GraphPi, Sandslash, Peregrine, Pangolin, G 2 Miner and PBE on all the unlabeled graphs.As shown in Table 5, GraphSet-CPU outperforms GraphPi, Sandslash, Peregrine and Pangolin-CPU by up to 9.15×, 2.29×, 42.9× and 21.7× (4.05×, 1.18×, 8.07× and 8.80× on average).GraphSet is slower than Sandslash on some graphs because Sandslash uses a special optimization for Clique Counting with bitmaps, and this optimization does not work for other graph mining applications because its memory overhead is too large.GraphSet-GPU outperforms G 2 Miner, Pangolin-GPU and PBE by up to 15.37×, 58.25× and 243.8× (4.65×, 14.19× and 69.83× on average).k-Frequent Subgraph Mining We perform k-Frequent Subgraph Mining experiments with GraphSet, Sandslash, Peregrine, G 2 Miner and Pangolin on three labeled real-world graphs.The performance of 2-FSM and 3-FSM is shown in Table 6.GraphSet-CPU outperforms Sandslash, Peregrine and Pangolin-CPU by up to 105.6×, 1300× and 3384× respectively (11.8×, 101.8× and 62.39× on average), and GraphSet-GPU outperforms G 2 Miner, Pangolin-GPU by up to 4.55× and 5.26× (2.46× and 2.02× on average).Graph mining systems like Pangolin, which uses pattern-oblivious graph mining algorithms with BFS implementation, are very suitable for FSM.When using GPU, algorithms with BFS implementation have obvious advantages due to the higher parallelism.Pattern-aware graph mining systems like Peregrine, Sandslash, G 2 Miner and Graph-Set do not save any intermediate subgraph result and the same subgraph is searched many times, which brings a lot of repeated computation.Even though, GraphSet has comparable performance with Pangolin.Since intermediate data is stored, Pangolin crashes due to the lack of memory when the searching space is large, such as 3-FSM on all the three graphs, but GraphSet can easily handle large searching space.

Multi-GPU Scalability
We evaluate the scalability of GraphSet with up to 64 NVIDIA Tesla V100 GPUs. Figure 11 shows the speedup of different applications obtained with the different number of GPUs on LiveJournal.GraphSet achieves a near-linear speedup with 16 GPUs for all tasks.However, some tasks do not continue to scale linearly when using 32 or 64 GPUs due to their small workload (e.g., 3-MC only takes 76 ms on 64 GPUs), where the overhead of kernel launch and other operations is non-negligible.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Evaluating Set-Based Transformations
We optimize applications through set-based transformations.To verify the effectiveness of our optimizations, we compare the performance of GraphSet when enabling set-based transformations or not for each application.As the patterns used in FSM are relatively small, which cannot reflect the effectiveness of this optimization, we calculate the MNI supports of the patterns P1 ∼ P6 instead of the supports of the patterns in the origin FSM algorithm.As shown in Figure 10 and Figure 12, GraphSet-CPU with setbased transformations outperforms that without transformations by up to 1839×, 1.44×, and 3667× respectively (399×, 1.19×, and 505× on average) for Pattern Matching, Motif Counting and Frequent Subgraph Mining, and GraphSet-GPU with set-based transformations outperforms that without transformations by up to 9654×, 3.66×, and 82027× respectively (1539×, 2.96×, and 11944× on average).

Effectiveness of Code Generation
GraphSet generates efficient code for graph mining applications.Compared with the adaptable code that can handle arbitrary patterns, our generated code has opportunities to use compilation techniques to reduce register usage and improve GPU occupancy.As shown in Figure 13, our generated code outperforms the adaptable code by up to 2.44× (1.52× on average) for pattern matching on GPU.We further analyze the register usage of different codes.According to evaluation results, the number of registers used in the adaptable code is 71, while the numbers of registers used in our generated code for different patterns are 32 ∼ 41.With the decrease of used registers, GPU occupancy is improved and GPU computing resources are better utilized, thus the overall performance of the GPU graph mining algorithm is improved.

RELATED WORKS
Pattern-oblivious algorithms.Graph mining algorithms can be categorized into two types: pattern-aware and pattern-oblivious.Early systems mainly use pattern-oblivious algorithms [7,11,18,43,45].As the representative, Arabesque [43] creatively proposes the concept of "think like an embedding" (TLE) and defines a highlevel filter-process computational model.It searches subgraphs in a breadth-first search (BFS) manner.At each exploration step, it saves a large number of subgraphs, expands subgraphs by adding an edge or a vertex to these subgraphs, and then prunes illegal subgraphs based on user-defined filter functions.GraphSet uses pattern-aware algorithms instead of pattern-oblivious algorithms since the former one has better computation and memory efficiency.Graph mining accelerators.Some hardware accelerators are designed to fully utilize computing and memory resources [8,9,17,38,39,41].DIMMining [17] combines software pre-processing and hardware architecture to increase set operation throughput.NDMiner [41] mainly focuses on better utilization of memory bandwidth.Fingers [8] fully exploits fine-grained parallelism to overcome the inefficiencies and imbalance of hardware utilization.Application-specific algorithms.There are some algorithms proposed for a specific graph mining application [1,5,24,25,27,37,42,49].These algorithms are designed with specific optimizations tailored to the characteristics of their intended applications, resulting in excellent performance, but the optimizations are not applicable to other applications.GraphSet, as a cross-platform graph mining framework, mainly focuses on the common optimizations of commonly used graph mining applications.GraphPi introduces IEP for counting embeddings, but it can only accelerate such counting applications.In contrast, GraphSet can handle cases where the innermost workload has complex dependencies like FSM.For GraphSet, IEP is merely a tool we use to calculate the number of repeated computations and is just one of many steps in set transformations.

CONCLUSION
In this work, we present a high-performance pattern-aware graph mining framework based on equivalent set transformations, which supports both CPU and GPU.Results show that GraphSet outperforms state-of-the-art graph mining systems significantly.

Figure 1 :
Figure 1: An example of graph mining.

Figure 2 :
Figure 2: The pattern-aware pseudocode of counting the number of occurrences for Cycle-6-Tri pattern in a data graph.V_G is the vertex set of the input data graph.N(v) returns v's neighbors (More details described in Section §2).
(a)).With dependence analysis, GraphSet performs rescheduling to extract loop variables with dependence out of nested loops (Figure 4(b)).Then, GraphSet transforms control flows in nested loops into the computation of counting iteration times (Figure 4(c)).To reduce the overhead of counting iteration times, GraphSet further performs loop invariant reduction (Figure 4(d)).For the computation overhead of counting iteration times (Figure 4(e)), GraphSet optimizes set operations by utilizing the structural information of input patterns and the properties of set operations (Figure 4(f)).

Figure 4 :
Figure 4: An example of set-based transformation (coef is the coefficient in IEP and can be calculated during preprocessing.).
(a), which is mainly composed of set operations.If we expand the operation loopD ∩ loopE and utilize the commutative law, associative law and idempotent law of set intersection, we can get the equation  ∩  = ( () ∩  () −  ) ∩ ( () ∩  () −  ) = ( () ∩  () ∩  ()) −  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.substitute the formula of IEP (a) IEP calculation.(b) Common intersection elimination.(c) Pattern-aware difference simplification.

Figure 6 :
Figure 6: The workflow of set operations reduction.coef is the coefficient in IEP and can be calculated during preprocessing.In the same way, we can get ∩  = ( () ∩  () −  ∩ ( () ∩  () −  = ( () ∩  () ∩  ()) −  =  ∩ After simplifying these loopE ∩ loopF and loopD ∩ loopE ∩ loopF in the same way, we find that their results are the same expression as previous two intersections.Since these intersection results are the same, we can extract and eliminate the common intersection and reuse its result in IEP calculation.Figure6(b) shows the optimized pseudocode after common intersection elimination.

Figure 7 :
Figure 7: Awareness of the structural information of the Cycle-6-Tri pattern.Pattern-Aware Difference Simplification.We further reduce the number of difference operations by utilizing the structure information of a pattern based on the following observations: 1) if there is an edge between two vertices in a pattern, then their corresponding vertices in an embedding must belong to each other's neighborhood.Taking Figure7as an example, vertex  and  are connected, so there must be  ∈  () and  ∈  ().2) since the id of each vertex in an embedding is different, we can delete vertex  from its neighborhood  () during preprocessing while not affecting the correctness of the algorithm.That is, we enforce ∉  ().Based on the properties, we can deduce the equations | () ∩  () − {, , }| =| () ∩  () − {}| = | () ∩  ()| − 1 | () ∩  () ∩  () − {, , }| =| () ∩  () ∩  ()|.By using the common intersection elimination and pattern-aware difference simplification optimizations, we get the pseudocode (Figure 6(c)) that minimizes the number of set intersection and difference operations in IEP calculation.

3 .Listing 6 : 1 #
Core pseudocode of the optimized FSM algorithm.Initialize and describe the composition function :

3 .Listing 7 :
Core pseudocode of the optimized existence query.comp_func and the pseudocode optimized by set-based transformation.Compared to the original function, the composition function has only one more statement (i.e., if n > 0) because the result is the same whether the set difference operation is executed once or several times.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Figure 8 :
Figure 8: Patterns used in our evaluation.

Figure 9 :
Figure 9: The overall performance of pattern matching.Table 2: Graph datasets.

Figure 10 :
Figure 10: Performance of pattern matching with and without set-based transformation."ST" denotes set-based transformation.

Figure 13 :
Figure 13: Performance of pattern matching with and without code generation.
1# Initialize and describe the original function :

Table 4 :
The execution time (seconds) of k-Motif Counting."C" means corruption.

Table 5 :
The execution time (seconds) of k-Clique Counting."C" means corruption.

Table 6 :
The execution time (seconds) of k-Frequent Subgraph Mining."C" means corruption."T" means the execution time exceeds 12 hours.