Exploiting Fine-Grained Redundancy in Set-Centric Graph Pattern Mining

Graph Pattern Mining (GPM) applications are memory intensive as they require a tremendous amount of edge checks. In recent years, the "set-centric" abstraction has gained attention for its powerful expressive abilities. By leveraging relational algebra, they optimized algorithms with methods like matching orders, early termination, automorphism-breaking, and result reuse to reduce redundancy. However, these approaches primarily address coarse-grained redundancy from exactly the same set formulas, neglecting that the data graph's inherent locality may lead to fine-grained duplicated edge checks. In fact, even unrelated set operations may check the same pair of vertices. This paper introduces the set union operation to the set-centric abstraction to fuse duplicated edge checks into one. It maintains the expressive power of relational algebra and previous optimizations while effectively avoids fine-grained redundancy in GPM tasks. Compared to state-of-the-art methods, our method achieves significant speedup on a V100 GPU cluster, demonstrating up to 305 × faster performance than the state-of-the-art GPM system G2Miner.


Introduction
Applications that use a graph to model data are becoming increasingly important and widespread.Graph pattern mining (GPM) is an emerging class of graph problems that discovers subgraphs in a given data graph and patterns and has plenty of adoption in many fields.By revealing the structure and relationships of a graph, GPM can help reveal trends and correlations, thus it is a vital technique for discovering structures and patterns in networks, as well as correlations between realworld entities.It is used in a wide range of applications such as predicting user behavior [3], detecting anomalies [2,21], social networks [20,22], chemoinformatics [11,16,23], classification [9], and recommending items [12,17].
Graph pattern mining (GPM) is essentially a search problem where a pattern graph needs to be found within a large data graph.The search process starts with matching one vertex from the pattern graph in the data graph and iteratively expands its neighbors to check whether the formed subgraphs satisfy the pattern constraints.The key computational step in this process is edge check, determining whether two vertices are connected.GPM algorithms often involve a large search space that is exponential in pattern size.With the increasing computational and storage demands of advanced graph mining algorithms, computing power is often exhausted.
Recently, advanced graph pattern mining systems have adopted the set-centric paradigm to improve search efficiency.The core idea is to exploit the topological information of the pattern graph to filter out intermediates that will not lead to a valid final match.Set-centric GPM systems often express the searching process as nested loops.At each iteration, the searching process is formalized using expressions composed solely of set intersections (∩) and set subtractions (−).For example, expressions like  ( 2 ) =  ( 0 ) ∩ ( 1 ) indicate that a valid candidate of  2 must be connected to both  0 and  1 .
Set-centric paradigm enables plentiful optimizations by harnessing the equivalence transformations of relational algebra.For example, AutoMine [19] presents a compiler to generate optimal matching orders with minimal searching space by estimating the cost of set formulas.GraphPi [24] leverages the principle of set inclusion-exclusion to directly calculate the result for the counting task.GraphZero [18] introduces a combination of symmetry-breaking rules enables the retrieval of redundant searches.SumPA [14] extracts common set formulas from set-centric expressions to reuse intermediates.While Cyclosa [13] caches the intermediates for some vertices and reuses them in a set dataflow.
Most of these systems pursue the ªpattern-awareº paradigm, where they transform the topological information of the pattern graph into set expressions and analyze them using

Background
Graphs.A graph can be represented as  = ( , ) where  represents a finite set of vertices and  ⊆  ×  is a set of edges.In real life, { , } is often very sparse, which means || ≪ | | × | |.  () represents all the neighbors of .In addition,  ∼  means  is connected to  and  ≁  is the opposite.For the convenience of discussion, we only consider unlabeled and undirected patterns and graphs, and all data can fit the aggregated device memory.Graph pattern mining algorithm.A graph pattern mining (GPM) algorithm A is to find or count the number of all distinct instances (Figure 1 (c)) in a data graph (Figure 1 (a)), where an instance    is an embedding of subgraph pattern that is isomorphic with the given pattern   (Figure 1 (b)) on  ( > 2) vertices, e.g., clique-counting and motif-counting.Checking the connectivity of a given vertex pair is the essential kernel in GPM algorithms.For convenience, ⟨  ,   ⟩ denotes an edge check, where   ,   ∈  .⟨  ,   ⟩ = 1 indicates there is an edge between   and   , called the  type.Otherwise, ⟨  ,   ⟩ = 0 is called the  type.A typical GPM algorithm follows a ªgenerate-checkº procedure, which first generates a candidate and then checks whether it can pass the above two checks.The algorithm can be conducted with BFS order by materializing partial instances at high parallelism or DFS order to reduce memory consumption.Set-centric abstraction.Given a -vertices pattern   and a data graph , a matching order  is a permutation of   and   ≺   means   precedes   .Different matching order will lead to different nested loops, which varies significantly in execution times.Various cost models [7,14,15,18,19,24] are proposed to predict the highest performance before mining.The set-centric GPM algorithm incrementally considers a vertex   ( < ) ∈  and generates its candidate set  (  ) which contains all the vertices that can extend an instance    into   +1 .   is a partial instance of   .A function F  (   ) is introduced to generate the candidate set  (  ), and we can use only set intersection and subtraction to implement F  (   ) [19].The candidate set  (  ) = F  (   ) = < l a t e x i t s h a 1 _ b a s e 6 4 = " i Z J U o R 2 Z Y G u + O v 8 H B j j G v i s V n O 0 = " > A A A C 2 3 i c j V H L S s N A F D 2 N r 1 p f U c G F b o K t Y D c l c a F u h I I b F y I t 2 l a o p S R x q q F p E p J J o V R X 7 t S t P + B W / 8 K P E P 9 A d 3 6 C d 6 Y p + E B 0 Q j J n z j 3 n Z O 6 M F b h O x H X 9 J a W M j I 6 N T 6 Q n M 1 P T M 7 N z 6 v x C N f L j 0 G Y V 2 3 f 9 8 N g y I + Y 6 H q t w h 7 v s O A i Z 2 b F c V r P a u 6 J e 6 7 I w c n z v i P c C 1 u i Y Z 5 7 T c m y T E 9 V U l 3 K l 9 W 7 T y O 8 c D q Y D m v R 8 r q l m 9 Y I u h / Y T G A n I F t X y + 9 P + 8 k 3 J V 5 9 x g l P 4 s B G j A w Y P n L A L E x E 9 d R j Q E R D X Q J + 4 k J A j 6 w y X y J A 3 J h U j h U l s m 7 5 n t K o n r E d r k R l J t 0 1 / c e k N y a l h j T w + 6 U L C 4 m + a r M c y W b C / Z f d l p t h b j 2 Y r y e o Q y 3 F O 7 F + + o f K / P t E L R w v b s g e H e g o k I 7 q z k 5 R Y n o r Y u f a p K 0 4 J A X E C n 1 I 9 J G x L 5 / C c N e m J Z O / i b E 1 Z f 5 V K w Y q 1 n W h j v I l d 0 g U b 3 6 / z J 6 h u F I z N g l E 2 s k U d g 5 H G C l a x T v e 5 h S L 2 U E K F s i 9 w j w c 8 K g 3 l S r l W b g d S J Z V 4 F v F l K H c f / S e Y 9 Q = = < / l a t e x i t > P (v1)=S(v1)=N (v0) < l a t e x i t s h a 1 _ b a s e 6 4 = " i Z J U o R 2 Z Y G u + O v 8 H B j j G v i s V n O 0 = " > A A A C 2 3 i c j V H L S s N A F D 2 N r 1 p f U c G F b o K t Y D c l c a F u h I I b F y I t 2 l a o p S R x q q F p E p J J o V R X 7 t S t P + B W / 8 K P E P 9 A d 3 6 C d 6 Y p + E B 0 Q j J n z j 3 n Z O 6 M F b h O x H X 9 J a W M j I 6 N T 6 Q n M 1 P T M 7 N z 6 v x C N f L j 0 G Y V 2 3 f 9 8 N g y I + Y 6 H q t w h 7 v s O A i Z 2 b F c V r P a u 6 J e 6 7 I w c n z v i P c C 1 u i Y Z 5 7 T c m y T E 9 V U l 3 K l 9 W 7 T y O 8 c D q Y D m v R 8 r q l m 9 Y I u h / Y T G A n I F t X y + 9 P + 8 k 3 J V 5 9 x g l P 4 s B G j A w Y P n L A L E x E 9 d R j Q E R D X Q J + 4 k J A j 6 w y X y J A 3 J h U j h U l s m 7 5 n t K o n r E d r k R l J t 0 1 / c e k N y a l h j T w + 6 U L C 4 m + a r M c y W b C / Z f d l p t h b j 2 Y r y e o Q y 3 F O 7 F + + o f K / P t E L R w v b s g e H e g o k I 7 q z k 5 R Y n o r Y u f a p K 0 4 J A X E C n 1 I 9 J G x L 5 / C c N e m J Z O / i b E 1 Z f 5 V K w Y q 1 n W h j v I l d 0 g U b 3 6 / z J 6 h u F I z N g l E 2 s k U d g 5 H G C l a x T v e 5 h S L 2 U E K F s i 9 w j w c 8 K g 3 l S r l W b g d S J Z V 4 F v F l K H c f / S e Y 9 Q = = < / l a t e x i t > P (v1)=S(v1)=N (v0) For < l a t e x i t s h a 1 _ b a s e 6 4 = " y 0 X Y L F Y 5 k b i F 0 Z v K c / S r j T l Y / t s = " > A A A C z n i c j V H L S s N A F D 1 N f d T 6 q o o r N 8 F W c F U S F + q y 4 M Z l B f u A t p Q k n d b B v E g m h V K K W 3 / A r X 6 L P 6 G 4 c q t f o X e m K a h F d E K S M + e e c 2 f u v X b o 8 l g Y x k t G y y 4 s L i 3 n V v K r a + s b m 4 W t 7 X o c J J H D a k 7 g B l H T t m L m c p / V B B c u a 4 Y R s z z b Z Q 3 r f s r s q U e z u n 2 c m y f G I F T o n 9 y 9 d X / t c n e x E 4 x p r q w a W e I s X I 7 l i W k q p T k T s 3 P n U l K C E i T u I j q s e E m X L 2 z 9 l Q n k T 1 L s / W V v V X p Z S s X L N M m + J N 7 p I u 2 P p + n T / B X q V s r Z S t u l W q m u i N P I q Y x x L d 5 y q q 2 E Q N D c q + x D 0 e 8 K g 5 2 p V 2 o 9 3 2 p F o u 8 8 z i y 9 D u P g C f q Z q 8 < / l a t e x i t > r f s r s q U e z u n 2 c m y f G I F T o n 9 y 9 d X / t c n e x E 4 x p r q w a W e I s X I 7 l i W k q p T k T s 3 P n U l K C E i T u I j q s e E m X L 2 z 9 l Q n k T 1 L s / W V v V X p Z S s X L N M m + J N 7 p I u 2 P p + n T / B X q V s r Z S t u l W q m u i N P I q Y x x L d 5 y q q 2 E Q N D c q + x D 0 e 8 K g 5 2 p V 2 o 9 3 2 p F o u 8 8 z i y 9 D u P g C f q Z q 8 < / l a t e x i t > S(v2)=P (v1) ∩ N (v1) For < l a t e x i t s h a 1 _ b a s e 6 4 = " Z l u d 6 r b E / 1 e a p k g 6 N a g R L c c l J A s = " > A < l a t e x i t s h a 1 _ b a s e 6 4 = " y J 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " y J 5 R n J y m x P B W x c + 1 L V 5 w S A u I E P q F 6 S N i W z v 4 5 a 9 I T y d 7 F 2 Z q y / i a V g h V r O 9 H G e B e 7 p A s 2 f l 7 n I K i s F 4 z N g n F o Z I o 6 e i O N J a w g R / e 5 h S L 2 U E K Z s q / w g E c 8 K W f K j X K r 3 P W k S i r x L O D b U O 4 / A T 6 w m 6 M = < / l a t e x i t > R n J y m x P B W x c + 1 L V 5 w S A u I E P q F 6 S N i W z v 4 5 a 9 I T y d 7 F 2 Z q y / i a V g h V r O 9 H G e B e 7 p A s 2 f l 7 n I K i s F 4 z N g n F o Z I o 6 e i O N J a w g R / e 5 h S L 2 U E K Z s q / w g E c 8 K W f K j X K r 3 P W k S i r x L O D b U O 4 / A T 6 w m 6 M = < / l a t e x i t > P (v3)=N (v1) − N (v0) − N (v2) For < l a t e x i t s h a 1 _ b a s e 6 4 = " y I V P H 6 T J v x H J / I n k t 2 o u 5 g K / r U 4 p w r z 0 y r f a 4 S 8 k u p t D B B T c y 8 l F i t Z u l 4 r p 0 V + 5 v 3 s f Z U e 2 t w 9 A q v O l m J Q 7 J / 6 T 4 y / 6 t T t U j s Y 1 7 X o K 4 g 0 Y y q z i 9 c c n 0 q a u f W p 6 o k H R J y C u 8 x n h L 7 W v l x z p b W Z L p 2 d b a u j j / q T M W q u V / k 5 n h S u + Q F O 9 + v 8 y d Y n y 4 7 s 2 V n x S l V b L y 3 D o x i j E / H w R w q W M Q y q v Q + w z V u c W c 0 j H P j w r h 8 T z W a C s 0 Q v j T j 6 g 3 p a K j E < / l a t e x i t > v3 2 S(v3) < l a t e x i t s h a 1 _ b a s e 6 4 = " y I V P H 6 T J v x H J / I n k t 2 o u 5 g K / r U 4 p w r z 0 y r f a 4 S 8 k u p t D B B T c y 8 l F i t Z u l 4 r p 0 V + 5 v 3 s f Z U e 2 t w 9 A q v O l m J Q 7 J / 6 T 4 y / 6 t T t U j s Y 1 7 X o K 4 g 0 Y y q z i 9 c c n 0 q a u f W p w z 0 e t B M t 0 q 6 0 6 7 F U S y W e D X x b 2 u 0 n L F y V W w = = < / l a t e x i t > v2 = 10 < l a t e x i t s h a 1 _ b a s e 6 4 = " C c c x l g f K I W U T r 9 w z 0 e t B M t 0 q 6 0 6 7 F U S y W e D X x b 2 u 0 n L r 2 V X A = = < / l a t e x i t > v2 = 11 < l a t e x i t s h a 1 _ b a s e 6 4 = " c h

Fine-Grained Redundancy
Loop and edge checks.In the set-centric paradigm, the GPM (Graph Pattern Mining) task can be represented as a nested loop, where the vertices in the pattern graph are matched one by one according to the given matching order  .At each iteration of the GPM procedure, we typically perform set operations to narrow down the candidates for each pattern vertex.Consider the pattern graph and the data graph in Figure 1 (a) and (b), the nested loop pseudo-code can be written in Figure 1 (c).During each iteration, we need to calculate two sets,  (  ) and  (  ) without any user definition. (  ) represents valid mapping of   , which consists of all vertices that satisfy the constraints ⟨  ,   ⟩|  ∈ ,  < . (  ) represents the candidates of  +1 , which consists of all vertices that satisfy the constraints ⟨ +1 ,   ⟩|  ∈ ,  < .Since the calculation of  (  ) is independent of   , it can be computed outside the iteration on most GPM systems for reusability.The objective of each round of computation is to consider the relationship between vertices in  (  ) and  (  ) to determine  ( +1 ), where  ( +1 ) is a subset of  (  ).In this process, we need to enumerate each pair of vertices from  (  ) and  (  ) and check if they are connected.In this process, each check is referred to as an edge check.To generate candidate set  (  ), it enumerates each vertex  from  (  −1 ) by performing  (  −1 ) ⊙  (), where ⊙ represents ∩ or −, denoted as a loop operation .As shown in Figure 1 (d), GPM will generate many candidates  (  ) on the -th iteration for different partial instances e.g., {8, 9, 10}, {8, 9, 11}.For convenience, we use  for  (  ) and  for  (  ) and  = {,  } corresponding to one loop operation on the -th iteration.So we denote the total loop operations on the -th iteration as Redundancy.We introduce the term fine-grained redundancy, which considers redundancy at the edge-checking granularity.
In contrast, previous works are coarse-grained, addressing redundancy at the set operation granularity.Coarse-grained redundancy.From the perspective of set operations granularity, redundancy occurs in completely identical set operations, that is, with the same operands and operators.In Figure 1 (c), there is no coarse-grained redundancy, since all set operations are distinct from each other.Fine-grained redundancy.But if we dive into the edge check level, as shown in Figure 1 (d), we can observe that many edge checks e.g., ⟨1, 5⟩ and ⟨2, 5⟩ are checked twice in I2-I3 iteration.This fine-grained redundancy stems from the characteristics of data graphs instead of patterns, e.g., the locality of data graphs.
This hidden redundancy remains concealed within the setcentric paradigm.Previous work primarily focuses on redundancy from inter-iterations, only identifying redundancy when encountering exact set formulas, while neglecting redundancy within intra-iterations.Interestingly, fine-grained redundancy can even arise between entirely different set operations.If we use a redundancy index to quantize the degree of redundancy of each loop.The redundancy index will be 0 (no same set operations) or 1 ( 100% the same set operations) in the view of coarse-grained redundancy.While in the view of fine-grained redundancy, the redundancy will be a Real number between 0 and 1, depending on the edge distribution of the data graph.
These problems are pervasive in various GPM tasks and significantly impact performance.Figure 1 (e) illustrates the searching behaviors observed on three patterns and two graphs (i.e., LiveJournal (LJ) and MiCo (MI)).In some patterns, each vertex pair has been checked over 10 thousand times repeatedly, leading to severe performance degradation.Furthermore, 80% of set operations exhibit fine-grained redundancy, despite being classified as having a coarse-redundancy in previous works.Figure 2 depicts the percentage of duplicateedge-check numbers over the total edge-check number across different patterns and datasets.From the perspective of finegrained redundancy, an average of 89.3% of edges are repeatedly checked.To sum up, previous ªpattern-awareº set-centric paradigm ignores this fine-grained redundancy caused by the data graph.And using ∩ and − operations are not enough to expose such redundancy.It is time to rethink the set-centric paradigm and consider the duplicated low-level intra-iteration edge checks.

Fused Matching
The common edge checks shared by multiple set operations are invisible in one by one set operation process manner.To deal with fine-grained redundancy caused by common edge checks, we level up the set operations execution granularity by introducing union set operation, in which we union the operands of a bunch of set operations first and then perform the edge checks on the union operands once(Sec 4.2).The result would introduce a few spurious edge checks and we apply a lightweight filter process to remove them.We proof the correctness of the algorithm by set algebra(Sec 4.1).
An edge check may be checked in each  = {  ,   }, resulting in at most  times repeated checking.By introducing the set union , we can merge the  loop operations into single loop operation  = { ,  }.Since the set operation naturally eliminates common vertices in different set operands, thus each edge check would be at most checked once in the merged , as shown in Figure 3.It is apparent that all possible edge checks contained are included in the merged set operation, while it also introduces a few spurious edge checks that are not in the original matching or over-estimation.Those incorrect edge checks arise where two vertices of an edge check belong to different set operations, i.e.,  ∈   ,  ∉   .We then propose a correction algorithm to remove those incorrect edge checks.
The principle of fused matching can be explained in the equation 1: Correctness.The fused matching algorithm can find all correct results obtained by traditional procedure without any additional erroneous matches.
Proof.Given one edge check ⟨, ⟩ ∈ (  ×  ),  ∈  ∧ ∈ .Then we can deduce that  ∈  ∧ ∈ , which means that ⟨, ⟩ ∈  × .Since (  −   )   = ∅, we can deduce that for any vertex  ∈   , then  ∉ (  −   ).So ⟨, ⟩ ∉   × (  −   ).According to these two conclusions, ⟨, ⟩ must belong to  ×  −   × (  −  ).To demonstrate that fused matching does not lead to incorrect results, we adopt a proof by contradiction, that is, there exists one edge check ⟨, ⟩ that belongs to the right side but not belongs to the left side if this theorem is wrong.Under this premise, we can deduce that  ∈  ∧  ∈ .That means ∃  and   , making  ∈   ∧  ∈   .Consider   in two cases: (a) =  (b) ≠ .In Case (a), we can deduce  ∈   ,  ∈   .Then the conclusion that ⟨, ⟩ ∈ ( × ) contradicts the premise.In case (b),  ∈   ∧ ∈   ∧ ∉   , we can conduce that ⟨, ⟩ ∈   × (   −   ).Then the conclusion that this edge check must not belong to the right side of equation 1 contradicts the premise too.Therefore, we can get the conclusion that there are no wrong edge checks in fused matching.

Algorithm Details
During the actual matching process, repeated edge checks are produced from different partial instances.So we need to trace corresponding partial instances and then achieve the goal ªcheck-one-fit-allº.We first introduce the algorithm details and then explain how the set union operation is implemented.Finally, we provide an example for illustration.
Fused matching consists of three steps: (1) Global matching.Perform edge checks in  and .
(2) Instance recovery.Use the frequency information of vertices to estimate the occurrence times of edge checks and recover the matching results by it.
Algorithm 1 details the implementation of our fused matching algorithm.It takes iteration  and QueryType ( QueryType =  or ) as input and outputs the actual matched checks.We first compute the union candidates  and  before execution(line 1).And we perform edge checks within these two candidates without duplicated results(line 2 − 11).We store the matching results in CheckList[u] and the number of edge checks in CheckCount[u] for the counting task.On line12-21, we enumerate vertices on each   and store parents and frequency(for the counting only task) in FreqList[u] and FreqCount [u].The upper bound of edge checks is estimated on line20-21.Finally, we enumerate each  and the complement set of  to remove false results on (line 22 − 31).Fast Set Union Operation.A straightforward approach to get results of  and  is to perform actually union operations among  sets, which is computationally complex and incurs high overhead.In fact, we can find opportunities to compute unions from the perspective of  and  set formula.Consider  these candidates  on the -th iteration, corresponding to  partial instances, actually share the same parent instance, i.e., { 0 ,  1 . . .,   −2 } but different   −1 s.For any two sets  and ,  −  ⊆  and  ∩  ⊆ .Therefore, we can deduce that , which can be computed and preserved before the -th iteration execution.Example.Considering an fused matching algorithm working example on the pattern   (Figure 1 (b)) with the data graph (Figure 1 (a)), Figure 4 describes the concrete steps.
R t z t K p I + 4 S c x V U D u 9 Z A 6 4 J Q h c j 0 e a y c + c P 7 t S 4 x 9 m l V m C L + 2 X c o y C 2 l Q q X r X J V W f a h u q 5 d P b a v K / 7 D O l d C 1 u C c 7 L 9 0 s 8 j / 1 d l z M D h j 1 b Y 3 z f P I H W N P J q q z V O 5 E b e X i X l e G G X J y F o / p L 4 g j p 5 z d k X C a 0 v V u 7 y V w / h 8 u 0 r L W j u r Y C j 9 t l X w c s x c g H g Z H n b a / 1 / b f d p o H r f q Z L O I l N t D i W 9 j H A d 7 g E H 3 m f o + P + I T P 3 g f v i 3 f r f b 0 L 9 e Z q z T r + G t 7 3 3 7 l m p S 8 = < / l a t e x i t > j 0 e a y c + c P 7 t S 4 x 9 m l V m C L + 2 X c o y C 2 l Q q X r X J V W f a h u q 5 d P b a v K / 7 D O l d C 1 u C c 7 L 9 0 s 8 j / 1 d l z M D h j 1 b Y 3 z f P I H W N P J q q z V O 5 E b e X i X l e G G X J y F o / p L 4 g j p 5 z d k X C a 0 v V u 7 y V w / h 8 u 0 r L W j u r Y C j 9 t l X w c s x c g H g Z H n b a / 1 / b f d p o H r f q Z L O I l N t D i W 9 j H A d 7 g E H 3 m f o + P + I T P 3 g f v i 3 f r f b 0 L 9 e Z q z T r + G t 7 3 3 7 l m p S 8 = < / l a t e x i t > v i v j ,id(v j ) <id(v i ) pairs We over-counted <6,1>, <6,2>, <6,4> in  Step 2: Instance recovery.(marked as 2 ).By traversing all the vertices that appear in each , we get the frequency of vertex {1, 2, 3, 4} as {2, 2, 2, 2}.Then we speculate that the edge check associated with vertex 1 appears two times by FreqCount[1] = 2, i.e., FreqCount(⟨1, 5⟩) = FreqCount(⟨1, 6⟩) = FreqCount(⟨1, 7⟩) = 2. Based on FreqCount[  ] and CheckCount[  ],   ∈ {1, 2, 3}, we can estimate the number of possible checks, which may be over-estimation.For example, the total number associated with vertex 1 is  [1] = 2 * 3 = 6, but the actual number is 5 (⟨1, 6⟩ are counted once more).

Dynamic Matching
Fine-Grained Redundancy Index.The traditional matching traverses all loop operations s and performs set operations one-by-one within each iteration, namely non-fused matching.For each   , we need perform set operations e.g.,   ⊙ (  ) for |  | times.The total number of edge checks is  −1 =0 |  | * |  |.Such a process will generate repeated edge checks e.g.,  0 ∈   ∩   ,  1 ∈   ∩   , causing ⟨ 0 ,  1 ⟩ checked twice.Naturally, we use the ratio of the repeated edge checks to the total as the fine-grained redundancy index, defined in the equation 2: The fine-grained redundancy is primarily determined by the number of common vertices among different candidate sets.The fused matching algorithm innovatively introduces a union operation to merge multiple candidate sets into one, thus avoiding repeated edge checks.Given  s on the -th iteration, the fused matching algorithm executes a constant that is independent of   and .The higher   is, the more repeated edge checks there are, in which fused matching algorithm is expected to achieve high efficiency.However, fused matching is not always advantageous due to   changing at runtime.When   is low, there are few duplicated edge checks, resulting in fused matching being inefficient.FR Influencing Factors.For the best overall performance of GPM, we need to analyze the influencing factors of finegrained redundancy.We simplify the problem by leveraging a random graph of  vertices, in which any pair of vertices are neighbors with probability .The expected size of  ∩  (  ) and  −  (  ) is hence || *  and || * (1 − ).That means the subtraction results are close to the origin  due to the low value .Consider   ,   are generated respectively from  by the set operation  ⊙  (  ).These two candidate sets generated by the subtraction tend to produce common vertices due to their closer to  rather than set intersections.As shown in Figure 5(c), to find the pattern   (Figure 1 (b)), the candidate sets with low   through set intersections on iteration 2 lead to poor performance of fused matching.While fused matching achieves significant performance on iteration 3 generated by subtraction.These redundancies can be aggravated in a nested loop for various patterns, fused matching performs better on sparse patterns than dense patterns, e.g., iterations of dense pattern 4-clique are all generated by set intersections as shown in Figure 5(a).From these observations, we empirically choose tuning parameter  ( > 1 for subtraction,  < 0.5 for intersection) to control the algorithm selection.On the other hand, considering the impact of the data graph, the dense graph MI which corresponds to high  may influence the performance of fused matching.Figure 5(b) shows that fused matching does not perform as well as traditional matching when matching sparse pattern   on dense graph MI.So we take  = | |/|| as another parameter for representing the sparsity of the data graph.Additionally, it is worth mentioning that altering the matching order does not completely eliminate fine-grained redundancy.This is because fine-grained redundancy occurs among multiple set operations within one iteration.Although the matching order may influence the   value, our fused matching is independent of the matching order choosing.Lightweight Runtime Model.Computing the exact finegrained redundancy index is a non-negligible overhead at runtime.We can approximate the redundancy index  by complexity mentioned in Algorithm 1 with parameters  and .We use  −1 =0 (|  | * |  |) as traditional matching workload, denoted as   .For fused matching workload estimation, we take We can obtain the size of these candidates with low overhead.To make up for the insufficiency of fused matching at low finegrained redundancy, we use a liner model   ≤   *  * (1−) to select whether to use fused matching at runtime for diverse datasets, patterns and iterations.

Implementation And Optimization
GraphFold enhance the parallelism of graph pattern mining by improving the workload balance among GPU threads.In detail, graph pattern mining involves extensive binary-searchbased edge checks [7], leading to an uneven distribution of tasks among GPU threads and irregular memory access patterns.Leveraging our dynamic matching algorithm, Graph-Fold merges a significant number of duplicate edge checks, effectively reducing the need for binary searches and achieving a better balance among GPU threads.GraphFold is designed to run on multiple GPUs and incorporates various GPU optimization techniques for set and edge-checking primitives.At the GPU thread level, we propose the GBS edge-checking technique, which utilizes GPU-shared memory to enhance parallelism.Additionally, we develop a novel warp-stealing technique to balance GPU warps' load distribution.In the following sections, we will elaborate GraphFold from two perspectives: the GPU parallel optimizations and the overall GraphFold system design, including user interfaces.GPU-friendly edge checks.During Graph pattern mining, it will generate multiple candidate lists with a large number of edge checks to be checked for connectivity.Traditional edge checking is implemented through binary searching [7] which ignores the characteristics of the graph data.Real-world graphs often exhibit two important characteristics: sparsity and locality.The term 'sparsity' refers to the fact that the average degree of a graph is typically much smaller than the number of vertices, and 'locality' describes the pattern where the vertex labels or IDs of neighboring vertices are often very close.We want to seize the opportunity to develop a more efficient edge checks method on GPUs, that is, batching edge checks instead of enumerating all edge checks one by one, namely GBS edge checking.With a given sampling scale  or sampling granularity  = | |  .We can maintain the block connectivity information of each vertex , which is stored as bitmap   ().(the -th bit of the bitmap is 1, indicating that the vertex has a neighbor in   ).We apply a two-phase checking.We first conduct the vertex-block checking to check the connectivity between the vertices   in  and each block (  ) derived from the above block list .If the bit of   is 0, it means that vertex  is disconnected to all vertices in the   , and the number of disconnected pairs is equal to |  |.If the bit of   is 1, we load the  () in (  ) range and conduct vertex-vertex checking one by one.To utilize GPU hardware features, we cache the bitmap into Shared Memory so that all threads in the warp can quickly check the vertex and block connectivity.For collecting results, we use warp primitives active_mask to indicate whether it is found or not, then call __popc to compute the storage index.Load Balance.GraphFold adopts a dynamic warp mapping strategy, called warp-stealing for the single GPU workload balance.G 2 Miner statically maps  edges to  warps, ensuring that each warp can process fixed  / edges.However, such a warp mapping strategy cannot guarantee load balancing between warps due to the large difference in the candidate subgraphs generated by each edge.To alleviate this problem, we maintain a task queue on the GPU, allowing warps to dynamically take edges from the queue for processing instead of fixed mapping.It allows some warps that finish current tasks earlier to do more work in a competitive manner.Although this warp mapping strategy will cause the number of edges processed by each warp to be different, it can alleviate the load-balancing problem between warps.System APIs.GraphFold utilizes a user programming interface similar to G 2 Miner and follows the set-centric programming paradigm of AutoMine [19].Figure 7 illustrates the primary APIs for utilizing GraphFold.We have implemented various popular applications, including clique counting, and crafted multiple programs designed for specific patterns.Additionally, configuration options allow users to specify parameters like the matching method or the number of GPUs.System architecture.(i) Single GPU Parallel: For higher parallelism and less load imbalance, we choose edge-parallelism (i.e., mapping per-edge processing task to a warp) instead of vertex-parallelism as the basic parallelism strategy.Each edge processing task is assigned to a GPU warp for efficiently expanding, which is the same as the state-of-the-art GPM systems e.g., G 2 Miner [7].Each warp produces all the candidate subgraphs expanded from one edge, while all the threads in a warp cooperate together to execute the set operation of list pairs.(ii) Multiple GPUs Parallel: We adopt the same multi-GPU task scheduling strategy as G 2 Miner, namely Even-Split and Round-Robin.The core idea is to divide the workload evenly or approximately evenly, thus statically assigning tasks to different GPUs.In this way, GPUs can execute their own tasks independently without communication.

Evaluation
We evaluate the performance of GraphFold by answering the following research questions: Q1: How efficient GraphFold is compared to other systems?Q2: How well does GraphFold scale in multi-GPU platforms?Q3: Can GraphFold works on different matching orders?Q4: How effective is GraphFold's fused matching?Q5: How effective is GraphFold's dynamic matching?Q6: How GraphFold achieves its performance?We compiled all the GPU programs using NVIDIA's nvcc compiler (version 10.1.0)with the -O3 flag., and all experiments of tested systems passed verification that produces the same results as a single-thread CPU-based standard implementation.To be fair, the measured results in all experiments ignored the IO time and output time, and we have tried our best to achieve the best performance of all tested systems.Baseline.GraphFold is compared with several state-of-the-art programmable GPU-based graph processing frameworks/libraries, i.e., PEREGRINE [15], Pangolin [8], CuTS [27], and G 2 Miner [7].According to the latest results in G 2 Miner [7], it is available and delivered relatively better performance than the others owing to its continuous evolution.Thus we mainly compare GraphFold with G 2 Miner.We evaluate GraphFold on a wide range of applications: Triangle counting (TC), motif counting (-MC), and -clique counting(-CC); and we select 8 typical query patterns for subgraph matching (SM) to cover the test cases used in [7] and [15].

Main Results
Exp-1: Overall performance.To answer Q1, we compare GraphFold with four state-of-the-art GPM systems, i.e., PERE-GRINE, Pangolin, CuTS, G 2 Miner.Since G 2 Miner is also a GPU-based GPM system and delivers constantly better performance, we report the runtime of individual patterns of Graph-Fold and G 2 Miner to highlight the efficiency of GraphFold.
Compare with G 2 Miner.Table 2 shows the performance of G 2 Miner and GraphFold on various query patterns for various datasets.For a fair comparison, GraphFold adopts exactly the same cost model as G 2 Miner to determine the matching order before implementing our proposed optimization for reducing fine-grained redundancy.We can find that: (i) GraphFold performs much better than G 2 Miner in most cases, with an average speedup ratio of 18.7 × and a maximum speedup ratio of 305 × (GW on P2).Additionally, GraphFold runs 4.6 × faster on a simple pattern P4 and 23.5 × faster on a complex pattern P7. (ii) As the pattern gets more complex, the runtime increases, but the increase in GraphFold is smaller than G 2 Miner.From P4 to P7, G 2 Miner's runtime increases 139 ×, while GraphFold only increases 56.2 ×.This is due to the fact that for more complex patterns, the number of false candidates increases much more than the true candidates, and GraphFold avoids these futile edge checks.(iii) For some large graphs or complex patterns (LJ on P2 and P3), G 2 Miner failed to complete the task, while GraphFold can run out of the results.The reason that GraphFold performs better than G 2 Miner is that GraphFold touches fewer edge checks because the fused matching algorithm calculates the number of instances based on a correction-after-estimation method.It skips redundant binary search thus saving precious memory bandwidth.Our results, shown in Figure 9, reveal the following observations: (i) We found that with 8 GPUs, the average speedup of G 2 Miner was only 3.2 ×, while GraphFold achieved an average speedup of 6.6 ×.Moreover, on some complex patterns such as P7, adding more GPUs did not bring any performance gains to G 2 Miner and even caused performance drops.(ii) We observed that as we increased the number of GPUs, the speedup of GraphFold over G 2 Miner became larger.This is because GraphFold has a better load balance.GraphFold utilizes balancing algorithms that assign tasks to each GPU with a fine-grained strategy, which ensures that the loading gap between different GPUs becomes thinner.In contrast, hub vertices can cause severe load imbalance and introduce many false candidates in G  of scalability, making it an effective solution for graph pattern mining applications on large graphs with more GPUs.Exp-3: Matching order.To answer Q3, Figure 10 (b) demonstrates that our method consistently exhibits performance improvement under different orders when identifying the pattern P5.Regardless of employing the matching order  1 or  2 , there exists a fine-grained redundancy issue during the GPM process.Compared with G 2 Miner, GraphFold achieves 9.68× and 2.60× speedups on the matching orders  1 and  2 respectively.This fact means that our dynamic matching works well irrespective of the chosen matching order.Both G 2 Miner and GraphFold exhibit inferior performance under matching order  2 compared to  1 across these datasets, resulting in an average 3.3× performance drop.This is because the search space of the GPM process is smaller under the matching order  1 .To enhance performance, we initially select the optimal matching order, and then apply our dynamic matching technique to address fine-grained redundancy.

Micro benchmarks
Exp-4: Fused matching algorithm.To answer Q4, we show the effectiveness and runtime breakdown of the fused matching algorithm in the following: Effectiveness.As shown in Figure 11 (a), by varying data graphs and query patterns, we observe that in all cases, fused matching can reduce edge checks time by nearly 80%, while in most cases fused matching can reduce edge checks by more than 90%.In the best case, it can reduce edge checks by 99.9%, since numerous false candidates are avoided.The reason fused matching can shrink the number of edge checks is that when the fine-grained index   is high, the candidate set is nearly equal to the original neighbors or the empty set in many cases, which results in much less work to do in the correction step rather than counting them one by one.Runtime breakdown.As shown in Figure 11 (b), instance recovery takes less time because it only needs to scan the candidate list once and only load/store data in shared memory.In most cases, error correction takes a significant part of runtime because it needs to load an adjacency list for binary searching, which consumes valuable memory bandwidth.The time spent on global matching is mainly on calculating the statistical information of a vertex's adjacency list, e.g., counting the number of disconnected edge checks.This part allows processing without query patterns and thus can be preprocessed to further improve performance.Exp-5: Dynamic matching algorithm.To answer Q5, we show the effectiveness of dynamic matching as below: As shown in Table 6 (a), by varying data graphs and query patterns, we observe that in all cases dynamic matching outperforms the direct utilization of the non-fused matching and fused matching.And compared to directly fused matching, dynamic matching runs 1.5 × faster on average.In most cases, directly fused matching is superior to non-fused matching, which can provide 7.3 × speedup on average.In some cases, the performance of fused matching in graph MI is lower than non-fused matching because it exhibits lower fine-grained redundancy when processing sparse patterns on dense graph Mi, which incurs the correction phase of fused matching algorithm taking more time to complete.Dynamic matching can make up for the poor performance of the fused matching under low redundancy by employing non-fused matching instead.Exp-6: Incremental speedups.To answer Q6, we illustrate the incremental speedups on five graphs (CP, LJ, YO, PO, and LF) with 4-MC in Figure 11 (c).To investigate the individual contribution of each optimization, we disable three main optimizations of GraphFold, then add optimizations one by one.Base means the version without any optimization, Load-balance means we apply our load balance strategy (Section 6), Hybrid RM is the dynamic matching algorithm(Section 4), and GPU opt is the GPU friendly kernel added to accelerate edge checks (Section 6).We find that our load-balance strategy can gain a 15% performance benefit on average.Our fused matching algorithm and GPU opt can improve performance by 2.3 × and 1.5 × on average, respectively.We also observed that the effectiveness of fused matching and GPU opt became better in larger graphs (e.g., LJ).Overall, the high performance of GraphFold mainly comes from our fused matching and GBS algorithms, which significantly reduce the number of false candidates.Moreover, fused matching does not rely on special hardware of GPU or massive parallelism, which means this algorithm is generic and can also be applied to CPU-based GPM systems.

Related Work
We summarize the related work to clarify common features with other work and highlight GraphFold's superiority in implementing highly efficient graph mining.GPM systems.Many works have developed high-performance graph pattern mining frameworks on GPU and CPU platforms.Arabesque [25] is the first distributed GPM system that proposes ªthink like an instanceº abstraction.It adopts BFS to generate partial instances incrementally which leads to high memory overhead.Thus it designs ODAG data structure to store the instance at the cost of introducing more false candidates.RStream [26] is a single-machine graph mining system, it streams edges for out-of-core graph pattern mining with relational algebra.It also materializes partial results that demand tremendous memory consumption.Peregrine [15] processes large graphs on a single node by directly computing the final results to save memory.It is a pattern-aware graph mining system that directly explores the subgraphs of interest while avoiding the exploration of unnecessary subgraphs, and it simultaneously bypasses expensive computations throughout the mining process.GraphMineSuite [4] proposes the first benchmark for GPM systems, which mentions set union for clique counting expressions.Pangolin [8] initially accelerates graph mining applications with GPUs and proposes an "extend-reduce-filter" abstraction.G 2 Miner [7] automatically selects the optimal kernels and generates code automatically for high-performance applications.
Redundancy.Many graph pattern mining systems introduce methods to reduce redundancy.AutoMine [19] formalizes the graph pattern mining problems as a set problem, thus highlevel graph pattern mining algorithms can be implemented with set operations like intersection and subtraction.It then generates an optimal query plan based on relational algebra and changes the processing order to prune false candidates at an early stage while reusing the intermediate results.CECI [5] adopts the BFS-based filtering and reverse-BFS-based refinement to prune the unpromising candidates early on, as well as leverages set intersection to accelerate edge checking to prune false candidates, and it uses search cardinality-based cost estimation for detecting and dividing large instance clusters in advance.EmptyHeaded [1] searches for the best possible execution plans for GPM.CFL-Match [6] decomposes a query pattern into three substructures with core-forest-leaf decomposition and then conducts matching in a substructure-bysubstructure manner.GraphPi [24] utilizes a new algorithm based on 2-cycles in group theory to generate multiple sets of asymmetric restrictions where each set can eliminate false candidates completely.Then it designs a performance model to determine the optimal matching order and an asymmetric restriction set for efficient processing.

Conclusion
GPM applications are constrained by their memory-intensive nature, mainly due to extensive edge scans.The "set-centric" abstraction offers powerful expressive capabilities and represents GPM tasks using set intersection and subtraction operations.However, existing algorithms based on relational algebra overlook fine-grained duplicated edge scans arising from the data graph's inherent locality.To address this, we present GraphFold, the first multi-GPU graph pattern mining (GPM) system designed to solve the fine-grained redundancy problem.By introducing the set union operation to the setcentric abstraction, we effectively fuse duplicated edge scans into one, maintaining expressive power and previous optimizations while significantly reducing fine-grained redundancy in GPM tasks.GraphFold achieves remarkable speedup on a V100 GPU cluster, demonstrating up to 305 × faster performance than the state-of-the-art GPM system, G 2 Miner.This advancement significantly improves the efficiency of GPM algorithms and opens new possibilities for the "set-centric" approach in big data applications.
Processed frequency per edge check < l a t e x i t s h a 1 _ b a s e 6 4 = " g b < / l a t e x i t > v1 < l a t e x i t s h a 1 _ b a s e 6 4 = " s 9 Z M Q I + k 5 y J / t w u r 4 A R p I e 1 D l x t e x i t s h a 1 _ b a s e 6 4 = "

Figure 2 .
Figure 2. The ubiquity of fine-grained redundancy among various patterns and datasets.

2 2 2 2
Count < l a t e x i t s h a 1 _ b a s e 6 4 = " h c f t o f 3 v H A w l o f X o R 6 G x 4 3 E C 9 s I = " > A A A D I 3 i c j V J N T x R B E H 2 M C o g I i x y 9 d F h I l o R s e j Y B d h M P J F 4 8 Y u I C y S 7 Z z M w 2 S 8 N 8 Z a Z n E 0 K 4 + 0 e 8 e t X / 4 I 1 4 8 e B d f 4 B 3 X z e z B g 9 E u 9 M z r 1 7 V q 6 7 q 7 j C P d W m k / D b n P X r 8 Z H 5 h 8 e n S s + H 2 M C o g I i x y 9 d F h I l o R s e j Y B d h M P J F 4 8 Y u I C y S 7 Z z M w 2 S 8 N 8 Z a Z n E 0 K 4 + 0 e 8 e t X / 4 I 1 4 8 e B d f 4 B 3 X z e z B g 9 E u 9 M z r 1 7 V q 6 7 q 7 j C P d W m k / D b n P X r 8 Z H 5 h 8 e n S s +

Figure 5 .
Figure 5.The runtime of traditional and fused matching varying patterns, data graphs and iterations.

Figure 7 .
Figure 7. GraphFold programming interface.edge checking is implemented through binary searching[7] which ignores the characteristics of the graph data.Real-world graphs often exhibit two important characteristics: sparsity and locality.The term 'sparsity' refers to the fact that the average degree of a graph is typically much smaller than the number of vertices, and 'locality' describes the pattern where the vertex labels or IDs of neighboring vertices are often very close.We want to seize the opportunity to develop a more efficient edge checks method on GPUs, that is, batching edge checks instead of enumerating all edge checks one by one, namely GBS edge checking.With a given sampling scale  or sampling granularity  = | |  .We can maintain the block connectivity information of each vertex , which is stored as bitmap   ().(the -th bit of the bitmap is 1, indicating that the vertex has a neighbor in   ).We apply a two-phase checking.We first conduct the vertex-block checking to check the connectivity between the vertices   in  and each block (  ) derived from the above block list .If the bit of   is 0, it means that vertex  is disconnected to all vertices in the   , and the number of disconnected pairs is equal to |  |.If the bit of   is 1, we load the  () in (  ) range and conduct vertex-vertex checking one by one.To utilize GPU hardware features, we cache the bitmap into Shared Memory so that all threads in the warp can quickly check the vertex and block connectivity.For collecting results, we use warp primitives active_mask to indicate whether it is found or not, then call __popc to compute the storage index.Load Balance.GraphFold adopts a dynamic warp mapping strategy, called warp-stealing for the single GPU workload balance.G 2 Miner statically maps  edges to  warps, ensuring that each warp can process fixed  / edges.However, such a warp mapping strategy cannot guarantee load balancing between warps due to the large difference in the candidate subgraphs generated by each edge.To alleviate this problem, we maintain a task queue on the GPU, allowing warps to dynamically take edges from the queue for processing instead of fixed mapping.It allows some warps that finish current tasks earlier to do more work in a competitive manner.Although this warp mapping strategy will cause the number of edges processed by each warp to be different, it can alleviate the load-balancing problem between warps.System APIs.GraphFold utilizes a user programming interface similar to G 2 Miner and follows the set-centric programming paradigm of AutoMine[19].Figure7illustrates the

Figure 8 .
Figure 8. Evaluated query patterns in this paper.

Figure 11 .
Figure 11.GraphFold microbenchmarks. of scalability, making it an effective solution for graph pattern mining applications on large graphs with more GPUs.Exp-3: Matching order.To answer Q3, Figure10(b) demonstrates that our method consistently exhibits performance improvement under different orders when identifying the pattern P5.Regardless of employing the matching order  1 or  2 , there exists a fine-grained redundancy issue during the GPM process.Compared with G 2 Miner, GraphFold achieves 9.68× and 2.60× speedups on the matching orders  1 and  2 respectively.This fact means that our dynamic matching works well irrespective of the chosen matching order.Both G 2 Miner and GraphFold exhibit inferior performance under matching order  2 compared to  1 across these datasets, resulting in an average 3.3× performance drop.This is because the search space of the GPM process is smaller under the matching order  1 .To enhance performance, we initially select the optimal matching order, and then apply our dynamic matching technique to address fine-grained redundancy.

Table 1 .
Representative graphs for benchmarking.

Table 2 .
The runtime of GraphFold vs. G 2 Miner.Runtime (s) [Lower is better]

Table 3 .
GraphFold vs. other GPM systems on TC.
Runtime (s) of 4-Motif Counting [Lower is better]*CuTS does not provide implementation of 4-MC.

Table 5 .
GraphFold vs. other GPM systems on 5-CL.Runtime (s) of 5-Clique Counting [Lower is better] is that complex patterns often generate more false candidates which can be filtered out by our fused matching algorithms.Compare with the hand-tuned GPM algorithm.

Table 6 .
Efficiency of dynamic matching.