Differentiating Set Intersections in Maximal Clique Enumeration by Function and Subproblem Size

Clique finding problems such as maximal clique enumeration are computationally challenging due to the combinatorial number of vertex sets that need to be evaluated. Despite prior advances, the majority of execution time is still spent in set intersections, which limits the speed of the algorithms and their application to increasingly larger graphs. To address this issue, we propose to differentiate the set intersection algorithm to its function and the problem size, leading to several innovations: (i) a new primitive “intersect-size-exceeds” which short-cuts the calculation of the size of an intersection if it does not exceed a pre-determined threshold; (ii) a new data structure that supports efficient hash-based intersection on constantly evolving sets; and (iii) efficient specialized variants of clique search and intersection for “small” and “tiny” subproblems. Compared to the state-of-the-art maximal clique search algorithms by Blanusa et al (VLDB 2020) and Besta et al (MICRO 2021), a median speedup is observed of 4.14 × and 10.91 × on AMD Zen Version 2, and a median speedup of 3.44 × and 8.26 × on Intel Sapphire Rapids over 14 challenging graph datasets.


INTRODUCTION
The field of graph mining includes many challenging problems, many of which are NP-hard [16,27].Execution times can grow dramatically and can become intractable in practice for hard problem instances.Set intersections are a common operation across many graph mining algorithms [7] and are often responsible for the majority of the execution time.As such, they are critical to increasing performance of graph mining.This work explores optimizations for set intersections in the context of the Maximal Clique Enumeration (MCE) problem, which is prototypical for NP-hard problems.
The Bron-Kerbosch algorithm [12] is the main MCE algorithm.Based on back-tracking search, it recursively grows cliques by selecting vertices one by one from a candidate set.The candidate set is updated on each step to include only neighbors of the currently constructed clique.This is achieved through set intersection, by intersecting the set of candidate vertices with the set of neighbors of the selected vertex.The majority of the execution time is taken by set intersections like those between the candidate set and the adjacency list of a vertex.
As the set intersection algorithm plays such a dominant role, it has been thoroughly studied and optimized.A variety of set data structures such as arrays of elements, bitsets and compressed bitsets, and hash sets and their corresponding intersection algorithms have been evaluated.The efficiency of each algorithm varies strongly depending on the difference in size of the intersected sets, and the sparsity pattern of bitsets and compressed bitsets [11,14,33,39,43].
Another approach to optimize MCE is to reduce the number of set intersections.Pivoting filters out certain candidate vertices based on a "pivot" vertex [12,47], which reduces the number of recursive calls of the Bron-Kerbosch algorithm.Determining the pivot requires many intersection operations itself, but has the potential to reduce the number of future intersections as there will be fewer recursive calls.Another common optimization is to cut out induced subgraphs of subproblems [22] encountered during the recursive search.The cut out subgraph only contains candidate vertices, and their neighbor lists are shorter as only the neighbors that are also candidates are retained.Cutting out an induced subgraph again requires more set intersections, but has the potential of speeding up future intersections.
To date research on MCE has assumed uniform treatment of sets across all functions: candidate intersection, pivoting and cutting out subgraphs.Moreover, there is a very large dynamic range in set sizes, which requires to solve intersections on pairs of large sets, as well as pairs of small sets and pairs of sets with highly imbalanced sizes.Prior work has sought to solve these with the same algorithms.This paper takes a different stance and propose to optimize the intersection algorithm with relevance to (i) the function of the intersection within the algorithm, and (ii) the problem size.Our key contributions are to: (1) Propose a new efficient set operation that accelerates the critical pivoting optimization targeting large subproblems and leveraging large differences in intersected set sizes.(2) Design a novel hashed data structure that supports efficient hash intersections on fast-changing sets.
(3) Apply a radically different optimization approach for small subproblems using bitset intersection, leveraging hardware support for small bitsets through vector ISAs [29].While bitsets and compressed bitsets have been evaluated for clique problems [3,8,40,41,49], performance increases most when constraining the problem size to fit in hardware registers.(4) Leverage cheap intersections in small subproblems to analyze the constructed sets and reduce recursive calls.The same optimizations have a disadvantageous cost/benefit trade-off for large subproblems.(5) Optimize tiny subproblems by avoiding all recursion and set intersection, which provides dramatic speedup for graphs dominated by low-degree vertices.
Experimental evaluation of our proposed algorithm on 14 challenging graph data sets and 2 processor architectures demonstrates a median speedup of ∼4×, and a maximum speedup over 30× compared to Blanusa's state of the art algorithm [9].A median speedup of ∼8× and a maximum over 30× is achieved compared to Besta et al [8].
The remainder of this paper is organized as follows: Section 2 presents the Bron-Kerbosch algorithm and characterizes its key performance issues.Section 3 presents novel solutions for MCE, and we present the result of our experimental evaluation in Section 4. Section 5 discusses further related work and Section 6 concludes the paper.

BACKGROUND
Let  = ( , ) be a graph where  is a set of vertices and  ⊂  ×  is a set of edges.We consider simple graphs, also called undirected graphs, where (, ) ∈  iff (, ) ∈ . for   ∈  do 3: for  ∈  do 5:

Maximal Clique Enumeration
The Bron-Kerbosch algorithm [12] is a recursive search algorithm (Algorithm 1 and 2).Three sets are maintained in each recursive call: a partially constructed clique , a set of candidate vertices , and a set of excluded vertices  .During each recursive call, the members of  are tried for addition to  one by one.They are then moved from  to  .For each tried candidate, a recursive call is made with  including the tried candidate, and with all non-neighbors of the candidate removed from the sets  and  .The recursion stops when  is empty, and a unique maximal clique is found when  is also empty.The clique is then given by .The algorithm has been shown to visit each maximal clique once [22].
The Bron-Kerbosch algorithm distinguishes between a vertex's left neighborhood (those neighbors whose numeric ID is smaller than its own) and the right neighborhood (highernumbered neighbors).The  set is comprised of left neighbors, while the  set is comprised of right neighbors.This avoids redundant enumeration of maximal cliques that are permutations of the same list of vertices.
Pivoting reduces the search process based on co-occurence of neighbouring vertices in maximal cliques [12].The observation is that the neighbours of any candidate vertex will Efficiency can be improved by sorting vertices by increasing degeneracy [22].This limits the size of the right neighborhood of a vertex (and thus ) to the degeneracy of the vertex, while the left neighbourhood (and thus  ) can be as large as the difference between degree and degeneracy of the vertex.Degeneracy [35] is tightly linked with cliques: a vertex with degeneracy  can be a member of at most a  + 1-clique.Degeneracy requires a one-time cost to (i) calculate the degeneracy of each vertex, and (ii) relabel the graph such that vertices are labeled in non-decreasing degeneracy order.Both steps are in practice much faster than MCE itself, especially for hard problem instances of MCE.Whereas some authors assume sequential algorithms for degeneracy [9,50], parallel computation of degeneracy is well documented [6,20,23].
A further optimization is to create graph cutouts in each recursive call.Each call only analyzes vertices in  ∪  and the edges between them.As we progress to deeper levels of recursive calls,  and  become smaller, hence the cutout graphs become smaller too.This reduces the length of adjacency lists and accelerates some set intersection algorithms.The cutout need not contain edges between pairs of  vertices as these edges would never be queried [18].

Set Intersections
The execution time of MCE is dominated by set intersection.Algorithms for set intersection assume that elements of the sets are numbered from 0 to  − 1, where  is the size of the universe of elements we might place in a set.
Merge intersection represents sets as sorted arrays.It advances an index for each array based on the relative order of the pointed-to array values, identifying common elements in the process.It requires O ( + ) steps to intersect two sets of sizes  and .Gallopping search [5] improves time complexity to O ( log(/)) [4].Zheng et al extend this notion by repeatedly reducing the possible intersection range and applying merge intersection on shorter ranges [57].Vectorized merge intersection reduces branch mispredictions [28].
Hash intersection represents the larger set as a hash set and looks up each element of the smaller set in the hash set.Hash intersection completes in O (min(, )) steps assuming that the hashed representation can be reused and that hash lookups can complete in O (1) steps (amortized).
Vertical vectorization of hash intersection looks up a distinct value in each lane of the vector [43].Horizontal vectorization searches a single key in multiple buckets using vectorized match-any [11,13,26,39].We find vertical hashing to perform best for MCE as the algorithm performs a large number of independent hash lookups.
Bitset intersection represents sets as strings of  bits and performs a bitwise "AND" on the bit strings in O ( ) steps.These approaches are, however, not straightforward, nor generic.If all adjacency lists are represented this way, as assumed by some authors [42,49], effectively a dense binary adjacency matrix is stored.This is not scalable to large sparse graphs and results in O (| |) steps for intersection, which is worse than merge and hash intersection.Sparse or compressed bitsets can be used [14,15,33,41].However, such representations are variable-length and require variable time to operate on [33].Setting a bit, or retrieving the position of the next set bit, become non-trivial operations.
Hash intersections [9] and compressed bitsets [8] have been shown to provide state of the art performance for the MCE problem, although no direct comparison between the two is available in the literature.
The set intersection algorithms interact with other performance optimizations in different ways.For instance, cutting out subgraphs accelerates merge intersection as adjacency lists become shorter [22].Indeed, the time complexity of merge intersection is linear in the size of the adjacency list.However, it does not improve performance for hash intersection [9].When the adjacency list is hashed, the time complexity depends on the length of the candidate list and is not reduced by cutting out a subgraph.

Challenges
We analyze Blanusa's many-core algorithm for maximal clique enumeration [9] to gain insight in key open challenges.This algorithm is the best-performing existing MCE algorithm for shared memory parallelism that we identified.The sets intersected during pivoting are themselves not needed, they only serve to identify the vertex with the largest such set.As such, it is common to use an optimized operation "intersect-size" that calculates the size of an intersection without materialising the intersected set.We see a further optimization opportunity as besides the set contents we also don't need to know the set size, in particular in those cases where the intersection is small.We address this with a new primitive operation, "intersect-size-exceeds" that can short-cut the computation early.

Hash Intersection
On Non-Constant Sets.Hash-based intersection dictates that the larger of the two sets should be stored in a hash set.If not, time complexity is O ( (, )) instead of the more favorable O (min(, )).In the Bron-Kerbosch algorithm, intersections are between an adjacency list and  or . Figure 2 shows the fraction of intersections where  and  are the larger sets, and their average length in case of the topcats and wormnet graphs.
The size of the  set is bounded by the maximum degree of a vertex and can be very large in skewed degree graphs.However, very often it contains at most one element.We observed that  sets tend to be very large during the first few recursive calls, after which they quickly reduce in size due to the intersections.While the  set is not frequently larger than an adjacency list, the size difference is significant when  is larger.
The size of the  set is bounded more strictly by the denegeracy of the graph as it strictly contains right-neighbours of the vertices in .Nonetheless,  sets contain the vertices still to be processed and tend to be larger than  sets on average (Figure 2 (right)). sets are however always smaller than adjacency lists during candidate intersections as they are intersected with adjacency lists of their left-neighbours.
sets used during pivoting are on average larger than during intersections in the main recursive call (Figure 2 (right)).Here,  is intersected also with the adjacency lists of rightneighbours, which tend to be smaller.Moreover, the number of intersections performed during pivoting is proportional to the size of the  set.As such, larger  sets are used more frequently in intersections, and it is important to perform intersections where  is hashed.
The  and  sets are, however, constantly changing, which precludes creating hash sets in support of intersections, and Blanusa et al always hash the adjacency list.This choice is however suboptimal as illustrated here.We require a bespoke data structure to store the  and  sets that does not require re-hashing as the sets are modified.

Most (But Not All
) Sets Are Small.Set intersection operations are performed on lists with varying lengths (Figure 2 (right)).Note that while adjacency lists are usually larger than the  and  sets, the relevant part of the adjacency   () is constrained to   () ∩ ( ∪) and is thus not larger than the combined size of  and .This relevant part can be identified by cutting out the induced graph of  ∪  on a recursive call [22].As such, it suffices to represent sets not larger than | ∪  |.
As the absolute size of sets used in intersection is small, significant acceleration is possible by representing sets as bitsets and using hardware instruction set extensions for Single Instruction Multiple Data (SIMD) parallelism.These hardware extensions provide native logical and arithmetic operations on long sequences of bits, e.g., 128 bits in SSE, 256 bits in AVX and AVX2, 512 bits in AVX512 [29], and up to 2048 bits in ARM SVE [45].Operations such as bitwise logical operations, which enact set operations, often complete in a single processor cycle.
Representing adjacency lists as bit sets effectively implies storing the adjacency structure as a dense binary matrix.For most graph datasets,  and  sets almost always fit in 64 bits (Figure 2), however, sets can be substantially larger.The construction of the matrices requires a graph cutout, which itself introduces performance overhead [9].As such, a carefully designed approach is required that amplifies the benefits of SIMD instruction sets without introducing problematic time or space overhead.

METHOD
Given the large breadth of subproblem sizes encountered during MCE, we optimize MCE search for the specific challenges of subproblems at different scales.

Optimizations for Large Subproblems
Large subproblems are those solved with sets represented as variable-length arrays and hash sets.We introduce two optimizations for this regime.
3.1.1Pivoting: Intersect-Size-Exceeds. Pivoting searches for the vertex that has the largest number of neighbours in common with the candidate list.As such, a running "best vertex" is maintained as all vertices are tried, and we are only interested in vertices that improve over the running best vertex.To accelerate pivoting, we devise an intersection algorithm that bails out early in case the current vertex will not improve over the running best vertex.This algorithm, intersect-sizeexceeds(, , ), returns the size of the intersection of sets  and , however, if | ∩ | < , it is allowed to return -1.
The algorithm builds on the property The key insight is that it is easier to determine during the intersection operation that | \ | < || −  as we evaluate each element of , than it is to determine that | ∩ | > .Consider the sets  = {1, 3, 5, 6, 8, 9, 13} and  = {3, 4, 7, 8, 9, 10, 12, 14} and threshold  = 4.The hash intersection tests individual elements of  for presence in .We find that after processing 1, 3, 5, 6 just one element (3) is in the intersection and 3 elements are not.As such, | \ | ≥ 3 and therefore | ∩ | < 4, which implies the threshold  cannot be exceeded.By counting the elements not included in the intersection, we conclude that | ∩ | <  by testing only 4 elements of .
Algorithm IntersectSizeExceeds (Algorithm 4) demonstrates this idea for a scalar hash intersection, where set  is hashed.The maximum number of missing elements allowed is || − if we are looking for | ∩ | > .As such, we initialize the count ℎ at || −  and subtract one each time an element of Algorithm 4 IntersectSizeExceeds: returns size of intersection of sets  and , and may return -1 when the size is  or less. is provided as array of length  and  as a hash table with  elements.
1: function IntersectSizeExceeds(, , ) ℎ ←  −  ⊲ number of missing elements tolerated 5: for  ← 0 to  − 1 do if ℎ ≤ 0 then return −1 ⊲ exit early The IntersectSizeExceed algorithm is a valid optimization for pivoting, as the pivoting algorithm will still find the same pivot vertex.Indeed, the optimization only reduces execution time for those intersections that result in a less-than-optimal intersection.Note also that the choice of pivot has no bearing on the correctness of MCE.

XPSet Data Structure.
A critical aspect of hash-based intersection is that the larger of the two sets being intersected should be placed in a hash set.Reuse of this hash set is important, as construction of the hash set takes time linear in the size of the larger set.However, the sets  and  are modified for every candidate vertex tried (Algorithm 2, line 7).Re-hashing the sets before each intersection is inefficient.
We design a data structure that circumvents these problems by keeping all elements of  and  in a fixed location while accounting for the gradual movement of elements from  to  .The purpose of the data structure is to support hash intersections with  or  either as hashed set or as iterated set, which requires: (1) Query membership of  and  in constant time.
(2) Efficiently iterate over all members of  and/or .
(3) Moving a single element of  to  in minimal steps without reconstructing the hashed representation.
Our solution maintains two arrays.The array  provides a hashed access.Array  lists elements of  and  at consecutive indices, as proposed by Bron and Kerbsoch [12].The elements of  appear at indices 0, . . .else return  candidate edges).In the baseline Bron-Kerbosch algorithm, every element of  is tried in turn, and moving it to  can be effected by incrementing , assuming we move elements in the order stored in the array [12].Pivoting, however, complicates the situation as the neighbours of the pivot are not iterated over, and not moved.As such, incrementing  is not sufficient to maintain  and  .However, we can address this by separating  into  \   () and  ∩   () where  is the selected pivot vertex.Only the elements of  \   () need to be moved to  in any call of BKSearch.As such, they are placed immediately after the elements of  in the joint list and incrementing  effects the move of those elements between  and  .
The array  maps each vertex to its index in the  array.This way, we can determine if a vertex  belongs to  ( [] < ), or to  ( [] ≥ ), or neither ( [] is invalid).The array  has length , where  is the total number of vertices in the graph cut-out.It is constructed as follows: for every element  inserted in the set, we store in  the array index at which the element  is placed in .This provides the following invariants:  [[]] =  for  = 0, . . .,  − 1 and [ []] =  for  ∈  ∪ .Non-initialised indices of  violate these constraints, similar to fast arrays [24].
Elements are inserted in an    as follows: Hash-based access is demonstrated in Algorithm 5. Invariants on  are tested first to determine if the queried vertex is a member of either set  or .For present elements, the method returns the position of the element in , which can be used immediately for remapping vertex IDs in a cut-out.Sequential iteration through  and/or  are trivial using array  and knowledge of  and .
An important property of the XPSet is that it supports efficient vertical vectorization.Where vertical hashing with linear probing requires multiple gather accesses to determine presence of an element [11,43], the XPSet answers each contains query with exactly two gather operations: one in the  array, and one in the  array.

Optimizations for Small Subproblems
3.2.1 Bitsets.Bitset representations of adjacency lists allow for efficient intersection as many set operations can be completed in a single or a few processor clock cycles.However, this equates to representing a dense binary adjacency matrix, which limits the approach to small subproblems.For instance, a 512 by 512 dense binary matrix requires 32 KiB storage.
To constrain memory footprint, we construct dense binary matrices in conjunction with cutting out the induced graph of  ∪ .We create these adjacency matrices only when the recursive search encounters sufficiently small sub-problems, such that each of the  and  sets fits in a hardware register.Let  = | | and  = | | at the time of taking the cut out.We consider two implementations of dense binary matrices.
A dense binary matrix is simply a binary matrix of dimension  × .We use a dense binary matrix if  +  ≤  where  is the smallest of 32, 64, 128, 256 or 512 that meets this constraint.This coincides with the use of 32-and 64-bit scalar operations, SSE vectors, AVX2 vectors and AVX512 vectors, respectively (we don't use the 512 × 512 option on the Zen2 architecture as in practice it is less efficient).
A blocked binary matrix supports larger sub-problems where matrix rows span two hardware registers.As edges among  vertices are not recorded in a cut-out, the adjacency matrix has a contiguous sub-block of size  ×  that is zero.We create two binary matrices, one of size ( + ) ×  that stores for each vertex its neighbours in the  set, and a second dense matrix of size  ×  that stores the  neighbours of the vertices in .The latter matrix is the mirror image of the first, however we store both to support efficient vectorized access.Blocked binary matrices are created when  and  are each less than the longest hardware-supported vector and  +  is larger than the same.

Bitset Cut-Out Construction.
It is time-consuming to construct bitset cut-outs [22] so efficient implementation is important.We describe the algorithms for cutting out dense binary matrices; blocked matrices are handled similarly.Bitset cutouts represent the induced subgraph of  ∪ , with vertices relabelled to the range 0, . . ., −1 where  = | ∪ |.We use the order in which vertices are listed in the XPSet as their new IDs to maintain consistency with the main algorithm.The adjacency information is stored in a  ×  dense binary matrix .The position  , = 1 if there is an edge between vertices .[ ] and .[] in the main graph, and 0 otherwise.
We ignore edges among  vertices because the Bron-Kerbosch algorithm never reads these edges [18] Empty elements of H have not been initialised  Hashed adjacency list (Figure 3): We iterate over the elements of  () and perform has lookups into   ().This determines all elements of row  in the dense matrix (except columns  where  <  (.[ ]) as those are zero).We use vertically vectorized [43] hash set lookup: each lane of a vector contains a subsequent element of  (), and the hash set lookup is vectorized across those elements.The result is a vector of boolean values that indicate whether the corresponding bit in the adjacency matrix should be set.We convert the vector to a bit mask with one bit per lane and populate the adjacency bitset 8 bits (AVX2) or 16 bits (AVX-512) at a time, assuming a 32-bit vertex identifier.
Hashed XPSet (Figure 4): We iterate over   () and hash lookup in  () using the mechanisms of the XPSet.The goal now is to populate sparsely selected elements of a row of the dense matrix.We again vectorize the computation using vertical vectorization, however, in this instance vector lanes map to row-wide bitmasks instead of singular bits.
If the row width is no larger than 64 bits, we can construct a one-hot bitset in each lane by means of a vector shift of 1 by the new vertex ID, which equals the position of the vertex in the  array.We clear the lane if the element is not present in  ().These bitsets are then or-reduced to a single row.This process is repeated to include all elements of   ().
Note that by design of the   data structure, just two vector gather operations (loading an element in each lane from unrelated locations), one in arrays  and  each, suffice to complete the hash lookup.In contrast, hash sets require potentially more lookups.
Rows of the matrix cannot be constructed this way if they are longer than 64 bits ( > 64).In this case, the rows of the matrix are built up in multiple steps, operating on a subset of the lanes.For example, when  = 128 on AVX2, we can perform 8-way vertically vectorized lookup in the XPSet, followed by constructing 8 one-hot singleton sets in 128bit subvectors, vectorized 2 sets at a time.This results in 4 vectors of 2x128 bit, wich are bitwise OR-ed and subsequently OR-reduced to obtain the 128-bit matrix row representation of those 8 elements.

Pivoting Optimizations in Binary
Matrices.We define specific optimizations for pivot selection in binary matrices.The purpose of these optimizations is to accelerate progression towards an empty  set, which is the condition to terminate recursion.We apply these rules only in binary matrix cut-outs where the overhead of executing additional intersections is minimal.
( In contrast, we aim to clean up as many vertices of  as possible while searching for the pivot.This reduces the size of  and as such any recursive call made may operate on a smaller set , resulting in fewer recursive calls overall.Note that application of the rules shrinks  during pivot selection.This helps to trigger the rules for other vertices too, which is an effect not available in Naudé's approach.Algorithm 6 demonstrates how these optimizations are applied.The function _ℎ () produces a bitset with only the bit at position  set to 1.  and  provide the population count and trailing zero count of the bitset.Lines 3-4 implement the termination condition for the recursion.Note that when enumerating the clique , elements of  need to be mapped back to IDs in the original graph and merged with the temporary clique at the point of taking the cutout.Lines 5-17 select the pivot.Lines 19-21 are the tail recursion optimization case (optimization 3 above).Finally, lines 23-28 implement the parallel iteration over elements of  =  \   ().
The first pivoting optimization is trivially valid and can be considered as an inlining optimization.The validity of the second and third optimizations is proven by Naudé [38].

Optimizations for Tiny Subproblems
Tiny subproblems are those where we visit a vertex  with at most three neighbours.For these problems, the options for cliques are explicitly enumerated through if-then-else statements by checking which neighours are members of  vs , and what edges connect them.It is als necessary to check if the neighbours precede  in degeneracy order, in which case they should be excluded and the clique should not be accepted.
Consider the examples in Figure 5.When vertex  in case (i) is visited, we observe it has neighbours  and , where the order is  <  < .Two cliques are enumerated: {,  } and {,  }.In case (ii), the clique {,  } is not enumerated.This while true do ⊲ tail recursion opt.

Putting It All Together
We design an algorithm for the MCE problem that utilises multiple data representations and enumeration algorithms.The top-level algorithm (Algorithm 7) iterates over all vertices  and constructs the initial  and  sets based on the left ( ) and right neighbours () of .It then considers one of three cases: (i) tiny subproblems where | | + | | ≤ 3 (Section 3.3); (ii) small subproblems where  and  fit in a natively supported SIMD word (Section 3.2.1);(iii) other subproblems are evaluated using the large problem search optimized with intersect-size-exceed (Section 3.1.1).and XPSet data structure (Section 3.1.2).This algorithm for large subproblems (Algorithm 8) again makes a choice between these three options on every recursive step.The binary search algorithm for a dense binary matrix is listed in (Algorithm 6), discussed in Section 3.2.3.The blocked case differs slightly as it handles two binary matrices and the  set is represented by two bitsets.

EVALUATION
Experimental results are collected on two processor architectures: an AMD EPYC 7702 Zen version 2 which supports AVX2 (256-bit vectors) and an Intel Xeon Gold 6438Y+ Sapphire Rapids (SR) which supports AVX512 (512-bit vectors).Both are dual socket machines with a sub-NUMA configuration totalling 8 (Zen2) and 4 (SR) NUMA domains.The SR machine has 256 GiB main memory and the Zen2 machine has 1 TiB.
Sapphire Rapids supports vector instructions for per-lane population count (AVX512POPCNTDQ).Mula's algorithm [37] is used on AVX2.Our code also makes heavy use of the scalar trailing zero count (TZCNT instruction in BMI1).TZCNT across a vector is achieved by finding the least significant non-zero lane and applying TZCNT on this lane.The use of binary matrices should be efficient on other vector instruction sets also provided these key operations, along with bitwise AND, OR, and NOT, are available on vector registers.All code is C++, compiled with gcc 10.2 on Zen2 and gcc 13.2.1 on SR.Intel Cilk is used for parallel execution.Source code directly calls into the Cilk runtime's parallel loop execution methods, bypassing the need for a custom Cilk compiler.The runtime itself is compiled with gcc 7.2.0.
The same hash function as Blanusa [9] is used for open addressed hash sets.The hash set occupation is kept below 50%.Performance results reported are the average of 10 measurements using 14 graph datasets (Table 1).

Performance
Table 2 shows the performance of Blanusa's algorithm [9], the GraphMineSuite (GMS) [8] implementation of Eppstein's algorithm on two processor architectures, and the algorithm proposed here, which we refer to as Differentiated Set Intersections (DSI).Results are presented for the best variant provided in GMS, which uses roaring bitmaps [33] to represent sets and applies an approximate degeneracy ordering algorithm.Both Blanusa's algorithm and GMS improve over Das et al's algorithm [17] by roughly an order of magnitude.Das's algorithm was already shown to outperform prior work by Eppstein et al [22] and Tomita et al [47].Hence, no comparisons are made against older algorithms.
Maximal clique enumeration time is presented without pre-processing overhead, as is customary in the literature.Speedups vary over a large range, from 1.44× to 37.10× over Blanusa's and from 1.69× to 38.38× over GMS on the Zen2 processor.The only instance where DSI is not fastest is warwiki, where GMS is 2-2.5×faster than DSI.GMS ran out  2 also lists the standard deviation on the execution times, as measured over 10 repeated executions.Standard deviation is often below 5% but can be substantial for all algorithms.On the Sapphire Rapids processor we observe up to 20.43% for Blanusa's (warwiki), 7.96% for GMS (yahoo) and 22.05% for DSI (HS-CX).On the Zen2 processor, Blanusa's has variability of 43.81% (warwiki) and GMS up to 40.01%(patents).We attribute this variability to the parallel execution, which will be discussed in Section 4.4.

Preprocessing Overhead
A substantial amount of time can be spent in pre-processing steps (Figure 6).From here on, results are reported for the Sapphire Rapids architecture.Even though a parallel and optimized k-core decomposition algorithm following [20] is used, it is not unusual for this step to take 40% of the total processing time.A new sort order is currently determined sequentially as it incurs little overhead except in the road networks and the patents graph.Relabeling of the graph is performed in parallel and takes relatively little time.The road networks stand out as different.Due to their very low vertex degrees, degeneracy ordering and hash intersection bring limited benefits and may be omitted.
GMS incurs broadly the same overhead as DSI, and has small efficiency benefits by using an approximate degeneracy algorithm.Preprocessing overhead in Blanusa's code is not state of the art as it calculates degeneracy sequentially.

Code Specialization
The time spent in each of the generic, dense, blocked and tiny algorithms varies strongly across data sets (Figure 7).This demonstrates the need for optimization of all code variants.Eight of the graphs spend a substantial amount of time in the generic code path that handles large  and  sets: talk, stackoflow, yahoo, warwiki, topcats, friendster, higgs and orkut.These are the graphs with highest maximum degree  (Table 1), indicating a need to use the generic variant due to the size of the  set.The dense variant (binary matrices from 32 × 32 to 512 × 512) is common across all graphs.The patents data set is nearly entirely handled using this problem type.This is in line with its very small maximum degree and low degeneracy.The blocked variant for larger binary matrices is infrequently used.However, this variant is frequently used in wormnet when restricting dense matrices to 256×256 as is done on the Zen2 architecture in line with AVX2 vector length.The tiny code variant is unusually important for USAroad and CAroad.These graphs have maximum degree 9 and 12, respectively, and the majority of vertices have a degree of three or less.
Whereas creating a cutout in a specialized data structure is beneficial in absolute terms, it generally takes longer to create a cutout than to enumerate maximal cliques for the cutout (Figure 8).More time is spent processing than cutting out a graph only in the wormnet graph.This occurs notwithstanding the optimized cutout construction (Section 3.2.2).Cut-outs are performed with relatively large  and  sets as they barely meet the size constrains of the dense or blocked algorithms, and hashing the  and  sets for supporting hash lookups when creating a cutout is important for overall performance.Figure 9 shows the slowdown incurred when  not hashing the XPSet during cut-out construction.Substantial slowdowns illustrate the importance of the XPSet data structure to speedup cutout construction.

Parallel Scalability
Figure 10 shows how performance changes as the number of parallel threads is increased.Here, timings are normalized to single-threaded execution of the same algorithm.Each of the algorithms may show high parallel scalability for some data sets, and demonstrate poor scalability for others.Such sensitivity of performance to data set is common for NP-hard problems.In this case, load imbalance and granularity of parallel work have been identified as possible causes.Load imbalance can generally be solved by creating small enough sequential parts of work and by spreading these across threads using work stealing [10].Usually this works well when the execution time of these sequential parts can be predicted such that equal amounts of work can be created and distributed across threads.This is however very hard in the case of NP-hard problems due to the very large difference between easy and hard subproblems, where hard subproblems take substantially longer.We analyzed the longest-running subproblem for any choice of top-level vertex.This longest-running subproblem takes over 75% of the whole execution time for talk, stackoflow, warwiki, HS-CX, higgs and wormnet.As each large subproblem would be executing in parallel using recursive parallel decomposition, this demonstrates clearly a load imbalance in the amount of work required per toplevel vertex.It also shows that completion of the program is dependent on completing these largest subproblems.
The load imbalance does not explain scalability of graphs such as USAroad and CAroad where only very small subproblems exist.In these graphs, we conjecture that parallel loop control has an important performance overhead.These graphs only require parallelism in the outer loop, however, other graphs require parallelism at each level of recursion.Distinguishing when to use a sequential or a parallel code for a subproblem is again a hard problem.Some graphs like friendster do not contain any hard subproblems but their size implies a large amount of computation.In this case, speedup over sequential execution is 93.5× for DSI.

RELATED WORK
Two lines of algorithms are researched: based on the Bron-Kerbosch algorithm [12] and based on Kose's algorithm [31].Akkoyunlu's algorithm [2] is a third type based on algebraic expressions of sets.This approach has however not been been studied much.
Kose's algorithm generates cliques by generating and assembling smaller cliques [31].The algorithm is constrained by the large number of intermediate cliques that are generated as the number of cliques in a graph is vastly larger than the number of maximal cliques [30].By consequence, distributed technologies like map-reduce are employed to manage the large amount of intermediate data, to extend cliques, and to filter duplicates.
Various improvements to Kose's algorithms have been proposed.CliqueEnumerator [56] seeks to manage the large number of intermediate cliques using compressible bitmap indices and fast bitwise operations.Wu et al use the Map-Reduce framework [19] to manage intermediate data sets [54].PECO [46] avoids duplication in the generation of cliques using vertex ordering, which is reminiscent of the left and right neighborhoods in Bron-Kerbosch.PECO also explores various vertex ordering heuristics to improve load balancing, but excludes degeneracy order.CMC-bit encodes candidate sets as bitsets across all actively explored cliques [55].Lessley et al [34] apply data-parallel primitives in a shared memory system and accelerate clique matching by hashing cliques.
The Bron-Kerbosch algorithm is the most promising algorithm line to date.Key improvements include Tomita's pivot selection [47] and Eppstein's degeneracy ordering and graph cutout [22].Du et al prune the search by checking if the current clique is contained within any previously enumerated maximal clique [21].Naudé [38] short-cuts pivot selection when a pivot can be found that leaves only two vertices to explore.Their approach is reported to perform best on dense graphs.Selecting the highest-degree vertex has been shown to accelerate pivot selection by avoiding set intersections and improve overall runtime [40].
Das et al present a shared memory parallel MCE algorithm that is provably work-efficient and has provable low parallel depth [17].Blanusa et al's parallel algorithm uses vectorized hash intersection and addresses load balance issues [9].
Bitset data structures have been explored for clique problems.San Segundo et al store the graph as a dense binary matrix, which constrains the maximum graph size that can be supported.They explore this for the maximum clique problem [42] and MCE [40].Storage size is improved using a sparse bitset [41], which is a basic type of compressed bitmap [14].Roaring bitmaps [33] were used in [8].Contrary to hash intersection, compressed bitmaps require graph cutouts due to their worst case time complexity.
PBAM stores only the  columns of the binary adjacency matrix [18].Our blocked binary format also stores the rows corresponding to  vertices.It is not clear how PBAM performs efficient intersections on the  set without these rows.
Dynamic Hashed Blocks (DHB) is a data structure for dynamic graphs [48], which requires random insertion and deletion of vertices and edges.DHB supports this with fast membership queries using a dual data structure consisting of a hashed set and a listing of elements.Similar to the XPSet, entries in the DHB hash table point to entries in the listing, however, the XPSet additionally makes use of the idea of fast arrays to reduce initialization and lookup time [24].This is possible due to cutting out subgraphs.

CONCLUSION
The majority of execution time in maximal clique enumeration is spent in set intersections.Contrary to previous work that has proposed the use of optimized set intersection algorithms such as hash sets, bitsets and compressed bitsets, this work proposes to differentiate the set representation and intersection algorithm depending on the function and subproblem size.This way, new algorithms (intersect-sizeexceeds), data structures (XPSet) and specialized algorithms for "tiny" and "small" subproblems are defined.Experimental evaluation on 14 challenging graph datasets and two processor architectures demonstrates substantial performance improvements over state of the art algorithms.Our novel approach is applicable across a broad range of graph minining and NP-hard problems as set intersection is a major performance bottleneck in this domain.

Figure 3 :
Figure 3: Cutout process for vectorized hashed lookup in adjacency list.The cutout matrix contains the elements in the  set of an   and has dimension 8x8 in this example.

Figure 4 :
Figure 4: Cutout process for vectorized hashed lookup in   with access to the  and  arrays of the  .

Figure 6 :
Figure 6: Fraction of time spent in preprocessing steps, compared to actual MCE search on the Sapphire Rapids processor.

a t e n t s s t a c k o f l o w y a h o o w a r w i k i p o k e c t o p c a t s f r i e n d s t eFigure 9 :
Figure 9: Performance impact of disabling XPSet hashing in cut-out construction for generic, blocked and dense cutouts.
() is the adjacency list of vertex  ∈  in graph  and represents the neighbors of .A clique  is a fully connected subgraph of , i.e.,  ⊂  and ∀  ∈  ⊂   ()∪{ }.A clique is maximal if no other clique exists that contains it.The MCE problem consists of enumerating all maximal cliques.
1: function BronKerbosch( = ( , ))2: Note that we terminate immediately if || −  ≤ 0. It is most beneficial to count missing elements in the smaller of  and  as this implies a smaller starting value of ℎ.The IntersectSizeExceed concept applies trivially to (vectorized) merge intersection and to vectorized hash intersection.
[38]inue the search for the pivot.(3)Ifafterselecting a pivot , the set of vertices to iterate over is a singleton (| \   ()| = 1), we perform tail recursion optimization, i.e., we add  to the clique, move it from  to  , and update  =  ∩   (),  =  ∩   ().We then re-run pivot selection.The purpose of this optimization is to limit stack growth and function call overhead.Naudé presents the second and third observations also, however, they are used differently[38].Their goal is to accelerate  is found where | \   ()| ≤ 2. If no such vertex is found, the vertex that maximizes | ∩   ()| becomes the pivot.By stopping the search early, they reduce the time spent pivoting, which dominates the execution time.However, they still need to perform recursive calls after this.
1) If a candidate pivot  ∈  is evaluated where | ∩   ()| = 0, then visiting the vertex results in a terminal recursive call ( is empty).We enumerate the current clique including  if  ∩   () is empty, and move  from  to  before continuing the pivot search.(2) If during pivot selection a vertex  ∈  is encountered where  ⊆   (), then  will be included in every maximal clique encountered in further recursive calls.Add  to the current clique, move it from  to  , and update  =  ∩   () (note that  =  ∩   () is a no-op).

Table 2 :
Execution time in seconds of Blanusa's algorithm, GMS and this work (DSI: Differentiated Set Intersections) on Sapphire Rapids and Zen2, both with 128 threads."Dev" shows standard deviation of execution time over average.Speedup of DSI over Blanusa's and GMS.OOM: out of memory.