Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles

Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.


INTRODUCTION
Graphs are widely adopted for usage as a data representation tool across many domains [47,50,70,71].A method of analysing graph-based data is to enumerate subgraph patterns, such as cycles, cliques, and motifs, in graphs [3].
However, enumerating subgraph patterns often leads to long execution times because the number of subgraph patterns that can exist in a graph [4,20] can be orders of magnitude greater than the number of graph vertices.As a result, fast subgraph enumeration algorithms are required that can exploit the parallelisim in modern multi-core processors.In this paper, we focus on enumerating simple cycles in directed graphs and introduce scalable parallel algorithms for that problem.A simple cycle is a sequence of edges that starts and ends with the same vertex and visits other vertices at most once.Enumerating simple cycles has important applications in several domains.For example, in electronic design automation, combinatorial loops in circuits are typically forbidden [28,56], and such loops can be detected by enumerating simple cycles.In a software bug tracking system, a dependency between two software bugs requires one bug to be addressed before the other [63].Circular bug dependencies are undesirable and can be detected by finding simple cycles.Other applications include detecting feedback loops in biological networks [37,41] and detecting unstable relationships in social networks [25,74].Table 2. Capabilities of the related work versus our own.Competing algorithms either fail to exploit fine-grained parallelism or do it on top of asymptotically inferior algorithms.
Related work [39] [58] [54] [57] [29] Ours Fine-grained parallelism the Read-Tarjan algorithm.Our general framework for parallelising temporal and hop-constrained cycle enumeration algorithms is presented in Section 7. In Section 8, we provide an experimental evaluation of our fine-grained parallel algorithms.Finally, we conclude our work in Section 9.

RELATED WORK
Simple cycle enumeration algorithms.Enumeration of simple cycles of graphs is a classical computer science problem [2,9,27,35,44,46,60,[66][67][68]72].The backtracking-based algorithms by Johnson [35], Read and Tarjan [60], and Szwarcfiter and Lauer [66] achieve the lowest time complexity bounds for enumerating simple cycles in directed graphs.These algorithms implement advanced recursion tree pruning techniques to improve on the brute-force Tiernan algorithm [68].Section 3.4 covers such pruning techniques in further detail.A cycle enumeration algorithm that is asymptotically faster than the aforementioned algorithms [35,60,66] has been proposed in Birmelé et al. [9], however, it is applicable only to undirected graphs.Simple cycles can also be enumerated by computing the powers of the adjacency matrix [19,36,55] or by using circuit vector space algorithms [24,46,73], but the complexity of such approaches grows exponentially with the size of the cycles or the size of the input graphs.Time-window, temporal ordering, and hop constraints.It is common to search for cycles under some additional constraints.For instance, in temporal graphs, it is common to search for cycles within a sliding time window, such as in Kumar and Calders [39] and Qiu et al [58].In addition, temporal ordering constraints can be imposed when searching for cycles in temporal graphs, such as in Kumar and Calders [39].Furthermore, the maximum number of hops in cycles or paths can be constrained, such as in Gupta and Suzumura [29] and Peng et al. [54].Note that hop-constrained simple cycles can also be enumerated using incremental algorithms, such as in Qiu et al. [58].However, this algorithm is based on the brute-force Tiernan algorithm [68], which makes it slower than nonincremental algorithms that use recursion tree pruning techniques [54].Additionally, because incremental algorithms maintain auxiliary data structures, such as paths, to be able to construct cycles incrementally, they are not as memory-efficient as nonincremental algorithms [54].
Table 2 offers comparisons between the capabilities of these methods and ours.
Parallel and distributed algorithms for cycle enumeration.Cui et al. [18] proposed a multi-threaded algorithm for detecting and removing simple cycles of a directed graph.The algorithm divides the graph into its strongly-connected components and each thread performs a depth-first search on a different component to find cycles.However, sizes of the strongly-connected components in real-world graphs can vary significantly [49], which leads to a workload imbalance.
Rocha and Thatte [61] proposed a distributed algorithm for simple cycle enumeration based on the bulk-synchronous parallel model [69], but it searches for cycles in a brute-force manner.Qing et al. [57] introduced a parallel algorithm for finding length-constrained simple cycles.It is the only other fine-grained parallel algorithm we are aware of in the sense that it can search for cycles starting from the same vertex in parallel.However, the way this algorithm searches for cycles is similar to the way the brute-force Tiernan algorithm [68] works.To our knowledge, we are the first ones to introduce fine-grained parallel versions of asymptotically-optimal simple cycle enumeration algorithms, which do not rely on a brute-force search, as we show in Table 2. Distributed algorithms for detecting the presence of cycles in graphs readily exist [6,23,51].However, our focus is on discovering all simple cycles of a graph rather than detecting whether a graph has a cycle or not.

BACKGROUND
This section introduces the main theoretical concepts used in this paper and provides an overview of the most prominent simple cycle enumeration algorithms.The notation used is given in Table 3.

Preliminaries
We consider a directed graph G(V, E) having a set of vertices V and a set of directed edges E = { →  | ,  ∈ V}.
The set of neighbours of a given vertex  is defined as N () = { |  →  ∈ E}.We refer to the vertex  of an edge  →  as its source vertex and to the vertex  as its destination vertex.An outgoing edge of a given vertex  is defined as  →  and an incoming edge is defined as  → , where  → ,  →  ∈ E. A path between the vertices  0 and   , denoted as  0 →  1 . . .→   , is a sequence of vertices such that there exists an edge between every two consecutive vertices of the sequence.A simple path is a path with no repeated vertices.A simple path is maximal if the last vertex of the path has no neighbours or all of its neighbours are already in the path [21].A cycle is a path of non-zero length from a vertex  to the same vertex .A simple cycle is a cycle with no repeated vertices except for the first and last vertex.The number of maximal simple paths and the number of simple cycles in a graph are denoted as  and , respectively (see Table 3).Note that  can be exponentially larger than  [67].A path or a cycle is said to satisfy a hop-constraint  if the number of edges in that path or cycle is less than or equal to .The goal of simple cycle enumeration is to compute all simple cycles of a directed graph G, ideally without computing all maximal simple paths of it.
A temporal graph is a graph that has its edges annotated with timestamps.[53].Such a graph might contain parallel edges, which are edges with the same source and destination vertices [7].An example of a temporal graph with parallel edges is given in Fig. 2. In temporal graphs, a temporal cycle is a simple cycle, in which the edges appear in the increasing order of their timestamps.A simple cycle or a temporal cycle of a temporal graph occurs within a time window [ 1 :  2 ] if every edge of that cycle has a timestamp   such that  1 ≤   ≤  2 .Fig. 2 shows the simple cycles of a temporal graph that occur within two different time windows of size  = 5.This graph contains two simple cycles in the time window [2 : 7] (Fig. 2a), which are also temporal cycles, and two simple cycles in the time window [10 : 15] (Fig. 2b), neither being a temporal cycle.Note that the existence of parallel edges in temporal graphs makes it possible to have several simple cycles that contain the same sequence of vertices, as shown in Fig. 2a.The union of several cycles that contain the same sequence of vertices is called a cycle bundle [39].

Task-level parallelism
The parallel algorithms described in this paper can be implemented using shared-memory parallel processing frameworks, such as TBB [38], Cilk [13], and OpenMP [59].These frameworks enable the decomposition of a program into tasks that can be independently executed by different software threads.In our setup, tasks are created and scheduled dynamically.
A parent task can spawn several child tasks.The depth of a task is the number of its direct ancestors.A dynamic task management system assigns the tasks created to the work queues of the available threads.Furthermore, a work-stealing scheduler [13,14,38] enables a thread that is not executing a task to steal a task from the work queue of another thread.
Stealing tasks enables dynamic load balancing and ensures full utilisation of the threads when there are sufficiently many tasks.

Work efficiency and scalability
We use the notions of work efficiency and scalability to analyse parallel algorithms [12].We refer to the time to execute a parallel algorithm on a problem of size  using  threads as   ().The size of a graph is determined by the number of vertices  as well as the number of edges , but we will refer only to  for simplicity.The depth of an algorithm is the length of the longest sequence of dependent operations in the algorithm.The time required to execute such a sequence is equal to the execution time of the parallel algorithm using an infinite number of threads, denoted by  ∞ .Furthermore, work performed by a parallel algorithm on a problem of size  using  threads, denoted as   (), is the sum of the execution times of the individual threads.The work efficiency and the scalability are formally defined as follows.
Informally, a work-efficient parallel algorithm performs the same amount of work as its serial version, within a constant factor.Scalability implies that, for sufficiently large inputs, increasing the number of threads increases the speedup of the parallel algorithm with respect to its serial version.
We also define the notion of strong scalability as follows [32].
Definition 3. (Strong scalability) A parallel algorithm is strongly scalable if and only if Whereas Definition 2 implies that the speedup  1 ()/  () achieved by a parallel algorithm with respect to its serial execution is infinite when the number of threads  is infinite, Definition 3 implies that the speedup is always in the order of .Another related concept is weak scalability, which requires the speedup to be in the order of  when the input size per thread is constant.Note that both strong scalability and weak scalability imply scalability.

Simple cycle enumeration algorithms
The following algorithms for simple cycle enumeration perform recursive searches to incrementally update simple paths that can lead to cycles.Each algorithm iterates the vertices or edges of the graph and independently constructs a recursion tree to enumerate all the cycles starting from that vertex or edge.are also unblocked.This unblocking process is performed recursively until no more vertices can be unblocked, which we refer to as the recursive unblocking procedure.
A vertex  is blocked (i.e., added to Blk) when visited by the algorithm.If a cycle is found after recursively exploring every neighbour of  that is not blocked, the vertex  is unblocked.However,  is not immediately unblocked if no cycles are found after exploring its neighbours.Instead, the Blist data structure is updated to enable unblocking of  in a later step by adding  to the list Blist [] of every neighbour  of .This delayed unblocking of the vertices enables the Johnson algorithm to discover each cycle in  () time in the worst case.Because this algorithm requires  ( + ) time to determine that there are no cycles, its worst-case time complexity is  ( +  + ) [66].Note that because  can be exponentially larger than  [67], the Johnson algorithm is asymptotically faster than the Tiernan algorithm.
In the example shown in Fig. 3a, every simple path Π that starts from  0 and contains vertices    a high pruning efficiency, but it also makes scalable parallelisation of the Johnson algorithm extremely challenging, which we are going to cover in Section 5.
The Read-Tarjan algorithm [60] also has a worst-case time complexity of  ( +  + ).This algorithm maintains a current path Π between a starting vertex and a frontier vertex.A recursive call of this algorithm iterates the neighbours of the current frontier vertex and performs a depth-first search (DFS).Assume that  0 is the starting vertex and  1 is the frontier vertex of Π (see Fig. 3a).From each neighbour  ∈ { 0 ,  2 } of  1 , a DFS tries to find a path extension  back to  0 that would form a simple cycle when appended to Π.In the example shown in Fig. 3a, the algorithm finds two path extensions, one indicated as  and one that consists of the edge  1 →  0 .The algorithm then explores each path extension by iteratively appending the vertices from it to the path Π.For each vertex  added to Π, the algorithm also searches for an alternate path extension from that vertex  to  0 using a DFS.In the example given in Fig. 3a, the algorithm iterates through the vertices of the path extension  and finds an alternate path extension  ′ from the neighbour  1 of  2 .If an alternate path extension is found, a child recursive call is invoked with the updated current path Π, which is  0 →  1 →  2 in our example.Otherwise, if all the vertices in  have already been added to the current path Π, Π is reported as a simple cycle.In our example, the Read-Tarjan algorithm explores both  and  ′ path extensions, and each leads to the discovery of a cycle.
The Read-Tarjan algorithm also maintains a set of blocked vertices Blk for recursion-tree pruning.However, differently from the Johnson algorithm, Blk only keeps track of the vertices that cannot lead to new cycles when exploring the current path extension within the same recursive call.The vertices in Blk are avoided while searching for additional path extensions that branch from the current path extension.For instance, the left subtree of the recursion tree shown in Fig. 3b demonstrates the exploration of the path extension  shown in Fig. 3a.During the exploration of , the vertices  1 , . . .,   are added to Blk immediately after visiting  1 , and they are not visited again while exploring .
However, when exploring another path extension  ′ in the right subtree, the vertices  1 , . . .,   are visited once again (see the dotted path of the right subtree).As a result, the Read-Tarjan algorithm visits  1 , . . .,   twice instead of just once.As we are going to show in Section 6, this drawback becomes an advantage when parallelising the Read-Tarjan algorithm because it enables independent exploration of different subtrees of the recursion tree.
Thread 2 Thread 3  The recursion tree of the Johnson algorithm for  = 6 constructed when the algorithm starts from  0 .Whereas a coarse-grained parallel algorithm explores the complete recursion tree using a single thread, our fine-grained parallel algorithms can explore different regions of the recursion tree in parallel using several threads.

COARSE-GRAINED PARALLEL METHODS
The most straightforward way of parallelising the Johnson and the Read-Tarjan algorithms is to search for cycles that start from different vertices in parallel.Each such search can then be executed by a different thread that explores its own recursion tree.This approach is beneficial because it is work-efficient and can be implemented using one of the existing graph processing frameworks, such as Pregel [45], in a manner similar to the method used by Rocha and Thatte [61].
We refer to this parallelisation approach as the coarse-grained parallel approach.
The coarse-grained approach can express more parallelism if each thread performs a search for cycles that start from a different edge rather than a different vertex.This assumption is supported by the fact that graphs typically have more edges than vertices.Nevertheless, the coarse-grained approach is not scalable, which we prove here.
Proposition 1.The coarse-grained parallel Johnson and Read-Tarjan algorithms are work-efficient.
The proof of Proposition 1 is trivial, and we omit it for brevity.
Theorem 1.The coarse-grained parallel Johnson and Read-Tarjan algorithms are not scalable.
Proof.In this case, the depth  ∞ () represents the worst-case execution time of a search for cycles that starts from a single vertex or edge, and it depends on the number of cycles found during this search.In the worst case, a single recursive search can discover all cycles of a graph.An example of such graph is given in Fig. 4a, where each vertex   , with  ∈ {1, . . .,  − 1}, is connected to  0 and to every vertex   such that  > .In that graph, any subset of vertices  2 , . . .,  −1 defines a different cycle.Therefore, the total number of cycles in this graph is equal to the number of all such subsets  = 2 −2 .Before the search for cycles, both the Johnson and the Read-Tarjan algorithm find all vertices that start a cycle, which is only  0 in this case.Therefore, the search for cycles will be performed only by one thread.Because both the Johnson and the Read-Tarjan algorithms require  () time to find each cycle, the depth of the coarse-grained  Theorem 1 shows that the main drawback of the coarse-grained parallel algorithms is their limited scalability.This limitation is apparent for the graph shown in Fig. 4a, which has an exponential number of cycles in .When using a coarse-grained parallel algorithm on this graph, all the cycles will be discovered by a single thread, and, thus, the depth of this algorithm grows linearly with , as shown in Table 4.Because only one thread can be effectively utilised, increasing the number of threads will not result in a reduction of the overall execution time of the coarse-grained parallel algorithm.Fig. 1a shows the workload imbalance exhibited by the coarse-grained parallel algorithms in practice.
Section 8 demonstrates the limited scalability of coarse-grained parallel algorithms in further detail.

FINE-GRAINED PARALLEL JOHNSON
To address the load imbalance issues that manifest themselves in the coarse-grained parallel Johnson algorithm, we introduce the fine-grained parallel Johnson algorithm.The main goal of our fine-grained algorithm is to enable several threads to explore a recursion tree concurrently, as shown in Fig. 4b, where each thread executes a subset of the recursive calls of this tree.However, enabling concurrent exploration of a recursion tree is in conflict with the sequential depth-first exploration, required by the Johnson algorithm to achieve a high pruning efficiency.
In this section, we first discuss the challenges that arise when parallelising the exploration of a recursion tree of the Johnson algorithm.Then, we introduce the copy-on-steal mechanism used to address these challenges and present our fine-grained parallel Johnson algorithm.Finally, we theoretically analyse our algorithm and show that it is scalable.

Fine-grained parallelisation challenges
The requirement of the sequential depth-first exploration of the Johnson algorithm makes it challenging to efficiently parallelise this algorithm in a fine-grained manner.This requirement is enforced by maintaining a set of blocked vertices Blk throughout the exploration of a recursion tree.If threads exploring the same recursion tree simply share the same set of blocked vertices Blk, the parallel algorithm could produce incorrect results.For example, considering the graph given in Fig. 5a, a thread exploring the path Π =  0 →  1 →  1 →  2 visit and block the vertex  4 in this case because  4 cannot participate in a simple cycle that begins with Π.Because the threads exploring this graph share the blocked Input: v -the current vertex, v 0 -the starting vertex d -the depth of this task InOut : T 1 -the thread that created this task and Mutex T 1 Output: true if a cycle was found 1 T 2 = the thread executing this task; ⊲ T 2 maintains Π T 2 , Blk T 2 , Blist T 2 , and Mutex T 2 ⊲ Create a child task 13 found = found ∨ f; 14 wait for the spawned tasks; 15 Mutex T 2 .lock();16 Π T 2 .pop(); 21 Mutex T 2 .unlock();22 return found; vertices, another thread attempting to discover the cycle  0 →  1 →  4 →  2 →  0 would fail to do so because  4 is blocked.Therefore, this approach might not discover all cycles in a graph.
To enable several threads to correctly find all cycles while exploring the same recursion tree, the algorithm could forward a new copy of the Blk and Blist data structures when invoking each child recursive call.However, this approach would redundantly explore many paths in a graph.The reason is that a recursive call would be unaware of the vertices visited and blocked by other calls that precede it in the depth-first order except for its direct ancestors in the recursion tree.When enumerating the simple cycles of the graph shown in Fig. 5a starting from  0 , this approach explores all 4 × 2 −1 + 3 maximal simple paths instead of just seven, that the Johnson algorithm would explore.Hence, this approach exhaustively explores all maximal simple paths in the graph and is identical to the brute-force solution of Tiernan (see Section 3.4).Next, we propose a fine-grained parallel algorithm that addresses the aforementioned parallelisation challenges.

Copy-on-steal
To enable different threads to concurrently explore the recursion tree in a depth-first fashion while also taking advantage of the powerful pruning capabilities of the Johnson algorithm, each thread executing our fine-grained parallel Johnson The stolen task

Task created
Task stolen Task that reports a simple cycle algorithm maintains its own copy of the Π, Blk, and Blist data structures.These data structures are copied between threads only when these threads attempt to explore the same recursion tree.To achieve this behaviour, our fine-grained parallel Johnson algorithm implements each recursive call of the Johnson algorithm as a separate task.The pseudocode of this task is given in Algorithm 1, where a data structure  , maintained by the thread   , is denoted as    (see Table 3).
If a child task and its parent task are executed by the same thread   , the child task reuses the Π   , Blk   , and Blist   data structures of the parent task.However, if a child task has been stolen-i.e., it is executed by a thread other than the thread that created it, the child task will allocate a new copy of these data structures (line 2 of Algorithm 1).We refer to this mechanism as copy-on-steal.
The problem with copying data structures between different threads upon task stealing is that the thread that has created the stolen task (i.e., the victim thread) can modify its data structures before this task is stolen by another thread (i.e., the stealing thread).This problem can be observed in the example shown in Fig. 6.There, the victim thread  1 and the stealing thread  2 explore the same recursion tree given in Fig. 6b while searching for cycles that start with  1 =  0 →  1 →  2 and  2 =  0 →  1 →  7 , respectively.In this case,  2 steals a task created by  1 that explores  7 , as indicated in Fig. 6b, and receives a copy of the blocked vertices Blk  1 = { 4 ,  5 ,  6 } discovered by  1 .The thread  1 blocked these vertices because they cannot participate in any simple cycle that begins with  1 .
If  2 simply uses a copy of these blocked vertices Blk  1 without modifications,  2 will be unable to find the cycle Therefore, a method for unblocking vertices after copy-on-steal is required to correctly find all cycles.
We explore two solutions for this problem: (i) Copy-on-steal with complete unblocking.To enable the threads of our algorithm to find cycles after performing copy-on-steal, the stealing thread could unblock all vertices that the victim thread had blocked after creating the stolen task.In our example given in Fig. 6, the stealing thread  2 unblocks all vertices Blk  1 = { 4 ,  5 ,  6 } it received from the victim thread  1 .Although this approach enables  2 to correctly find cycles, it also fails to take advantage of the information collected by  1 to reduce the redundant work of  2 .For instance, in Fig. 6,  2 visits  5 and  6 , even though  1 already concluded that these vertices cannot participate in any simple cycle that begins with  =  0 →  1 , Algorithm 2: FGJ_copyOnSteal (d, T 1 , T 2 ) Input: d -the depth of the task executing this function InOut : T 1 -the victim thread T 2 -the stealing thread Input: G -the input graph with vertices V and edges E 1 parallel foreach  0 →  : E do 2 T 0 = the thread executing this loop iteration; ⊲ T 0 maintains Π T 0 , Blk T 0 , Blist T 0 , and Mutex T 0 where  is the largest common prefix of all the paths explored by  1 and  2 .As a result,  2 redundantly visits the dotted part of the recursion tree given in Fig. 6b.
(ii) Copy-on-steal with recursive unblocking.In this approach, the stealing thread capitalises on the information already discovered by the victim thread.The stealing thread  2 can reuse a subset  ⊂ Blk  1 of the blocked vertices discovered by  1 if the vertices in  cannot participate in simple cycles that begin with , where  is the largest common prefix of all the paths explored by  1 and  2 .Because any path discovered by  2 begins with ,  2 can avoid visiting vertices from .Thus, to correctly find simple cycles, it is sufficient for  2 to unblock the vertices from Blk  1 \ .
To achieve this behaviour,  2 invokes a recursive unblocking procedure of the Johnson algorithm for every vertex  ∈ Π  1 \ , as shown in Algorithm 2, where Π  1 is the path  1 is exploring during task stealing.The vertices in  can only be unblocked by a recursive unblocking invoked for  ∈ ; hence, the vertices in  remain blocked.In the example given in Fig. 6,  2 invokes a recursive unblocking procedure for Π  1 \  = { 2 }, which results in unblocking of  4 .Thus,  2 is able to discover a cycle that contains  4 .The vertices  = { 5 ,  6 } will not be unblocked because they cannot take part in any simple cycle that begins with  =  0 →  1 .Therefore, thread  2 avoids visiting the dotted part of the recursion tree given in Fig. 6b.
Without countermeasures, our algorithm can suffer from race conditions because its data structures can be accessed concurrently by different threads.For instance, a stealing thread  2 can copy the data structures of a victim thread  1 while  1 performs a recursive unblocking, in which case  2 could receive the vertex set Blk  1 that is partially unblocked.
When using copy-on-steal with recursive unblocking,  2 may not be able to continue the interrupted unblocking of Blk  1 , causing the algorithm to miss certain cycles.To avoid this problem, we define critical sections in lines 15-21 of Algorithm 1 and in lines 1-3 of Algorithm 2 using coarse-grained locking by maintaining a mutex per thread.However, such a locking mechanism is not required when using copy-on-steal with complete unblocking because  2 can correctly unblock vertices in Blk  1 simply by removing all vertices from Blk  1 inserted after the stolen task was created.Thus, it is sufficient to enable thread-safe operations on Π, Blk, and Blist using fine-grained locking.As a result, the critical sections are shorter when the copy-on-steal with complete unblocking approach is used.
Nevertheless, we opt to use the copy-on-steal with recursive unblocking approach in our fine-grained parallel Johnson algorithm because this approach leads to less redundant work and rarely suffers from synchronisation overheads.The pseudocode of our fine-grained parallel Johnson algorithm is given in Algorithm 3.

Theoretical analysis
We now show that the fine-grained parallel Johnson algorithm is scalable but not work-efficient.
Theorem 2. The fine-grained parallel Johnson algorithm is not work-efficient.
Proof.According to Lemma 3 presented by Johnson [35], a vertex cannot be unblocked more than once unless a cycle is found, and once a vertex is visited, it can be visited again only after being unblocked.Thus, the Johnson algorithm visits each vertex and edge at most  times.In the fine-grained parallel Johnson algorithm executed using  threads, each thread maintains a separate set of data structures used for managing blocked vertices.Because the threads are unaware of each other's blocked vertices, each edge is visited at most  times,  times by each thread.
Additionally, an edge cannot be visited more than  times because each maximal simple path of a graph is explored by a different thread in the worst case, and during each simple path exploration, an edge is visited at most once.Therefore, the maximum number of times an edge can be visited by the fine-grained parallel Johnson algorithm is min {, }.
Because the algorithm executes in  ( + ) time if there does not exist a cycle or a path in the input graph, the work performed by the fine-grained parallel Johnson algorithm is When  > 0,  > 1, and  > , the work performed by the fine-grained parallel Johnson algorithm   () is greater than the execution time  1 () of the sequential Johnson algorithm.Thus, this algorithm is not work-efficient.□ The work inefficiency of our fine-grained parallel Johnson algorithm occurs if more than one thread performs the work the sequential Johnson algorithm would perform between the discovery of two cycles.This behaviour can be illustrated using the graph from Fig. 5a, which contains  = 4 cycles and  =  × 2 −1 + 3 maximal simple paths, each starting from vertex  0 .When discovering each cycle, our fine-grained algorithm explores an infeasible region of the recursion tree, as shown in Fig. 5b, in which the vertices  1 , . . .,   are visited.If this infeasible region is explored using a single thread, each vertex   , with  ∈ {1, . . ., }, will be visited exactly once.However, if  threads are exploring the same infeasible region of the recursion tree, vertices  1 , . . .,   will be visited up to  times because the threads are unaware of each other's blocked vertices.In this case, the fine-grained parallel Johnson algorithm performs more work than necessary, and, thus, it is not work-efficient.Additionally, each infeasible region of the recursion tree that visits vertices  1 , . . .,   can be executed by at most / = 2 −1 threads because there are 2  For the fine-grained parallel Johnson algorithm to be scalable, it is sufficient for  to increase sublinearly with .
Even though this algorithm is scalable, a strong or weak scalability is not guaranteed due to the work inefficiency of this algorithm.Nevertheless, our experiments show that this algorithm is strongly scalable in practice (see Fig. 18).

Summary
Our relaxation of the strictly depth-first-search-based recursion-tree exploration reduces the pruning efficiency of the Johnson algorithm.In the worst case, the fine-grained parallel Johnson algorithm could perform as much work as the brute-force Tiernan algorithm does-i.e.,  ().However, in practice, this worst-case scenario does not happen (see Section 8).In addition, our fine-grained parallel Johnson algorithm can suffer from synchronisation issues in some rare cases (see Section 8) because our copy-on-steal mechanism can lead to long critical sections.In the next section, we introduce a fine-grained parallel algorithm that is scalable, work-efficient, and less prone to synchronisation issues.

FINE-GRAINED PARALLEL READ-TARJAN
In this section, we first introduce several optimisations that reduce the number of unnecessary vertex visits performed by the sequential Read-Tarjan algorithm.Then, we present our fine-grained parallel Read-Tarjan algorithm that includes these optimisations.Finally, we show that our parallel algorithm is work-efficient and strongly-scalable.

Improvements to the pruning efficiency
To improve the pruning efficiency of the sequential Read-Tarjan algorithm, we include the following optimisations: (i) Blocked vertex set forwarding enables a recursive call of the Read-Tarjan algorithm to reuse vertices blocked by its parent call, resulting in fewer vertex visits.The original Read-Tarjan algorithm discards blocked vertices after each recursive call [60], even though this information could be reused later.In this optimisation, the algorithm forwards the blocked vertices Blk of a recursive call to its child recursive calls, preventing those child calls from unnecessarily visiting the vertices in Blk again.For example, in Fig. 7, the vertex  8 is blocked the first time the algorithm visits  8 while exploring the path extension  1 .This optimisation prevents the algorithm from visiting  8 again when exploring the same extension  1 or another extension  3 that branches from  1 .As a result of this optimisation, the algorithm can avoid the dotted part of the recursion tree.
(ii) Path extension forwarding prevents recomputation of the path extension  found by a parent recursive call by forwarding this path extension to its child recursive call.In this way, each child recursive call performs one fewer DFS invocation than the original Read-Tarjan algorithm [60].
(iii) Blocking on a successful DFS is another mechanism for discovering vertices to be blocked.As a reminder, the Read-Tarjan algorithm searches for path extensions using a DFS.In the original algorithm, a vertex is blocked only if it is visited during an unsuccessful DFS invocation, which fails to discover a path extension.However, successful DFS invocations could also visit some vertices that have all their neighbours blocked.Such vertices cannot lead to the discovery of new cycles and, thus, can also be blocked.The pseudocode of the DFS function that includes this optimisation is given in Algorithm 4. In our example given in Fig. 7, a successful DFS invoked from  3 finds a path extension  3 and discovers that the only neighbour  8 of  7 is blocked.The algorithm then blocks  7 , which enables it to avoid visiting  7 again when exploring  3 .Therefore, fewer vertices are visited during the execution of the algorithm.

Fine-grained parallelisation
Although the optimisations presented in Section 6.1 eliminate some of the redundant work performed by the Read-Tarjan algorithm, this algorithm typically performs more work than the Johnson algorithm (see Section 3.4).However, this redundancy makes it possible to parallelise the Read-Tarjan algorithm in a scalable and work-efficient manner.
Because the Read-Tarjan algorithm allocates a new Blk set for each path extension exploration, a recursive call can explore different path extensions in an arbitrary order.In addition, discovery of a new path extension  results in the Algorithm 5: FGRT_task(v, v 0 , E, d, T 1 ) Input: v -the current vertex, v 0 -the starting vertex E -the path extension from v to v 0 d -the depth of this task InOut : T 1 -the thread that created this task ⊲ T 1 maintains Π T 1 and Blk T 1 1 T 2 = the thread executing this task; ⊲ T 2 maintains Π T 2 and Blk T 2  invocation of a single recursive call, and these calls can be executed in an arbitrary order.As a result, several threads can concurrently explore different paths of the same recursion tree constructed by the Read-Tarjan algorithm for a given starting edge.There are neither data dependencies nor ordering requirements between different calls, apart from those that exist between a parent and a child.To exploit the parallelism available during the recursion tree exploration, we execute each path extension exploration in each recursive call as a separate task, all of which can be independently executed.Examples of such tasks are shown in Fig. 7.We refer to the resulting algorithm as the fine-grained parallel Read-Tarjan algorithm.
Our implementation shown in Algorithm 5 performs only a single path extension exploration in a recursive call and uses all the optimisations we introduced in Section 6.1.We execute each such recursive call as a separate task using a dynamic thread scheduling framework (see Section 3.2).To find all cycles of a graph, we execute a parallel for loop iteration for each edge  0 →  that uses Algorithm 4 to search for a path extension  from  to  0 , as shown in Algorithm 6 If such  exists, a task is created using ,  0 , and  as its input parameters.This task then recursively creates new tasks, as shown in lines 14 and 19 of Algorithm 5, until all cycles that start with the edge  0 →  have been discovered.
To prevent different threads from concurrently modifying Π and Blk, each task allocates and maintains its own Π and Blk sets.A task can receive a copy of Π and Blk directly from its parent task at the time of task creation.However, it is possible to minimise the copy overheads by copying these sets only when a task is stolen.For this purpose, we use the copy-on-steal with complete unblocking approach described in Section 5.2, which has shorter critical sections than the copy-on-steal with recursive unblocking approach used by our fine-grained parallel Johnson algorithm.

Theoretical analysis
We now show that the fine-grained parallel Read-Tarjan algorithm is both work-efficient and strongly scalable.Proof.Because each task of our fine-grained parallel Read-Tarjan algorithm either discovers a cycle or creates at least two child tasks, our algorithm is executed using  () tasks.Each task performs several unsuccessful DFS invocations and one successful DFS per each child task it creates.All unsuccessful DFS invocations explore at most  edges in total because they share the same set of blocked vertices.In the worst case, each edge is visited twice per task, once by a successful DFS and once by one of the unsuccessful DFS invocations.Thus, this algorithm performs  () work per task.Because this algorithm performs  ( + ) work if there are no cycles in the graph, the total amount of work this algorithm performs is   () =  ( +  + ).Hence, this algorithm is work-efficient based on Definition 1. □ The work-efficiency of our fine-grained parallel Read-Tarjan algorithm can be demonstrated using the example given in Fig. 5a.In this example, the threads of this algorithm independently explore four different path extensions A thread exploring a path extension   invokes a DFS from  2 , which explores vertices  1 , . . .,   at most once and fails to find any other path extension.Therefore, the amount of work the fine-grained parallel Read-Tarjan algorithm performs does not increase compared to its single-threaded execution.
Lemma 2. The depth  ∞ () of the fine-grained parallel Read-Tarjan algorithm is in  ().
Proof.In the worst case, a thread executing this algorithm creates a task for each vertex of its longest simple cycle, which has a length of at most .Before invoking its first child task, a task executes a sequence of unsuccessful DFS invocations in  () and a successful DFS invocation also in  ().Thus, the depth of this algorithm is  ().□ The worst-case depth of our algorithm can be observed when this algorithm is executed on the graph given in Fig. 4a.
This graph has  = 2 −2 cycles and the length of its longest cycle  0 → . . . −1 →  0 is .The algorithm creates a task for each vertex of the cycle and performs a successful DFS in each such call, which leads to  ∞ ∈  ().

Summary
The work of our fine-grained parallel Read-Tarjan algorithm does not increase after fine-grained parallelisation.This parallel algorithm performs   () ∈  ( +  + ) work: the same as the work performed by its serial version.Our optimisations introduced in Section 6.1 do not reduce the work   () performed by our parallel algorithm in the worst case.However, these optimisations significantly improve its performance in practice (see Section 8.4).In addition, the synchronisation overheads of the fine-grained parallel Read-Tarjan algorithm are not as significant as those of the fine-grained Johnson algorithm because of its shorter critical sections.Furthermore, this algorithm is the only asymptotically-optimal parallel algorithm for cycle enumeration for which we are able to prove strong scalability.

PARALLELISING CONSTRAINED CYCLE SEARCH
This section describes the methods for adapting our parallel algorithms to search for simple cycles under various constraints.Because state-of-the-art algorithms for temporal and hop-constrained cycle enumeration are extensions of the Johnson algorithm [39,54], our parallelisation approach described in Section 5 is also applicable to these algorithms.
In this section, we describe the changes to the fine-grained parallel Johnson algorithm needed for enumeration of temporal and hop-constrained cycles.We also introduce modifications to the cycle enumeration algorithms required for finding time-window-constrained cycles.

Cycles in a time window
Cycle enumeration algorithms require minimal modifications to support time-window constraints.Such constraints restrict the search for simple, temporal, and hop-constrained cycles to those that occur within a time window of a given size , as illustrated in Fig. 2. To find time-window-constrained cycles that start with an edge that has timestamp  0 , only the edges with timestamps that belong to the time window [ 0 :  0 + ] are visited.To avoid reporting the same cycle several times, another edge with the same timestamp  0 is visited only if the source vertex of that edge has an ID that is smaller than the ID of the vertex from which the search for cycles was started.Overall, imposing time-window constraints reduces the number of cycles discovered, which results in a more tractable problem.
Because temporal graphs may contain parallel edges, several simple cycles that contain the same sequence of vertices may exist in a temporal graph (see Fig. 2a).As a result, such cycles may be discovered simultaneously, which could accelerate the search for cycles in temporal graphs.For that purpose, we use a method similar to Kumar and Calders [39], in which several simple cycles that contain the same sequence of vertices and whose edges belong to the same time The stolen task

Task created
Task stolen PrevLocks.Note that this original closing time of  3 was previously set by  1 while exploring the path  0 →  1 →  3 .
The recursive unblocking that  2 invokes for  3 unblocks only the edge  6 →  3 because it is the only incoming edge of  3 with a timestamp smaller than the closing time 9 of the vertex  3 .Without recording the previous closing times,  2 could instead unblock all incoming edges of  3 by invoking recursive unblocking for  3 with a closing time ∞, which also unblocks the edges  6 →  7 and  7 →  3 .However, because there is no temporal cycle that contains these two edges and starts with  0 ,  2 would unnecessarily visit them in this case.Thus, restoring the closing time of  3 to its original value 9 prevents  2 from performing this redundant work.
We also adapt the Read-Tarjan algorithm and its fine-grained and coarse-grained versions to enumerate temporal cycles using closing times.The necessary changes to the algorithm are trivial, and we omit discussing them for brevity.
To reduce the number of vertices visited during the search for temporal cycles, we use a method similar to the SCC-based technique discussed in Section 7.1.Instead of computing an SCC for each edge , we compute a cycle-union that represents an intersection of temporal ancestors and temporal descendants of .The temporal descendants and the temporal ancestors of  are the vertices that belong to the temporal paths in which  is the first edge and the last edge, respectively.Defined as such, a cycle-union contains only the vertices that participate in temporal cycles that have  as their starting edge.Thus, the search for temporal cycles that start with  can be limited to only those vertices.

Hop-constrained cycles
An efficient algorithm for enumerating hop-constrained cycles and paths, called BC-DFS [54], This procedure also sets the barrier of any vertex  that can reach  in  hops to bar ′ +  if the previous barrier of  was greater than bar ′ + .Maintaining barriers in such a way minimises redundant vertex visits when searching for hop-constrained cycles. 0

,… Barrier values modified by
T " during copy-on-steal The stolen task To parallelise BC-DFS in a fine-grained manner, we use the same technique as that used for fine-grained parallelisation of the Johnson algorithm (Section 5) and the 2SCENT algorithm (Section 7.2).In this case, threads exploring a recursion tree of BC-DFS maintain separate data structures, such as the current path Π and barrier values for each vertex, and use the copy-on-steal with the recursive unblocking approach to copy these data structures among threads.Similarly to our algorithm from Section 7.2, each thread also maintains a data structure PrevLocks that records the original barrier value of each vertex  from Π.When a thread steals a task, it performs a recursive unblocking procedure for each vertex  removed from Π using its original barrier value obtained from PrevLocks, as shown in Algorithm 7.This procedure reduces the barrier value of the vertices that can reach , enabling the stealing thread to visit those vertices.We refer to the resulting algorithm as the fine-grained parallel hop-constrained Johnson algorithm.
The modified copy-on-steal with recursive unblocking approach given in Algorithm 7 enables a stealing thread of the aforementioned fine-grained parallel algorithm to reuse barriers discovered by other threads.This behaviour can be observed in the example given in Fig. 9.In that example, the thread  1 first visits the vertices  2 ,  6 ,  7 ,  8 and sets the barrier value of each visited vertex to  − |Π| + 1 (values in red shown in Fig. 9a) because it was not able to find a cycle of length  = 6 [54].Here, |Π| denotes the length of Π at the moment of exploration of each vertex.When the thread  2 steals the task indicated in Fig. 9b from  1 , the copy-on-steal mechanism executed by  2 performs a recursive unblocking of the vertex  1 using the original barrier value 0 of  1 obtained from PrevLocks.This recursive unblocking reduces the barrier value of  2 from 4 to 1, which enables  2 to find the cycle that contains  2 .The barrier values of the vertices  6 ,  7 , and  8 are not modified, and, thus, the thread  2 avoids visiting these vertices unnecessarily.

Summary
In this section, we described a method to adapt the cycle enumeration algorithms, such as our fine-grained algorithms introduced in Sections 5 and 6, to search for cycles under time window constraints.In addition, we introduced a modified version of our copy-on-steal with recursive unblocking approach, introduced in Section 5, that supports fine-grained parallelisation of temporal and hop-constrained cycle enumeration algorithms [39,54] derived from the Johnson algorithm.As a result, our fine-grained parallel algorithms can enumerate cycles under time-window, temporal, and hop constraints.

EXPERIMENTAL EVALUATION
This section evaluates the performance of our fine-grained parallel algorithms for simple, temporal, and hop-constrained cycle enumeration1 .As Table 2 shows, we are the only ones to offer fine-grained parallel versions of the asymptoticallyoptimal cycle enumeration algorithms, such as the Johnson and the Read-Tarjan algorithms.However, the methods covered in Table 2 can be parallelised using the coarse-grained approach covered in Section 4. Thus, we use the coarse-grained approach as our main comparison point.
The experiments are performed using two different clusters: Intel2 KNL [64] and Intel Xeon Skylake [26].The details of these two clusters are given in Table 5.We developed our code on the Intel KNL cluster and ran most of the analyses there; yet, for completeness, we also provide the comparisons to competing implementations on the Intel Xeon Skylake cluster available in Google Cloud's Compute Engine [26].Scalability experiments are conducted on the Intel KNL cluster.In these experiments, the data points that use 64 threads or less are executed on a single Intel KNL processor;  the Intel Xeon Skylake cluster using 480 threads.The values above the bars show the execution time of each algorithm relative to that of our fine-grained parallel temporal Johnson for the same benchmark.The values that contain the symbol > represent the experiments that did not finish within the given time limit.two processors are used to execute the data points that use 128 threads; and all four processors are used otherwise.
Furthermore, we use more than one thread per core only if the number of threads used is greater than 256.
We use the Threading Building Blocks (TBB) [38] library to parallelise the algorithms on a single processor.We distribute the execution of the algorithms across multiple processors using the Message Passing Interface (MPI) [17].
When using distributed execution, each processor stores a copy of the input graph in its main memory and searches for cycles starting from a different set of graph edges.The starting edges are divided among the processors such that when the edges are ordered in the ascending order of their timestamps,  consecutive edges in that order are assigned to  different processors.Each processor then uses its own dynamic scheduler to balance the workload across its hardware threads.In this setup, workload imbalance across processors may still occur, but its impact is limited in our experiments because we use at most five processors.
We perform the experiments using the graphs listed in Table 6.The TR, FR, and MS graphs are from Harvard Dataverse [33], the NL graph is from Konect [40], the AML graph is from the AML-Data repository [5], and the rest are from SNAP [42].Except for BA and BO, all of the graphs have parallel edges, as shown in Table 6.To make cycle enumeration problems tractable, we use time-window constraints in all of our experiments.The time window sizes used in our experiments are given in the figures next to the graph names.We stop the execution of an algorithm if it takes more than 24ℎ on the Intel KNL cluster or more than 6ℎ on the Intel Xeon Skylake cluster.

Temporal cycle enumeration
The goal of a temporal cycle enumeration problem is to find all simple cycles with edges ordered in time.Here, we evaluate the performance of our fine-grained parallel algorithms for this problem introduced in Section 7.2.Our main  We refer to the backtracking phase of the state-of-the-art 2SCENT algorithm [39] for temporal cycle enumeration as the temporal Johnson algorithm and parallelise it in a coarse-grained manner for the experiments.We do not parallelise the entire 2SCENT algorithm because the preprocessing phase of 2SCENT is strictly sequential and has a time complexity in the order of the complexity of its backtracking phase.We also provide direct comparisons with the 2SCENT algorithm.Fig. 10 shows that our fine-grained parallel algorithms achieve an order of magnitude speedup compared to the coarse-grained algorithms on the Intel KNL cluster.For the NL graph, this speedup reaches up to 40×.Because the Intel Xeon Skylake cluster contains fewer physical cores than the Intel KNL cluster, the speedup between our fine-grained and the coarse-grained parallel Johnson algorithms is smaller on the former cluster.As can be observed in Fig. 11, this speedup increases as we increase the time window size used in the algorithms.Note that enumerating cycles in longer time windows is more challenging because longer time windows contain a larger number of cycles.Fig. 12 shows the number of temporal cycles enumerated in the experiments shown in Fig. 10 and their frequency distribution for the given cycle length.The execution time of the cycle enumeration algorithms typically depends on the number of cycles discovered.However, due to the existence of parallel edges, many cycles may consist of the same sequence of vertices and can be explored simultaneously by grouping such cycles into a cycle bundle [39].For example, in the cases of the CO, TR, and MS graphs, a cycle bundle discovered by our algorithms contains more than 10 M cycles on average.For this reason, despite discovering several orders of magnitude more cycles in the CO, TR, and MS graphs than in the other graphs, the execution time of our fine-grained algorithms on the CO, TR, and MS graphs is comparable to their execution time on the other graphs.In addition, in the cases of BA, BO, FR, and NL, where one temporal cycle per cycle bundle is discovered, our fine-grained algorithms are more time-consuming on the NL graph because more cycles are discovered in the NL graph than in the BA, BO, and FR graphs.Thus, the execution time of our fine-grained algorithms depends more on the number of cycle bundles explored than the number of cycles.
The scalability evaluation of the parallel temporal cycle enumeration algorithms is given in Fig. 13.We also report the performance of the sequential 2SCENT algorithm in the same figure.The performance of our fine-grained parallel algorithms improves linearly until 256 threads, after which it becomes sublinear due to simultaneous multithreading.
As a result, our fine-grained versions of the Johnson and the Read-Tarjan algorithms reach 435× and 470× speedups, respectively, compared to their serial versions.Additionally, when using 1024 threads, our fine-grained Johnson algorithm is on average 260× faster than 2SCENT when 2SCENT completes in 24 hours.On the other hand, the coarse-grained Johnson algorithm does not scale as well as the fine-grained algorithms.As a result, the performance gap between the fine-grained and the coarse-grained algorithms increases as we increase the number of threads.Overall, the fastest algorithm for temporal cycle enumeration that we tested is our fine-grained Johnson algorithm, which is, on average, 60% faster than our fine-grained Read-Tarjan algorithm.When using 1024 threads, both finegrained algorithms are an order of magnitude faster than their coarse-grained counterparts.Moreover, our fine-grained parallel algorithms, executed on the Intel KNL cluster using 1024 threads, are two orders of magnitude faster than the state-of-the-art algorithm 2SCENT [39].

Hop-constrained cycle enumeration
In hop-constrained cycle enumeration, we search for all simple cycles in a graph that are shorter than the specified hop constraint.Here, we compare our fine-grained parallel hop-constrained Johnson algorithm, introduced in Section 7.3, with the state-of-the-art algorithms BC-DFS and JOIN [54] for this problem.For this evaluation, we parallelised BC-DFS and JOIN in the coarse-grained manner.Because adapting the Read-Tarjan algorithm to enumerate hop-constrained cycles is not trivial, we do not report the performance of the fine-grained and coarse-grained versions of this algorithm.Coarse-grained par.BC-DFS [52] hop constraint: Coarse-grained par.JOIN [54] Fig. 14.Performance of parallel algorithms for hop-constrained simple cycle enumeration on (a) the Intel KNL cluster using 1024 threads and (b) the Intel Xeon Skylake cluster using 480 threads.The values above the bars show the execution time of the coarsegrained parallel algorithms relative to that of our fine-grained parallel algorithm.The values that contain the symbol > represent the experiments that did not finish within the given time limit.Larger hop constraints increase the performance gap between the two algorithms.We also omit the performance results for the MS graph because our fine-grained algorithm did not finish under 12ℎ when using the smallest time window size.Fig. 14 shows that our fine-grained parallel algorithm is, on average, more than 10× faster than the coarse-grained parallel BC-DFS algorithm for the two largest hop constraints tested.When using the hop-constraint that is less than or equal to ten, the coarse-grained parallelisation approach is able to achieve workload balance across cores, and thus the performance of this approach is similar to that of our fine-grained approach in this case.As we increase the hop constraint, the probability of encountering deeper recursion trees also increases.Exploring such trees using the coarse-grained approach leads to workload imbalance (see Section 4).Our fine-grained algorithm is designed to resolve this problem by exploring a recursion tree using several threads.Therefore, increasing the hop constraint increases the speedup of our fine-grained algorithm with respect to the coarse-grained algorithm.
According to Fig. 15, the number of cycles increases exponentially as the hop constraint is increased.Thus, increasing the hop constraint could lead to an exponential increase in the execution time of our fine-grained parallel algorithm for hop-constrained cycle enumeration, which can be observed in Fig. 14.Note that Figs. 12 and 19 indicate that the frequency distributions of the cycles have a bell shape.As a result, the increase in the number of cycles with increasing hop constraints shown in Fig. 15 may not be exponential when the hop constraint is increased beyond 20.  of the previous algorithm by Qing et al. [57] (see Table 2 and Section 2).We parallelise the Tiernan algorithm in a fine-grained manner by wrapping each recursive call of this algorithm into a task and by using a dynamic task scheduler to balance the workload across the threads.Note that the algorithm by Qing et al. [57] uses a static load balancing mechanism, which makes it less efficient than our fine-grained parallelisation of the Tiernan algorithm.
As we can see in Fig. 17, our fine-grained parallel algorithms show an order of magnitude average speedup compared to coarse-grained parallel algorithms on two different platforms.The reason for this speedup is better scalability of our fine-grained algorithms, which we demonstrate in Fig. 18.Similarly to the cases of temporal and hop-constrained cycle enumeration (see Figs. 13 and 16), our fine-grained parallel algorithms scale linearly with the number of physical cores used whereas the coarse-grained parallel Johnson algorithm does not scale as well.Thus, the speedup between the fine-grained and the coarse-grained algorithms increases by utilising more threads.
Fig. 19 shows the number of simple cycles enumerated in the experiments shown in Fig. 17 and their frequency distribution for the given cycle length.Similarly to temporal cycle enumeration, the execution time mainly depends on the number of cycle bundles explored rather than on the number of cycles enumerated.For example, each cycle bundle explored in the BA, BO, SU, FR, and NL graphs contains only two or fewer simple cycles on average, and the execution time of our fine-grained Johnson algorithm is the longest for the NL graph, which also has the most reported cycles    Simple cycles tend to be longer than temporal cycles, despite using a smaller time window for simple cycle enumeration (see Fig. 12).0.9 0.8 1.0 1.3  (see Fig. 19).Furthermore, despite BA and BO having similar sizes, the execution time of our fine-grained Johnson algorithm is an order of magnitude longer for BO than for BA.The reason for this difference is that more cycles were discovered in the BO graph than in the BA graph for the time window sizes given in Fig. 17.As a result, the execution time of the simple cycle enumeration can significantly vary, even for graphs of similar sizes.The synchronisation overheads caused by recursive unblocking of our fine-grained parallel Johnson algorithm (see Section 5.2) are visible only in the case of AML.In this case, the fine-grained parallel Johnson algorithm performs 60% fewer edge visits than the fine-grained parallel Read-Tarjan; however, it is 25% slower.These synchronisation overheads can be explained by a very low cycle-to-vertex ratio.Because a vertex is blocked if it cannot take part in a cycle, the probability of a vertex being blocked is higher when the cycle-to-vertex ratio is lower (see Table 6 and Fig. 19).
In consequence, more vertices are unblocked during the recursive unblocking of the fine-grained parallel Johnson Thirdly, we have shown that, whereas our fine-grained parallel Read-Tarjan algorithm is work efficient, our finegrained parallel Johnson algorithm is not.In general, the former is competitive against the latter because of the new pruning methods we introduced, yet the latter outperforms the former in most experiments.In some rare cases, our fine-grained parallel Johnson algorithm can suffer from synchronisation overheads.In such cases, our fine-grained parallel Read-Tarjan algorithm offers a more scalable alternative.

Fig. 1 .
Fig. 1.Per-thread execution time of (a) the coarse-grained Johnson algorithm vs.(b) our fine-grained Johnson algorithm using the WT graph and a 12ℎ time window.Thanks to a perfect load balancing, our fine-grained method is 3× faster on a 64-core CPU executing

Fig. 2 .
Fig. 2. Two snapshots of a temporal graph associated with two different time windows of size  = 5.The solid arrows indicate the edges that belong to the respective time windows.
a) Example graph (b) Recursion tree

Fig. 3 .
Fig. 3. (a) An example graph and (b) the recursion tree constructed when searching for cycles that start from  0 .The nodes of the recursion tree represent the recursive calls of the depth-first search.The dotted path of the right subtree is explored only by the Read-Tarjan algorithm.

Fig. 4 .
Fig. 4. (a) A graph with an exponential number of simple cycles.(b)The recursion tree of the Johnson algorithm for  = 6 constructed when the algorithm starts from  0 .Whereas a coarse-grained parallel algorithm explores the complete recursion tree using a single thread, our fine-grained parallel algorithms can explore different regions of the recursion tree in parallel using several threads.

Fig. 5 .
Fig. 5. (a) An example graph and (b) the recursion tree of our fine-grained Johnson algorithm when enumerating cycles that start from  0 .Each thread of our fine-grained Johnson algorithm explores the vertices  1 , . . .,   at most once.

Fig. 6 .
Fig. 6.(a) An example graph and (b) the recursion tree of our fine-grained Johnson algorithm when enumerating simple cycles that start from  0 .Here,    denotes a data structure  of the thread   .The thread  2 can prune the dotted part of the tree by avoiding  5 and  6 that the thread  1 has blocked after creating the task stolen by  2 .

Fig. 7 .Theorem 3 .
Fig. 7. (a) An example graph and (b) the recursion tree of our fine-grained parallel Read-Tarjan algorithm when enumerating cycles that start from  0 .The nodes of the recursion tree represent the recursive calls of the depth-first search.Tasks shown in (b) can be executed independently of each other.Theorem 3. The fine-grained parallel Johnson algorithm is scalable when lim →∞

Theorem 4 .
The fine-grained parallel Read-Tarjan algorithm is work-efficient.
Fig. 8. (a) An example graph and (b) the recursion tree of our fine-grained temporal Johnson algorithm when enumerating temporal cycles that start from  0 .The thread  2 can avoid the dotted part of the tree by reusing the blocked edges  6 →  7 and  7 →  3 discovered by  1 .
replaces the set of blocked vertices Blk in the Johnson algorithm with barriers.A barrier value bar of a vertex  indicates that the starting vertex  0 of a cycle cannot be reached within bar hops from .As a result,  is blocked if the length of the current path Π when the algorithm attempts to visit  is greater than or equal to  − bar, where  is the hop constraint.BC-DFS modifies the recursive unblocking of the Johnson algorithm to reduce the barrier bar of  to a specified value bar ′ < bar.
Fig. 9. (a) An example graph and (b) the recursion tree of our fine-grained hop-constrained Johnson algorithm when enumerating cycles of length  = 6 that start from  0 .Barrier values of unmarked vertices are 0. Copy-on-steal enables the thread  2 to reuse barriers discovered by the thread  1 and to avoid exploring the dotted part of the tree.

Fig. 10 .
Fig.10.Performance of parallel algorithms for temporal cycle enumeration on (a) the Intel KNL cluster using 1024 threads and (b) the Intel Xeon Skylake cluster using 480 threads.The values above the bars show the execution time of each algorithm relative to that of our fine-grained parallel temporal Johnson for the same benchmark.The values that contain the symbol > represent the experiments that did not finish within the given time limit.

Fig. 11 .
Fig. 11.Longer time windows increase the performance gap between the algorithms.The algorithms are executed on the Intel KNL cluster using 1024 threads.The numbers above the bars show the execution times of the coarse-grained algorithm relative to that of the fine-grained algorithm.

Fig. 12 .
Fig. 12. (a), (b) Frequency distribution of temporal cycles for different cycle lengths and (c) the total number of temporal cycles discovered during the experiments shown in Fig. 10.The number of temporal cycles discovered is several orders of magnitude greater than the number of vertices or edges of a graph.

Fig. 13 .
Fig.13.Scalability evaluation of parallel temporal cycle enumeration algorithms executed on the Intel KNL cluster.The baseline is our parallel temporal Johnson algorithm.The relative performance of 2SCENT[39] is shown when it completes in 24 hours.Note that the 2SCENT implementation is single-threaded and the single-threaded execution results are not available for all graphs.

Fig. 15 .
Fig. 15.(a), (b) Frequency distribution of hop-constrained cycles for different cycle lengths and (c) the total number of hop-constrained cycles discovered during the experiments shown in Fig. 14 using the hop-constraint of 20.In most cases, the number of cycles increases exponentially with hop-constraint.

Fig. 17 .Fig. 18 .
Fig.17.Performance of parallel algorithms for simple cycle enumeration on (a) the Intel KNL cluster using 1024 threads and (b) the Intel Xeon Skylake cluster using 480 threads.The values above the bars show the execution time of each algorithm relative to that of our fine-grained parallel Johnson algorithm for the same benchmark.The values that contain the symbol > represent the experiments that did not finish within the given time limit.

Fig. 19 .
Fig. 19.(a), (b) Frequency distribution of simple cycles for different cycle lengths and (c) the total number of simple cycles discovered during the experiments shown in Fig.17.Simple cycles tend to be longer than temporal cycles, despite using a smaller time window for simple cycle enumeration (see Fig.12).

17
Fig. 19.(a), (b) Frequency distribution of simple cycles for different cycle lengths and (c) the total number of simple cycles discovered during the experiments shown in Fig.17.Simple cycles tend to be longer than temporal cycles, despite using a smaller time window for simple cycle enumeration (see Fig.12).
Fig. 19.(a), (b) Frequency distribution of simple cycles for different cycle lengths and (c) the total number of simple cycles discovered during the experiments shown in Fig.17.Simple cycles tend to be longer than temporal cycles, despite using a smaller time window for simple cycle enumeration (see Fig.12).

Fig. 20 .
Fig.20.Comparison of our fine-grained parallel algorithms for simple cycle enumeration with the fine-grained parallel Tiernan algorithm[68] on the Intel KNL cluster using 1024 threads.The values above the bars show the execution time of each algorithm relative to that of our fine-grained parallel Johnson algorithm for the same benchmark.Our fine-grained parallel Johnson algorithm is up to 7× faster than the fine-grained parallel Tiernan algorithm.

Fig. 20 presents
Fig.20presents the comparison of our fine-grained parallel Johnson and Read-Tarjan algorithms with the Tiernan algorithm[68] parallelised in a fine-grained manner.Our fine-grained parallel Johnson algorithm is up to 7× faster than the fine-grained parallel Tiernan algorithm.The main reason for this performance gap is that the Tiernan algorithm performs more redundant work than the Johnson algorithm (see Section 3.4).The fine-grained parallel Read-Tarjan algorithm is slower in the case of NL than the fine-grained parallel Tiernan algorithm because the redundant work performed by the algorithms is limited, and the Tiernan algorithm performs less work per visited edge than the Read-Tarjan algorithm.However, the fine-grained parallel Read-Tarjan algorithm can be up to 5.3× faster than the finegrained parallel Tiernan algorithm for other benchmarks.Therefore, our fine-grained parallel Johnson and Read-Tarjan algorithms are preferable to the parallel formulation of the Tiernan algorithm, such as the algorithm by Qing et al.[57].

Table 3 .
Summary of the notation used in the paper.Data structure  is maintained by the thread   .
[68]difference between these algorithms is to what extent they reduce the redundant work performed during the recursive search, which we discuss next.The Tiernan algorithm[68]enumerates simple cycles using a brute-force search.It recursively extends a simple path Π by appending a neighbour  of the last vertex  of Π provided that  is not already in Π.A clear downside of this [35]rithm is that it can repeatedly visit vertices that can never lead to a cycle.When searching for cycles in the graph shown in Fig.3astarting from the vertex  0 , this algorithm would explore the path containing  1 , ...,   2 times.From each vertex   and   , with  ∈ {1, ..., }, the Tiernan algorithm would explore this path only to discover that it cannot lead to a simple cycle.As noted by Tarjan[67], the Tiernan algorithm explores every simple path and, consequently, all maximal simple paths of a graph.Exploring a maximal simple path takes  () time because it requires visiting each edge of the graph in the worst case.Given a graph with  maximal simple paths (see Table3), the worst-case time complexity of the Tiernan algorithm is  ().The Johnson algorithm[35]improves upon the Tiernan algorithm by avoiding the vertices that cannot lead to simple cycles when appended to the current simple path Π.For this purpose, the Johnson algorithm maintains a set of blocked vertices Blk that are avoided during the search.In addition, a list of vertices Blist [] is stored for each blocked vertex .Whenever a vertex  is unblocked (i.e., removed from Blk) by the Johnson algorithm, the vertices in Blist [] 1 , . . .,   is a maximal simple path, and, thus, it cannot lead to a simple cycle.The Johnson algorithm would block  1 , . . .,   immediately after visiting this sequence once and then keep these vertices blocked until it finishes exploring the neighbours of  2 .As a result, the Johnson algorithm visits vertices  1 , . . .,   only once, rather than 2 times the Tiernan algorithm would

Table 4 .
Work and depth of the coarse-and fine-grained parallel algorithms.
the coarse-grained algorithms are not scalable based on −1 maximal simple paths that can be explored in each infeasible region.In this case, each vertex   , with  ∈ {1, ..., }, is visited up to  times, and, thus, the fine-grained parallel Johnson algorithm behaves as the Tiernan algorithm (see Section 3.4).Lemma 1.The depth  ∞ () of the fine-grained parallel Johnson algorithm is in  ().Proof.The worst-case depth of this algorithm occurs when a thread performs copy-on-steal and explores a maximal simple path.A thread explores such a path in  () time because it visits at most  edges.As a result, Π and Blk contain at most  vertices, and Blist contains at most  pairs of vertices.Therefore, copy-on-steal requires  () time to copy Π, Blk, and Blist, and to unblock vertices in Blk.As a result, the depth of this algorithm is  ∞ () ∈  ().Vertex blocked in  2 ,  3 , and  4 Vertex blocked in  3 Vertex blocked in ! 1 , ! 2 , ! 3 , and ! 4 Blk T 1 } ; ⊲ Operations on Π and Blk are thread-safe 4 while Π T 2 .back()≠ v do Π T 2 .pop(); 5 Remove vertices from Blk T 2 inserted at depth d ′ ≥ d; 6 found = false; 7 while E ≠ ∅ do ⊲ Exploration of the path extension E T 2 = Π T 2 .push(v);Blk T 2 = Blk T 2 ∪ {v}; Blk T 2 then 12 E ′ = FGRT_DFS(u, v 0 , Blk T 2 , Vis = ∅); 9 Π 10 foreach u : N (v) s.t.u.id > v 0 .iddo 11 if u ≠ E.front() ∧ u ∉ 16 else Blk T 2 = Blk T 2 ∪ Vis; 17 if found then break; 18 if E = ∅ then report cycle Π T 2 ; 19 else spawn FGRT_task(v, v 0 , E, d + 1, T 2 ); ⊲ Create a child task

Table 5 .
Hardware platforms used in the experiments.Here, P, C/P, and T/C represent the number of processors, the number of cores per processor, and the number of hardware threads per core, respectively.

Table 6 .
Temporal graphs used in the experiments.In this table, Δ avg and Δ max refer to the average and maximum values of Δ, respectively, where Δ is the number of outgoing edges of a vertex, i.e., vertex degree.Similarly,  avg and  max refer to the average and maximum values of  , respectively, where  represents the number of parallel edges for a given source and destination vertex.Time span refers to the difference between the maximum and minimum timestamps in a graph.