SPORE: Combining Symmetry and Partial Order Reduction

Symmetry reduction (SR) and partial order reduction (POR) aim to scale up model checking by exploiting the underlying program structure: SR avoids exploring executions equivalent up to some permutation of symmetric threads, while POR avoids exploring executions equivalent up to reordering of independent instructions. While both SR and POR have been well studied individually, their combination in the context of stateless model checking has remained an open problem. In this paper, we present SPORE, the first stateless model checker that combines SR and POR in a sound, complete and optimal manner. SPORE can leverage both symmetries in the client program itself, but also internal symmetries in the underlying implementation (i.e., idempotent operations), a novel symmetry notion we introduce in this paper. Our experiments confirm that SPORE explores drastically fewer executions than tools that solely employ SR/POR, thereby greatly advancing the state-of-the-art.


INTRODUCTION
Stateless model checking (SMC) [Godefroid 1997] verifies a concurrent program by enumerating all of its executions.SMC is quite popular in concurrent program verification as (a) can be used by programmers without any expertise in formal methods, (b) it can handle programs in full-fledged programming languages like C, C++ and Java, and (c) it can reason about the effects of the underlying weak memory model (e.g., C/C++11 [Lahav et al. 2017]).On the downside, however, SMC only supports verification of bounded programs, and often does not scale well enough to handle client programs with a sufficient number of threads to provide strong confidence the correctness of a given implementation.
There are two sound techniques that can be employed to increase the scalability of SMC.Symmetry reduction (SR) [Clarke et al. 1996;Emerson and Wahl 2005] exploits symmetries in the threads of the program under test (e.g., all threads running the same code) and avoids to consider all the ways in which symmetric threads interleave, as the order in which such threads execute is clearly irrelevant.As an example of SR, consider the fais program where  symmetric threads perform an atomic "fetch-and-increment" operation on : fetch_add (, 1) ... fetch_add (, 1) (fais) While naive SMC explores  !executions for this program, SR only explores 1 execution.Dynamic partial order reduction (DPOR) [Abdulla et al. 2014;Flanagan and Godefroid 2005] reduces the program state space by not exploring executions that are equivalent up to some permutation of independent instructions (e.g., instructions accessing different variables).For instance, consider the program below where 26 (non-symmetric) threads write different parts of an array:  := 1  := 2 ...  := 26 (array) For array, naive SMC would again explore 26! executions while DPOR would only explore 1, as it notices that all threads access different parts of memory, and hence their relative order is irrelevant.
A common way to view both SR and DPOR is via the equivalence partitioning they induce on the program state space.Indeed, SR groups together executions that can be obtained from one another by changing the ID of symmetric threads, while DPOR groups together executions that can be obtained from one another by changing the order of non-conflicting instructions.
Observe, however, that even for symmetric programs, SR and DPOR are not equivalent, and neither approach subsumes the other.This can be seen with the example below: While DPOR explores  !executions for fais+array (due to the conflicting fetch_adds), SR explores (2 − 1)!! executions (double factorial of odd numbers).This discrepancy is because in SMC, after each thread has executed its fetch_add, symmetry "breaks", as each thread reads a different value.
Even though SR and DPOR are both effective when applied, existing SR/DROR approaches have two major limitations.First, they are incompatible: indeed, despite years of research on each of SR/DPOR, no algorithm manages to successfully combine the two, so employing one of them precludes the usage of the other.Second, both SR and DPOR fail to leverage internal symmetries, i.e., idempotent operations of the underlying implementation.One case of internal symmetry is the quintessential helping pattern, where some operation observes an ongoing operations of the same type that is incomplete, and then tries to complete the ongoing operation before performing its own.SR fails to exploit internal symmetries as the threads performing the operations are not sharing the same code, while DPOR fails to do so because the two operations are considered conflicting.
In this paper, we present Spore (Symmetry and Partial Order Reduction Explorer), a novel algorithm that combines SR and DPOR, and overcomes both limitations above.Spore resolves thread-level symmetries by restricting the coherence order of symmetric conflicting operations to agree with their thread order, and internal symmetries with a novel memory-model axiomatization that equates executions differing only in the order of the locally symmetric operations.The resulting algorithm is sound, complete and optimal under the combined equivalence partitionings, and achieves exponential reduction in verification time over the state-of-the-art.Spore is also parametric in the choice of the underlying (weak) memory model.
Our contributions can be summarized as follows.
§2 We (informally) describe why the combination of DPOR and SR is non-trivial, as well as how Spore exploits thread-level and internal internal symmetries.§3 We present Spore in detail and prove its correctness.§4 We implement Spore in a tool for C/C++ programs, and empirically demonstrate that it is orders of magnitude faster than the state-of-the-art.

SPORE: INFORMAL DESCRIPTION
We develop Spore by adding SR on top of a DPOR algorithm (as opposed to the other way around), since DPOR underpins most modern SMC solutions [Abdulla et al. 2018;Aronis et al. 2018;Chalupa et al. 2017;Kokologiannakis et al. 2022Kokologiannakis et al. , 2019b;;Norris and Demsky 2013].As such, we begin this section by explaining the basics of DPOR ( §2.1), and then describe why the combination of DPOR and symmetry reduction is non-trivial and how Spore achieves it ( §2.2).We end the section by demonstrating how Spore handles internal symmetries ( §2.3).

Dynamic Partial Order Reduction
Modern DPOR algorithms, such as TruSt [Kokologiannakis et al. 2022], represent program executions up to the reordering of independent accesses in a structure called execution graph [Alglave et al. 2014], and verify a given program by constructing its associated execution graphs in an incremental fashion.
Each execution graph  comprises: (a) a set of events E (graph nodes), modeling instructions of the program, and (b) a few relations on events (graph edges), modeling various interactions between the instructions.In the following, we consider three such directed edges: the program order (po), which orders instructions of the same thread, the reads-from relation (rf), which relates each read event  in  to a write event  in , from which  obtains its value, and the coherence order (co), which totally orders writes at each memory location.Init init The exploration proceeds in a depth-first manner: DPOR adds the events of the program from left to right, one by one, and whenever a read has more than one place to read from, DPOR initiates a recursive subexploration.For instance, when the read of T2 is added, it can read both 0 and 1 (both options are consistent according to SC), and thus DPOR initiates subexplorations B and C .DPOR proceeds in a similar manner, until all events of the program have been added to the graph.

Conventions
Following standard conventions in the weak memory model literature, we (1) treat rf as a relation from the write to the read event; (2) assume a special initialization event init, which initializes every location with 0 and is thus po-before all other events and co-before all other write events; (3) we do not draw co edges from init to other writes (as it is trivially co-before them).In explorations, we use letters to refer to intermediate executions, numbers to refer to full executions, and red to denote executions that will not be explored.
Revisits.The exploration in Example 1 was largely straightforward, but there is still one aspect of DPOR we have not discussed: revisiting.For exposition purposes, suppose we add the events of w+r+r from right to left.When we encounter the reads, they cannot yet read 1 because the corresponding write does not exist in the graph.Therefore, whenever a write is added to a graph, DPOR also revisits existing same-location reads to see if they can read from the newly added write.
Whenever DPOR revisits a read  from a write , it restricts the graph to remove some of the events added to the graph after  , since they may depend on the value read by  .(If not, they will be re-added in subsequent steps of the exploration.)The most common choice for restricting the graph is to keep only the events that were added before  and those causally before  (i.e., in its porf △ = (po ∪ rf) + prefix).For instance, in the right-to-left exploration of w+r+r, if W (, 1) revisits the read of T3, the resulting graph does not have the read of T2 because it was added after T3 and is not porf-before W(, 1).
The restriction due to revisits may lead to duplicate explorations, as we demonstrate below.
Example 2 Consider the following variation of w+r+r.

Fig. 1. Revisit opportunities
Adding the events from left to right, observe that there are two subexplorations where W(, 2) has the chance to revisit the read of T2: when the latter reads 0 and when it reads 1.These subexplorations are shown in Fig. 1.If W(, 2) performs the revisit in both, the exact same graph will be created.
There are two ways DPOR can avoid such duplication.Abdulla et al. [2014] and Kokologiannakis et al. [2019b] simply save all encountered executions (more precisely: the ones created by revisits), and drop subsequent revisits that yield an already encountered execution.Storing executions, however, leads to exponential memory consumption in the size of the program under test.
Avoiding Duplication with Maximal Extensions.A better solution adopted by TruSt [Kokologiannakis et al. 2022] is to impose a revisiting condition so that a given revisit only takes place once among all possible subexplorations.The key observation is that whenever DPOR encounters two graphs that will yield the same graph immediately after a revisit, then in both cases the revisit happens from the same write  to the same read  , and the graphs only differ in the sets of events that were affected by the revisit (namely,  itself and all the events deleted by the revisit).
TruSt therefore constrains the events affected by the revisit (i.e., the read being revisited and the deleted events) to form a maximal extension: to be added co-maximally w.r.t. to the porf-prefix of the revisiting write.Maximal conditions are better understood with an example.
Example 3 Consider the rev-ex below along with its SC-consistent execution graphs.
W (, 1) A DPOR run producing these execution can be seen below.
Assuming that DPOR adds events in a left-to-right manner, after adding the events of the first two threads, it then adds W(, 1) which can either revisit R() or not (graphs C and B , respectively).
Following the respective subexplorations, W(, 1) is encountered in both cases: in exploration B immediately, and in exploration C after adding the events under the conditional of T21 .Similarly to W (, 1), in both subexplorations, W(, 1) has the opportunity to either revisit R() or not.
Revisiting R () in both cases, however, leads to duplication, as the same graph (graph F ) will be obtained twice.Maximal extensions dictate that the revisit only takes place from execution E , as all the affected events are added maximally w.r.t.W(, 1).To see why, it is helpful to think "backwards": starting from the graph obtained from the revisit without the write and read participating in the revisit (W(, 1) and R()), if all the affected events are added in a co-maximal manner (i.e., reads reading the co-latest write and writes added last in co), we get graph E , which is the graph from where the revisit takes place.
To define maximal extensions, we first introduce an auxiliary definition about execution graphs.A write event  is co-maximal in a set of events  if  ∈  and it does not have a co-successor in  (i.e.,  ′ ∈ .⟨,  ′ ⟩ ∈ co).
Definition 2.1.An event  in a graph  is added maximally w.r.t. a write event  in , if the following conditions hold, where  is the set of all events  ′ added before  or in 's porf prefix (i.e., ⟨ ′ , ⟩ ∈ porf): Observe that non-write/read events are always added maximally w.r.t. a revisiting write.Maximal extensions also have the following useful property, which we will use in some of our examples below.Proposition 2.2.If a write  revisits a read  resulting in a graph , the porf-prefix of  will not be removed in any of the subsequent subexplorations starting from  [Kokologiannakis et al. 2022].
2.2 Spore: Thread-Level Symmetries Consider again the w+r+r example where T2 and T3 share their code.
T1:  := 1 T2:  2 :=  T3:  3 :=  (w+r+r) We say that executions 2 and 3 from its consistent executions (see Example 1) are symmetric because one can be obtained by permuting the symmetric threads of the other.
2.2.1 Distinguishing Among Symmetric Executions.To avoid exploring both graphs, we pick a representative execution among them and instrument DPOR to drop non-representative symmetric executions.Spore achieves this using thread IDs: we deem as representative the graph where a symmetric thread only reads values that are at least as "recent" (in terms of co) as the ones read by its symmetric predecessor.In the w+r+r example, this means that graph 2 is the representative one, as in graph 3 the read of T2 reads a value that is co-after the one read by T32 .
Let us formalize this intuition.We say that two events ,  ′ in an execution graph  are prefixmatching (and write prefix-matching(,  ′ )), if they originate from threads with the same code and have matching po-prefixes, i.e., all events po-before them are either not memory accesses or reads that pairwise read from the same write.Note that two writes can be prefix-matching, but any po-later pair of events cannot be: writes break matching prefixes because they are co-ordered.
Spore picks as representative graphs the ones where the thread order of prefix-matching events does not contradict an extension of co called extended coherence order: eco △ = (co ∪ rf ∪ rb) + , where rb △ = rf −1 ; co is the reads-before order, denoting that a read reads from a write whose value is later overwritten.Observe that, due to the definition of prefix-matching events above, any eco path between two prefix-matching events will involve co.
Given this notion of representative graphs, in the w+r+w example above, graph 2 in Example 1 is the representative because eco agrees with the thread order (there is an rb; rf path from T2 to T3), but graph 3 is not as eco contradicts the thread order.

Problem #1:
The Interaction Between Representative and Maximal Executions.This solution, however, does not work that easily due to revisiting ( §2.1).The problem is that SR avoids exploring certain graphs (i.e., the non-representative ones), the exploration of which DPOR might require so that a given revisit happens.Put differently, maximal extensions can be non-representative graphs.
Example 4 To illustrate the problem, consider the following variation of w+r+r (again, T2 and T3 share their code), and suppose we are interested in the executions where  = 1.
Similarly to w+r+r, graphs 2 and 3 are symmetric, and graph 2 is the representative one.
We now present a (partial) DPOR exploration of this program, with the objective of showing that the combination of DPOR and SR is not guaranteed to be correct.Concretely, we will show that execution 1 will not be generated if DPOR explores the program threads in a peculiar order3 .
Suppose DPOR first adds the read of T3, and then proceeds with the events of T1.When it adds W (, 1), it can either revisit R () (top exploration tree) or not (bottom exploration tree).Since we are interested in generating execution 1 , let us disregard the top exploration tree (where T3 reads 1) and focus on the bottom one.(The reason we discard the top one is that DPOR does not "undo" revisits: since W (, 1) revisits R () of T3, in all subsequent subexplorations T3 keep reading 1; see Prop.2.2.) At the next step, the algorithm will add the read of T2, which can either read 1 (from T1) or 0 (the initial value).DPOR, however, will only consider the exploration where the read is reading 0, and not the execution where it reads 1, as the latter is not the representative among the symmetric ones.(The one where T2 reads 0 and T3 reads 1 is.) At the final step, the algorithm will add the W(, 1) event of T4, and will consider to revisit the R ().With the maximal extension condition of §2.1, however, this revisit is doomed to fail, since the read of T2 is not added co-maximally w.r.t.W (, 1).Hence DPOR will not generate execution 1 .
As the w+r+r-rev example demonstrated, the problem when combining DPOR and SR is that resulting algorithm might deem the graphs on which TruSt's maximal extension condition enables a certain revisit as non-representative (and therefore drop them).
There are two potential solutions to this problem.
The first is to modify the maximal extension condition to hold only for representative graphs.Unfortunately, this approach does not work because of the atomicity condition of read-modifywrite (RMW) operations.In our technical appendix [Kokologiannakis et al. 2024b], we show that it is impossible to define a maximality condition purely at the level of execution graphs without consulting the program.
The second solution is to keep the maximal extension condition intact, but restrict the exploration order so that representative executions always form maximal extensions.To see why restricting the exploration order is a promising solution, let us consider again Example 4. The reason why a maximal extension was created in a non-representative execution was that T3 was added before T2 (i.e., against thread order), and T2 had co-later options available to it (T1 was added after T3 but before T2).By fixing the exploration order, we essentially try to "force" co to agree with the thread order.

Problem #2:
Fixing the Exploration Order is Inadequate.Given the above, a natural choice is to maintain a left-to-right scheduling among threads that share their code.Even though this simple modification mitigates the issue in w+r+r-rev, it does not restore correctness in general.
Example 5 To see why, consider the program below where T2 and T3 share their code, along with one of its representative executions.
Assuming that we schedule all threads in a left-to-right manner, execution 42 cannot be generated by the procedure described so far.The first point where the algorithm has more than one choice to consider is the addition of R() of T3.The case where R() reads from W (, 1) cannot lead to 42 because the restriction of the graph upon the revisit of R () will preserve the rf-edge of the R() read.Therefore, we are left with the case where R() reads from init (graph K below).
When the W (, 1) of T3 is added to K , there are three options: L : W(, 1) is added co-after T2's W(, 1).This execution is explored by DPOR, but cannot lead to the graph 42 because when W(, 1) is added in T3, it will be unable to revisit R () because the W(, 1) of T2 is not maximally added w.r.t.T3's W (, 1): it is co-before T3's W (, 1), which is in T3's porf-prefix .
M : W(, 1) is co-before T2's W (, 1).This execution is dropped because co contradicts thread-order of symmetric events.N : W(, 1) revisits the R () of T2.This execution is also dropped because it is not a representative one (T2 is reading a co-earlier value than T3).
As the r+rww+rww example above clearly demonstrates, fixing the scheduling policy is insufficient to guarantee completeness.Essentially, the issue described in §2.2.2 still persists: execution 42 could not be produced because a maximal extension was dropped (graph M ) in favor of the representative one (graph L ).In turn, in the representative execution L , a co-edge from a symmetric thread to the porf-prefix of the revisiting write precluded the revisit.
This last observation is key in marrying DPOR and SR: since a revisit fails due to an event of a symmetric thread being added non-maximally, Spore's solution is to consider symmetric events part of the revisiting write's prefix.In the case of r+rww+rww, when Spore considers the revisit between the W (, 1) of T3 and the R() of T1, the prefix of W(, 1) will include not just the events porf-before it, but also the porf-prefix of symmetric events as well (namely, event W (, 1) of T2).As such, graph 42 will be generated from L because all the affected events (namely, T1's R () and T2's W (, 1)) are added maximally w.r.t. the new prefix of W (, 1).

Problem #3:
Handling po ∪ rf ∪ co cycles.Changing the notion of a prefix is instrumental in restoring completeness, but comes with a caveat.In DPOR, a write can never revisit events in its own prefix.So, by introducing a new notion of a prefix (henceforth sprefix) in Spore, do we lose any executions?Is it possible that this novel notion of a prefix precludes some revisit that does not create a causal cycle, thereby rendering Spore incomplete?
The answer depends on the underlying memory model.First, we can show that sprefix cycles boil down to po ∪ rf ∪ co cycles.(Our full argument is presented in §3.)Strong models, such as SC, TSO [SPARC International Inc. 1994], and SRA [Lahav et al. 2016], forbid (po ∪ rf ∪ co) + cycles, and so it is never possible for a read to read from a write in its sprefix.
In weaker models, such as RC11 [Lahav et al. 2017], however, the answer is yes: it can be the case that an event is in its own sprefix but not in its own porf-prefix.Such a scenario is shown below.
Example 6 Consider the sp-cyc program, where T2 and T3 share their code.
In the execution of Example 6, W(, 1) is in its own sprefix (W (, 1) is read from the R () of T2, which is symmetric to the R () of T3, which is in turn in the prefix of W (, 1)), but not in its own porf-prefix (there is no porf cycle).
To restore completeness, Spore therefore checks that no consistent execution graph has a po ∪ rf ∪ co cycle.This condition typically holds: a po ∪ rf ∪ co cycle implies that there exist two writes that are not porf-ordered, and such unordered concurrent writes are rare in realistic implementations [Abdulla et al. 2019;Kokologiannakis et al. 2019b].As we show in §4, Spore is directly applicable to realistic libraries of concurrent data structures.

Spore: Internal Symmetries
We now switch gears and present how Spore exploits internal symmetries.We first present some examples of such symmetries ( §2.3.1), and then discuss Spore's treatment ( §2.3.2).We end this section by discussing how internal and thread-level symmetries interact ( §2.3.3).
DGLM queue is a lock-free queue comprising two pointers head and tail.At the end of each enqueue operation, each enqueuer advances the tail pointer to point to the last element of the queue.If, however, a concurrent enqueuer or dequeuer detects that the tail pointer is lagging behind (i.e., tail.next≠ NULL), it tries to advance tail on behalf of an incomplete enqueue.
RDCSS is a double CAS operation that takes as an argument a descriptor  containing two addresses  1 ,  2 with their expected values  1 ,  2 and a new value  2 .If both addresses contain their expected values, then the new value  2 is stored at the second address  2 .To perform the double comparison atomically, RDCSS first tries to place its descriptor in the  2 address, and then reads  1 to determine whether to replace it with the new value  2 or restore the old value  2 .In case another thread encounters the descriptor, it tries to complete the ongoing RDCSS call.
Both algorithms employ the textbook helping pattern [Herlihy 1991;Herlihy and Shavit 2008], where some operation A observes an ongoing, incomplete operation B and tries to complete B before performing its own.This helping pattern appears ins widely used concurrent libraries, including libcds [Khizhinsky n.d.], folly [Facebook n.d.] and ckit [Bahra n.d.], as well as in most algorithms described by Herlihy and Shavit [2008]; Observe that in both cases, the highlighted main and helping operations are idempotent: one of the CASes succeeds and all the others fail without changing the state.Moreover, their result is the same irrespective of which operation succeeds, and that the program cannot distinguish which operation succeeded.Indeed: (i) both operations execute exactly the same code, (ii) their returned value is not checked by the program, and (iii) swapping which of the operations succeeded preserves consistency and does not mask any error.As we will shortly see, these three conditions enable Spore to exploit internal symmetries and drastically reduce the state space.(In contrast, thread-level symmetries are inapplicable because the main and the helping operations have different execution prefixes.)2.3.2Exploiting Idempotent Operations.Spore exploits idempotent operations by only exploring executions where the main operation succeeds.To this end, Spore changes the underlying memory model and treats helping operations as no-ops, which have no incoming/outgoing rf or co edges.To do that, Spore requires assistance from the user: the user annotates helping operations in the program (as in Fig. 2), and then Spore automatically treats them as no-ops and reduces the state space to be searched.
Annotations bring us to a major challenge that needs to be resolved: ensuring annotation correctness.If users incorrectly annotate a function as helping, it might mask an existing error in the user program.As such, Spore uses a dummy event in the place of the function to check whether certain (sufficient) conditions hold.If they do not, Spore reports an annotation error to the user.
Some minimal preconditions that need to hold for a function  ℎ to be considered as helping w.r.t. a function   have already been stated in §2.3.1:(i)  ℎ and   execute the same code, (ii) the returned value of  ℎ and   is not checked by the program, and (iii) replacing an execution where   fails and  ℎ succeeds with one where   succeeds and  ℎ is treated as no-op preserves consistency and the presence of an error.
Let us now go over these conditions in more detail.The first two conditions lie at the heart of idempotency, and are what allow Spore to treat  ℎ as a no-op: no code uses the result of  ℎ and is thus safe to disregard it.Had  ℎ and   been different (or had their results been used), then annotating one of them as helping would mask errors in programs, like in the example below.  and  ℎ are functions comprising a single CAS operation, but the result of   is used (i.e.,  ℎ is incorrectly annotated as helping).If we treat  ℎ as a dummy event, the execution above (where the failed CAS generates a single read event and the successful one two events annotated with an excl flag) will not be explored and the error will be missed.
Condition (iii) is a bit more intricate.To ensure it, we need to guarantee that in any execution where  ℎ succeeds,   has already observed (in a synchronizing manner) the operations of  ℎ .If reading from writes in   can imply less synchronization with the rest of the program, then it is possible that reading from  ℎ results in an error, but reading from   does not (and thus, treating  ℎ as dummy can mask errors).We demonstrate this point with the following example.If the CAS in T2 succeeds and T1's read of  reads from it, then T1 will necessarily read  = 1.If, however, the CAS in T3 succeeds and T1 reads from it (as shown in the graph above), T1 can subsequently read  = 0 and violate its assertion (as shown in the graph above).
To fix this last issue, Spore imposes four more conditions on the user annotations: (1)   and  ℎ have no other writes apart from a final CAS (2)   has a preceding source event whose value it uses as the compare operand (3)   is immediately preceded by a write, which is observed in a synchronizing manner before  ℎ (4) all writes to the location of   's CAS are part of read-modify-write (RMW) operations These conditions are formalized in §3.As we prove in §3, these conditions are sufficient to detect erroneously annotated helping patterns.

2.3.3
The Interaction Between Internal and Thread-Level Symmetries.Before moving on to our formal discussion of Spore, it is worth noting that idempotent operations facilitate SR.Consider an example with two symmetric threads performing a helping CAS.Assuming that the threads are symmetric up until the CASes, treating the CASes as an RMW operations breaks the symmetry, while treating them as dummy events preserves the symmetry.

SPORE: FORMAL DESCRIPTION
In this section, we describe the theoretical basis of Spore.In particular, we explain: ( § 3.1) the representation of executions as execution graphs; ( §3.2) how Spore can be represented as a memory model; ( §3.3) Spore's exploration algorithm; ( §3.4) why Spore is correct, i.e., why it explores exactly one graph per the combined equivalence classes of DPOR and SR, and does not mask any errors.

Execution Graphs
An execution graph comprises a set of events (nodes), and a few relations on these events (edges).Definition 3.1.An event, e ∈ Event, is either the initialization event init, or a thread event ⟨t, i, l⟩ where t ∈ Tid is a thread identifier, i ∈ Idx is a serial number (denoting the index of an event within a thread), and l ∈ Lab is a label that takes (at least) one of the following forms: • Write label: Read and write attributes include the exclusivity flag excl for RMWs, and the access mode for RC11style models.(Additional kinds of events exist for memory allocations, deallocations, assertion violations, etc., but these do not affect the model checking algorithm in any meaningful way.) Having defined events, we define execution graphs as follows.
Definition 3.2.An execution graph  ∈ EXEC comprises the following components: (1) a set of events  that includes init and does not contain multiple events with the same thread identifier and serial number; (2) rf :  ∩R →  ∩W, called the reads-from function, mapping each read event to a same-location write from where it gets its value; (3) co ⊆ l ∈Loc W l × W l (where W l △ = {init} ∪ {⟨t, i, l⟩ ∈  | l = W _ (l, _)}) called the coherence order, a strict partial order that is total on W l for every location l ∈ Loc; and (4) ≤, a total order on  that represents the order in which events were incrementally added to the graph.

Conventions
We write  .E,  .rf, .coand ≤  to project the various components of an execution graph.
We assume that init ∈ W, and omit the ∅ for read/write labels with no attributes.The functions tid, idx, loc, mod and arg respectively return the thread identifier, serial number, location, access mode and function arguments of an event, when applicable.We write  .W for  .E ∩ W (and similarly for other sets), and use superscript and subscripts to restrict label sets (e.g., Observe that  does not have an explicit program order (po) component.We induce po based on our representation of events as follows: In our technical appendix [Kokologiannakis et al. 2024b], we define two mappings from programs to sets of execution graphs: (1) ., which ignores function annotation labels, and simply generates an event with a M m label before the events corresponding to the function body; and (2) .Annot , which in the case of functions annotated with help, generates only the M help event and does not generate any events for the body of the function call.Both mappings keep the rf and co components of graphs completely unconstrained.These components will be constrained by the memory model.

Consistency and Error Detection
A memory model, M, comprises three components: (a) a causal prefix relation, cb M , (b) a consistency predicate consistent M () that determines whether an execution graph  is consistent, and (c) an IsErroneous M () predicate, prescribing whether  contains an error (e.g., an invalid memory access) according to M.
The consistency predicate is used to constrain the semantics of a program.The annotationignoring (resp.annotation-aware) semantics of a program P under a memory model M, denoted P M (resp.P Annot M ), is given by the set of execution graphs in P (resp.P Annot ) that are Mconsistent.
Annotation Correctness.To ensure annotation correctness, Spore first checks that for each  ℎ ∈  .M help , there exists a (unique)   ∈  .M main with the same arguments, and that these functions do not return any results (cf.conditions (i) and (ii) of §2.3.2), and are well-formed:.theycomprise a (possibly empty) sequence of reads followed by a CAS operation, with a possible data dependency from the reads to the CAS (no other dependencies are allowed so that the locations accessed can be deduced by the arguments of   / ℎ ).
Assuming both functions has the proper form, Spore has to now ensure that (iii) holds, i.e., that their synchronization is the same.Since the definition of synchronization differs among memory models, for simplicity, we here provide a definition that works for SC and RA 4 .In what follows, we lift loc/exp to return the location/expected-value of the CAS read following an   ∈  .M main .
Our definition uses the notion of a source write  at location loc(  ), which is observed before   (i.e., either it is po-before   or it is read po-before   ), and writes the value exp(  ).We also require that the immediate po-predecessor of   is observed before  ℎ , which ensures that the  ℎ has synchronized with everything in   's prefix, and that all writes to loc(  ) after  are RMWs and do not write the same value as .The latter condition ensures that   and  ℎ cannot both succeed, and that if  ℎ succeeds, then   observes its update.

Exploration Algorithm
Let us now proceed by showing how Spore enumerates all SYM-consistent execution graphs of a program P. The algorithm is shown in Algorithm 1, which constructs the consistent graphs incrementally by recording the event addition order in the graphs' ≤  component.Spore is optimal in the sense that it only explores consistent execution graphs and it never explores two execution graphs that differ only in their ≤  components.Spore verifies the input program P under a memory model M by calling Explore with the initial graph  ∅ containing only the initialization event init.
First, Explore(, ) checks whether the current graph contains an error (Line 2).Note that errors are checked against Spore's memory model: they include not only errors under the underlying memory model M, but also user annotation errors.
In addition, recall that Spore's errors include the existence of po ∪ rf ∪ co cycles.Such a check is necessary to justify why exploring cb SYM -acyclic execution graphs suffices: any (po∪rf∪co)-acyclic graph where the symmetry-before order does not contradict the eco order is also cb SYM -acyclic.
Algorithm 1 Spore: An optimal combination of DPOR and SR 1: procedure Explore P () 2: if IsErroneous SYM () then exit("Error") for  ∈  .W loc() do ExploreIfConsistent P (SetRF(, , )) for   ∈  .W loc() do ExploreIfConsistent P (SetCO(,   , )) If the graph is error-free, Explore extends it by one event  from the program by calling AddNextEvent (Line 3).If there are no events to add, then a full execution of P has been explored, and Explore returns.
If  is a read, then Explore recursively explores all consistent rf options for that read.As such, for each same-location write , Explore recursively calls itself (via the helper function ExploreIfConsistent) on the graph that results if  reads from  (Line 5).ExploreIfConsistent checks whether  is consistent (Line 15), and if so calls Explore recursively.(Recall that consistency also requires that the graph does not violate our SR principle.) If  is a write, Spore proceeds with the non-revisit case and the revisit case, respectively.For the non-revisit case, Explore checks for all possible placements of the newly added write in co by means of ExploreCOs (Line 7).
For the revisit case, Spore also checks whether any of the existing reads of  can be revisited to read from : since  was not present when their possible reads-from options were examined, Explore explores these additional rf options now.Thus, for each same-location read  that does not precede , if revisiting  will not lead to a duplicate exploration (checked by ShouldRevisit5 ), Explore calls ExploreCOs on the graph that occurs if all the events that were added after  are deleted, excluding  and its predecessors (Line 11).
Observe, however, that as we motivated earlier in § 2.2.4,Spore only explores cb SYM -acyclic execution graphs.As such, Spore never revisits reads that are in cb SYM -before  (as opposed to cb M -before ), as revisiting such reads would create cb SYM cycles (the cb SYM -prefix of a revisiting write is always preserved).
If  has any other type (Line 13), Explore recursively calls itself.
Remark 1. Observe that, with the exception of annotation errors, Spore does not take any special care for method annotation labels M. Indeed, this is because these are handled implicitly by the interpreter: Line 3 adds events according to our annotated semantics P Annot .When the interpreter encounters a function annotated with main, it will yield an M main (a) (which is not treated specially) as well as the events of the function, while for a function annotated with help it will only yield an M help (a) event.
Remark 2. We assume that the AddNextEvent procedure (Line 3), always picks the leftmost thread among the ones that are symmetric, i.e., their next events are prefix-matching.This is necessary for the algorithm's correctness, which demands that when an event  is added, its cb SYM -prefix already be present in the graph.

Soundness, Completeness and Optimality
3.4.1 Soundness of Internal Symmetries.We show that if a program P is erroneous under its standard interpretation P (which ignores annotations), then it is also erroneous under the annotated interpretation P Annot (which encodes annotated functions with dummy events).See [Kokologiannakis et al. 2024b] for how programs are mapped to execution graph sets.
Theorem 3.4.Let P be an annotated program and  ∈ P M such that IsErroneous M ().Then, there exists  ′ ∈ P Annot M such that IsErroneous SYM ().Proof sketch.It suffices to show that there exists a corresponding execution  ′ (where every  ℎ being treated as a (single) dummy event M help (...)) such that (1) IsErroneous M ( ′ ) holds, or (2)  ′ is incorrectly annotated (see Def. 3.3).The lack of an annotation error is essential in showing that changing  ′ so that   succeeds instead of  ℎ does not affect 's consistency.
The conditions of Def.3.3 essentially enforce that in any execution where  ℎ would succeed, (a) there is an   , running the same code, (b)   fails (there can only be one write that writes the expected value), (c)   reads from the CAS of  ℎ , or from a co-later (due to coherence and the presence of the source event), and therefore there is a porf-path from the CAS of  ℎ to the CAS of   (all writes to the CAS location are part of an RMW, and thus such a co path is also a porf path), (d)   is preceded by a write that was observed by the thread of  ℎ .This guarantees that swapping the events of   with those of  ℎ , and replacing the events of  ℎ with a no-op, adds no synchronization in the execution, and therefore preserves both consistency and the presence of an error.
If any of the previous conditions fails, we show that there exists an execution with  ℎ being treated as a no-op that is not correctly annotated.□ 3.4.2Correctness of Spore.To state our desired result, we first need to formally define which are the execution graphs that are considered equivalent up to symmetry.Given a program P with  threads, a valid thread permutation  is a bijection {1, ... ,  } ↦ → {1, ... ,  } such that threads  () and  share the same code for all 1 ≤  ≤  .We say that two executions  1 and  2 are symmetric, denoted  1 ≈  2 , if there exists a valid thread permutation  such that  ( 1 ) =  2 , where  ( 1 ) applies the permutation to all the thread IDs in the events of  1 .
The following proposition demonstrates that the class of M-consistent execution graphs up to symmetry corresponds (one-to-one) to the class of SYM-consistent execution graphs.Proposition 3.5.Given a program P and an execution graph  ∈ P Annot M , there is a unique execution graph  ′ ∈ P Annot SYM such that  ≈  ′ .Proof.To obtain  ′ from , sort the threads running the same function by the eco of the respective events (lexicographically, in po order).It is easy to see that this ordering is well-defined (there are no cycles), and unique: any possibly eco-unordered threads are in fact equal, and that the constructed graph  ′ satisfies irreflexive(symb; eco).□ Correctness of the exploration algorithm follows by adapting the proof of Awamoche [Kokologiannakis et al. 2023] and is captured by the following proposition.Proposition 3.6 (Algorithmic Correctness and Optimality).
Termination holds because either a revisit step is performed and the part of the graph that cannot be changed grows or a non-revisit step is performed and the execution graph grows.Soundness holds by construction because consistency is checked before every recursive call.Completeness is more elaborate: it holds because all possible rf/co options are considered for each newly added event, and moreover previous reads can be revisited in their maximal extension (which always exists and is consistent).Optimality holds because there cannot be two steps leading to the same graph; in case of revisits, that is precluded by the uniqueness of maximal extensions.
We next show that if P Annot SYM includes a cb SYM -cyclic execution, which the algorithm would not explore, then it also includes a cb SYM -acyclic execution with a po ∪ rf ∪ co cycle, which the algorithm would explore and report.Proposition 3.7 (cb SYM cycle).If there is an execution  ∈ P Annot SYM with a  .cbSYM cycle, then there is an execution  ′ ∈ P Annot SYM such that irreflexive( ′ .cbSYM ) and  ′ has a po ∪ rf ∪ co cycle.Combining Prop. 3.5,Prop. 3.6(3), and Prop.3.7, we obtain our completeness result.Theorem 3.8 (Completeness).If there exists  ∈ P Annot SYM such that IsErroneous SYM (), then Explore P ( ∅ ) will report an error.Otherwise, for each  ∈ P Annot M , Explore P ( ∅ ) will explore an execution  ′ ∈ P Annot SYM such that  ≈  ′ .Combining Prop. 3.5 and Prop. 3.6(4), we obtain our optimality result.Theorem 3.9 (Optimality).For any two executions  and  ′ explored by Explore P ( ∅ ),   ′ .

EVALUATION
We implemented Spore as a tool for C/C++ programs on top of the open-source GenMC stateless model checker, which implements the TruSt algorithm for DPOR.We reused GenMC's infrastructure for interpreting programs and constructing and maintaining execution graphs, but replaced GenMC's consistency checking and error detection mechanism with the ones described in §3.1.We also modified the notion of a prefix used in graph construction to use cb SYM , and made GenMC's scheduler respect cb SYM when encountering symmetric threads.

Goals
We evaluate Spore on a set of real-world implementations with two goals: (1) show that Spore scales well enough to verify useful implementations (and determine its scalability limit), and (2) determine to what extent its scalability should be attributed to internal vs thread-level symmetries.
To attain these goals, we run Spore on a set of representative real-world clients and benchmarks.The clients evaluate the effectiveness of the SR algorithm, while the benchmarks evaluate the effectiveness of Spore's modeling of internal symmetries.To further study how internal and thread-level symmetries contribute to Spore's performance, we compares Spore against (a) plain SMC enhanced with SR (SR), (b) a baseline TruSt implementation (TruSt), (c) Spore without thread-level symmetries (DPOR+IS), and (d) Spore without internal symmetries (DPOR+SR).Our evaluation is performed under RC11.
As we show, Spore yields a huge improvement over the state-of-the-art as it can gracefully scale to up to 6 threads (often to many more), and both internal and thread-level symmetries are crucial for its scalability to more threads.
Experimental Setup.We conducted all experiments on a Dell PowerEdge R6525 system running a custom Debian-based distribution with 2 AMD EPYC 7702 CPUs (256 cores @ 2.80 GHz) and 2TB of RAM.We set the timeout limit to 30 minutes (denoted by ).All times are in seconds.
We also ran some of our benchmarks against the DPOR implementation of Nidhugg [Abdulla et al. 2014], which obtained similar and/or worse results than TruSt (see [Kokologiannakis et al. 2024b]).

Benchmarks
To evaluate the effectiveness of thread-level symmetries, we used three different clients: • Multiset( ):  2 (resp. 2 ) threads insert (resp.remove) elements at a data structure; the client checks whether each removed element was previously inserted.
• Empty( ):  threads insert an element and subsequently remove an element; the client ensures each removal succeeds.As it can be seen, the clients become progressively more challenging in the sense that the number of multiple operations per thread increases, which hinders symmetry reduction.
To demonstrate that Spore is applicable to non-data-structure benchmarks as well, we used two other clients (Fig. 4): • Mutex( ):  threads perform a lock followed by an unlock operation.
• RDCSS( ):  threads perform an RDCSS call followed by an RDCS/read call, and 2 threads perform a single RDCSS call.To evaluate the effectiveness of internal symmetries, we used some representative benchmarks both with and without idempotent operations: • msqueue [Michael and Scott 1998], dglmqueue [Doherty et al. 2004], folqueue [Fober et al. 2001] and rdcss [Harris et al. 2002] all employ idempotent operations.• treiber [Treiber 1986], ttaslock [Herlihy and Shavit 2008, §7.2] and twalock [Dice and Kogan 2019] do not employ idempotent operations.These benchmarks exercise different aspects of internal symmetries so that the individual effects of each symmetry type are more visible.
We also note that we have identified idempotent operations in various widely used concurrency libraries (e.g., libcds [Khizhinsky n.d.], folly [Facebook n.d.], ckit [Bahra n.d.]).Even though Spore's support for C++ precluded us from using libcds and folly as benchmarks, we did manage to run certain benchmarks from ckit, with similar performance gains.

Results
Our results are summarized in Fig. 3 6 .First, as explained in §1, SR alone is inadequate for scalability, and using a combination of DPOR and SR is crucial: with the exception of a few benchmarks, SR idempotent operation breaks symmetry, thereby leading to state-space explosion.Spore, on the other hand, runs lickety-split: it explores a single execution when the client is fully symmetric (up to 4 threads), and a small number of executions otherwise (modeling the different ways insertions interfere with deletions).As the number of dequeuers increases, Spore explores more executions,as there are more ways for deletions to interfere with insertions.
Moving on to folqueue and treiber, we can make observations similar to the ones for the previous benchmarks, albeit a bit toned down.In the case of folqueue, thread-level symmetries have a limited effect, as each thread uses a distinct (global) location to dispose pointers, which breaks symmetry among threads early: Spore performs similarly to DPOR+IS, while TruSt performs similarly to DPOR+SR.Analogously, in treiber, internal symmetries have no effect, as the code has no idempotent operations: Spore performs just as well as DPOR+SR, while DPOR+IS performs just as well as TruSt.
Generally, we observe that DPOR+IS performs better than DPOR+SR in the multiset client when both thread-level and internal symmetries are present, implying that internal symmetries carry more weight when it comes to scaling to more threads.This should not come as a surprise.Idempotent operations might be performed more than once per thread, while thread-level symmetry will break after the first non-symmetric operation.As such, since the number of idempotent operations is greater than the number of threads, internal symmetries offer a greater reduction.
Next, we move on to the other two clients.In a similar fashion, Spore scales much better than TruSt (which only manages to terminate within the time limit for two or three configurations), although it does not manage to finish within the time limit for all configurations, since these clients are not completely symmetric (like the multiset one).As expected, Spore performs better in the LIFO/FIFO (where it can better leverage the symmetry in the client), and DPOR+IS performs better than DPOR+SR whenever there are internal symmetries, for the same reasons as in the multiset client.(Note that Spore performs similarly to DPOR+IS for the first configuration of each benchmark in the LIFO/FIFO client, as SR requires at least two symmetric threads to have any effect.) Finally, in Fig. 4 we compare all tools on some non-data-structure benchmarks.The two locking benchmarks do not employ idempotent operations, and thus Spore coincides with DPOR+SR, which has an exponentially smaller state-space than plain DPOR.In contrast, rdcss makes heavy use of idempotent operations, and so Spore manages to scale way better than plain DPOR.

RELATED WORK
As far as symmetry reduction is concerned, it has mostly been explored in the context of stateful model checking [Clarke et al. 1996;Emerson and Wahl 2005;Wahl and Donaldson 2010].In that setting, the main challenge is to identify when two threads are symmetric, that is computationally as hard as the graph isomorphism problem.By contrast, Spore is able to detect when two threads are symmetric on-the-fly, though in principle the reductions it achieves are not as good as the ones in stateful model checking.
As far as internal symmetries are concerned, even though a lot of effort has been devoted into making DPOR algorithms more efficient and scalable during the past few years (e.g., [Abdulla et al. 2015[Abdulla et al. , 2017[Abdulla et al. , 2018;;Aronis et al. 2018;Chalupa et al. 2017;Chatterjee et al. 2019;Kokologiannakis et al. 2017Kokologiannakis et al. , 2022Kokologiannakis et al. , 2019b;;Nguyen et al. 2018;Norris and Demsky 2013;Rodríguez et al. 2015]), most works focus on improving the core of DPOR and do not take into consideration the programs under test.SAVer [Kokologiannakis et al. 2021] and LAPOR [Kokologiannakis et al. 2019a] extend DPOR for programs that have spinloops and locks, respectively, while constrained-DPOR [Albert et al. 2018] takes programmer annotations into account in order to consider certain atomic operations non-conflicting.
In a different context, there has been a large body of work on static verification of concurrent programs, with techniques such as bounded model checking (BMC) or abstraction-based techniques (e.g., [Clarke et al. 2004;Elmas et al. 2009;Flanagan et al. 2005;Gavrilenko et al. 2019]).We expect that-at least for SAT/SMT-based techniques-both thread-level and internal symmetries could be exploited in a similar fashion to reduce the size of the resulting SAT formula and speed up the verification.

CONCLUSION
We presented Spore, a novel model checking algorithm that combines DPOR with symmetry reduction, and also exploits internal symmetries of C/C++ concurrent data structures.Our experiments confirm that Spore outperforms the state-of-the-art by a wide margin.
There are several ways this work could be extended.First, we would like to see whether Spore can handle other classes of programs in related domains, namely distributed algorithms and/or persistent programs, where similar symmetries appear.It remains to be seem whether those patterns exhibit symmetries that can be exploited in a similar fashion to enhance the applicability of automated verification techniques in those domains.Second, it would also be interesting whether Spore can be applied to models like ARMv8 [Flur et al. 2016] and POWER [Alglave et al. 2014] that do allow TruSt's po ∪ rf cycles in consistent executions (which Spore does not currently produce).Finally, Spore could also be combined with testing techniques, so that only representative executions are produced when obtaining traces of a concurrent program.

Example 1
Consider the w+r+r program below.T1:  := 1 T2:  2 :=  T3:  3 :=  (w+r+r) Under sequential consistency (SC) [Lamport 1979], the program has four executions, 1 -4 , which model the four equivalence classes into which the 3! = 6 thread interleavings are partitioned.These graphs can be produced by the following DPOR exploration starting from the initial graph Init through the intermediate graphs A , B , and C .
where k records the write attributes, l ∈ Loc the location accessed, and v ∈ Val the value written.• Read label: R k (l, v) ∈ R, where k records the read attributes, l ∈ Loc the location accessed, and v ∈ Val the value read.• Annotated function label: M m ( , a) ∈ M, where m ∈ {main, help} is the function attribute,  ∈ Fname is the name of the function been called, and a ∈ Val * is a sequence representing the function arguments.