Concurrent Immediate Reference Counting

Memory management for optimistic concurrency in unmanaged programming languages is challenging. Safe memory reclamation (SMR) algorithms help address this, but they are difficult to use correctly. Automatic reference counting provides a simpler interface, but it has been less efficient than SMR algorithms. Recently, there has been a push to apply the optimizations used in garbage collectors for managed languages to elide reference count updates from local references. Notably, Fast Reference Counter, OrcGC, and Concurrent Deferred Reference Counting use SMR algorithms to protect local references by deferring decrements or reclamation. While they show a significant performance improvement, their use of deferral may result in growing memory usage due to slow reclamation of linked structures, and suboptimal performance in update-heavy workloads. We present Concurrent Immediate Reference Counting (CIRC), a new combination of SMR algorithms with reference counting. CIRC employs deferral like other modern methods, but it avoids their problems with novel algorithms for (1) immediately reclaiming linked structures recursively by tracking the reachability of each object, and (2) applying decrements immediately and deferring only the reclamation. Our experiments show that CIRC’s memory usage does not grow over time and is only slightly higher than the underlying SMR. Moreover, CIRC further narrows the performance gap between the underlying SMR, positioning it as a promising solution to safe automatic memory management for highly concurrent data structures in unmanaged languages.


INTRODUCTION
The performance and scalability of modern software systems often depend on highly concurrent non-blocking data structures.These data structures can optimistically access memory, which makes manual memory management challenging: knowing when memory will no longer be accessed and is safe to reuse is difficult.
To address this challenge, algorithms for safe memory reclamation (SMR) have been developed for non-blocking data structures.There are many forms of SMR, such as hazard pointers (HP) [Michael 2002b[Michael , 2004]], pass-the-buck [Herlihy et al. 2005], read-copy-update (RCU) [McKenney and Slingwine 1998], and epoch-based reclamation (EBR) [Fraser 2004].The interfaces to these systems require a deep understanding of both the client data structure and the SMR algorithm.This makes their correct usage challenging.In fact, Anderson et al. [2021] report several usage bugs in the benchmark suite of several SMR algorithms that lead to use-after-free and memory leaks.
An alternative to SMR is automatic reference counting.With C++20, concurrent reference counting is exposed in the standard library as atomic<shared_ptr> and provides a considerably easier programming model.Unfortunately, in the presence of optimistic memory access getting the implementation correct and efficient is tricky.Standard implementations typically involve either locks or split reference counts [Williams 2019].These implementations negate many of the scaling benefits of the non-blocking data structures.
On the other hand, developments in garbage collectors (GCs) for managed languages have shown that reference counting can be fast.A key optimization in high-performance reference counting GC is to avoid eagerly counting the references from the local variables (i.e., stacks and registers) and let the collector check them during the collection routine.There are two seminal approaches to this optimization.In Deutsch and Bobrow [1976]'s method, objects that reach the zero count (i.e., no references from other heap objects) are first added to the zero-count table (ZCT), deferring their reclamation.The GC occasionally pauses all the other threads, scans the local variables to temporarily mark objects referenced by them (e.g., by incrementing), and reclaims all unmarked objects in ZCT.In Bacon et al. [2001]'s method, decrements are deferred by logging them in a thread-local buffer. 1 The GC pauses each thread one by one to fetch their logs and scan their local variables.Then the GC temporarily increments the referents of scanned local variables and executes the decrements logged sufficiently long ago, reclaiming the objects that reach zero count.
Recently, there has been a push to apply such optimizations to concurrent reference counting for unmanaged languages.Modern algorithms such as Fast Reference Counter (FRC) [Tripp et al. 2018], OrcGC [Correia et al. 2021], and Concurrent Deferred Reference Counting (CDRC) [Anderson et al. 2021[Anderson et al. , 2022] ] use SMR to protect uncounted local references, replacing the GC's role of scanning local references.Specifically, OrcGC delays the reclamation of the zero-count object protected by hazard pointer, following Deutsch and Bobrow [1976]; FRC delays decrements and temporarily increments objects protected by hazard pointer, following Bacon et al. [2001]; and CDRC generalizes the deferred decrement approach to other SMRs by delaying decrements to an object until it is no longer protected by the SMR.
1.1 Problems of Deferral-Based Methods SMR-based deferral has enabled efficient concurrent reference counting in unmanaged languages.However, it is also a weakness.
Slow progress of reclamation.The delay between the release of the last reference and reclamation can lead to the algorithm not being able to keep up with the application's rate of creating garbage.For instance, Fig. 1a shows that the memory usage of a linked-list-based concurrent queue using CDRC grows over time.To see why this occurs, consider dequeuing a series of nodes node 1 to node from the queue, where each node references node +1 (Fig. 1b).Only after node is reclaimed can node +1 be considered for reclamation.But reclaiming node +1 requires checking the protected local references.Since this process is executed in batches by the underlying SMR, the rate of collection may fall behind the rate of dequeuing, resulting in a buildup of garbage backlog.Other  modern reference counting methods face the same issue as they utilize some form of deferred processing.For example, Correia et al. [2021] observed that the memory footprint of the lock-free skiplist [Fraser 2004;Shavit et al. 2011] using OrcGC can be very large, because the multiple levels of links increase the likelihood of forming garbage chains.Several solutions exist, but they either compromise the performance or the safe interface.The first straightforward solution is to eagerly attempt to execute the deferred tasks.For example, OrcGC scans the entire hazard pointer slots whenever an object reaches zero count.However, this incurs a big overhead.In fact, Anderson et al. [2021] report that OrcGC is consistently outperformed by CDRC, more than twice slower in some cases.Our evaluation of a linked-list-based queue with an eager variant of CDRC shows a similar increase in the overhead (Fig. 8a).And yet, this solution does not completely fix the memory usage problem, as observed in OrcGC's skiplist benchmark and our queue benchmark (Fig. 8b).
The second solution employed by OrcGC, called poisoning, is to manually apply preemptive decrements to the successors of garbage objects to eliminate the dependency between the reclamation of linked objects.For example, after dequeuing node , it immediately decrements node +1 .However, for safety, the links from the poisoned objects must not be followed.This is done by marking the links as poisoned when applying preemptive decrements, and restarting the operations that encounter poisoned links.This not only disallows uninterrupted traversal of linked structures, but also makes its application as difficult as manual SMR schemes.Specifically, programmers must ensure that they poison only the detached objects in order to maintain the correctness of data structure operations.
Finally, we believe the deferred decrement approaches (FRC and CDRC) can enable prompt recursive reclamation by temporarily incrementing the protected local references and immediately applying recursive decrements during the collection routine, similarly to the original algorithm by Bacon et al. [2001]. 2 However, this is not compatible with the fastest variants of CDRC, because they are based on RCU/EBR-like SMR schemes that do not announce each local reference.They instead use critical sections that protect all references inside it, which is the key factor to their superior performance.Adding per-pointer protection would negate their performance advantage.
Performance overhead of deferred decrement.CDRC may incur significant performance overhead when the objects have many counted references and are updated frequently.Since CDRC schedules a deferred task for each decrement, it creates a large number of deferred tasks in such cases.Frequent scheduling of deferred tasks imposes a non-negligible burden on the underlying SMR, as it increases the frequency of scanning local protections.In addition, it would also increase global synchronization overhead if CDRC were implemented on top of real-world SMR implementations such as Folly's HP and RCU [Meta 2023] and Crossbeam's EBR [Crossbeam Developers 2023], because they use shared data structures to distribute the reclamation workload and to be transparent [Nikolaev and Ravindran 2021], e.g., supporting dynamic (un)registration of threads.

Our Solution for Fast and Safe Recursive Reclamation
We present Concurrent Immediate Reference Counting (CIRC), a new combination of an SMR scheme with reference counting.CIRC employs deferral like other modern methods, but it avoids their problems without incurring significant overhead or resorting to an unsafe interface.In throughput, CIRC generally outperforms CDRC and incurs almost none to modest overhead over the underlying SMR.The key idea of CIRC is to track when each object was last reachable so that it can immediately and recursively reclaim the objects that have been unreachable for a sufficiently long time.CIRC realizes this idea in a safe interface and an efficient implementation, underpinned by two novel algorithms.
First, CIRC automatically tracks reachability without inputs from the programmer.To do so, CIRC divides the time into epochs [Bacon et al. 2001;Fraser 2004], and attaches the few-bit representation of the epoch number to pointer fields and reference counts, which are updated along with pointer writes and immediate decrements, respectively.The updates are done in a way that when an object reaches the zero count, its reference count epoch is the one at which the object was last reachable.Specifically, this reachability information is propagated through recursive decrements: in Fig. 1b, if node was dequeued long ago and is being destroyed now, node +1 could not have been accessed through node .This knowledge allows immediate recursive destruction of the node +1 .
Second, CIRC follows the immediate decrement style (i.e., deferred reclamation) and handles concurrency in a simple yet efficient way.Zero-count objects must be checked for local references, but concurrency complicates the problem.If a zero-count object is incremented away from zero and decremented back to zero while a collector is running concurrently, its destruction must be canceled because there can be new local references missed by the collector.To detect such cases, OrcGC uses an additional sequence number to track how many updates have been applied to the reference count.Our approach is surprisingly simple: if an increment moves the count away from zero, then it must apply a second increment.Thus, when the delayed destruction is processed, if the reference count is still zero, then it must be safe to destruct the object.Otherwise, the destruction is canceled, and the additional reference count is removed.
The combination of two key ideas in CIRC allows memory to be reclaimed considerably more quickly than CDRC.Fig. 1a shows that CIRC's memory usage is only slightly higher than the underlying SMR, and more importantly is not growing with time.At the same time, CIRC performs comparably to EBR in read-most workloads, and in the worst case of heavily contended workloads with large amounts of reference count updates, CIRC performs within 35% of EBR.In addition, the HP version of CIRC without recursive destruction shows up to 55% throughput improvement over the CDRC's counterpart thanks to the reduced number of deferred tasks.
The rest of the paper is structured as follows.§2 provides the background on SMR algorithms and the basic structure of deferral-based concurrent reference counting algorithms.§3 shows how to immediately apply decrements, and §4 presents the algorithm to allow immediately applying recursive destruction.§5 extends CIRC with support for weak references to handle reference root 1 2 2 : CAS( root , 1 , 2 ); decrement( 1 ); free( 1 ) 1 : new reference?
Fig. 2. The race between reference creation and reclamation.Thread 1 is a empting to get a reference to 1 .At the same time, thread 2 detaches 1 , decrements it, and reclaims it.
cycles.§6 presents an experimental evaluation of CIRC against the underlying SMRs and CDRC.§7 discusses related work in detail and concludes with future work.

BACKGROUND
A naive implementation of reference counting does not work in non-blocking concurrent programs because of the race between reference creation and reclamation.Consider the scenario illustrated in Fig. 2, where two threads access a shared non-blocking concurrent linked list.The first thread 1 reads the root root to obtain a reference to the first node 1 .But just before 1 increments the count of 1 , another thread 2 unlinks 1 from root , decrements the count, reaching 0, and thus destructs and deallocates it.Then it is not safe for 1 to increment 1 as it will lead to use-after-free.Note that making the increment atomic, e.g., using the fetch-and-add (FAA) instruction, is not enough for preventing this issue, because it stems from the fact that obtaining a pointer value and incrementing the pointed object's count are not atomic.
In this section, we review how the manual concurrent reclamation algorithms handle such problems ( §2.1) and the common aspects of the modern concurrent reference counting algorithms for unmanaged languages that utilize the manual reclamation methods under the hood ( §2.2).

Manual Concurrent Reclamation Schemes
Manual memory reclamation methods for non-blocking concurrency, often called safe memory reclamation (SMR) schemes, require some manual effort to defer the reclamation until it is safe.In particular, they provide an interface consisting of a function to protect a local reference from the reclamation of its referent, and a function to retire a pointer, i.e., schedule its reclamation, so that it can be reclaimed later when it is no longer protected.There are two classic approaches for implementing this interface: pointer-based methods represented by hazard pointers (HP) [Michael 2004] and critical-section-based approaches represented by read-copy-update (RCU) [McKenney and Slingwine 1998].
Hazard pointers.In HP, each thread owns hazard pointer slots in which they announce (store) a pointer to protect.For example, in Fig. 2, 1 should write 1 to its hazard slot before accessing it.On the other hand, 2 calls the retire() function with 1 after unlinking it from root .The retire() function occasionally triggers the reclamation procedure, which takes a pointer from the retired pointer list, checks if it is protected by any of the hazard pointer slots, and if not, reclaims it.Reclamation is usually done in batch (or incrementally) to amortize (or distribute) the cost of scanning the protection slots.
One important subtlety in HP is that writing to a hazard slot itself does not guarantee the safety of access.For example, in Fig. 2, 2 may retire and reclaim 1 just before 1 protects 1 , leading to the same race problem discussed above.To ensure safety, the protector should validate that the pointer has not yet been retired.Since a pointer is retired only after it is detached (i.e., made unreachable) from the data structure's entrypoints, validation can be done by checking the reachability of the object.For example, 1 should check that root still points to 1 after writing to a hazard slot.
Read-copy-update.RCU provides protection based on critical sections.A critical section protects all references that can be obtained inside it.More precisely, if a pointer had not been retired before the beginning of a critical section, then it is protected in the said critical section.For example, in Fig. 2, after 1 starts a critical section, it can freely traverse the list nodes even if some of them are detached and retired while 1 is traversing.
RCU can be implemented with epoch-based technique [Fraser 2004], which maintains a monotonically increasing epoch counter that represents time.When a thread enters a critical section, it announces the current epoch.When an object is retired, the current epoch is recorded in the object.A retired object can be reclaimed if its retirement epoch is smaller than the minimum epoch active critical sections.Such algorithms are called epoch-based reclamation (EBR), and are usually the fastest among the SMR schemes.
Difficulties in using manual schemes.While the manual schemes are more performant than other methods such as traditional implementations of concurrent reference counting, they are known to be difficult to use even for experienced programmers.Correctly applying them requires a deep understanding of the client data structure, and sometimes non-trivial changes should be made.For instance, manual methods require retiring all and only the objects that have been globally detached from the data structure.This is difficult in data structures such as lock-free skiplists [Fraser 2004;Shavit et al. 2011] where a logically deleted node can be physically inserted back.To decide the safety of retirement, the C++ and Rust implementations of lock-free skiplists we are aware of incorporate manual reference counting alongside an SMR method.For another example, HP requires validation and handling its failure, which is not compatible with many concurrent data structures [Brown 2015].Anderson et al. [2021] report usage bugs in the benchmark suite of several manual schemes that lead to use-after-free and memory leaks.

Basics of Deferral-Based Concurrent Reference Counting
FRC [Tripp et al. 2018], OrcGC [Correia et al. 2021], CDRC [Anderson et al. 2021[Anderson et al. , 2022]], and our new algorithm CIRC use SMR schemes to implement efficient concurrent reference counting with a safe interface.They share the same key idea: using the SMR to protect objects from being reclaimed while incrementing them (Fig. 2), and exposing the SMR's local reference protection to allow short-lived accesses without updating reference counts.Algorithm 1 shows the interface and implementations that are common to them in a pseudocode with a Rust-style ownership type system.Our presentation largely follows the generalized version of CDRC [Anderson et al. 2022].
Preliminaries.Object<T> (line 1) represents objects of type T managed by reference counting.It extends T with a count for strong (normal) references and another for weak references.(Weak references are discussed in §5.) Rc<T> (line 4) is a smart pointer type for reference-counted pointer to an object of type T, and Atomic<Rc<T>> (line 6) represents a mutable field that contains an Rc<T>.
A Snapshot<T> (line 8) is a local reference protected by the backend SMR scheme.3It consists of the pointer value and a Guard that represent the per-pointer protection from the backend SMR, provided by the generalized interface called acquire-defer (AD, lines 11 to 16). 4 For example, a guard in HP is a pointer to the hazard pointer slot, and in RCU it is the zero-sized unit type.To account for Concurrent Immediate Reference Counting 153:7 Algorithm 1 Interface and implementation common to deferral-based concurrent reference counting libraries.The parts that require algorithm-specific implementation are highlighted purple.(ptr, guard) ← strongAD.acquire(&self.inner)if ptr ≠ null then decrement old RCU-style protection, acquire-defer provides the begin_CS and end_CS functions to manage critical sections.Critical sections must be active throughout a client operation, and especially Snapshots must not escape from the critical section they are created in.For HP, critical section functions are no-op.We collectively call the per-pointer protection and the critical section snapshot protections.
The acquire function creates a guard and protects the pointer loaded from src.For HP, it obtains a hazard slot, announces the loaded pointer, and validates the protection by checking that src has not changed (if changed, repeat).This validation is always safe in the context of reference counting (unlike in HP), because a pointer loaded from Atomic<Rc> is backed by a reference count, which ensures that the object is live.This also means that a Snapshot can be acquired directly from an Rc without validation, using the acquire_raw function.For RCU, acquire does nothing other than loading the pointer.A guard is destroyed by the release function.
The defer function (renamed from "retire" to avoid confusion) schedules a task associated with ptr.A deferred task is executed after all the snapshot protections acquired before its scheduling are (*ptr).decrement_strong()released.CDRC uses this function to schedule decrements, while OrcGC and CIRC schedule the reclamation of zero-count objects.
High-level implementation.Atomic<Rc>::get_snapshot (line 21) is a thin wrapper around acquire.Given a snapshot, Rc::from_snapshot (line 24) increments the count, the implementation of which differs across the different reference counting algorithms.For example, CDRC simply increments the count with FAA, while it is more involved in CIRC and OrcGC.Atomic<Rc>::load (line 27) is a combination of get_snapshot and from_snapshot, but releases the guard at the end.
Atomic<Rc>::cas (line 32) implements the atomic compare-and-swap (CAS) operation with the CAS for the underlying raw pointer.The expected raw pointer value is taken from an Rc or a Snapshot.If the CAS is successful, the desired Rc is forget-ed to prevent running its destructor, so that the ownership of its count is transferred to the Atomic<Rc>; and the old pointer value is returned as an Rc reference, receiving the ownership of the count.If the CAS fails, the current raw pointer value and the input Rc are returned.
The store function (line 36) atomically swaps the new pointer value with the old value, and applies the implementation-specific decrement method to it.For example, CDRC schedules a deferred decrement, while OrcGC and CIRC immediately apply a decrement.Similarly, the destructor of Atomic<Rc> (line 39) loads the pointer value and requests its decrement.

IMMEDIATE DECREMENT
CIRC utilizes deferral like other modern methods, but it attempts to apply operations immediately to resolve the problems of the other methods ( §1.1).Specifically, CIRC (1) follows the immediate decrement style that (in most cases) schedules a single task after the count reaches zero; and (2) identifies a chain of garbage objects that can be immediately reclaimed.This section focuses on the first aspect, immediate decrement.Its core design challenge is coordinating snapshots and the zero count: even if the count has become zero, it should be possible to create an Rc reference out of a Snapshot reference.Algorithm 2 presents the immediate decrement algorithm for CIRC without weak references. 5  Immediate decrement.The decrement_strong function (line 51) is used for functions such as Atomic<Rc>::drop and Atomic<Rc>::store.It first decrements the count with the FAA instruction 5 Our implementation uses SeqCst memory ordering for accesses to reference counts (loads, FAAs, and CASes).However, we believe many of them can be relaxed.Our hazard pointer implementation uses asymmetric fences [Dice et al. 2001;Goldblatt 2022]  The "inc" and "dec" transitions are normal increments and decrements.The "destruct" transition corresponds to the "if" branch of try_reclaim (Algorithm 2) and try_destruct (Algorithm 3).The "re-defer" and "cancel" transitions correspond to the "else" branch.The "resurrect" transition corresponds to resurrection."fail inc" happens only in Algorithm 3, when a empting to increment already destructed object.
(which returns the old value).If the count has reached zero, it schedules a deferred execution of the try_reclaim function (line 59), which is invoked after all the existing snapshot protections are withdrawn.To ensure that the deferred try_reclaim is eventually invoked in scenarios such as Fig. 1b, it occasionally triggers the execution of deferred tasks (line 54).However, it is not safe to directly reclaim the object when the deferred function is invoked, because a new Rc reference could have been created from a (now destroyed) Snapshot reference.So, try_reclaim checks if the count is still zero, ensuring that there is no new Rc.If so, there cannot be a new Snapshot, either.Therefore, it is safe to reclaim the object.
Resurrection.But if the count has increased, the reclamation process should be canceled and retried later.Naively re-scheduling a deferred execution of try_reclaim is not correct, because just before it is invoked again (after checking snapshot protections), a new Snapshot might be derived from an Rc.If all the Rc references are removed after the creation of the Snapshot and before the invocation of try_reclaim, the re-invoked try_reclaim sees zero count, triggering the reclamation.So, using the new Snapshot reference may result in use-after-free.
To fix this, we let the increment_strong function (line 56) increment twice if the count was zero, and try_reclaim call decrement_strong if the count was non-zero.In increment_strong, what actually grants a count for a new Rc is the second increment (line 58).Intuitively, the first increment (from zero to one, line 57) tells the pending try_reclaim that the object has been resurrected and thus must be checked again for new snapshot protections.Since increment_strong is called with a snapshot protection, the pending try_reclaim is invoked only after increment_strong completes.Then, try_reclaim will remove this resurrection count, and re-schedule itself if the count has become zero.The try_reclaim function cannot immediately reclaim the object even if it has decremented to zero, because a new Snapshot could have been created from a (now destroyed) Rc.
Fig. 3 summarizes the algorithm with a state transition diagram of the count.The count starts in the state 1, and it moves along the states in the upper row by increments and decrements.When it becomes zero, it enters Pending(0) state, waiting for the invocation of try_reclaim.It moves along the Pending( ) states with ≥ 0 by increments from increment_strong and decrements from usual decrement_strong not called by try_reclaim.These transitions cannot move the count to Pending(0) state, because of the extra resurrection increment of increment_strong from Pending(0).Only try_reclaim can move the count from Pending(1) to Pending(0).This guarantees that there is no concurrent execution of try_reclaim.For Pending( ) with ≥ 2, try_reclaim returns the count to one of the normal states.If try_reclaim is invoked in Pending(0), the count enters the final Destructed state.root 0@ 1 1 1@ 2 2 1@_ 3 2@_ 4 Fig. 4. 1 , 2 , and 3 are detached from the list one by one, concurrently.1 is decremented to zero at time 1 , and 2 is decremented to one at time 2 .

IMMEDIATE RECURSIVE DESTRUCTION
The design up to §3 still suffers from the slow reclamation of long links due to deferred destruction ( §1.1).For example, suppose consecutive nodes 1 , 2 , and 3 are detached from a linked list one at a time as depicted in Fig. 4.After 1 is destructed and 2 's count is decremented to zero, try_reclaim( 2) is scheduled for deferred execution.To immediately destruct 2 , we should immediately check if 2 has snapshot references.However, this cannot be done efficiently.In the HP version, one can scan the hazard slots, but it will degrade the performance.Alternatively, one can temporarily increment the referents of the snapshots and immediately decrement during collection, but it is not possible in the RCU version since it does not track the protection of each individual object.
In this section, we develop the second component of CIRC: immediate recursive destruction based on epoch-based RCU.We first introduce a generic idealized algorithm that tracks the time each object was last reachable ( §4.1) and refine it to an efficient algorithm leveraging epochs ( §4.2).

Key Idea
In the above example, we observe that checking the safety of destructing 2 can be split into two parts based on the time of snapshot creation.
(Snap-Old) Check if there are old snapshots for 2 by reusing the result of the protection scan performed just before the invocation of try_reclaim( 1 ).SMR schemes usually cache the scan result to test multiple retired objects in batches.For example, HP first copies the set of protected pointers, and epoch-based RCU computes the minimum epoch of active critical sections.If this check started at time , then the cache tells us whether the snapshots created before are gone.(Snap-New) Check if 2 has new snapshots that are created after .(Such a snapshot could have been created from root when it was pointing to 2 .) If 2 passes these checks and is destructed, 3 is decremented to zero and undergoes the same procedure with the same protection cache.This procedure continues until reaching a node with a non-zero count after decrement.
However, Snap-New is difficult to implement efficiently.For example, naively recording the time of each snapshot creation would incur significant overhead, defeating the purpose of snapshot references.Also, relying on user-provided information such as explicit retirement is not an option since such an interface is inherently unsafe, defeating the purpose of automatic reference counting.
Tracking the upper bound of snapshot creation time.To tackle this challenge, we first consider how to track the upper bound of the snapshot creation time, which can be used to implement Snap-New by checking if this upper bound is smaller than .We aim to maintain object timestamps for each object such that when the object's count reaches zero, its timestamp is the upper bound of snapshot creation time.Specifically, the object timestamp should adhere to the following invariant: Fig. 5. Installing an Rc reference to a zero-count node.(a) root initially points to 1 .Thread 1 gets a Snapshot to 1 , creates a new node 2 (ge ing an Rc), and gets a Snapshot to 2 .Thread 2 updates root to null, decrementing 1 to zero at time 1 .(b) Thread 1 installs the Rc reference to 2 on 1 at time .When 1 is destructed, we want to know if 2 can be immediately destructed.
(Invariant) The object timestamp is the upper bound of the time at which a snapshot might have been derived from the object's destroyed Rc references.
When destroying an Rc reference, i.e., decrementing a count, Snapshot references can no longer be derived from that Rc reference.Therefore, object timestamps are updated as follows: (Dec-Base) When an object is non-recursively decremented (e.g., in Fig. 4, decrementing 1 after directing root from 1 to 2 ), its object timestamp is updated to the current time.
Applying Dec-Base to recursive decrements (e.g., decrementing 2 due to the destruction of 1 ) precludes immediate recursive destruction, because the recursive decrements are performed after the scan (at ).Therefore, it needs special treatment in order to precisely track the snapshot time.
The key observation is that if no one had access to the Rc being destroyed since long ago, then no one could have created a snapshot from it.We proceed by case analysis on how a thread that created a snapshot to 2 was accessing the Rc from 1 to 2 that is being recursively destroyed.
• The snapshot is created after the Rc reference is stored in an Atomic<Rc> field of 1 .In this case, creating the snapshot requires a reference to the 1 , either an Rc or a Snapshot.Therefore, the time at which a snapshot to 2 can be derived this way is bounded by the time by which all the references to 1 are destroyed.Since the destruction time of Rc references to 1 are tracked by its object timestamp, we are left with the obligation of tracking the destruction time of the snapshots to 1 .• The snapshot is created before the Rc is stored in 1 , e.g., when it is directly owned by a thread.Fig. 5 depicts such a scenario: 1 is decremented to zero at 1 , a thread that holds an Rc reference to 2 creates a snapshot to 2 , and this thread installs the Rc reference to 1 at . 6The upper bound of the time when a snapshot to 2 can be derived this way is .
Summing up the above analysis, the following rules enforce the object timestamp invariant.
(Link) Each mutable pointer field (Atomic<Rc>) is associated with a link timestamp.Whenever a reference is written, the link timestamp is updated together to the current time.(Dec-Rec) When an object is decremented due to the destruction of a predecessor , 's object timestamp is updated to max( , ′ , → , ), where is 's object timestamp, ′ is the time by which all snapshots to are destroyed, → is the link timestamp of the link from to that is currently being destroyed, and is the current object timestamp of .
Problems.There are two main problems in implementing those rules.
(1) In Dec-Rec, how do we know when all the snapshots are destroyed ( ′ )?For efficiency, this should not rely on additional scanning or tracking individual snapshots.
(2) How do we atomically get the current time, update the count (resp.link), and update the object (resp.link) timestamp?If these operations are done non-atomically, the updated object (resp.link) timestamps are outdated, thwarting the correctness.

An Efficient Epoch-Based Algorithm
We design an efficient epoch-based algorithm that addresses the problems discussed above.
Epoch-based RCU.Our algorithm builds on a variant of epoch-based RCU (EBR) algorithm by Parkinson et al. [2017].This scheme maintains an invariant that the difference between epochs of overlapping critical sections is at most 1.Finally, we consider recursively decrementing an object due to the destruction of a predecessor .The problematic part of Dec-Rec in §4.1 was ′ , the maximum destruction time of snapshots to .Note that it suffices to track the maximum destruction time of Snapshots that can be created before .Thanks to the EBR version of Invariant, if reaches zero count with object epoch , it is guaranteed that the maximum snapshot epoch of is + 1.Therefore, we have the following rule: (Dec-Rec) When an object is decremented due to the destruction of a predecessor , 's object epoch is updated to max( , → , ), where is 's object epoch, → is the epoch of the link from to that is currently being destroyed, and is the current object epoch of .Truncating epochs.The algorithm so far assumes using a dedicated field for each object epoch and link epoch.To reduce the overhead of an extra word, epochs can be truncated into a few bits and packed into the reference count for an object epoch and the most significant bits (MSBs) of the pointer value for a link epoch.
The key ideas are that it is safe to over-approximate the object and link epochs because it only prevents destruction; and that if the current epoch is , then the upper bound of the epoch value in the whole system is + 1, bounding the over-approximation.Specifically, for -bit truncated epoch , i.e., the least significant bits (LSBs) of the real epoch, if the current untruncated epoch is , we define the over-approximation function: ceil , ( ) Then we define two operators required for our algorithm: the expired , ( ) predicate that tests if the real untruncated epoch has no snapshot; and the max , function that over-approximates the maximum of over-approximations of truncated epochs.Formally, they should satisfy the following properties.

SUPPORTING WEAK REFERENCES
In this section, we add weak references to CIRC to support data structures that contain reference cycles.Our algorithm largely follows CDRC [Anderson et al. 2022]'s approach,7 but it is adapted to our immediate decrement algorithm and incorporates Parkinson et al. [2023]'s optimization.Algorithm 3 presents the immediate decrement algorithm extended with weak references.
Background.Normally, automatic reference counting cannot reclaim cyclic structures due to the cyclic dependency of reclamation.Breaking the dependency requires at least one edge in the cycle to be an uncounted reference.To handle this with a safe interface, reference counting schemes usually come with weak references (represented by the Weak<T> type) associated with weak counts.Weak references make reclaiming an object a two-step process: when an object has no incoming strong references, then it can be destructed, which removes all its outgoing references; and when an object has no incoming weak (and strong) references, its memory block can be deallocated.This allows an object pointed only by weak references in a cycle to initiate destruction, which eventually deallocates it after the destruction of the cycle removes its weak references.At the same time, programmers can safely check whether the referent of a weak reference has not yet been destructed and obtain a dereferenceable strong reference (called upgrading).
The standard strategy for implementing the two-stage reclamation for concurrent reference counting is to give an implicit weak count to undestructed objects.That is, the weak count of an object is the number of weak references plus one if the strong count is non-zero.This allows detecting the absence of both strong and weak references atomically.
Algorithm 3 CIRC with weak references.Notable differences from strong-only CIRC and the corresponding parts in CDRC are highlighted green.Managing weak counts.A Weak reference to an object is constructed from an Rc reference by the Rc::downgrade function (line 71).Similarly to Rc, a Weak can be stored to and loaded from an Atomic<Weak>.Updating the weak count is done in the same manner as the strong count of the strong-only CIRC, but using weakAD, the instance of acquire-defer for protecting the object from deallocation but not destruction (omitted in the algorithm). 8For example, the increment_weak function (omitted) resurrects the count if the count was zero, and the decrement_weak function (omitted) schedules a deferred execution of the try_dealloc function (omitted), which deallocates the memory block if the count was not resurrected.
The try_reclaim function from Algorithm 2 is renamed to try_destruct (line 78) and modified to remove the implicit weak count by invoking decrement_weak when the count has not been resurrected.However, the resurrection check should be modified to take account of the interaction between weak references and strong references.We discuss this change below.
Increment-if-not-destructed. Weak::upgrade (line 74) creates an Rc reference by atomically incrementing the strong count if the referenced object has not been destructed yet.Therefore, increment_strong should be modified to fail (return false) when the object is already destructed.Traditional reference counting method implements this operation (sometimes called increment-ifnot-zero or sticky counter) as a simple CAS loop that tries adding one to the count when the count is non-zero.However, this approach is not compatible with Algorithm 2 as it results in spurious failures.This is because the count's physical value becoming zero is not the linearization point of the destruction of the object.Even if the count is zero, there can be snapshots preventing the destruction and the count can even be resurrected.In other words, it is not possible to distinguish the Pending(0) state and Destructed state in Fig. 3 just by looking at the count value.Therefore, we need a new increment-if-not-destructed operation.
We adopt the idea of stealing a bit from the count to indicate the count's state used in CDRC [Anderson et al. 2022] and Parkinson et al. [2023]'s wait-free increment-if-not-zero algorithm.The DESTRUCTED bit indicates whether the count is in Destructed state.try_destruct tries setting DESTRUCTED bit by a CAS from 0 (line 80).If successful, the count transitions from the Pending(0) state to Destructed state, allowing it to proceed to destruction.If the CAS failed, then it must be due to resurrection, because try_destruct cannot be invoked concurrently.So, it calls decrement_strong as in the strong-only version.
The modified increment_strong function first increments the count with FAA, and then checks the DESTRUCTED bit (line 86).If set, the operation fails.This corresponds to the self transition of the Destructed state in Fig. 3, which changes the physical value but not the logical state.If DESTRUCTED was not set and the count value was zero (line 87), then the previous increment has resurrected the count, so the count should be incremented again.
Loading Snapshot from Atomic<Weak>.With the interface introduced so far, obtaining a dereferenceable reference from a mutable weak pointer field (Atomic<Weak>) involves at least two increments: one in Atomic<Weak>::load and at least one in Weak::upgrade.Following CDRC's weak snapshots, CIRC supports the Atomic<Weak>::get_strong_snapshot function (line 89), which creates a Snapshot if and only if the object is not destructed, without updating the counts in most cases. 9he get_strong_snapshot function starts by acquiring a protection in weakAD to prevent deallocation of the referent.Then it uses the acquire_raw function to initiate the acquisition of a protection in strongAD (corresponding to writing to a hazard slot in HP, and no-op in RCU).The protection is validated (line 93) by checking that the object's strong count is not in the Destructed state, using the is_destructed function (line 100).
For this validation to be sound, i.e., the object does not get destructed while the snapshot is active, the is_destructed function must ensure the scheduled try_destruct (if exists) will fail when it returns false.To this end, if the strong count was in the Pending(0) state, it resurrects the count with a CAS from zero to one (line 103).If failed, it re-checks whether the count transitioned to the Destructed state (line 106).
The validation failure of get_strong_snapshot does not necessarily mean that the object pointed by the source Atomic<Weak> is destructed, because it may now point to another object.Therefore, instead of unconditionally returning null (which would not be linearizable), it retries from the beginning if the source points to a different object (lines 90 and 98).
Optimization for objects without weak references.Algorithm 3 introduces non-negligible overhead from the deferred execution of try_dealloc when the weak count becomes zero.However, deferred deallocation is not necessary if the object has never had a weak reference, because the object could not have been acquired in weakAD.Following Parkinson et al. [2023], we optimize such cases by taking another bit from the count to indicate whether the object has ever had a weak reference.The application is straightforward: the increment_weak function attempts to set this bit if it is not already set, and try_destruct can immediately deallocate the memory if it is not set.We present the full algorithm in the appendix [Jung et al. 2024].
Interaction with immediate recursive destruction.The immediate recursive destruction algorithm ( §4) is modified to work with try_destruct function.Note that the deferred executions of try_delloc do not cause the garbage chain problem, since they do not depend on one another.

EXPERIMENTAL EVALUATION
We implemented CIRC as a Rust library and evaluated it on a synthetic benchmark suite 10 to demonstrate that it is resistant to the long garbage chain problem and introduces only a minor overhead to underlying SMR schemes, while keeping the simple interface of reference counting.
The benchmark suite includes the following reclamation schemes: NR: the baseline that does not reclaim memory; EBR: epoch-based RCU; HP: hazard pointers with asymmetric fence optimization [Dice et al. 2001;Goldblatt 2022]; CIRC-EBR: the full CIRC with EBR; CIRC0-HP: the HP flavor of CIRC without immediate recursive destruction; CDRC-EBR and CDRC-HP: the EBR and HP flavor of CDRC 11 .CIRC and CDRC share the same code for the underlying reclamation schemes.In the implementation of the reclamation schemes, configuration parameters are tuned to adequately balance the throughput and memory usage.
The benchmark suite consists of the following lock-free map and queue data structures, which we believe represent most use cases of atomic<shared_ptr<T>> libraries.HMList: Harris-Michael linked list [Michael 2002a], as an example of long sequence of read-only operations; HashMap: chaining hash table using HMList [Michael 2002a], as an example of a large number of root locations with short link chains; NMTree: Natarajan-Mittal tree [Natarajan and Mittal 2014]; SkipList: skiplist [Shavit et al. 2011], as an example of complex data structures with high node indegree; and DoubleLink: doubly-linked queue [Ramalhete and Correia 2017b], as an example of a long reclamation dependency chain and the use of weak pointers for back references.CIRC-EBR-based NMTree, SkipList, and DoubleLink is about 500, 400, and 140 lines of code respectively.
The benchmark suite was compiled with Rust nightly-2023-04-21 with default optimization and link-time optimization.We used jemalloc [Evans 2006] for the memory allocator.We conducted experiments on two machines: AMD64T: single-socket AMD EPYC 7543 (2.8 GHz, 32 cores, 64 threads) with 8×32 GiB DDR4 DRAMs (256 GiB), and INTEL96T: dual-socket Intel Xeon Gold 6248R (3.0 GHz, 48 cores, 96 threads) with 12×32 GiB DDR4 DRAMs (384 GiB).The machines run Ubuntu 22.04 and Linux 5.15 with the default configuration.The results from the two machines exhibit a similar trend, so we mainly discuss AMD64T results here.For full results, see the appendix [Jung et al. 2024]. 10Available at the project website [Jung et al. 2024 Methodology.For map data structures, each thread repeatedly calls get(), insert(), and remove() methods randomly.We measured throughput (operations per second) and the peak memory usage for (1) varying number of threads: 1, 8, 16, 24, • • • , 128 (twice the number of hardware threads); (2) three types of workloads: write-heavy (50% inserts and 50% removes), read-write (50% reads and 50% writes), and read-most (90% reads and 10% writes); and (3) fixed time: 10 seconds.The key ranges for HMList are 1K and 10K, and the key ranges for the others are 100K and 100M.The data structures are pre-filled to 50%.Figs. 6 and 7 show representative results from this benchmark.
For DoubleLink, each thread repeatedly enqueues an element and then dequeues an element.We measured the throughput (operations per second) of a pair of operations and the peak memory usage for (1) varying number of threads: 1, 2, 4, 8, 16, 24, • • • , 128 (twice the number of hardware threads); and (2) fixed time: 10 seconds.For this benchmark, we additionally evaluate a variant of CDRC, CDRC-EBR-Flush, which flushes its thread-local deferred tasks after dequeuing an element.Fig. 8 shows the representative results from this benchmark.
Throughput.CIRC adds only a moderate performance overhead to the backend SMR scheme, and outperforms CDRC in write-heavy workloads thanks to the reduced number of deferred tasks, and adds only a moderate performance overhead to the backend SMR scheme.The additional overhead of maintaining object and link epochs for recursive destruction was negligible.
The read-most HMList benchmark (Fig. 7) compares the overhead in traversing long data structures.For the EBR backend, CIRC and CDRC introduce negligible overhead to EBR, showing equal throughput.For the HP backend, CIRC and CDRC can be slower than HP.Since the size of a Snapshot is two words in HP backend (a pointer value and a Guard, i.e., pointer to a hazard pointer), the cost of swapping Snapshots for hand-over-hand acquisition is bigger.CIRC0-HP is slightly slower than CDRC-HP in this benchmark, but it is slightly faster in INTEL96T (see appendix).
In the high-throughput low-indegree data structure benchmarks (HashMap and NMTree, Figs.6a and 6b) CIRC and CDRC introduce up to 20% and 30% overhead to the underlying SMR scheme, respectively.In SkipList, a high-throughput and high-indegree data structure (Fig. 6c), CIRC-EBR, CIRC0-HP, and CDRC-EBR introduce up to 35% performance overhead to the underlying SMR scheme.CDRC-HP shows a significantly larger overhead, up to 58%.Note that CDRC exhibits a much larger memory footprint (discussed below), suggesting that its performance would substantially degrade if this problem is fixed.
In DoubleLink (Fig. 8a), CIRC-EBR is up to 35% slower than EBR due to the extra deferred task required for weak references.CDRC-EBR shows better throughput than EBR, because its progress in reclamation is very slow (explained below).
In the map benchmarks with big key ranges (see appendix), the throughput gap is smaller due to reduced contention: the gap among EBR, CDRC-EBR, and CIRC-EBR is within 10% and the gap among HP, CDRC-HP, and CIRC-HP is within 30%.
INTEL96T benchmarks show a similar trend (see appendix), but the throughput gap between RC schemes and the underlying SMR is generally larger than that of AMD64T (at most 45% between EBR and CIRC-EBR).We believe that reference count updates incur more overhead in multi-socket machines.
Memory usage.CIRC exhibits a memory footprint similar to its underlying SMR scheme, because it schedules only a single deferred task for reclaiming an object in the common case, and it can recursively destruct a long chain of unreachable objects without scheduling a deferred task.
While CDRC shows a similar trend in HMList, HashMap, and NMTree (Figs. 6a, 6b and 7), SkipList (Fig. 6c) and DoubleLink (Fig. 8b) benchmark demonstrate that CDRC cannot promptly reclaim long linked structure.DoubleLink is a linked list where elements are dequeued from the head and enqueued to the tail, which naturally forms a long chain of detached objects ( §1.1).In SkipList,

RELATED AND FUTURE WORK
In this section, we compare CIRC and related algorithms in detail, introduce other related work, and conclude with the future work.
Reference counting GCs.Reference counting is a key component in some modern high-performance GCs such as LXR [Zhao et al. 2022].As discussed in §1, one of the key optimizations is to avoid eagerly counting the local references and defer the job to the GC.
Deutsch and Bobrow [1976]'s method defers reclamation of zero-count objects by putting them into the zero-count table (ZCT) data structure.The GC occasionally initiates a stop-the-world pause where all the other threads are stopped, scans all local variables and temporarily marks the objects referenced by them, and reclaims all unmarked objects in ZCT.Pausing all the threads at once is crucial for correctness.For example, if the GC pauses and scans each thread one by one, it may miss the local references newly created by a thread that was already scanned, and the objects decremented to zero after the creation of local reference can be prematurely reclaimed.
On the other hand, Bacon et al. [2001]'s method does not require stop-the-world pause, because it enforces the invariant that the zero-count objects do not have any incoming references by deferring decrements to the next round of the collection cycle.To do so, the time is divided into epochs, and the reference count updates (including increments) are first logged in a thread-local buffer with the current epoch number.The GC interrupts each thread one at a time to fetch their logs, scan their local variables, and increment their epoch.After checking all the threads, the GC applies the increments from the current epoch, temporarily increments the scanned local variables, and executes the decrements from the previous epoch, reclaiming the objects that reach the zero count.Deferred increment is not crucial for correctness, but it allows the designated GC to update the counts non-atomically, avoiding expensive synchronization.
Another prominent optimization is coalescing by Levanoni and Petrank [2001].Essentially, this approach considers only the difference between the heap states at each epoch boundary.This eliminates the redundant reference count updates for the intermediate referents of frequently modified heap objects.This idea is implemented by logging the first update to each field.The GC collects the logs from each thread one by one without stop-the-world.To resolve the inconsistency in the data due to concurrency, the GC checks each thread multiple times.In addition, the GC snoops the new references while it is running by letting each thread record the objects whose reference is written to another heap object.This information is also used for correctly handling the ZCT.CIRC's resurrection mechanism can be considered as a variant of snooping, where only incrementfrom-zero is recorded.CIRC and other concurrent reference counting methods for unmanaged languages we are aware of do not utilize coalescing.
These optimizations are not easy to apply to unmanaged languages.Since unmanaged languages use raw pointers that are ambiguous with non-pointer values, automatic garbage collectors for them usually resort to conservative methods [Boehm and Weiser 1988;Shahriyar et al. 2014].Furthermore, scanning the local variables still requires stopping the thread.Recent work such as FRC, OrcGC, and CDRC, and our algorithm CIRC use SMR method's local reference protection to replace the GC's role of scanning the local references.
Deferred decrement in unmanaged languages.FRC [Tripp et al. 2018] builds on the buffered reference counting method by Bacon et al. [2001]: the processing of decrements is deferred to the collector, and the collector scans and temporarily increments the local references.After loading a local reference, it should be explicitly announced, similarly to hazard pointers (HP) [Michael 2004].
CDRC uses deferred decrements too, but the time at which the decrements are applied is governed by an underlying SMR.The initial version of CDRC [Anderson et al. 2021] was based on HP, but it was generalized to utilize any standard SMR as its backend [Anderson et al. 2022].Conceptually, the underlying SMR protects a count of the given object.The deferred decrement is implemented with the retire function modified to decrement the object when the count is no longer protected.Since the SMR prevents decrements, the object can be immediately reclaimed when the count hits zero, and the local references do not need to be temporarily incremented during the collection routine.This allows using critical-section-based protection SMR methods such as Read-Copy-Update (RCU) [McKenney and Slingwine 1998] or epoch-based reclamation (EBR) [Fraser 2004] as the backend, which usually are the fastest SMRs.
An advantage of CDRC over CIRC is that it trivially allows conditional store of Snapshot with lazy increment, i.e., passing a Snapshot to Atomic<Rc>::cas as desired and incrementing the count after the successful CAS.This is not allowed in CIRC because its Snapshot does not guarantee a non-zero count, and thus it may lead to a negative count, which is not compatible with the resurrection mechanism.For example, before Atomic<Rc>::cas increments, another thread may overwrite the pointer (e.g., with store) and decrement the count to -1.Therefore, CIRC only takes Rc as the argument of Atomic<Rc>::cas, and as a result, it may have to run a pair of a redundant increment and decrement when it fails.This affects operations that make an object point to another object in a data structure, e.g., removing a node from a linked list.On the other hand, CDRC's Snapshot guarantees a non-zero count because it protects the count itself, allowing lazy increment.However, our evaluation ( §6) shows that the advantage of immediate decrements usually outweighs the disadvantage of lacking this feature when implementing concurrent data structures.Such operation failures are less common, and defer is called only "1 + (the number of resurrections)" times in CIRC while on the other hand it is called "1 + (the number of all increments)" times in CDRC.
Deferred reclamation in unmanaged languages.OrcGC [Correia et al. 2021] is closer to CIRC and Deutsch and Bobrow [1976]'s ZCT-based approach in that decrements are immediately applied and the zero-count objects enter a special state that handles their potential reclamation.A variant of HP called pass-the-pointer is used for managing this state: hazard pointers protect local references, and the retire function is modified to scan the hazard pointers and reclaim the unprotected zerocount objects.
As discussed above, algorithms following this style should handle concurrency carefully.For example, suppose a zero-count object is incremented and decremented back to zero while the reclamation procedure for the object is running.There are two problems: (1) the reclamation process should not be invoked again for that object to avoid double-free; and (2) if a new local reference is created during that period, the reclamation should be canceled since the scan may have missed the new reference.For (1), OrcGC uses a bit in the count to indicate that the collector is checking the object.For (2), each reference count is combined with a version number that is increased whenever the count is updated.This allows for detecting when a reference count has remained zero for a period of time.If there was no Snapshot during this period, the object is safe to reclaim.
This combination of methods also tolerates negative counts and thus allows conditional store of Snapshot with lazy increment.We believe CIRC's resurrection mechanism can be modified to tolerate negative counts by using bit flags instead of additional increments.
Unlike the reference count epoch we use in §4, an overflow in OrcGC's version numbers can lead to unsoundness in the algorithm: incorrectly believing the count remained zero.As this must be packed into a single atomically updatable location, there is a trade-off between potential unsoundness and the number of possible incoming edges to an object.bug in Anderson et al.'s algorithm.Multiple threads may decrement to zero, attempt to set the closed bit, and if successful, decrement the implicit weak count.If the thread that successfully sets the closed bit also deallocates the object, the other threads attempting to set the bit will access the freed memory.Parkinson et al.'s solution is to associate the implicit weak count with the strong count's physical value being zero instead of it being closed.When a weak reference holder increments from zero, it should increment the weak count, because the weak reference's weak count is logically converted to the implicit weak count.This protects the other threads from deallocation.
In CIRC, the implicit weak count is still associated with the object's destruction, but it does not suffer from this problem because its resurrection mechanism guarantees try_destruct is exclusive.In fact, CIRC's resurrection count plays a similar role as Parkinson et al.'s implicit weak count.
CIRC's Atomic<Weak>::get_strong_snapshot is based on CDRC's weak snapshot.CDRC's weak snapshot has the same goal, but it is weaker than normal Snapshot in that creating an Rc reference out of it may fail.To implement weak snapshot, CDRC uses two acquire-defer instances for destruction: strongAD defers decrements, and disposeAD defers destruction after the count reaches zero.In CDRC, the protection in strongAD cannot be validated by checking that the count is non-zero (line 93), because even if the count value is currently non-zero, there might be a deferred decrement scheduled earlier.Therefore, the count can be closed after obtaining a weak snapshot, after which no new Rc reference can be created.On the other hand, validating the disposeAD protection with a non-zero count does guarantee that the object is not destructed, because the count being non-zero implies that the destruction has not been scheduled yet.
Conclusion and future work.We have designed Concurrent Immediate Reference Counting (CIRC), a safe concurrent reference counting method for unmanaged languages that promptly reclaims long linked structures without additional per-pointer announcements.Our evaluation shows that CIRC performs competitively with the fastest manual methods in highly concurrent data structures.
As future work, we would like to explore recursive destruction algorithms that support other SMR techniques.CIRC adds EBR to the list of SMRs supporting recursive destruction (previously only HP, as discussed in §1.1), but EBR-based reference counting has a limitation that it does not bound the memory usage.To achieve high performance and bounded memory usage at the same time, a recursive destruction algorithm for hybrid SMRs such as HE and IBR is needed.
We also plan to formally verify CIRC; apply our techniques to GCs for managed languages; and adopt more optimizations from the GC literature such as coalescing to unmanaged languages.

Fig. 3 .
Fig.3.The state machine of the strong reference count in CIRC.The numbers mean the physical value of the counter.The "inc" and "dec" transitions are normal increments and decrements.The "destruct" transition corresponds to the "if" branch of try_reclaim (Algorithm 2) and try_destruct (Algorithm 3).The "re-defer" and "cancel" transitions correspond to the "else" branch.The "resurrect" transition corresponds to resurrection."fail inc" happens only in Algorithm 3, when a empting to increment already destructed object.

(
Invariant) If the object epoch is , then the upper bound of the epoch of snapshots derived from the object's destroyed Rc reference is + 1.This bound allows destructing zero-count objects with epoch if the current epoch is at least + 3. The rules for updating object and link epochs should not simply overwrite the epoch to the current epoch because an epoch can co-exist with − 1 and + 1.For example, if an object has epoch , the thread that decrements could be in epoch − 1. Updating the object epoch to − 1 violates the Invariant because there can be a snapshot to at epoch + 1.Therefore, the rules should ensure that they do not decrease the object epoch and link epoch: (Dec-Base) When an object with epoch is non-recursively decremented by a thread at epoch , the object's epoch is updated to max( , ). (Link) When a reference is written to a mutable pointer field with link epoch by a thread at epoch , the link epoch is updated to max( , ).
Replacing timestamps with epochs.We use the epochs in place of the timestamps from the idealized algorithm outlined in §4.1.Specifically, each object is associated with an object epoch, and each mutable pointer field is associated with a link epoch.The flavor of EBR we use enables this algorithm because a snapshot at epoch is guaranteed to be destroyed before epoch + 2, and the staleness of the epoch is bounded by one even if the read of the current epoch and the update of the object and link epochs are done non-atomically.We start with the object epoch invariant that leverages the EBR's invariant.If an Rc is destroyed at , then a thread at + 2 cannot access it.This leads to the following invariant.