VERLIB: Concurrent Versioned Pointers

Recent work has shown how to augment any CAS-based concurrent data structure to support taking a snapshot of the current memory state. Taking the snapshot, as well as loads and CAS (Compare and Swap) operations, take constant time. Importantly, such snapshotting can be used to easily implement linearizable queries, such as range queries, over any part of a data structure. In this paper, we make two significant improvements over this approach. The first improvement removes a subtle and hard to reason about restriction that was needed to avoid a level of indirection on pointers. We introduce an approach, which we refer to as indirection-on-need, that removes the restriction, but yet almost always avoids indirection. The second improvement is to efficiently support snapshotting with lock-free locks. This requires supporting an idempotent CAS. We show a particularly simple solution to the problem that leverages the data structures used for snapshotting. Based on these ideas we implemented an easy-to-use C++ library, verlib, centered around a versioned pointer type. The library works with lock (standard or lock-free) and CAS based algorithms, or any combination. Converting existing concurrent data-structures to use the library takes minimal effort. We present results for experiments that use verlib to convert state-of-the-art data structures for ordered maps (a B-tree), radix-ordered maps (an ART-tree), and unordered maps (an optimized hash table) to be snapshottable. The snapshottable versions perform almost as well as the original versions and far outperform any previous implementations that support atomic range queries.


Introduction
The ability to query a concurrent data structure atomically across multiple locations has many applications, such as searching for all keys within a range.Supporting such multipoint queries has therefore garnered significant interest over the past decade [1,2,4,6,14,16,18,27,28,37,45,47,51,62,64].Multi-point queries can be supported with atomic snapshots of the memory state.Recent work [62] (henceforth the WBB+ approach) has shown how to efficiently support such snapshots for concurrent data structures that use loads and compare-and-swaps (CASs) on shared memory.The approach uses version lists [53] and maintains all the (asymptotic) time bounds on data structures it is applied to.It is reported to be efficient in practice [62] outperforming prior methods.
In this paper, we make two significant improvements over the WBB+ approach, one that avoids a subtle restriction on how pointers are used that is needed to avoid a level of indirection, and the second is to support multiversioning with lock-free locks [8].The first is difficult because of sharing of meta-data on objects being pointed to, and the second because previous technique for lock-free locks did not support a CAS operation while CAS is at the heart of the WBB+ approach.Furthermore, we abstract the ideas into a simple library interface for supporting snapshotting, almost always without indirection, and for easily combining the approach with lock-free locks.
Avoiding indirection with multiversioning is critical for performance since it can avoid an extra cache miss on every access. 1 The difficulty of avoiding indirection in a general and lock-free manner is inherent to almost all concurrent approaches that are based on version lists.In particular a version list maintains a historic list of values stored at a location, and each link in the list (version) contains the value and two pieces of meta-data, a timestamp of when that value was written, and a pointer to the previous version (see Figure 1a).Accessing even the most recent value therefore requires first reading the head of the list and then indirectly the value itself.Storing the most recent value directly is difficult in a lock-free setting because of the need to update the version (value, timestamp and previous pointer) atomically.
Leveraging earlier work on a specific concurrent binary tree data structure [28], WBB+ suggest an alternative to get around this problem when values are always pointers, which Avoiding indirection in a binary tree by putting version links on the node being pointed to.Structures in orange are tree nodes, with key , and in green are version links where  means previous and  is the timestamp.The problem with removing indirection is sharing.For example, the nodes with  = 4 and  = 8 both point to the node with  = 5, and seem to need different timestamps and  pointers.In this case, it is OK and the timestamp 7 with  pointing to  = 6 is correct since ( = 0) < ( = 7) < ( = 8).
is to store the meta-data on the object being pointed to.However this means, in general, an object cannot over time have two different pointers point to it since then the meta-data could be different for each (see Figure 1b).They show, however, that in some cases this sharing is not problematic and define a property, called recorded-once, which limits the use of pointers to avoid improper sharing of meta-data.WBB+ point out that any data structure can be converted into a recorded-once form.Unfortunately, the recorded-once requirement is subtle and many algorithms require non-trivial changes to make them recorded-once, often requiring extra copying.All the experiments they reported were for recorded-once variants of data structures. 2 Our first contribution is greatly simplifying avoiding indirection by supplying an interface for snapshotted (multiversioned) pointers that does not require the recorded-once condition.The approach avoids indirection in most cases, and when indirection is added, gets rid of it quickly via a shortcut.This is all done under the hood and is invisible to the user.The key ideas here are (1) a light-weight check for when it is safe to avoid indirection, which is possible in the majority of cases, and (2) detection of when indirection is no longer needed and safe removal of indirection at that time.We refer to this approach as indirection-on-need.
The second improvement we make is with respect to using locks, and in particular lock-free locks (i.e., implementations 2 The changes were small, but subtle, requiring an expert to make them. of locks that guarantee progress even if processes stall or fail).The idea of a lock-free lock is to have processes help each other run the critical sections when they need a lock that is taken.The idea dates back 30+ years [5,60] but has only been made practical recently [8].The recent work has shown that lock-free locks can far outperform standard locks when a machine is oversubscribed-i.e., more software threads than hardware threads.
The difficulty with lock-free locks is that when helping, multiple processes can be running the same code, but the code needs to appear as if it ran exactly once: referred to as idempotence [15,21].Recent research has shown how to implement idempotence cheaply [8], but the approach does not support an idempotent CAS.CAS is difficult because it is hard to tell if a CAS succeeded or not since it could have failed either due to another process running the same instance of a critical section, in which case, if any one succeeds, they should all succeed, or due to a different instance.It is known, theoretically, how to implement an idempotent CAS [3,10], but with significant practical cost and requiring a doubleword-width CAS.
In this paper we show, perhaps surprisingly, that in conjunction with multiversioning an idempotent CAS can be implemented at almost no additional cost and in a simple manner.In particular the method for timestamping in the WBB+ approach can be overloaded to keep track of who succeeded on the CAS allowing all helpers to see the same result.The approach only requires a single-word CAS.
Based on these ideas we have developed and implemented an easy-to-use C++ library called verlib.The library revolves around a versioned pointer type, which can be used for both lock-based and CAS-based concurrent data structures, as well any combination.As with atomic locations in many programming languages, the versioned pointer supports atomic loads, stores, and CASes.The user can convert their existing concurrent data structure to use verlib with only a couple changes: (1) replacing atomic locations holding pointers that need to be part of the snapshotted state with versioned pointers, and (2) inheriting a "versioned" class into any objects pointed to by such pointers.Then the user can wrap a collection of loads in a with_snapshot and all the loads will see an atomic view-i.e., the state of the versioned pointers at some fixed point in time.
Once a concurrent data structure is modified to use verlib, compiler flags can be set to run it either with or without multiversioning, and either with lock-free versions of the locks or standard versions.If used without multiversioning, then loads within a with_snapshot are not atomic.The library also supports different timestamping techniques, including both hardware timestamping and software approaches.
We have converted several state-of-the-art concurrent data structures to use this approach, including a doubly linked list, a hash table, an adaptive radix tree (ART) [40], and a B-tree.All but the hash table are taken from the flock library 3 [8], and the hash table uses array bucket copying [20].We believe our baseline implementations are the fastest or competitive with the fastest current implementations for sorted lists, sorted sets, radix-sorted sets and unsorted-sets.In the paper, we present several experimental results comparing the different data structures, with the different settings of the flags mentioned above, and under a variety of workloads.The workloads include various mixes of updates (inserts and deletes), finds, range queries, and multi-finds.We also vary the data structure sizes and the skewness of the key distribution using a Zipfian distribution.We then compare performance to some existing data structures that directly support range queries.
The experiments demonstrate several points.They show that the cost of versioning is typically small.They show that indirection-on-need is much more efficient than using indirection, while not requiring changes to algorithms to make them recorded-once.They show that combining multiversioning with lock-free locks is efficient, performing much better than standard locks when oversubscribed.And they show that software approaches to timestamping are almost as good as hardware timestamps.
The contributions of the paper include: • A new indirection-on-need approach for version lists that mostly avoids indirection, while not requiring that objects are only recorded-once.• Efficient and full support of versioned pointers inside of both blocking and lock-free locks.This includes a new mechanism to support an idempotent CAS.• A easy-to-use portable library, verlib, for adding versioning to existing or new concurrent data structures.• The first B-tree we know of that is lock-free and versioned.It is also significantly faster than previous data structures that support linearizable range queries.• First versioned radix tree, whether lock-free or not.
• A collection of experiments demonstrating the various tradeoffs of our approaches including the first comparison we know that compares a variety of timestamping approaches.

Related Work
Multiversioning using version lists dates back to the 70s [53] and is often used for efficiently supporting read-only transactions in databases [11, 17, 22, 25, 30, 38, 41, 46, 48-50, 52-54, 66].None of this database work considers multiversioning concurrent data structures, and only one [30] is lock-free and it sequentializes commits.More recent work has considered efficient multi-point read-only operations in the context of concurrent data structures.These techniques most often support atomic single point updates and atomic multi-point queries and are mostly 3 A library for lock-free locks data structure specific.Range queries on ordered sets (maps) have been studied extensively.Brown and Avni [16] gave an obstruction-free range query for -ary search trees.Avni, Shavit and Suissa [4] described how to support range queries on skip lists.Basin et al. [6] described a concurrent implementation of a key-value map that supports range queries.Fatourou, Papavasileiou and Ruppert [28] gave a persistent implementation of a binary search tree with wait-free range queries.The last two both use version list.Winblad, Sagonas and Jonsson [64] also gave a concurrent binary search tree that supports range queries.
Researchers have also taken steps towards the design of general techniques for supporting multi-point queries that can be applied to classes of data structures.Petrank and Timnat [51] described how to add a non-blocking scan operation to non-blocking data structures such as linked lists and skip lists that implement a set abstract data type; scan returns the state of the entire data structure.Updates and scan operations must coordinate carefully using auxiliary snap collector objects.Agarwal et al. [1] discussed what properties a data structure must have in order for this technique to be applied.Chatterjee [18] adapted Petrank and Timnat's algorithm to support range queries.Arbel-Raviv and Brown [2] described how to implement range queries for concurrent set data structures that use epoch-based memory reclamation.
As described in Section 4, WBB+ describe a general approach to support snapshots for any concurrent algorithm that uses cases and loads to access shared memory.It introduces the idea of set-stamp helping.Nelson-Slivon, Hassan and Palmiery [45] describe a technique for supporting range queries on a variety of ordered data structures (e.g.linked list, skip list and binary search tree).Kobus and Kokociński, and Wojciechowski describe a linked-list data structure that supports arbitrary snapshots well as atomic batch updates [37].Sheffi, Ramalhete and Petrank present lock-free data structures supporting linearizable range queries that also bound memory usage by aborting long-lived queries that force the system to hold onto too many old versions [57].All these use version lists and the last one also uses set-stamp helping.
Several works have studied removing a level of indirection in transactional memory systems, where the main purpose is to avoid extra cache misses.Harris et.al. [32] reduce the two levels of indirection required by DSTM [35] to one in the common case when there is no ongoing transaction involving a location.Marathe et.al. [43] improve this to one level in all cases.Both approaches are obstruction-free.The Cicada system [41] completely removes indirection in certain cases, but requires locks.Furthermore it can only avoid indirection for the first value written to a location (we do not have this restriction).None of these system store the meta-data on the target of a pointer, and hence none have to deal with the issue of sharing meta-data when there are multiple pointers to an object, which we need to handle.Shortcutting of indirection is also supported by Cicada.Again this is only under a lock.We note that the Cicada approach does have the advantage that it works for arbitrary values while ours is just for pointers.
The idea of lock-free locks was introduced by Turek, Shasha and Prakash [60] and independently by Barnes [5].Both approaches use helping and allow arbitrary nesting of locks, and, as long as there are no lock cycles, ensure that the code runs in a lock-free manner-i.e. that the system will make progress for any schedule.The approaches were widely considered to be impractical due to their approach to idempotence, requiring effectively a context switch on every read and write.Therefore, most lock-free data-structures have instead used custom approaches for helping [9,23,26,29,31,34,56,61,65].Ben-David, Blelloch and Wei [8] developed a much more efficient approach to idempotence, outlined in Section 4. We know of no work prior to this paper that combines lock-free locks and multiversioning.
Researchers have studied reducing the memory required by multiversioning [9,13,42,57,63].In this paper, we use a simple epoch based collector, but we expect these approaches can also be applied.

verlib
Here we present the rather minimal verlib interface.Although presented and implemented in C++, it should not be hard to embed the ideas in libraries for other programming languages.Our implementation is available in the following repository: https://github.com/cmuparlay/verlib.
The interface is listed in Figure 2. It consists of two classes: • A versioned_ptr<T> class which is used to store versioned pointers to objects of type T. • A versioned class that must be inherited by every type T that is used in a versioned_ptr<T>.It has no user accessible fields.
The library also supports the function: with_snapshot( ), which takes a thunk (i.e. a function without arguments)  and runs it such that all calls to load() on a versioned pointer return values at a fixed point in the linearized order of updates which falls between the invocation and response of the with_snapshot.In other words, the loads in  on versioned pointers appear as if they ran on a snapshot of memory.The with_snapshot( ) function returns the value returned by  .If the structure is to be used with lock-free locks (not required) then it must use flock locks, meaning that all std::atomic<T> types (i.e.mutable shared locations holding values of type T) must be replaced with flck::atomic, and any code that allocates or frees within a lock must use the flock idempotent memory management routines.Note that if not using lock-free locks, any safe memory reclamation scheme can be used.Currently verlib implements a store with a load-and-cas.This means that concurrent stores and CASes to the same location will not necessarily linearize.Cost Bounds.In the following discussion, the term "steps" refers to the number of machine instructions executed.The store and cas operations each take a constant number of steps. 4The load operation outside a with_snapshot takes a constant number of steps, and inside, the number of steps is at most proportional to the number of store and cas operations on the same versioned pointer that are concurrent with the containing with_snapshot.The overhead of the with_snapshot is a constant additive number of steps, and if using optTS (one of our timestamping schemes) then the thunk  in a with_snapshot( ) might be run twice.

verlib Example: Doubly Linked List
As an example of how to use the interface, we present the code for a doubly-linked sorted lists [8] that supports snapshots.In addition to insertions, deletions and finds, the snapshots allow for atomic range queries and any other queries involving a snapshot of the state of the list.We present the code for insertions and range queries in Algorithm 3. Usages of the verlib library are marked in red.The code supports lock-free locks using flock, but could also use standard locks, in which case all the code in blue can be replaced with generic versions (e.g., std::mutex::try_lock, std::atomic and any safe memory reclamation scheme).
Each node of the list holds a key and value, previous and next pointers, and a flag indicating whether the node has been removed.The versioned_ptr on (line 3) indicates that the next pointer should be versioned (since it is used in an atomic snapshot).The versioned class needs to be inherited for any class X that is used as versioned_ptr<X> (line 1).
The range implements an atomic range query from key k1 (inclusive) to key k2 (exclusive).It finds the first key greater or equal to k1 using find_node, and then continues traversing the list while pushing keys onto result until finding a key greater or equal to k2.We assume the list has a sentinel infinite key at the end.The with_snapshot takes as its only argument a thunk f (lambda with no argument) 5  and runs it such that all its loads see an atomic view of the memory state (i.e. of all versioned pointers).The range query will therefore be atomic (i.e., linearizable with updates).Note that only the next pointer needs to be versioned since only it is followed in the query.
The insert searches for the first node next with a key greater or equal to k and tries to acquire a lock on its previous node (prev).If the lock is successfully acquired, prev has not been removed, and prev->next still points to next, the algorithm allocates a new node and splices it in.Otherwise it makes another attempt by repeating in the while loop.The lck->try_lock(f) from the flock library attempts to take the lock and if successful runs the thunk f.It returns true if successful and the thunk returned true.

Background on WBB+
Here we review the WBB+ approach for snapshotting [62] since we build on it.The approach is designed to support 5 In C++ "[=] { body }" creates a lambda with no arguments where the free variables of the body are captured by value.taking atomic snapshots of the state of the memory of concurrent algorithms.The approach can be applied to any concurrent algorithm that accesses shared memory through locations supporting loads and compare-and-swaps (CASs).Stores can be implemented with a load and then CAS.To support snapshots, the user replaces all locations they want to included in a snapshot with "versioned" locations 6 .Versioned locations support a read_snapshot operation that, given a snapshot handle, returns the value of that location for that snapshot.The interface then supplies a take_snapshot operation that returns a handle to a snapshot of the state of all versioned locations.
The approach is implemented using version lists.It keeps a global timestamp and each versioned location keeps a reverse time-ordered list of all the successful CAS operations applied to it.A versioned location points to the head of its list.Each link in the list contains a value, a timestamp and a previous pointer (Figure 1a).In the following discussion we use primcas to indicate a machine-level CAS and cas to indicate a user level CAS on a versioned location.The complete C++ code for the approach, modified to support the verlib interface, is given in Algorithm 4.
The cas operation appends a new version onto the front of the version list by allocating a new version link (Line 42), pointing its previous pointer to the current version, and using a primcas to try to install it in the head pointer (Line 43).If there are concurrent cas operations only one will succeed.The tricky part is installing the timestamp on the link.Installing the stamp before or after the primcas can lead to incorrect results.To fix this, WBB+ introduce a technique, which we will refer to as set-stamp helping, that has all operations help set the timestamp at the head of a version list.This technique is a simplification over previous solutions that use double-compare-single-swap [2].The set-stamp helping approach is implemented by initially setting the new version's timestamp to a special TBD (to be determined) value (Line 42).It then links in this new version with a primcas.If successful, it sets the new version's timestamp by reading the current global timestamp and using a primcas to update the new version's timestamp from TBD (Line 47).To ensure the timestamp for the previous version is set, before appending the new version, the cas operation also helps set the previous version's timestamp (Line 39).Any reads or readSnapshots also help set the timestamp if they encounter a TBD.A readSnapshot is implemented by following the version list of the object to the first link with a timestamp at or earlier than the one requested (Line 27).The take_snapshot operation simply increments the global timestamp returning the old one (Line 10).
The time for read, cas and takeSnapshot is constant.Therefore the asymptotic time of a concurrent algorithm is not affected by using a versioned location.The time for a read-Snapshot is at most proportional to the number of cas operations on the location between when the takeSnapshot was applied and the readSnapshot is called on that timestamp (i.e. the depth of the desired version in the version list).
The with_snapshot is simply a wrapper function that calls take_snapshot and stores the resulting snapshot handle in a thread local variable called local_stamp.Note that in this code, the structure versioned is empty.Fields will be added to support indirection-on-need.
The WBB+ approach as just described, and in the code, requires a level of indirection through a version link to read each location.The WBB+ paper describes an optimization that avoids the indirection.The idea is to store the timestamp and the pointer to the previous version directly in each data structure object-i.e. the objects a location points to, such as a node of a tree or linked list.The versioned location then points directly to the object instead of indirectly through a version link (Figure 1b).The problem is that two pointers to the same location will share the timestamp and previous version data.WBB+ observed that such sharing is safe if the data structure is recorded-once-meaning that each pointer can only be used as the new field of a cas at most once [62].Any data structure can be converted so that all objects are recorded-once, but this conversion is often non-trivial and can cause extra cost.Our goal is to remove the restriction.

Indirection-on-need
In this section, we introduce a new mechanism, which we call indirection-on-need, for avoiding the level of indirection added by the baseline WBB+ approach.As in the WBB+ approach, we augment each data structure object with additional fields to store version list meta-data which consists of a timestamp field and a pointer to the previous version (see Figure 1b).This can be done through our library by inheriting from the vp:versioned class (e.g.Line 1 of Algorithm 3).Our mechanism differs from the one in WBB+ in two main ways.First, when a versioned pointer is written to, our library automatically determines if indirection can be avoided and does so whenever possible.Second, we show that any indirection that is added is only needed temporary and our library automatically removes it when it is no longer needed.
For the first contribution, we identify two cases where indirection can be avoided.Suppose  is a newly allocated object and we wish to change a versioned pointer  to point to it for the first time.In this case, our library sees that the two meta-data fields of  are unused and initializes them to store version list meta-data for  (Lines 45 and 48 of Algorithm 5).Then it changes  to point directly to  (Line 49).For this to work, we make one reasonable restriction, which is that when an object  is allocated by a process, a versioned pointer to it must be written using a store or cas before any other process can see it-i.e., no side channels can be used to communicate the pointer to .This is to avoid races among processes each trying to be the first to write a pointer to a newly allocated object.
In the second case, even if  is not a newly allocated object and its meta-data fields were already set by another versioned pointer, we can still initialize a new versioned pointer  to point directly to .This is because  is the oldest version in 's version list, so it just needs to contain enough information to indicate this.In particular, we just need to make sure any read_snapshot operation (Line 25) on  never tries to follow 's prev_version pointer.This is the case because 's timestamp was set before  was initialized, so any call to read_snapshot on  will have timestamp greater than or equal to 's timestamp.Therefore, the read_snapshot will never traverse past  and it will treat  as if it was the end of the version list.Therefore, it is safe to not add any indirection when initializing versioned pointers, effectively allowing multiple versioned pointers to share the same meta-data.
If neither of the previous two cases hold when writing to a versioned pointer, then we create a ver_link object as before and have the versioned pointer point indirectly to  via this ver_link.Our library steals a bit from the pointer to distinguish between direct and indirect versioned pointers.See the cas operation in Algorithm 5 for the full implementation of this approach.
This approach is very effective in practice because in many commonly used concurrent data structures [26,33,44], indirection is only added when deleting a node since inserts always write newly allocated nodes.Furthermore, each update operation usually only performs a small number of pointer swings and most of the writes are initialization writes, which do not add any indirection with this approach.However, the indirect version links added by deletes eventually build up, and we need an efficient strategy for shortcutting them out.Shortcutting.To identify indirect ver_links that can be safely shortcutted out, the idea is to make use of the memory reclamation scheme.Memory reclamation is essentially the problem of determining when an object or a version link is safe to deallocate.If a versioned pointer is stored indirectly, and all of the versions in its version list are safe to deallocate except the current one, then it is safe to shortcut out the version list by storing the versioned pointer directly.This is done by the shortcut procedure in Algorithm 5.
In the discussion that follows, we will assume a shared done_stamp is maintained that is guaranteed at all times to be no greater than the minimum of the local_stamps of any ongoing with_snapshots as well as the global stamp.This ensures that no current or future read_snapshot will ask for a version older than done_stamp.In the full version of the paper, we describe how to maintain the done_stamp with epoch-based memory reclamation (EBR).
Since all ongoing with_snapshots have timestamps no less than done_stamp (by assumption) we can determine if a version list is no longer needed by checking if the timestamp of the current version is no more than done_stamp.This check is performed by the shortcut function (Line 22).If the check passes, then no ongoing or future with_snapshot will access any of the old versions from this list, so it safely shortcuts out the version link  (line 23).This causes the versioned pointer  to point directly to some object .Effectively, this sets  as the tail of 's version list.One complication is that  might have a different timestamp than .We can show that 's timestamp must be strictly less than 's timestamp because an indirect ver_link is only created if 's timestamp is already set.Since all active and future with_snapshots have timestamp greater than or equal to 's timestamp, none of them can distinguish between 's timestamp and 's timestamp, so it is safe to use 's timestamp instead.Shortcutting is another situation that results in multiple versioned pointers safely sharing the same version list meta-data.
The shortcut function is called each time an indirect versioned pointer is loaded and also at the end of each store and cas.If there are no concurrent with_snapshots, then store/cas will immediately shortcut out any indirect nodes that it creates, in which case indirect nodes are only reachable for a brief moment of time.Shortcutting adds an additional write to each store/cas, but we see in our experiments that the benefits almost always outweigh the cost.
This shortcutting technique requires us to make some additional changes to the versioned pointer's cas operation.The primcas on Line 49 can fail not only because another cas succeeded, but also because an indirect ver_link got shortcutted out.In the latter case, the value of the versioned pointer did not change, so we need to retry the primcas on Line 52.Another subtlety is that the cas needs to know if it overwrote an indirect pointer and is thus responsible for retiring it.This check and the subsequent retire is done on Line 59.

Snapshots with Locks-free-Locks
The WBB+ approach to snapshotting works for lock-based code, as does the indirection-on-need approach just described.At the end of the section we describe a slightly more efficient version of store that is specialized for the case where there are no write-write races.This is typically true when using locks.The WBB+ approach, however, does not work with the lock-free lock technique of Ben-David, Blelloch and Wei [8] (flock).This is because flock does not allow CAS to be used inside a lock-free lock's critical section CAS is crucial for the WBB+ approach.In this section, we cover how to combine snapshots and lock-free locks, including a technique for implementing an idempotent CAS that is safe to use with lock-free locks, optimizations to maintaining timestamps, and a specialized implementation of store.
In the flock approach, a lock request takes two arguments: the requested lock and a thunk containing the critical code to run when the lock is acquired.A thunk is a closure containing both the code pointer to the critical section and the values of captured free variables.When a process acquires a lock, it leaves a pointer to the thunk on the lock so others can help run it.When attempting to acquire a lock that is already locked, the process will help run the thunk stored on the lock and help reset the lock back to the unlocked state.
The difficult part is helping since multiple threads might be running the same thunk concurrently, which semantically should only run once.flock ensures code is idempotent, guaranteeing it appears as if it ran exactly once.The library-based approach replaces operations on shared memory (loads, stores, cams 7 , allocation, and deallocation) with idempotent versions.It uses a log for each thunk, and ensures that (1) for all loads the original thunk and all helpers read the same value, (2) for all stores and cams only the first 7 A limited form of cas that does not return whether it succeeded.among the original and helpers will have an effect, and (3) memory allocations and deallocations happen once.
Importantly, the approach does not support a general cas since it is difficult to determine, in a general and idempotent way, if a cas succeeded.Using a cam followed by a load to check for success does not work since another cam could succeed between the two operations, making it appear that the first failed.It is possible to implement an idempotent cas using a double-word wide regular cas [10], but this is impractical since it would require that all versioned pointers be maintained as double words.Furthermore, the approach is also quite complicated.
Here we describe a simple technique to implement an idempotent cas that works within the flock framework when used with snapshotting.The code is shown in Figure 6 with the redefinition of primcas.The code is deceptively simple, and the correctness a somewhat subtle.The implementation relies on the fact that when updating a pointer with a versioned cas, the timestamp of the new version is initially set to TBD, and before it is updated again, it must be set to a real timestamp.This means that a versioned cas succeeded if and only if either the value in the location is the same as the new value of the cas, or the timestamp was set (i.e., the conditions on Lines 22 and 23).This is shown more formally by the following theorem.Theorem 6.1.The primcas on Line 20 of Algorithm 6 and as used on Lines 49 and 52 of Algorithm 5 implements a linearizable cas.
Proof.In the following, we will use Line  . to indicate Line  of Algorithm .We first note that two concurrent primcass must have different new values since they both just allocated new objects on Line 5.47.Therefore, if the cam on Line 6.21 failed, then the check on Line 6.22 will always fail since no concurrent primcas can be writing the same value.Now consider two concurrent cas operations and for now just the first primcas on Line 5.49.Without loss of generality assume the cam (Line 6.21) of the first cas linearizes first and succeeds.If the cam of the second cas linearizes after the first cas runs the check on Line 6.22, then the first cas passes the check and correctly reports success.However, if the cam of the second cas succeeds and linearizes before the first runs the check, then the first will fail the check.This is exactly the problem with trying to implement a cas with a cam then load to check.But, in this case, for the second cas to succeed on its cam, it must have loaded the result of the first cam into old_v on Line 5.40.Hence, the first primcas properly reports success or failure.A similar argument can be made for the second primcas on Line 5.52 but here the old_v is from Line 5.51.However, in this case the timestamp on old_v must be set since it is an indirect value.□ There is a second problem we found with using lock-free locks with snapshotting.The problem that using idempotent operations when accessing timestamps causes a significant bottleneck.This is because the timestamp is heavily contended.We show that although using a non-idempotent load and cam to update the timestamp can lead to non-idempotent execution (different helpers on the same thunk can see different timestamps), this does not effect correctness.We can therefore use a non-idempotent atomic for the global timestamp (Line 4) and also non-idempotent cas in set_stamp (Line 11) .Note that the timestamp within each version link (Line 2) needs to be idempotent since the load on Line 28 needs to be idempotent.The use of non-idempotent global timestamps is justified with the following theorem along with the fact that with lock-free locks, helpers run in the same epoch as the original [62].The proof is in the full version of the paper.Theorem 6.2.Any call to set_stamp on Line 9 of Algorithm 6 can be repeated any number of times by helper operations without affecting the correctness of versioning.
Another similar optimization can be made to use a nonidempotent cas and a non-idempotent Retire for shortcutting (Lines [17][18].This is safe since shortcutting is a "helping" step anyway (many processes might be attempting a shortcut at the same time and it has to appear to happen once), so it is already idempotent.
Finally we show how to directly implement store on Lines 25-34, which avoids several steps that would have been required if a load and cas were instead used to implement the store.However, this version assumes there are no writewrite races-i.e., that locks prevent two processes storing to the same location concurrently.We therefore refer to it as store_norace.The code for this direct store is less than half as many lines as for the cas.

Optimistic Timestamps
One of the bottlenecks of multiversioning is incrementing the shared global timestamp.For this reason, there has been significant work on reducing the cost of maintaining timestamps [11,24,36,41,55,59,62,67]. On x86-based machines, the RDTSC instruction implements a hardware timestamp that is synchronous across cores [36,55].This, however, is not portable, and manufacturers are not guaranteeing the clock will be synchronous across processing nodes in future platforms.We present a software optimistic timestamping technique called optTS which simplifies the low contention version-clock proposed in TL2 [24].Our variant never increments the timestamp during updates and only sometimes increments for queries.
The optTS technique first runs each query optimistically and only increments the global timestamp if the optimistic execution aborts.Specifically the execution only needs to abort if the query comes across a timestamp equal to its own.The approach runs the query at most twice since the second run is guaranteed to see a consistent snapshot.
The code for the approach is given in Algorithm 7. It uses the global_stamp defined in Algorithm 4. The approach modifies read_snapshot so that after locating the version with the largest timestamp less than or equal to the current local stamp, it checks if that stamp is equal to the current stamp (Line 5).If so, and if running optimistically, it sets the abort flag.The approach then modifies the with_snapshot(f) so that it first runs the query f without incrementing the stamp (Line 11).It then checks if the query aborted and, if so, increments the stamp and reruns (Line 15).The second run is guaranteed to produce a linearizable return value because it is essentially the same as the old with_snapshot implementation in Algorithm 4. Note that this technique requires f to be safe to run twice.This is a natural requirement since f is a read-only query on the data structure.As an optimization, queries passed to with_snapshot can periodically check the abort flag and finish early if they see it set.

Experimental Evaluation
We apply verlib to several concurrent set data structures to add support for linearizable range queries and groups of  find operations that act atomically (multi-finds).Our goal is to (1) measure the overhead verlib adds to the original data     which reduces the number of increments.Finally we implemented our own simpler variant of the TL2 approach (optTS) described in Section 7.

Results
Indirection-on-need. Figure 8 compares the performance of the versioned pointer algorithms presented in this paper.Indirect represents the algorithm from Section 6 without the indirection-on-need optimization, NoShortcut uses indirection-on-need but without shortcutting, and IndOnNeed uses shortcutting and is the default implementation in verlib.We also implemented a variant of versioned pointers, called RecOnce, which never uses indirect nodes and only works for recorded-once data structures (as with the WBB+ experiments).We applied this to our recorded-once variant of btree.All these variants use lock-free locks and hardware timestamping (hwTS).To measure the overhead of versioned pointers, we also show the original non-versioned data structure (Non-versioned) in the graphs.Multi-finds on this data structure are not linearizable (each find can linearize at its own point).
Overall, across the wide variety of workloads and data structures shown in Figure 8, the overhead of applying versioned pointers (with all optimizations) to a Non-versioned data structure is generally low.It is higher on lists since everything fits in cache and traversing a list is very cheap-hence the cost of the extra checks is more significant.For arttree, indirection-on-need improves the performance of Indirect versioned pointers by almost 2x on the left side of Figure 8c.The shortcutting optimization also consistently helps on these data structures, especially for read mostly workloads, although not as much as initially removing indirection.For btree, IndOnNeed versioned pointers achieves essentially the same performance as RecOnce, while not requiring the data structure to be recorded-once.Overall, indirection avoidance is more important for larger data structure that do not fit in cache and for read-heavy workloads.
In Figure 8d, we vary the amount of contention by drawing keys from the zipfian distribution and varying its parameter.The relative performance of the versioned pointer implementations generally stayed the same across all contention levels, although at high contention shortcutting no longer helps.
The remaining experiments use IndOnNeed as the default implementation of versioned pointers.Timestamps. Figure 9 compares the five timestamp techniques: qeryTS, updateTS, hwTS, optTS and tl2-TS.As a baseline, it also compares with No-Stamp, which never increments the global timestamp, resulting in non-linearizable snapshots.We applied all six to our lock-free versioned hashtable.In this experiment, the update rate varies from 0-100% with all other operations being multi-finds of size 16.To the best of our knowledge this is the first apples-to-apples comparison among a wide range of timestamp techniques.Across these experiments, hwTS tends to perform the fastest because our machine supports a very light-weight rdtsc instruction for reading the hardware clock.Optimistic timestamp (optTS) achieves almost the same performance as hwTS, indicating that optimistic executions of multi-find often succeed without having to increment the global timestamp.optTS is slightly faster than tl2-TS due to being simpler and more optimized for this setting with just readonly transactions.qeryTS and updateTS perform poorly in multi-point query heavy and update heavy workloads, respectively, due to high contention when incrementing the timestamp.optTS outperforms hwTS at high update rates because it does almost no work.Direct Stores.Section 6 described how to replace a loadthen-cas with a store, avoiding some checks and updates.We ran experiments with and without this optimization.On workloads with 50% updates we saw up to a 8% improvement in performance (e.g., on B-trees with 100K keys and uniform distribution).On workloads with 5% updates the improvement was negligible, as might be expected since the optimization only affects the performance of updates.Range query.Figure 10 compares our versioned btrees with state-of-the-art data structures supporting linearizable range queries.Updates and especially range queries on our versioned B-trees are significantly faster because of the increased cache locality due to the large fanout at internal nodes and the batching of keys in each leaf.Out of the other range queriable data structures, only LFCA and VcasChro-maticTree store a batch of keys in each leaf, but internal nodes still only have fanout 2. Developing a general and easy-to-apply library allowed us to apply versioning to faster baseline data structures than those used in previous work.Scalability.experiments were run with verlib in lock-free mode, and these graphs also show its performance in blocking mode.Consistent with previous experiments on lock-free locks [8], blocking mode tends to be slightly faster before oversubscription, but drops severely in performance after oversubscription.This motivates the importance of supporting both versioning and lock-free locks.
We also plot the performance of LSKN-arttree and SBabtree, which are state-of-the-art concurrent radix trees and B-trees, respectively.They both use blocking locks, so they also slow down after oversubscription.Our verlib arttrees and btrees perform competitively with these data structures while also being lock-free and supporting linearizable range queries.When oversubscribed, the performance of SB-abtree does not degrade as much as our blocking btree.We believe this is because the SB-abtree takes finer grained locks at the leaves, instead of at the parent.Space.Figure 12 gives some numbers for space in terms of bytes per entry.The space overhead for versioning is particularly small for btrees since each node is large (up to 512 bytes), and the versioning metadata (next pointer and timestamp) is only needed once per node.The other structures have smaller nodes and hence the space overhead for the metadata is larger.

Conclusion
In conclusion, we present an efficient implementation of concurrent versioned pointers that is compatible with both blocking and lock-free locks and is optimized to avoid indirection whenever possible.It is significantly easier to apply than previous work on versioned CAS objects [62], which requires the user to often modify their data structure in nontrivial ways to get good performance.We wrap these ideas in a verlib library and apply it to several data structures to support linearizable range queries.Experiments show that these data structures are significantly faster than existing concurrent, range queriable data structures.

Figure 1 .
Figure 1.Avoiding indirection in a binary tree by putting version links on the node being pointed to.Structures in orange are tree nodes, with key , and in green are version links where  means previous and  is the timestamp.The problem with removing indirection is sharing.For example, the nodes with  = 4 and  = 8 both point to the node with  = 5, and seem to need different timestamps and  pointers.In this case, it is OK and the timestamp 7 with  pointing to  = 6 is correct since ( = 0) < ( = 7) < ( = 8).

Figure 8 .Figure 9 .
Figure 8. Comparing different versioned pointer implementations.Unless otherwise specified, the workload consists of 128 threads performing 20% updates and 80% multi-finds of size 16, with keys drawn from the uniform distribution.The default for lists is 1000 and for all other data structure is 10M.dlist(10x) indicates that its throughput was scaled up by 10x.

Figure 10 .
Figure 10.Range queries.Comparing various data structures supporting linearizable range queries.Run with 100 threads: 5 update threads, 95 range query threads, keys drawn from uniform distribution, 10M data structure size

Figure 11 .
Figure 11.Scalability.Comparing various arttree and btree implementations.Solid lines used for data structures that support linearizable range queries, dotted lines used otherwise.Run with 10M keys, 5% updates, 95% lookups, and keys drawn from Zipfian distribution (parameter .99).The dotted vertical line indicates the number of cores on our machine.
Algorithm 2. The verlib interface.C++ template declarations for F and T left out.
Algorithm 6. Combining snapshotting with lock-free-locks.Only changes from Figure 5 are shown.