RTT-UAF: Reuse Time Tracking for Use-After-Free Detection

Memory safety continues to be a critical challenge in modern computing, with approximately 70% of reported vulnerabilities annually attributed to memory-related issues. Among these issues, Use-After-Free (UAF) vulnerabilities or bugs, where a program accesses memory through a dangling pointer, pose significant threats. Existing UAF detection methods, such as Key-And-Lock (KAL) mechanisms, incur notable performance overhead due to explicitly propagating the key and lock address (metadata). We identify that approximately 67% of KAL’s performance overhead is introduced by this metadata propagation. This paper introduces RTT, which significantly reduces performance overhead by reducing the number of memory accesses related to metadata propagation. RTT achieves an average performance overhead of 170%, substantially lower than existing KAL methods, and incurs only an 8% memory overhead on SPEC CPU 2017. The results of experimental evaluations on real-world UAF bugs further demonstrate that RTT’s UAF bug detection rate is equivalent to other KAL methods.


INTRODUCTION
In recent years, memory safety challenges have evolved alongside computing, becoming more complex as systems and software grow in sophistication.This long-standing issue demands continued attention and innovation in security measures to tackle emerging vulnerabilities.Microsoft and Google report that around 70% of annual vulnerabilities are memory safety related 1 .Mozilla echoes this, noting 32 of 34 critical bugs are memory issues2 .Among these, bugs caused by temporal safety violation are a key concern, commonly referred to as Use-After-Free (UAF) vulnerabilities or UAF bugs.Using C/C++ terminologies, when the program frees a buffer, the pointer associated with this buffer becomes a dangling pointer.
A UAF bug happens when a program accesses the memory (reads or writes) through a dangling pointer.This can lead to unpredictable behaviors (e.g. a system crash) or provide the attacker a chance to create an exploit (e.g.unauthorized code execution [10,23]).Exploits based on UAF bugs can cause significant impacts, like the 2015 Pwn2Own contest's takedown of major browsers, including Internet Explorer, Firefox, Chrome, and Safari.Even with constant effort in developing new detectors, recent instances of UAF bugs can still lead to browser crashes, as reported by Google 3 .This underscores the critical need to address temporal memory safety.Various detection methods have been developed to identify UAF bugs.One of the most popular detectors is checking if the accessed pointer is a dangling pointer before every memory access [3, 8-10, 21, 26].To achieve this, the detector assigns each allocated buffer a tag/version ID, with all associated pointers sharing this tag/version ID.When the program frees this buffer, the detector resets the tag/version ID of the buffer to another value.Then, the detector can identify a dangling pointer by checking if the buffer and pointer's tags/version IDs mismatch.
The Key-And-Lock-based detector (KAL) is one of the most effective tag/version ID-based detectors [10,19,20,28].The version ID used in KAL is 64-bit to ensure it will never repeat [19], thus having a high detection rate.As illustrated in Fig. 1, KAL calls the version ID for the pointer as "key" and the version ID for the buffer as "lock", which are saved separately.To check if the key and the lock match before each memory access, KAL has to incorporate additional operations to propagate the 64-bit key and the 64-bit lock address (collectively referred to as 128-bit temporal metadata) along with a pointer.For example, during a function call involving a pointer as an argument, KAL adds the temporal metadata to the end of the argument list, increasing the number of arguments.This propagation leads to additional memory accesses as it pushes/pops more data to stack when entering a function, resulting in significant performance overhead.To understand the details of the propagation overhead, we breakdown the performance overhead of UAFSan [10] (a most recent open-source implementation of KAL).
Our measurements show that the performance overhead caused by data propagation accounts for 67% of the total performance overhead, as depicted in Fig. 2. The remaining performance overhead comes from assigning keys and locks (which is negligible) and checking the key-lock pairs.Therefore, in this work, we aim to address the primary source of performance overhead: metadata propagation, while maintain the same security guarantee.To achieve our goal, we consider reducing the performance overhead of propagating the key and the lock address separately.
Reducing key propagation overhead In modern computer systems, a pointer typically occupies 48 bits within a 64-bit data structure, leaving 16 bits unused.If we can reduce the version ID to 16 bits and embed it in the upper unused bits of the pointer, the key can naturally propagate with the pointer.Then we do not need to instrument the code to explicitly propagate the key.However, using a 16-bit version ID reduces the detection rate, because the version ID will repeat when a program allocates buffers for more than 2 16 times to the same address.A potential solution to this issue is limiting the times the program can allocate buffers to 2 16 times.Then the version ID will not increase beyond 2 16 − 1, effectively preventing version ID repetition.To achieve this, we can monitor the reuse time of each address and freeze the address that has been reused for 2 16 times.However, most allocators merge or split buffers when allocating or freeing buffers to reduce memory usage.This brings the first challenge about how to accurately and efficiently track the reuse time of each address.
Reducing lock address propagation overhead If we eliminate the explicit propagation of the lock address, the program only knows the pointer value.To conduct a UAF check, it becomes imperative to determine both the buffer to which the pointer points and the address of the lock associated with this buffer.Thus, the second challenge involves figuring out the address of the associated lock with only the pointer value.
To overcome these challenges, we introduce RTT, leveraging the concept of the binning allocator [5,6,12,13].This allocator divides the heap into a user-defined number of equally sized bins.All buffers in the same bin should have the same buffer size and the starting address should be divisible by the buffer size.These bring two key properties that can potentially help to achieve our goal: 1) a limited number of possible starting (base) addresses; 2) encoding the starting address into the pointer value itself.With the first property, the number of possible starting addresses is limited and the reuse time of each buffer is identical to the reuse time of each possible starting address.RTT addresses the first challenge by only monitoring the reuse time of each potential starting address.With the second property, given a pointer value, the program can track the starting address of the buffer through a series of arithmetic operations.RTT further introduces a shadow memory, featuring a one-to-one mapping with each potential starting address to store the lock.Consequently, RTT can retrieve the lock address with only the pointer value through a sequence of arithmetic operations.RTT tackles the second challenge by saving the lock at the address with a one-to-one mapping to the possible starting address.
We evaluate the performance and memory overhead of RTT across 13 real-world benchmarks from SPEC CPU 2017 [2] and eight open-source programs collected from Github [10].We also evaluate the success rate of using RTT to detect real-world UAF bugs.RTT exhibits remarkable performance in comparison to other KAL methods, showcasing an average performance overhead of 170% on Spec CPU 2017 compared to the native run.This is significantly lower than the 600% and 1900% average performance overhead observed in the other two open-source KAL-based methods.In addition, RTT incurs negligible memory overhead, measured at 8% when compared to the native run.Furthermore, RTT demonstrates an equivalent UAF bug detection rate to that of other KAL methods.
Our contributions can be summarized as follows: • We identify that a substantial portion of the performance overhead (67%) in KAL originates from propagating metadata.• We introduce RTT, a novel approach that can effectively remove the excessive memory accesses arisen from propagating the metadata in KAL.

BACKGROUND 2.1 Use-After-Free (UAF)
When the program frees a buffer, the associated pointer becomes a dangling pointer.UAF occurs when the program reads or writes through a dangling pointer.The attacker can exploit this behavior to execute malicious code, thus creating a UAF exploitation (as will be illustrated in Fig 5).Firstly, the attacker can allocate a buffer that overlaps with this freed buffer.These two buffers can overlap in various ways: only the lower address part; only the higher address part; only the middle address part; or the entire address.Then the attacker can replace the data saved in this overlapped region with malicious code or functions.Finally, when the program references the dangling pointer later, it will jump to execute the malicious code.In summary, the attacker can execute a UAF exploitation depending on two behaviors: 1) the program references a dangling pointer and 2) the attacker overwrites the freed buffer pointed to by that dangling pointer.We demonstrate a detailed example of UAF exploitation in Listing 1.It's essential to recognize that the provided example only represents a simplified illustration.The real-world application is considerably more complicated [27], often involving unpredictable user input, function call, or branching.It is challenging to detect all dangling pointers only with static analysis.

Origin Code
In line 1, the program defines structure, X, which contains only one element: a pointer to a predefined function func().In line 4, the program calls the malloc function to allocate a buffer storing X, (buffer 1) returning pointer p1.Then, the program frees buffer 1 in line 5.After that p1 becomes a dangling pointer.However, in line 12, the program references p1 to call func().
Attacker Inserted Code In response to this code, the attacker can insert malicious code after the program frees buffer 1 and before the program refers to the dangling pointer p1.This is highlighted from line 6 to line 11.In line 8, the attacker allocates another buffer, buffer 2. The attacker can modify any data saved in the address covered by buffer 2. If this address overlaps with the one of buffer 1, the attacker can also modify the structure X previously saved in buffer 1.In this way, the attacker can replace the pointer to func() with a pointer to a malicious function in line 10.When the program calls func() via the dangling pointer p1 in line 12, the program will be redirected to execute the malicious function.access, the detector checks if the tag/version ID of a pointer and buffer match.Among the methods in this category, KAL stands out as a prominent one, demonstrating a higher detection rate.KAL uses a 64-bit version ID to ensure there is no version ID collision [19].KAL calls the version ID for the pointer as 'key' and the version ID for the buffer as 'lock'.Listing 2 illustrates an example of KAL, with all shadowed code instrumented by KAL.In line 10, the program allocates a buffer.In response to this behavior, KAL assigns this buffer and associated pointer a unique version ID, in line 13.This version ID starts from "1" and increases by "1" after each allocation, ensuring no buffers can have identical version IDs.Then KAL finds a disjoint memory space to save the lock in line 15 and 16.KAL further saves the key and the lock address in a hashtable in line 17.After freeing a buffer, KAL resets the lock to a specific value, such as "0", while keeping the key and lock address unchanged, as seen in line 19.Before each pointer deference, KAL checks if the key and lock match.Any mismatch is considered a UAF bug.To ensure that the detector is aware of the key and lock during each UAF check, KAL has to instrument the code to explicitly propagate the key and lock address.To check if pointer 1 is a dangling pointer, KAL verifies if the key matches with the lock in lines 4 and 5.However, the key and lock address will be lost after entering function f.To address this, KAL appends the key and lock address associated with pointer 1 to the end of the argument list 4 , as shown in line 2.However, the instrumentation for propagating metadata introduces additional memory accesses, consequently increasing performance overhead.

Binning Allocator
The binning allocator constraints the starting address and size of each buffer, exhibiting two key properties that can potentially help to reduce the propagation operation in KAL: (➊) a finite number of possible starting addresses and (➋) encoding both the buffer starting address and size directly into the pointer value, irrespective of the offset to which the pointer references a buffer.
A finite number of possible starting addresses The binning allocator divides the heap region in a specific number of equal-sized bins.All buffers saved in the same bin have the same buffer size.Additionally, buffers must start at an address that's a multiple of the buffer size.To determine the appropriate bin for storing buffers, the binning allocator employs a size table containing the same number of distinct sizes as the number of bins.During allocation, the binning allocator pads buffers to match the nearest larger size in this size table, and then places them in the corresponding bin.Let's look at an example with a binning allocator configured with 3 bins, as shown in Fig. 3.Each bin equally occupies 96 bytes and starts at addresses 0x00, 0x60, and 0xc0, respectively.Bin 1 saves 16-byte buffers, bin 2 saves 32-byte buffers, and bin 3 saves 48-byte buffers.The binning allocator also constructs a size table containing three entries: "16 bytes", "32 bytes", and "48 bytes".When the program wants to allocate a 17-byte buffer, it first checks the size table.Given that 17 falls within the range of 16 to 32, it selects a size of 32 bytes.Subsequently, the binning allocator pads the buffer to 32 bytes and puts it in bin 2. Since there are only three addresses dividable by 32 in bin 2: 0x60, 0x80, and 0xa0, the binning allocator can only use one of them as the starting address.Finally, the allocation function returns the pointer address 0x60.
Encoding the buffer starting address and buffer size within the pointer value Listing 3 demonstrates the details about how to calculate a buffer's starting address and size through the pointer value itself.Firstly, the allocator calculates the bin number (bin ID) by dividing the pointer address by the predefined bin size, in line 3.Then, it uses the bin ID as an index to retrieve the buffer size from the size table, in line 4. Finally, the allocator divides the address by the buffer size, takes the integer part, and multiplies it by the buffer size, as demonstrated in line 5.Take Fig. 3 as an example, when a pointer value points to address 0x65, We can get bin ID as 0x65/0x60 + 1 = 2.By checking the size table, we further figure out that the buffer size is 32 bytes.Then we can get the starting address of the buffer that pointed by this pointer as (0x65/0x20) * 0x20 = 0x60.

RTT 3.1 Problem Characterization
Performance Overhead of KAL In the section (Section 2.2), we emphasize that KAL requires propagating the key and lock address alongside the pointer to conduct UAF checks.This propagation introduces additional memory accesses, leading to significant performance overhead.There are multiple scenarios under which KAL needs to propagate the metadata: entering a function call with pointer arguments; leaving a function call with a pointer returned; assigning one pointer to another.In Fig. 2, we present a breakdown of UAFSan's performance overhead (under optimization level -O0) on SPEC CPU 2017, which is a recent open-source KAL implementation.We classify the overhead into three components: assigning metadata after each allocation/free function, propagating the metadata, and checking UAF bugs before each memory access.Our analysis reveals that a significant (67%) of the performance overhead stems from propagating the key and lock address.Eliminating this propagation instrumentation has the potential to reduce the overall performance overhead.At the same time, we do not want to undermine the security guarantee of KAL.
Elimination of the Explicit Key Propagation Notably, in C++ the pointer occupies 48 bits while it is stored as 64-bit data.This leaves 16 upper bits unused in the pointer.If we can reduce the version ID from 64 bits to 16 bits, we can embed the key into the unused upper bits of the pointer.Then we can propagate the key together with the pointer without any instrumentation.Given that there are only 2 16 possible values for a 16-bit version ID, and this ID increases sequentially, allocating buffers to the same address more than 2 16 times will result in the version ID repeating.This repetition undermines the detection accuracy of KAL.To avoid this case, we can track the reuse time of each address and freeze the one has been allocated buffers for more than 2 16 times.However, the buffer splitting and merging scheme in the modern allocator imposes a significant challenge for this.The program may allocate buffers of various starting addresses and buffer sizes to the same address.Tracking the reuse time of a buffer becomes challenging unless the reuse time of each byte covered by this buffer is counted and saved.However, adopting this approach straightforwardly introduces substantial memory and performance overhead.Considering the memory overhead, the default allocation granularity is 8 bytes, tracking the reuse time of each 8-byte with a 2-byte counter incurs at least 25% memory overhead.Considering the performance overhead, when the program allocates a large buffer, the program should 1) collect the reuse time of each address covered by the buffer; 2) select the maximum reuse time among the result of 1); 3) update the reuse time of each address covered by this buffer with the result of 2).The first challenge lies in how to accurately track the reuse time of each address without incurring a high performance overhead or memory overhead.
Elimination of the Explicit Lock Address Propagation Eliminating the explicit propagation of the lock address along with the pointer requires that the detector can derive the necessary information solely from the pointer value itself.A pointer may point to any offset within a buffer rather than the starting address.To determine the lock address, we should first know the starting address and size (or end address) of each buffer.Then we need to identify which buffer covers the given pointer.Finally, we should determine the location of the lock associated with that buffer.The second challenge lies in calculating the associated lock address using only the pointer value itself.

Overview
We find that the intrinsic property of the binning allocator makes it possible for us to solve these two challenges.As mentioned in section 2.3, the binning allocator allocates the buffer of the same size in the same bin and makes the starting address of each buffer divisible by the buffer size.With this design, when the program frees a buffer and reallocates a new buffer, there are only two possible cases: They can either (➊) completely overlap or (➋) do not overlap.However, they can not partially overlap.Taking Fig. 5 as an example, bin 1 can only allocate buffers of 16 bytes and this bin starts from 0x0000 and ends at 0x0300.The program first allocates buffer 1 (0x0100 to 0x0200) and frees it.Then, when the program allocates a new buffer of 16 bytes, the binning allocator can only allocate this buffer starting at (➊) 0x0000, (➋) 0x0100, or (➌) 0x0200.In cases ➊ and ➌, the new buffer does not overlap with buffer 1 at all.In case ➋, the new buffer overlaps with buffer 1 with the exact same starting address and ending address (buffer 6 in Fig 5).The other buffers (buffer 2 to buffer 5) are not allowed in the binning allocator: buffer 2's size is larger than 16 bytes; buffer 3 and buffer 4 do not have a starting address divisible by 16; buffer 5 has a size less than 16 bytes.This property ensures that there is no buffer merging or splitting and there is only a limited number of possible starting addresses.The reuse time of every byte inside a buffer will always be the same.To solve the first challenge, we only need to track the reuse time of each starting address.Since the reuse time is strictly increasing, we can also use it as the lock/key at the same time.Then we can save the reuse time in a shadow memory that has a fixed one-to-one address mapping with each starting address.Given any pointer value, we can first get the starting address of each buffer with a combination of arithmetic operations as shown in section 2.3.Then we can get the address where we save the reuse time by an extra address mapping calculation.For the second challenge, we can get the pointer's associated lock address with a series of arithmetic operations, the starting address calculation followed by the mapping address (between shadow memory and starting address) calculation.
Fig. 4 demonstrates the whole workflow of RTT, which mainly consists of two steps, static instrumentation and dynamic heap management.During the static instrumentation, RTT accepts the C/C++ source code as input and replaces every allocation/deallocation function with the one written by the binning allocator.Before each memory access, RTT instruments the code to conduct a UAF check.During the runtime management, RTT allocates/deallocates the buffer with the binning allocator and manages the key and lock.During each UAF check, it dynamically calculates the lock address and compares the key saved in the pointer upper bits with the lock saved in the lock address.

Details
In this section, we will first introduce how RTT builds up the shadow memory that has a fixed address mapping between the buffer's starting address and shadow memory.Then we will introduce how RTT manages the key and lock in runtime.After that, we will introduce the optional optimization to reduce memory consumption under the extreme case.In the end, we will depict how RTT works with each part.
3.3.1 Lock Address Calculation.It's essential to recognize that the binning allocator organizes buffers of the same size into the same bin and enforces that each buffer's starting address aligns with the padded size.This alignment enables the starting address of each buffer to be inherently encoded within the pointer itself.At the same time, a buffer's position within the bin, also called buffer index is also encoded in the pointer itself.These insights strongly advocate for the adoption of shadow memory as the most intuitive approach to saving the lock.By establishing a fixed mapping between the buffer ID and its lock address, we establish a direct and efficient connection between the buffer's address and its corresponding lock address.The calculation of this address mapping is demonstrated in Listing 4. RTT saves the starting address of each bin's locks in an array/table named "lock_table_base", whose entry number is the same as the number of bins.We call the address saved in each entry the starting address of each bin's lock, which is predefined during compiling time.RTT first computes the bin ID in line 4 and the size in line 5. Then RTT derives the buffer ID in line 6.Since the buffer ID and the lock address in shadow memory align, the buffer's lock address = (buffer ID) * (lock size) + (starting address of current bin's lock) (line 7).Take Fig. 3 as an example, the bin ID, size table, and offset table are aligned, with each bin having 96 bytes.Given a pointer pointing to address 0x82.We first get its bin ID by (0x82/0x60) + 1 = 2.By checking the size table, we know that the buffer size is 32 bytes.Then we can calculate its buffer ID by floor(0x82 -0x60/0x20) = 1.The starting address of the second bin is saved in the second entry of the offset table: 0xff30.Finally, we get the lock address of this buffer: 0xff30 + 1 * lock_size (0x02) = 0xff32.

Key and Lock Management.
At the beginning of the program, each element in the shadow memory will be set to "0", which means no buffer has been allocated to this address.Once the program starts to allocate a buffer to an address, the lock will start from "1".After the program frees this buffer, RTT further increases the lock by "1", instead of resetting this value to "0".When the program allocates a new buffer to the same address, this buffer directly inherits the lock.It is worth noting that this will not undermine the UAF detection rate.Since the lock strictly increases, any dangling pointer will have a key value less than the lock.At the same time, each distinct starting address can independently increase its lock value.RTT can assign the same lock to adjacent buffers, as UAF incidents only occur for buffers allocated with the same starting address.We maintain the assumption of no spatial memory safety violation [19][10], ensuring every memory access stays within the valid address range of the buffer.Any UAF bug resulting from accessing a buffer beyond its valid bounds is considered a spatial memory safety violation.In the later section 5, we will evaluate the latency and memory overhead of RTT with or without spatial memory protection as [19].
Whenever the program allocates the buffer to the same address for 2 16 − 1 times, RTT will freeze this address for any further allocation.Over time, each address within a bin may be allocated for 2 16 − 1 times, marking the point where the bin is considered "used up." To resolve this issue and facilitate ongoing allocation, we propose an expansion of the predefined size table.For instance, if the table initially comprises 64 predefined sizes, we reserve 128 entries instead.These additional 64 entries remain unassigned initially.When a bin becomes "used up," we employ a two-step solution.Firstly, RTT copies the size of the "used up" bin to one of the empty entries within the expanded table.Then, RTT creates a new bin to accommodate all subsequently allocated buffers of the same size.This new bin effectively takes over every allocated buffer of that size.RTT redirects the allocation requests of the same size to the new bin.Following this process, the "used-up" bin is no longer permitted to allocate any buffer of the same size while still maintaining the capability to free or read/write the buffer.

Memory
Freeing.RTT enables each virtual address to reallocate up to 2 16 − 1 times.Once this allocation limit is reached, the virtual address is restricted from allocating additional buffers.However, in C/C++, the associated physical memory is not explicitly freed, leading to potential increases in memory consumption.To address this concern and mitigate memory usage under extreme cases, we can sweep the heap memory and release the physical memory for the virtual address that is not allowed to use anymore.However, this method lacks efficiency.RTT presents an alternative solution by focusing on each buffer's lock.When the program frees a buffer and the associated address is not allowed to allocate any buffer, RTT sets its version ID to a unique value "0".With this special value, instead of sweeping the entire heap memory space, RTT sweeps the lock table.Specifically, it identifies continuous pages where every lock associated with it is "0".Then RTT returns these pages to the system, thus releasing physical memory.While this approach effectively mitigates memory consumption, it does introduce a performance overhead due to the sweep operation.The frequency of the sweep operation can impact this performance overhead.To find a suitable trade-off between latency and memory overhead, RTT accumulates the size of memory that can be freed, also called no-allocation region.Based on this value, RTT calculates the maximum number of continuous pages that this no-allocation region can potentially cover, labeling these pages as no-allocation pages.A predefined threshold, denoted as T no-allocation pages, is established.Once the accumulated count of no-allocation pages surpasses T, RTT executes the memory-freeing operation.

DISCUSSION & LIMITATION
RTT focuses on protecting heap buffers allocated through the malloc function series, replacing the default allocator with a binning allocator.The underlying concept can also be adapted to other kinds of allocation functions, such as the mmap function series, providing a flexible and extensible approach.As outlined in the previous section, RTT employs a mechanism to freeze an address if it allocates more than 2 16 − 1 buffers to the same location.This restriction will only affect the virtual address space, which has 2 48 available bytes.Given that the UAF detector is mainly used during the software testing phase, where the program typically runs for a limited duration, it is uncommon to exhaust all available virtual addresses when using RTT.Additionally, the RTT prototype is designed to be thread-safe and is capable of operating effectively in parallel programs.This ensures its applicability and reliability under concurrent execution.

EVALUATION 5.1 Methodology
We implement our work with CLANG version 4.0.0 and LLVM version 4.0.0, on top of LowFat [5], a prototype of the binning allocator proposed for spatial memory safety protection.We modify the allocation function and insert additional UAF detection instructions through the LLVM IR.To optimize performance, we leverage the LLVM GoldPlugin tool 5 to inline all instrumented code, following the approach in previous work (e.g.CETS [19]).In this paper, we evaluate RTT and another four open-source UAF detectors (sanitizers): CETS [19], UAFSan [10], AddSan [21], and TSan [22].CETS and UAFSan are both KAL prototypes; AddSan is a widely used sanitizer detecting UAF bugs probabilistically; and TSan is designed to detect UAF bugs related to data races, applicable to general UAF bugs.We compare RTT with these four UAF detectors from three perspectives: • Performance We measure the latency of each detector and normalize it to the latency of native run.Our objective is to determine if RTT achieves lower performance (latency) overhead than other UAF detectors.• Memory We measure the memory consumption of each detector and normalize it to the memory consumption of native run.Our objective is to determine if RTT achieves lower memory overhead than other UAF detectors.• Security We evaluate the detection rate of each detector on real-world UAF bugs.We aim to determine whether RTT achieves a higher detection rate than other UAF detectors.
We consider two types of real-world benchmarks that are different in code size (KLOC) and application areas, providing a diverse test set as shown in table 1: • SPEC CPU 2017 benchmarks: We focus on the C/C++ subset of SPEC CPU 2017 benchmarks, which are widely used for assessing performance and memory overhead in related work [7,11,15,29].In this work, we select a subset comprising eight integer and five floating-point benchmarks.• Open-source programs collected from Github: We select eight widely used open-source programs from GitHub with real-world UAF vulnerabilities [10].We evaluate performance, memory and security on this benchmark.
To measure latency, on SPEC CPU 2017 benchmarks, we utilize their built-in timer; on the open-source programs collected from Github, we employ the Linux built-in timer.To measure memory consumption, we use the peak Resident Set Size (RSS), representing the maximum physical memory occupied during program execution, as done in previous work [6,10,15].To evaluate security, we use the UAF bugs found in the collected Github programs that are recorded into the Common Vulnerabilities and Exposures (CVE 6 ) list.We exclude the GoHttp program (in the collected Github programs) when evaluating the performance and memory.GoHttp is a web application that should keep running without stopping which is not supposed to have a latency. 6https://cve.mitre.org/

Experiment Setup
Our experiments are conducted on a four-core Intel i7-6700 CPU (3.40GHz) with 32GB RAM, operating on Ubuntu 18.04.All programs are compiled using optimization level 0, denoted as "-O0" flag during compilation.

Performance
To ensure a fair comparison, we compare RTT separately against each detector that requires different settings: To compare RTT with CETS and AddSan, we add spatial protection to RTT (provided by LowFat) and make the instrumented code to be inlined.This is because both CETS and AddSan utilize code inlining, but we can not disable the spatial protection for them.To compare RTT with UAFSan and TSan, we only consider UAF detection and not inline the instrumented code.This is because TSan and UAFSan do not support inline mode.
As depicted in Fig. 6a, the performance overhead of RTT with inling ranges from 83% to 413%, averaging at 174%.In comparison, the performance overheads of CETS 7 and AddSan are 620% and 100%, respectively.Despite AddSan having a lower performance overhead, it comes with higher memory overhead and a lower detection rate, as will be detailed in the subsequent section.Examining Fig. 6b, the performance overhead for RTT without inlining spans from 149% to 548%, with an average of 275%.At the same time, TSan and UAFSan8 show performance overheads of 833%   This also introduces extra memory access operations.In contrary, the metadata addressing in RTT only relies on a combination of arithmetic operations which does not involve any memory access.Fig. 7a and Fig. 7b present comparisons across various methods using seven real-world open-source benchmarks from GitHub.Among these, CETS encounters failures on gravity, mxml, and lua, while UAFSan fails to run on gravity.RTT with inlining shows the lowest average performance overhead of 145%, outperforming both CETS and AddSan, which are 375% and 565%, respectively.RTT without inlining has an average overhead of 186%, surpassing TSan and UAFSan which have overheads of 1120% and 645%, respectively.

Memory
Different from last section, we compare the memory consumption of RTT with spatial protection to CETS, AddSan, TSan, and UAFSan together.As inline and spatial protection do not impact memory consumption, comparing all methods together will be more clear.On SPEC CPU 2017, as illustrated in Fig 8, RTT shows the lowest average memory overhead of 7.6%.CETS, AddSan, TSan, and UAFSan have average memory overheads of 288%, 337%, 420%, and 114%, respectively.On the seven real-world programs gathered from GitHub, RTT, CETS, AddSan, TSan, and UAFSan demonstrate average memory overheads of 77%, 208%, 144%, 144%, and 142%, respectively.
Compared to the other methods, the reduced memory overhead of RTT mainly stems from its smaller metadata size and delayed quarantee.Compared with CETS and UAFSan, which employ an eight-byte lock for each buffer and 16-byte metadata (key and lock address) per pointer, RTT only requires two-byte lock per buffer.Compared with AddSan, which enforces each buffer to be guaranteed upon every deallocation, RTT only guarantees each buffer after it is reused for 2 16 times.This delayed quarantee significantly enhances memory consumption efficiency.Compared with TSan, RTT uses a simpler data structure to save metadata.TSan is designed for data races, which traces the dependencies of each read and write operation across different threads, requiring a complex and extensive dependency graph.

Security
As shown in TABLE 2, we test the RTT and the related methods on the real-world UAF bugs, which are collected from the eight real-world programs from GitHub.These UAF bugs include buffer reallocation and double-free.Both RTT and UAFSan 9 successfully identify all the UAF bugs that they can compile and execute.In contrast, CETS achieves a lower detection rate, successfully identifying only five UAF bugs.This is mainly because the prototype of CETS is implemented in LLVM 3.4 which is outdated, thus failing to 9 UAFSan fails to compile on the UAF bug collected from gravity benchmark.instrument the code properly.AddSan, relying on quarantining the address of a freed buffer without applying any label to the pointer, fails to detect four UAF bugs.This failure is caused by the potential escape of a dangling pointer from UAF checks after the quarantine period.TSan, designed primarily for data race detection rather than general UAF bugs, fails to detect three UAF bugs.

RELATED WORK 6.1 UAF Detection
UAF detectors [11,17,18,21] track the lifetime information (metadata) of each pointer/buffer and explicitly check if the accessed address is valid before each memory access.
Fat Pointer In fat pointer-based approaches [17], the pointer is made wider.This widened part is used to store the pointer's tag/ID, buffer's tag/ID address, etc. so that metadata can propagate with the pointer naturally.However, this may require the detector to modify the hardware such as expanding the register size, which leads to additional memory and performance overheads.It may also require the detector to modify the representation of each pointer, which introduces compatibility issues with the unprotected code and incurs extra performance overhead.
Non-fat Pointer In contrast to the fat pointer-based approach, another approach [3,11,15] stores metadata in the upper unused bits of the pointer.This involves placing a tag, ID, or hashtable index into the upper unused bits.Due to the limited bit number of unused bits, non-fat pointer-based approaches have tag/ID repetition or hash table collision problems.Although RTT also applies the idea of using the pointer's upper unused bits, RTT does not have these problems.RTT explicitly tracks the reuse time of each address and freezes the address that reuses for 2 16 times.The version ID in RTT never overflows.Thus the same address will never be assigned a repetitive version ID.This ensures RTT has a much higher UAF detection rate than the other non-fat pointer-based approaches.

Garbage Collector
The garbage collector [1,7,9,21,26,27] quarantines the addresses covered by freed buffers until these addresses are considered safe to be reused.The program can not allocate buffers to the addresses under quarantine.The garbage collector treats each memory unit (e.g., every 8 bytes) as a potential pointer.If any potential pointer points to the address covered by a freed buffer, this freed buffer is considered unsafe and should be kept under quarantine.This freed buffer is only considered safe and can be reused when there are no potential pointers pointed to it.To identify all potential pointers, this method has to periodically scan the entire memory region, which increases both performance and memory overhead.AddSan can also be considered as a secure allocator-based method with a slight difference.Instead of sweeping the entire memory periodically, AddSan considers the address safe to be reused after the total memory occupied by the quarantined addresses reaches a predefined memory threshold.However, AddSan may fail to detect UAF bugs that occur after the addresses associated with the freed buffer are removed from quarantine.2: The detection rate on the real-world UAF bugs: ✓ denotes successful bug detection; ✗ indicates failure bug detection;means the method fails to compile or run.

Secure Allocator
The secure allocator [4,25] aims to prevent any address reusing happens.Once a buffer is freed, its virtual address space becomes unavailable for all subsequent allocations.In this way, the program can not allocate buffers to the same address.This approach eliminates the need for sweeping the entire memory region since no addresses will be reused.However, it may lead to virtual address exhaustion in long-running programs due to the limited address reusing times.

Hardware Extensions
The hardware extension-based approach optimizes memory safety by introducing additional hardware components such as memory units [11,17,24,29,30], calculation units [14], or logic units [6].AOS [11] and PACMem [15] store metadata in a separate hash table.Utilizing the ARM PAC extension [16], they store the index to the hashtable in the pointer's upper unused bits.AOS goes a step further by introducing a cache table to accelerate metadata retrieval.However, these methods have a high collision rate of hash table indexes when there are a large number of allocations.C3 [14], on the other hand, uniquely encodes each buffer using its starting and ending addresses.It provides additional protection against unauthorized reads by incorporating encryption and decryption pipelines.No-Fat [29] is a hardware implementation of LowFat.No-Fat replaces the software size table with a hardware table located alongside the CPU core.This design offers a faster pipeline for metadata retrieval.However, implementing these hardware-based approaches in real-world applications poses significant challenges compared to implementing software detectors.

CONCLUSION
In conclusion, this paper highlights the primary source of performance overhead in KAL methods as metadata propagation.To address this challenge, we propose RTT, leveraging the inherent characteristics of the binning allocator to reduce the performance overhead by eliminating memory accesses introduced by metadata propagation.We also extensively evaluate RTT on real-world benchmarks, including SPEC CPU 2017, and collected GitHub programs.We demonstrate that RTT outperforms other KAL methods with significantly lower performance and memory overhead.Specifically, RTT achieves an average performance overhead of 170% and a minimal 8% memory overhead on SPEC CPU 2017, while exhibiting 145% performance overhead and 77% memory overhead on the collected GitHub programs.Importantly, RTT maintains an equivalent detection rate compared to other KAL methods.

Figure 1 :
Figure 1: A diagram of Key-And-Lock-based UAF detection

Figure 2 :
Figure 2: Performance overhead breakdown of UAFSan on Spec CPU 2017

Figure 3 :
Figure 3: Buffer Allocation and Version ID/Tag Management

Figure 4 :
Figure 4: The whole workflow of RTT

Figure 5 :
Figure 5: Overlap of addresses between freed and reallocated buffers.

3. 3 . 4
An Example of RTT.As shown in Listing 5, RTT does not need to instrument the code to explicitly propagate the key or lock address anymore.There are only three kinds of instrumentation in RTT:(➊) set lock and key after each allocation (line 10-11); (➋) modify lock after each free (16); (➌) get key and lock to conduct UAF check before each memory access(line 13 and 3-4); As shown in Listing 4, line 11 to 17, setting lock after allocation/free and getting lock for each UAF check use the same addressing function.
Non-inlined and temporal protection only

Figure 6 :
Figure 6: Performance overhead comparison on Spec CPU 2017 Non-inlined and temporal protection only

Figure 7 :
Figure 7: Performance overhead comparison on real-world open-source programs