ReCon: Efficient Detection, Management, and Use of Non-Speculative Information Leakage

In a speculative side-channel attack, a secret is improperly accessed and then leaked by passing it to a transmitter instruction. Several proposed defenses effectively close this security hole by either delaying the secret from being loaded or propagated, or by delaying dependent transmitters (e.g., loads) from executing when fed with tainted input derived from an earlier speculative load. This results in a loss of memory-level parallelism and performance.A security definition proposed recently, in which data already leaked in non-speculative execution need not be considered secret during speculative execution, can provide a solution to the loss of performance. However, detecting and tracking non-speculative leakage carries its own cost, increasing complexity. The key insight of our work that enables us to exploit non-speculative leakage as an optimization to other secure speculation schemes is that the majority of non-speculative leakage is simply due to pointer dereferencing (or base-address indexing) — essentially what many secure speculation schemes prevent from taking place speculatively.We present ReCon that: i) efficiently detects non-speculative leakage by limiting detection to pairs of directly-dependent loads that dereference pointers (or index a base-address); and ii) piggybacks non-speculative leakage information on the coherence protocol. In ReCon, the coherence protocol remembers and propagates the knowledge of what has leaked and therefore what is safe to dereference under speculation. To demonstrate the effectiveness of ReCon, we show how two state-of-the-art secure speculation schemes, Non-speculative Data Access (NDA) and speculative Taint Tracking (STT), leverage this information to enable more memory-level parallelism both in a single core scenario and in a multicore scenario: NDA with ReCon reduces the performance loss by 28.7% for SPEC2017, 31.5% for SPEC2006, and 46.7% for PARSEC; STT with ReCon reduces the loss by 45.1%, 39%, and 78.6%, respectively.CCS CONCEPTS• Computer systems organization → Superscalar architectures; • Security and privacy → Hardware-based security protocols.


INTRODUCTION
Since the discovery of speculative side-channel attacks [24,29], a wide variety of transient execution attacks have been found [10,22,23,25,30,32,41].These attacks vary in attack method, being able to leak information through port contention [10], micro-op caches [35], and reorder buffer contention [3] and even to break the ARM Pointer Authentication [32].To mitigate against speculative side-channel attacks, a slew of speculative execution defenses have been proposed [4,5,14,28,38,39,47,52,53,56].These defenses differ in the threat model they operate under, their performance overhead compared to an unsafe baseline, and the amount of modifications they introduce to the system.
Early works, such as InvisiSpec [53] and Ghost Loads [38], as well as later proposals such as Muontrap [5], provide solutions to block speculative side-channel attacks in the cache hierarchy, but do not eliminate many other side-channels [9].
Later works, such as Non-speculative Data Access (NDA) [52], and Speculative Taint Tracking (STT) [56] improve on previous work by securing more speculative side-channels.Their common characteristic is that they focus on delaying potentially dangerous, speculative transmitter instructions.STT outperforms NDA through a method called taint tracking, in which potential secrets are tracked, and only possible transmitters are delayed.Still, STT is unable to issue load-dependent instructions, since the second dependent load is considered a transmitter.This results in a loss of instruction-and memory-level parallelism and a corresponding loss in performance.
To recover some of the performance overhead that these defense proposals introduce, several optimizations have been proposed.They either rely on compiler support [48,58], or modifications of the core microarchitecture and of the memory hierarchy [4,55].We take a different approach to the optimization of secure speculation schemes by proposing to efficiently detect non-speculative leakage and use this information to lift security mechanisms for safe speculative loads.
Our approach is based on a new security definition concerning the exposure of secrets, proposed by Choudhary et al. in Speculative Privacy Tracking (SPT) [14].Under this definition, "any data that can leak through the program's non-speculative execution should not be treated as secret during the program's speculative execution" [14].
SPT leverages this security definition to provide comprehensive mitigation for speculative side-channels.More specifically, SPT proposes continuous taint tracking that spans all execution (both nonspeculative and speculative) to detect and track non-speculative leakage.Assisted by a sophisticated forward and backward untaint mechanism, SPT maximizes the amount of non-speculative leakage detected.However, this requires changes of significant complexity in the core.Moreover, SPT protects secrets loaded in registers prior to speculative execution, albeit at a relatively steep performance cost.Register protection, however, is not a requirement to eliminate universal read gadgets that can leak all memory [14], which is the main focus of our work.Besides, non-speculatively accessed secrets can be protected by other means [54], which renders such high-cost register protection mechanism less appealing for our purposes.
Building on the SPT security definition, we make the following observation: A large part of the information leakage prevented by secure speculation schemes, such as STT and NDA, simply comes from load pairs that dereference pointers or index base-addresses.In the interest of conciseness, for the remainder of this paper, we will liberally use the term pointer to refer to either an (8-byte) address loaded from memory or an (8-byte) integer index that is loaded from memory and added to a (constant) base address.These two cases are indistinguishable in our approach.Similarly, we use the gerund dereferencing to encompass both pointer dereferencing and base-address indexing.
Moreover, trying to prevent exactly this type of leakage is what causes a significant performance loss in secure schemes.While program execution may leak (non-speculatively) in many different ways and through many different side-channels, we focus exclusively on exploiting the non-speculative leakage due to load pairs.
The focus on load pairs enables us to distill the detection and management of non-speculative leakage (that is highly complex in the general case) into two simple functions: i) observe a load pair dereferencing pointers, and ii) mark this particular pointer's value as safe to dereference under speculation.If two loads execute non-speculatively, where the second load dereferences the value of the first load, then our approach, called ReCon, reveals the value of the first load as public.If a value has not been revealed or is changed, it must be concealed as secret.In ReCon, we mark the address that contains a revealed or concealed value, correspondingly, as such.
Consider the following example: Here, PC2 dereferences the value of the PC1 load (i.e., uses the value stored at address 0x13 as an address), revealing this information to the memory hierarchy.Committing this non-speculative access (PC2), as part of the intended program execution, means that the value stored at 0x13 cannot be considered a secret any longer.ReCon marks the memory location 0x13 as revealed.This holds true until the value at 0x13 is updated by a store instruction, in which case, the memory location 0x13 would be marked as concealed.Speculatively reading from the revealed memory location 0x13 (e.g., by the PC3 load), means that there is no need to apply speculative defenses and the dependent PC4 load can dereference the value and execute without leaking anything that is not already public.
A further insight that drives our work concerns the preservation and propagation of the conceal/reveal information throughout the cache hierarchy: it is possible to propagate the information efficiently via the coherence protocol.We tag cache lines with reveal/conceal information, and we piggyback this information on the coherence transactions of a standard directory MESI protocol (or similar).Coherence handles the updates to the value of the pointer, at the same time resetting the leakage information as needed.We explain this in Section 5.3 where we show that without modifications to the base coherence protocol (only adding conceal/reveal information on top of it), we can effectively manage leakage information throughout a multicore cache hierarchy for both singleand multi-threaded workloads.
The overall approach, called ReCon (Reveal/Conceal), is an efficient, low-complexity technique for detecting revealed addresses from non-speculative pointer dereferencing, for tracking revealed addresses throughout a multicore cache hierarchy, and for optimizing secure speculation schemes (Section 4 and Section 5).
In this work, we apply ReCon on STT and NDA and evaluate the resulting schemes with benchmarks from the SPEC2017 and SPEC2006 benchmark suites (Section 6).Our results show that ReCon reduces the overhead over the unsafe baseline from 13.2% to 9.4% and from 8.9% to 4.9% on SPEC2017, and from 10.4% to 7.2% and from 8.1% to 5% on SPEC2006, when applied on-top of NDA and STT, respectively.We also evaluate parallel benchmarks from the PARSEC benchmark suite, and we observe a 46.7% and 78.6% reduction in the overhead incurred in total execution time, for NDA and STT, respectively.

BACKGROUND
In this section, we describe the previous work that ReCon leverages to enable safe execution of loads that access revealed addresses.

Non-speculative Data Access (NDA)
Non-speculative Data Access (NDA) [52] defends by blocking the propagation of secrets at the earliest possible stage.While NDA proposes several variations of its central defense mechanism, we focus on the strategy labeled permissive propagation.
In permissive propagation, potential secrets can be acquired by speculative load instructions, but the secrets are not allowed to propagate to dependent instructions, i.e., broadcast is delayed until the originating instruction is non-speculative.Hence, there is no possible way to expose potential secrets.NDA does not require any extra modification to handle either explicit or implicit channels, as the potential secrets are never released into the rest of the processor core in any way.This is in contrast to STT [56] and SPT [14], which have to use taint tracking to keep track of the propagation of potential secrets to dependent instructions, and need to handle both explicit and implicit channels when inputs are tainted.
While NDA limits instruction level parallelism by not propagating the results of load instructions, it achieves the same amount of memory level parallelism as STT [56] (dependent loads are blocked in both) while proposing a far simpler scheme.However, blocking all non-transmitting instructions that depend on speculative load values incurs larger performance penalties.

Speculative Taint Tracking (STT)
Speculative Taint Tracking (STT) [56] is a state-of-the-art safe speculative execution scheme due to its performance and versatility.STT relies on two fundamental principles: Firstly, all instructions that do not have a dependency originating from a load instruction are allowed to execute as normal, including loads.Second, it delays speculative transmitting operands whose execution depends on a speculatively loaded value until the value is confirmed to be non-speculative and therefore not a secret.
STT uses a taint tracking mechanism, similar to dynamic information flow tracking (DIFT) [45], to prevent the execution of transmitting instructions that depend on speculatively loaded values.STT taints the output register of a speculative load instruction, or any instruction dependent on tainted data.STT automatically untaints the destination register and all tainted registers that originate from this destination register, as soon as the corresponding load instruction becomes safe, i.e., when it becomes non-speculative.
STT also provides an extensive analysis of explicit and implicit channels, which respectively, directly and indirectly, can be used to leak secrets, such as in some Spectre variants [24] (explicitly) and SmotherSpectre [10] (implicitly).To prevent the use of explicit channels, STT delays the execution of any transmit instruction whose operands are tainted.This means that any instruction with tainted input will not be executed until its inputs are no longer tainted.To prevent the use of implicit channels, STT ensures that the program control flow is not influenced by tainted data.This means that branch predictions can still occur as normal, but the resolution of branch predictions, whether correctly or wrongly predicted, is delayed until the branch inputs are untainted.

Speculative Privacy Tracking (SPT)
Speculative Privacy Tracking (SPT) [14] is a defense mechanism that offers even more comprehensive protection against speculative side-channels, compared to NDA and STT.SPT shares similarities with STT, as they are both schemes based on DIFT [45], however, SPT protects all leakage under speculation, including register values that were set pre-speculation and have not leaked their values (non-speculative secrets).SPT refers to the taint tracking mechanism of STT as s-taint, to differentiate it from their proposed taint tracking mechanism.While s-taint focuses on speculative tracking through registers, SPT proposes a global, continues taint tracking mechanism that also propagates taint/untaint information through the memory hierarchy.It also introduces novel ideas to efficiently taint and untaint instructions.The taint tracking mechanism, proposed by SPT, controls both non-speculative and speculative execution and is able to protect secrets that reside in registers before speculation.Also, the taint tracking mechanism enables SPT to leverage their key insight: "any data that can leak through the program's non-speculative execution should not be treated as secret during the program's speculative execution".While SPT can maintain taints in both non-speculative and speculative execution, it can dynamically untaint and thus execute instructions normally under speculative execution.Whenever SPT identifies leaked data, its untainting mechanism untaints both older and younger dependent tainted instructions.SPT protects against the leakage of all non-speculative secrets -addresses that have never leaked their values non-speculatively.

THREAT MODEL
In the following section, we outline ReCon's threat model, specifically how it integrates with the STT and NDA threat models, and the modifications ReCon makes to the visibility of potential secrets.

STT and NDA Integration
ReCon is a performance optimization that is applied on top of secure speculation mechanisms, e.g., NDA or STT.The underlying secure mechanism maintains its entire threat model except for values that previously have been made public and, hence, not guaranteed to be protected under speculation by the secure mechanism.More specifically: In both NDA and STT secrets are defined as values that the core should not be able to access, and is only able to access as a result of erroneous speculation, i.e., potential secrets are speculatively accessed values that might be squashed.ReCon, similarly to SPT [14], relaxes this definition by excluding values that have already been made public through non-speculative execution.
Regarding the threat models of STT and NDA with ReCon we note the following: Register Protection: For both STT and NDA, values that reside in registers before the point of speculation are not considered to be potential secrets, as these are accessible without speculation, and are therefore part of normal execution.ReCon does not affect this property, i.e., does not add protection for registers.Explicit and Implicit Channel Protection: STT delays potential transmitters to defend against explicit channels.For implicit channel mitigation, STT ensures that the control flow is not influenced by tainted data.NDA delays the propagation of a speculative load   instruction, and thus it does not need special treatment for explicit and implicit channels, as they cannot be formed.ReCon does not affect either of them (for secrets that have not been made public).Speculative Model: STT introduces two speculative models, Spectre that considers only speculation cast by control flow instructions, and Futuristic that considers an instruction speculative until it guarantees not to be squashed.ReCon can operate under any speculative model, and it is not affected by the model choice.For evaluation, we use only speculation cast by control flow and store instructions, we elaborate more on that in Section 6.

Conditional Security Guarantee
As described, ReCon removes protection from values that are already public information.The lack of protection for public information enables some new forms of execution patterns that were not previously possible in the underlying threat model.For example, load-load pairs that have previously been observed nonspeculatively would be allowed to execute speculatively, as the information is considered public under the threat model.Programs that utilize secret-dependent non-speculative behavior are considered to have made their secrets public, even if such accesses were attempted to be obfuscated.Take the following example: The secret value, selector, is made public at line 5, in which observable program behavior is dependent on the selector value.This means that the threat model does not consider future leakage of this value as insecure, since it has been made public.Thus, it would be possible to create speculative replay attacks to expose the value of selector, leveraging a speculative gadget (similar to lines 4 and 5 in the example above) elsewhere in the code.The example attempts to obfuscate which key is selected through accessing all possible keys, but such methods are not considered secure in general.
Speaking generally, this means that programs that avoid secretdependent behavior, as recommended by modern secure programming techniques, in their non-speculative execution will not have their security premise changed under the ReCon threat model.Observe a secure version of the previous example: This version employs constant-time programming principles that ensure that the secret, selector, is never part of observable program execution.Such programming principles are endorsed by industry leaders such as Intel [20] for secure applications.
Sandboxing: ReCon does not inherently support isolation between multiple sandboxes in the same process (address space).For example, consider two sandboxes operating in the same process.As public information can be accessed speculatively with ReCon, one sandbox can access and speculatively "observe" public information of the other sandbox.This "observation" would normally be prevented by NDA or STT.Today, this is of less practical concern, as sandboxes are separated in their own process, e.g., in web browsers (Site Isolation) [34] or in the cloud [42].

RECON
In the following section, we describe the core premise of ReCon.We describe i) how we use non-speculative leakage to improve the performance of secure speculation schemes, ii) how we can detect and capture this non-speculative leakage (reveal), and iii) how we can ensure that new secrets are not accidentally leaked (conceal).

Overview
ReCon uses knowledge of prior non-speculative execution to alter the execution of specific load instructions under speculation.The goal is to lift speculative side-channel defenses for loaded values that are safe to improve performance.
ReCon identifies pairs of dependent loads, in which the second address is entirely dependent on the output value of the first load (see 1 ○ in Figure 1), and marks this value as revealed when the second load commits 2 ○.For now, assume that a bit associated with each word in the cache hierarchy marks the word as revealed or concealed.We discuss the details of the storage and transmission of this information in Section 5.
This means that the value of the first load has leaked (nonspeculatively) as an address via a side-channel and should not be considered a secret as defined by the SPT threat model [14].Protecting this value under speculation is superfluous for security, but harmful to performance.
ReCon essentially marks the address accessed by the first load (LD 1 ) as containing a revealed value, and leverages this knowledge to disable security mechanisms for any load to this address 3 ○, as these security mechanisms are detrimental to performance.The value is only safe as long as it has not been changed, i.e., ReCon marks the address of stores as concealed 5 ○. ReCon tracks direct dependence load pairs (Section 4.3 and Section 5.1), preserving the address of the first load (LD 1 ) until the non-speculative commit of the second load (LD 2 ), and then marking the corresponding word in the cache as revealed.Whenever a load is performed to a revealed value, special handling appropriate to the underlying security mechanism ensures that the revealed value is treated as though it is non-speculative 4 ○, e.g., it does not cause taints for STT, and can propagate immediately for NDA.

What Non-Speculative Leakage to Capture?
Non-speculative information leakage occurs as a result of changes to the microarchitectural state that are visible to an attacker, such as timing differences in execution.When disregarding approaches that focus on energy usage or contention, microarchitectural changes that leak information can only occur in two possible ways: (1) through a data dependency between a load and a following transmitter instruction, known as an explicit channel.(2) through a control dependency where the different possible paths result in different microarchitectural states, known as an implicit channel.
ReCon efficiently captures non-speculative leakage to improve the performance of an underlying secure speculation scheme.For such a case, leakage from explicit channels is of the greatest interest, because most of the performance loss in secure speculation schemes comes from the reduction in memory-level parallelism, as a consequence of preventing explicit channels [56].Leakage from control dependencies is harder to detect and causes less performance degradation under secure speculation schemes [56].
Within explicit channel leakage, we focus solely on leakage caused by dependent loads (load pairs), which is common through pointer dereferencing.As we analyze in Section 6.2, this case constitutes the majority of leakage caused by load instructions.Although this leaves a subset of leakage undetected, this is not a correctness issue for security; it only affects the attainable benefit we can achieve.Comprehensive protection is provided by the underlying secure speculation scheme that we are optimizing.

Direct Dependence Loads
Non-speculative leakage due to data dependencies can occur, e.g., when a load passes its value directly to another load, or when a load gets its value manipulated by a sequence of instructions that passes it to a subsequent load.Consider the following example: Values are loaded (PC1 & PC2) from two addresses (0x13 and 0x7), manipulated (PC3), and passed as an address argument to a third load (PC4).PC4 is a transmitter instruction that leaks the addresses 0x13 and 0x7 through indirect dependencies.The value from address 0x13 (loaded by PC1) is also passed as an address argument to a fourth load (PC5).PC5 is also a transmitter instruction that leaks address 0x13 through a direct dependence (meaning no other instruction intervenes between the two loads).In this example, the value stored in 0x13 is leaked both indirectly (through a dependence tree that involve other instructions and, importantly, may involve other loads) and directly (through a load-load pair without any dependent intervening instructions).ReCon limits its detection to only the leakage of load pairs with a direct dependence.
ReCon associates a leaked value with the address it is stored at (e.g., 0x13 in the example).Establishing this association for an arbitrary large indirect-dependence tree requires dynamic information flow tracking [45] that extends arbitrarily long in the past and may involve multiple loads and addresses (e.g., PC1 and PC2, 0x13 and 0x7, in the example above).In contrast, with exactly two loads (PC1 and PC5) it is straightforward to unambiguously associate the leakage to the address (0x13) in a simple and effective way.
Load addresses that are derived from a load with an offset, i.e., an immediate, still create a valid load-pair, as the introduction of an offset does not affect ReCon's overall security guarantee.Such an offset is by definition always present, and calculating the value of the secret is trivial.Consider the following example: Instead of revealing address 0x13, it reveals address 0x13+0x10.If instead, the second load has an offset: The load pair now reveals address 0x13, as the offset is known.
ReCon is secure regardless of the proportion of total non-speculative information leakage it captures, as security is guaranteed by the underlying secure speculative execution scheme, and ReCon is only using already leaked information to reduce their performance overhead.As such, ReCon has a trade-off between the design cost of capturing information leakage, and the performance gain from using this leakage.ReCon uses this trade-off to capture most leakage at a low cost by focusing on loads whose address is directly dependent on another load without intermediaries, which comprises a majority of the total leakage of a program (Section 6.2).Detecting the presence of such direct dependence load pairs is achieved in a simple manner by checking to see if there is a dependence between the destination register of a preceding first load and the source register of a following second load.
We describe such an implementation in detail in Section 5.1.Then, once the second load is committed, the leakage of the value is known to be non-speculative, and the address of the value is marked as revealed in the cache, which enables following load instructions to that address to lift security mechanisms.
Insofar as CISC instructions are internally decoded into RISC micro-operations, instructions such as the x86 arithmetic instructions with a memory-fetched input operand are broken into two (or more) micro-operations where one of them is a load.This load can participate in the formation of a load pair.

Concealing New Secrets
Once an address is revealed, it is guaranteed to be safe as long as its contents do not change.A store to an address that has been revealed breaks this condition, making the address unsafe again.In this section, we describe how ReCon assures that revealed addresses turn to concealed when they are changed.More importantly, conceal operations work at any granularity (e.g., byte, sub-word, word, ...), while revealing and using revealed data works only for aligned addresses and at word granularity.This means that if any part of a revealed word changes, the whole word becomes concealed.

Performed Stores.
When a committed store instruction writes to its target address in the cache hierarchy (i.e., when the store is performed), the new contents at this address have not been observed non-speculatively (i.e., revealed through a committed load instruction) ( 5 ○ in Figure 1).For this reason, a store marks the address as concealed, (no longer revealed) -the address contains a new secret -which prevents future speculative loads from passing the loaded value as an address to other instructions 6 ○.Upon commit of a dependent load that uses the contents of the concealed address, the address is marked anew as revealed 7 ○.

4.4.2
In-Flight Stores.We consider committed stores that reside in the store buffer (SB) as not yet performed; stores in the store queue (SQ), are in-flight stores that have not committed.Younger loads receive their value forwarded from stores in the SB or in the SQ, rather than from the L1 cache.
In ReCon, a store conceals its output in the SQ/SB ( 8 ○ in Figure 1).Thus, a load always receives concealed data from store forwarding and defenses cannot be lifted 9 ○.There may be a period where the same data are known as concealed inside the core and revealed outside.This is inline with memory models that relax the  →  order and are read-own-write-early multi-copy atomic (rMCA) [49], e.g., x86-TSO [43] or weaker memory models.The memory location is concealed outside the core when the store exits the SB 10 ○.

Store-to-Load Forwarding
Store-to-load forwarding forms an implicit branch that can potentially leak speculatively accessed secrets.The following section describes how we ensure that ReCon does not leak potential secrets through such implicit branches.

Without Memory Dependence
Speculation.An unresolved store forms a resolution-based implicit channel that is handled by delaying loads until the store address is non-speculatively resolved [56].In this case, ReCon has no effect.4.5.2With Memory Dependence Speculation.For a pair of dependent loads (PC3 and PC4), two implicit channels are formed in the presence of an older unresolved store (PC2), as shown in Figure 2.With memory dependence speculation [15,31], two memory dependence predictions take place.The implicit channels also become prediction-based channels that are mitigated by updating the predictor with non-speculative values [56].
As outlined in Table 1, there are four cases to consider: Either of the two loads can be predicted to depend on the previous store (STF) or on memory (MEM).The key takeaway is that what is observable under STT and what is observable under ReCon differs only in the first case, (assuming that address [r4] is revealed -otherwise there is no difference).This is expected as if ld [r4] (PC3) is observed and [r4] is revealed, ReCon allows ld [r5] (PC4) to also be observed.This does not leak anything more than what has leaked before.In case 2, the store forwarding passes a concealed value to ld [r5] (PC4) preventing it from being observed both in STT and ReCon.Similarly, for case 3 and 4, the store passes a concealed value to ld [r4] (PC3, effectively reverting ReCon to STT.Thus, in both STT and ReCon, the observable loads are independent of the speculatively loaded secret (PC1), and the only information that leaks, is the memory dependence predictions, which are independent of the secret.A similar argument holds for NDA permissive propagation [52].

IMPLEMENTATION
In this section, we describe how ReCon can be implemented.We first describe how load pairs are detected, how revealed addresses are tracked in the cache hierarchy, and finally how revealed addresses are exploited to lift security defenses for NDA and STT.

Detecting Non-Speculative Load Pairs
An address that is used by a direct dependence load pair is only safe once the second load (LD 2 in Figure 1) becomes non-speculative.From a strict security perspective, the earliest that LD 2 becomes non-speculative is when the load has reached its visibility point [56], i.e., it is bound to commit.Although a load can reach its visibility point anywhere in the pipeline, for simplicity, we opt for implementing load-pair detection in the commit stage.This has no impact on security, as it only delays the revealing of the address and the earliest point when security defenses can be lifted.ReCon detects load pairs by including a table with, at most, as many entries as physical integer registers (smaller tables are possible and are evaluated in Section 6.6).Each table entry consists of an active (A) bit and an address field.We call this table the loadpair table (LPT), and it is accessed using the indices of the source and destination physical registers of a committing instruction.Detecting load pairs at commit, using physical registers, relieves us from the burden of establishing the correct dependence when multiple dynamic instances of the same load pair exist in the pipeline.
When a load commits, the LPT is accessed using both the destination register and the source register.For the destination register, if the load address has not already been revealed, then the active bit is set and the load's address is written to the corresponding entry, see 1 ○ in Figure 3.At the same time, the load checks the active bit of its source address register 2 ○.If the active bit is set, then a load pair has been detected and the load is the second load of the pair (LD 2 ).The committing load, then, marks the address that is stored in the LPT entry (i.e., of the first load) as revealed.Revealed addresses are tracked in the cache hierarchy, as described in the next section.If the active bit is not set, then no further operations are performed.The active bit is cleared for the destination register(s) of any other instruction than loads that commit.

Multi-Source Load
Instructions.Up to this point, we have described how ReCon tracks load pairs where the second load has a single direct dependence on an older load, i.e., the second load having a single source register.Let us now consider more complex instructions that might have multiple source registers, as commonly found in the x86 instruction set.At the first sight, one might conclude that the two instructions form a load pair, as the second mov instruction has a direct dependence on the first mov through register %rax.However, the actual behavior depends on the underlying microarchitecture and its micro-operations.Such complex instructions are commonly decoded into multiple simpler micro-operations.For the given example, the second mov could be decoded into three micro-operations, as follows: # mov (% rdx ,% rax ,8) ,% rax mul % rax ,8 ,% r1 # src1 , src2 , dst add % rdx ,% r1 ,% r2 # src1 , src2 , dst load (% r2 ) ,% r3 # src , dst In this case, ReCon would not detect a load pair since there is no direct dependence between two load operations.If the underlying microarchitecture instead support multi-source operations, such that the second mov is decoded into a single micro-operation, then a load-pair would be detected.In the general case, the second load micro-operation can have as many direct load dependencies as it has source operands, and a load pair can be detected for each operand.This requires that a lookup is made in the load pair table for each source operand to detect if the source register is written by a load operation.Each load pair can then be revealed by sending a reveal message to the L1.This is not on the critical path and such messages can be dropped since not revealing an address is only a potential performance loss and is always secure to do.For this paper, we focus on evaluating the case where there exists only a single direct load dependence and leave multi-source operations for future work.

Tracking Revealed/Concealed Addresses
ReCon tracks revealed addresses by associating a bit vector with each cache line to mark the data words that have been revealed by committed loads.The vector has as many bits as the total number of words that fit in a cache line (e.g., eight bits for eight 8-byte words in a 64-byte cache line-see Section 6.7 for details).
A newly fetched cache line from memory has all its words marked as concealed.When the second load (LD 2 ) of a load-pair commits, the address accessed by the first load (LD 1 ) is marked as revealed.LD 2 sends a reveal request to the L1 cache to reveal the address by setting the bit of the corresponding word in the cache line.Similarly, upon the commit of a store instruction, the bit of the corresponding word is cleared, stating that a new value exists in the data word that has not yet been revealed.

ReCon Coherence
A similar bit-vector approach to maintain reveal/conceal information per address, is followed in SPT to keep taints/untaints, but instead of keeping the bit-vector with every cache line in the L1, a different structure, mirroring the L1, keeps the bit-vectors separately [14].While the SPT approach has the advantage of i) bounding the absolute storage cost, and ii) not changing the L1, it has two important disadvantages when it comes to using non-speculative leakage information for optimization: First, leakage information is private to each core and cannot be shared.Today, several important workloads are multithreaded (e.g., browsers) to be able to extract performance out of multicores.In ReCon, we want to take advantage of the information gained in one core to optimize the execution of another core as the security model applies for a whole process (subject to the restrictions discussed in Section 3).Second, naïvely propagating this information to a shared LLC, e.g., via evictions, is not coherent and may result in a loss of information.
A contribution of ReCon is to solve this problem by assigning the non-speculative leakage information as meta-information that is carried and maintained by the coherence protocol.For this work, we assume a standard directory MESI protocol.A coherent version of the ReCon bit-vector is kept with each directory entry.The bit-vector of a cache line is transferred with the standard coherent transactions of the protocol between the directory and the private caches and between the private caches themselves.
Consider a single cache line, shared by threads of the same process (same address space) running on two different cores with private L1 caches.Each L1 receives a copy of the directory bit-vector, initially all set to concealed.Each thread can independently reveal words in this cache line in its L1 bit-vector, without knowledge of the other thread's bit-vector state.At this point, the revealed information can only be used locally by each core.However, upon eviction, the directory needs to be notified that the particular L1 is no longer a sharer (for this cache line).It is at this point that the ReCon bit-vector is transferred back to the directory and is logically Or'ed with the bit-vector that exists there.Or-ing the L1 bit-vector with the directory bit-vector guarantees that information is preserved across consecutive evictions from different L1s.Any core that reads the cache line from the directory, now learns of all the revealed addresses accumulated in the directory bit-vector.
Consider, now, what happens when a core conceals an address.Recall that to conceal an address, its contents must change, i.e., the address must be written by a store.For this, the L1 needs to have write permissions to its cache line.If the cache line is not already in state M (Modified), the core must ask the directory to grant write permission and invalidate any other sharers. 1 In ReCon, the writer assumes control of the directory bit-vector and owns the only coherent copy of the ReCon bit-vector until either: i) writes it back to the directory (overwriting the directory bit-vector with its own copy); ii) writes it back to the directory and passes it on to a new reader on a downgrade; or, iii) passes it on to the next writer on an invalidation (from the new writer).Until the writer gives up its write permission: i) it can reveal as many addresses in the cache line as it wants (which no other core can do at the same time); and ii) it is the only core that can supply a valid bit-vector to (a request from) a new reader or a new writer.

Using Revealed Addresses
Load instructions that perform a cache access, eventually return the corresponding ReCon bit in the cache hierarchy for that word.If the value has already been revealed, then the core can immediately disable any applied speculative restrictions associated with the loaded value.For NDA [52], the loaded value of revealed addresses is immediately propagated to dependent instructions.Similarly, for STT [56], any load that receives a revealed value untaints its destination register, which enables the value to be used by any transmitting instruction.Both techniques benefit from increased instruction and memory-level parallelism.

EVALUATION
In this section, we describe our methodology, we characterize the non-speculative leakage, and we present ReCon's overall results.

Methodology
We implement the evaluated security schemes on the latest version of the gem5 [12] simulator (version 22) using Ruby and SLICC to model the memory system with a three-level MESI coherence protocol (with an in-cache directory), on an infrastructure that shares implementations for NDA, STT, and ReCon.We use GARNET [1] to model the interconnect.
Speculation state is tracked through shadows [38,39].We evaluate only speculation that is triggered by control and store instructions [52], similarly to other speculative side-channel threat models [14,53,55,56], which only track control instructions, as this is the type of speculation leveraged by Spectre attacks [10,22,24,41].We do not consider speculation triggered by load instructions and instructions that can raise an exception for the following reasons: there is not a discovered attack under speculation caused by load instructions, and recovering the performance lost from this type of speculation is a solved problem [36,48,57], and speculation caused by instructions that can raise an exception lie on a different spectrum of attacks [29].Thus, we are closer to the Spectre threat model that tracks only control shadows [56], rather than the Futuristic threat model that considers all instructions speculative until they reach their visibility point [56].
We use the SPEC2017 speed [17] and SPEC2006 [16] CPU benchmark suites as a representative for single thread applications, and   [11] benchmark suite to evaluate parallel benchmarks.For the SPEC2017 benchmarks, we run detailed full system (FS) simulations using the out-of-order (OoO) processor model, extracting simulation phases with the use of simpoints [44].We collect simpoints for the first 100 billion instructions (under FS simulation).We take up to five simpoints per benchmark for an interval of 100 million instructions.We start gathering statistics after running 50 million instructions of a detailed warm-up, so that our mechanism that is implemented in OoO is also included in the warm-up.For the SPEC2006 benchmarks, we run in system emulation (SE) mode using the out-of-order (OoO) processor model.We warm up the processor for 3 billion instructions, and we gather statistics for the next 1 billion instructions.For the parallel benchmarks, we fastforward into the region of interest (ROI) in system emulation (SE) and run for 100 million instructions.We present instructions per cycle (IPC) as a performance metric for SPEC2017 and SPEC2006, and ROI execution time for PARSEC [6].
We also use Clueless [13], an open-source tool that measures the exposure of a program's memory to cache side-channels by applying a global DIFT mechanism, to better understand the general behavior of non-speculative leakage.Clueless is a trace-based tool that does not model speculative execution, and thus models the non-speculative behavior of the program.Clueless tracks dynamic instruction dependencies (through registers and memory) and detects data values that are turned into addresses.These are considered leaked values and their address in memory is tagged as a leakage point.Newly written values revert the address back to a non-leaked status.Thus, Clueless dynamically records the portion of memory that has leaked at any specific moment.
Because non-speculative leakage due to direct-dependence load pairs (Section 4.3) is a subset of the leakage captured by DIFT, we modify Clueless to also provide statistics specifically for direct dependence load-to-load dereferencing.We also modify it from being pin-based to trace-based, and we use SPEC2017 and SPEC2006 traces provided by the ChampSim [19] simulator for general studies.

Leakage Breakdown
To understand the direct-dependence load-to-load leakage, we report results from Clueless.In Figure 4, we show the average percentage of memory addresses that are identified as leakage points.We show the results both for all captured leakage (DIFT) and for direct dependencies (load-load pairs).We observe that across the SPEC2017 and SPEC2006 benchmark suites, on average, 53% of the address space leaks its content (when we capture leakage by DIFT), while direct load-to-load dereferencing is responsible for 32% of the address space that has leaked its content.In other words, direct dependencies cover 60% of the total leakage.In fact, we find that in some cases there is negligible additional leakage occurring if we measure it with DIFT, and the program's leakage is solely due to direct-dependence load-to-load dereferencing (e.g., gcc, imagick, mcf, and xalancbmk from SPEC2017).

Performance Results
Figure 5 and Figure 6 show the performance results of NDA/STT and ReCon for single thread performance as instructions per cycle (IPC) normalized to the unsafe baseline processor.
The more strict NDA introduces a 13.2% performance degradation on average, while STT introduces a degradation of 8.9% compared to the unsafe baseline across the SPEC2017 benchmarks.For the SPEC2006 benchmark suite, NDA introduces an overhead of 10.4%, while STT introduces an overhead of 8.1%, on average.
ReCon's purpose is to reduce the number of tainted (STT) or notpropagated (NDA) load instructions, and thus increase performance.
Figure 7 shows the reduction in tainted load instructions for ReCon normalized against the total number of tainted loads for STT.We see significantly fewer, 43.8% on average, tainted loads with ReCon.This is a natural consequence of the mechanism, as ReCon untaints the destination register, and thus does not cause dependent loads to be tainted.ReCon's improvement for NDA are nearly identical, since both STT and NDA apply their defenses to the same loads (i.e., loads depending on speculatively loaded values), and ReCon applies its optimization to the same set of loads.We have therefore omitted the data for NDA to simplify the figure.
For SPEC2017, ReCon faces a performance overhead of 9.4% and 4.9% over the unsafe baseline processor, reducing it by 28.7% for NDA and 45.1% for STT, respectively.For SPEC2006, we observe similar results, with an overhead of 7.2% and 5%, which translates to a reduction of 31.5% and 39% for NDA and STT, respectively.
Notice that some benchmarks face a very small absolute number of tainted loads (i.e., bwaves, imagick, and lbm from SPEC2017) and they do not face a performance degradation on either NDA or STT (Figure 5 and Figure 6), leaving no room for ReCon to boost the performance.Another notable mention is that some benchmarks have a similar absolute number of tainted loads (i.e., leela and nab), yet the former faces a performance reduction of 6.8% while the latter only 2.7% (on STT).This observation shows that some tainted loads are more critical than others, and reducing the number of tainted loads does not guarantee analogous performance gains.For example, ReCon recovers a significant amount of performance (from 64.1% to 88.5%) by reducing the number of tainted loads by 61% for xalancbmk (SPEC2017) as compared to STT, while for perlbench (SPEC2017), ReCon improves the performance from 94.6% to 96.4% by reducing a similar amount of tainted loads (59%).The former reduces the overhead by 67.9% while the latter by 34.5%, yet the reduction in tainted loads is similar (with the latter having slightly more reduction).This can be seen in Figure 7.
For the PARSEC benchmark suite, NDA increases the total execution time by 9.7% and STT by 4.4%, as shown in Figure 8. ReCon reduces the execution time overhead by 46.7% and 78.6%, resulting in a slowdown of 5.2% and 1% over the unsafe baseline, respectively.

Leakage/Performance Correlation
To understand how detected leakage from load pairs correlates with the performance gains of ReCon, we analyze benchmarks that experience at least a 5% performance degradation with STT for SPEC2017, namely cactuBSSN, deepsjeng, mcf, leela, omnetpp, perlbench, and xalancbmk.The 5% performance degradation limit reduces noise from benchmarks where STT and ReCon marginally affect performance.
Figure 9 illustrates the correlation between non-speculative leakage (as observed by Clueless) and performance (as observed by ReCon).The figure shows the ratio of leakage captured by directdependence load pairs to all leakage captured by Clueless' global DIFT mechanism.A perfect ratio means that all leakage is captured by load pairs and would be represented by a full column.Benchmarks are sorted from higher overhead reduction to lower overhead reduction (left to right).We see that ReCon successfully recovers performance when the leakage is highly dependent on load pairs.The lower the ratio of load-pairs to total leakage, the lower the performance gain (e.g., cactuBSSN and deepsjeng).Moreover, the amount of performance gains is dependent on two things: i) the rate at which pointers are reused (i.e., previously seen pointer dereferencing should be repeated), and ii) the phase the pointers are reused: ReCon requires the program to experience speculative execution when reusing them (and thus the underlying secure speculation scheme is applied).

L1 and L2 bound ReCon
ReCon is a flexible optimization that can be applied to multiple cache levels.While the default design applies the mechanism to all levels (L1, L2, and LLC), we examine the behavior when it is applied to only the first cache level, and on the first and second cache level, thus introducing a low implementation overhead.
Figure 10 shows the evaluation for STT with the SPEC2017 benchmark suite.We observe that some benchmarks, such as cactuBSSN and leela, recover the majority of the performance loss only by using the L1 cache, while others, such as gcc, mcf, omnetpp and xalancbmk, need to cover a larger working set size and thus leverage the L2 and LLC.
Overall, applying ReCon only to the L1 data cache reduces the overhead introduced by STT from 8.9% to 7.3%, and applying ReCon to the L1 and L2 reduces the overhead further to 6.3%.

Load-Pair Table Sensitivity Analysis
The load-pair table (LPT), as described in Section 5.1, uses as many entries as physical registers to store the address accessed by a load instruction.While the number of registers is architecture-specific, modern architectures commonly have around 200 integer registers.More specifically, Intel Skylake has 180 integer registers [21] and AMD Zen 3 and Zen 4 has 192 [7] and 224 [8] integer register, respectively.For that many registers, the LPT would translate to a size slightly bigger than 1KiB (we elaborate more on the implementation overhead in Section 6.7).As direct-dependence load pairs are usually near each other (in program order and inside the pipeline), just a few entries are enough to capture the majority of load pairs.
Figure 11 shows the results of a sensitivity study where we successively reduce the LPT size by a factor of two.The table is still indexed by the destination and source registers, but now conflicts are possible, as different physical registers map to the same entry.To ensure correctness, we tag LPT entries with the physical register index.The results show that the only benchmark that is significantly affected by the LPT size, experiencing increasingly many conflicts with every reduction, is mcf.We evaluate all the configurations between LPT/2 and LPT/64.They are consistent with the trend shown in the figure, thus, we omit them to simplify the presentation.
Overall, we observe that reducing the LPT size marginally affects performance.This behavior verifies our assumption that load pairs consist of loads that are close to each other.

Implementation Overhead
ReCon is a low-complexity approach, and its implementation overhead is primarily a storage overhead in the cache hierarchy and directory.More specifically, ReCon makes changes in the core and the cache hierarchy as follows.
In the core, ReCon adds a load-pair table in the commit stage to propagate the accessed address of committed loads.The load-pair (LPT) consists of an address (48 bits) and a valid bit (1 bit) per entry.For example, 180 registers (Intel Skylake [21]) would require a 1.1KiB LPT, while 224 registers (AMD Zen 4 [8]) would require a 1.37KiB LPT.As explained in Section 6.6 this could be further reduced to 641 bytes and 798 bytes, respectively, by shrinking it to half and adding an extra eight bits per entry (tag for register).
ReCon works with aligned 8-byte memory locations, to limit the total number of reveal bits required per cache line.It does not do misaligned or sub-8-byte reveal operations and keeps the values concealed in such cases.In the cache hierarchy, ReCon adds a byte per 64-byte cache line in the private caches and in the directory to track the revealed/concealed state of memory locations (eight revealed memory locations can fit at maximum in a 64-byte cache line).We evaluate an in-cache directory, which makes the storage cost of the directory bit-vectors proportional to the LLC size.This translates to an overhead of less than 1.5% of the total cache storage (private caches and LLC, considering the storage cost of data + tags + coherence state).For a high performance system, one can consider a decoupled directory that is, e.g., 2× or 4× overprovisioned compared to the aggregate size of the private caches.In that case, the storage cost of the directory bit-vectors becomes proportional to the aggregate size of the private caches.

RELATED WORK
Several approaches have been proposed to defend against speculative side-channels.As already mentioned, NDA [52] and STT [56] use the same principles, with NDA being more strict by not allowing potential secrets to propagate to any dependent instructions, achieving reduced instruction level parallelism, while STT applies a taint tracking mechanism to propagate secrets and delay only dependent transmitting instructions.DoM [39,40], instead of tracking potential secrets and blocking transmitters, delays all loads that miss in the (L1) data cache, as hits do not produce timing effects.This eliminates all observable cache timing differences.Mechanisms such as InvisiSpec [53], Ghost Loads [38], MuonTrap [5], and GhostMinion [4] focus on hiding speculative execution by using speculative buffers that temporary store speculative information, and modifying the memory system to comply with this invisibility.CleanupSpec [37] focuses on restoring microarchitectural states after misspeculation is verified, effectively scrubbing potential secrets from the observable state.There have also been several attacks [2,3,9,27] that target schemes, but ReCon does not either affect or their effectiveness they can be applied independently.
The above solutions introduce varying performance overheads and implementation complexities.To recover some lost performance, several have been proposed.
Speculative Data-Oblivious execution (SDO) [55] is an optimization to STT that uses prediction to make speculative execution independent of speculatively accessed values.In contrast to ReCon, SDO focuses on STT, and cannot be readily combined with other schemes as, for example, with NDA.NDA does not propagate the secret, and thus SDO would be unable to predict the cache level hit for dependent loads.SDO on STT provides a 44.4% reduction in overhead with the Spectre threat model and 36.3% with the futuristic threat model, when protecting against memory side-channels (load instructions).While our evaluation differs (e.g., we use SPEC2017 speed benchmarks instead of rate, and with an entirely different set of simpoints), we report a 45% reduction in overhead for a threat model that lies between the Futuristic and the Spectre threat model (36.3% and 44.4% reduction in SDO, respectively).Not only that, but the two optimization mechanisms are orthogonal and can cooperate, as ReCon applies on untainted loads to untaint their dependent loads, while SDO applies on tainted loads to predict their cache level hit.Thus, they can both be applied at the same time, ReCon reducing the number of tainted load instructions and SDO recovering performance by predicting tainted loads.
Other optimizations: InvarSpec [58] detects load instructions that are guaranteed to commit regardless the outcome of speculation, lifting their protection while still speculative.InvarSpec operates together with secure speculation schemes that protect against all speculative data leakage (e.g., InvisiSpec [53]).While it can also be adapted for schemes like STT and NDA, the performance gains are unknown as those schemes already explore some memory-level parallelism by allowing independent loads to happen while speculative.This is a major performance bottleneck for DoM [39], for example, which delays all loads that miss in the first level cache and InvarSpec [58] enables the execution of some of those specific misses.
Clearing the Shadows [48] focuses on instruction re-ordering to eliminate speculation as early as possible.InvarSpec and Clearing the Shadows are optimizations that leverage compilers and hardware/software co-design, unlike our approach, which only affects the hardware implementation.
Pinned Loads [57] focuses on speculation and the overhead caused by memory re-ordering, proposing a mechanism to resolve memory violations as early as possible, to enable the execution of protected loads much earlier.Both Clearing the Shadows [48] and Pinned Loads [57] focus on eliminating speculation and thus boosting performance by assigning less work to the underneath mitigation.This is different from our work, where we actually try to optimize the existing schemes.
Doppelganger Loads [26] is an optimization that also leverages non-speculative information, but instead of directly connecting it to leakage (load pairs), it uses the addresses accessed by committed loads to train an address predictor and safely predict the address of subsequent speculative loads.

CONCLUSION
We propose ReCon, an efficient, low-complexity approach to leverage knowledge of non-speculative leakage for the purpose of relaxing defenses in secure speculation mechanisms, such as NDA and STT, that would otherwise protect data that have already leaked.
Based on the observation that an address accessed by a load and transmitted by a second dependent load, leaks the value at this address, ReCon focuses exclusively on detecting non-speculative, direct-dependent load pairs, shedding all the complexity of a general dynamic information flow tracking (DIFT) tracking mechanism proposed previously.Furthermore, ReCon leverages the existing cache coherence infrastructure (including the directory) to store, share, transmit, and keep coherent the non-speculative-leakage state of addresses.ReCon, depending on the underlying secure speculation mechanism, enables the execution of load instructions that would otherwise be delayed.For example, under STT, ReCon untaints the output register of the load instruction that accesses an address known to have leaked non-speculatively.
ReCon successfully reduces the overhead for NDA by 28.7%, and 31.5%, and the overhead for STT by 45.1%, and 39%, on average, for the SPEC2017, and SPEC2006.For the PARSEC benchmark suite, ReCon reduces the overhead incurred in the total execution time by 78.6%, and 46.7%, respectively for NDA and STT.

Figure 2 :
Figure 2: Implicit channels of store-to-load forwarding with ReCon.

Figure 4 :
Figure 4: Percentage breakdown of leakage out of all address space.

Figure 7 :
Figure 7: Amount of tainted loads on SPEC2017 of STT (full column) and ReCon (hatched part), normalized to STT.

Figure 8 :
Figure 8: Normalized execution time of parallel benchmarks.

Figure 9 :Figure 10 :
Figure 9: Correlation between percentage of captured leakage (direct load pairs / all leakage) and overhead reduction.(SPEC2017 benchmarks with more than 5% performance degradation in STT shown.)

Figure 11 :
Figure 11: Normalized IPC of STT+ReCon with various sizes of Load-Pair Table (LPT).

Table 1 :
Memory dependence prediction cases for the store forwarding example of Figure2.