Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Time-to-Digital Converters

Voltage fluctuation sensors measure minute changes in an FPGA power distribution network, allowing attackers to extract information from concurrently executing computations. Previous voltage fluctuation sensors make assumptions about the co-tenant computation and require the attacker have a priori access or system knowledge to tune the sensor parameters statically. We present the open-source design of the Tunable Dual-Polarity Time-to-Digital Converter, which introduces three dynamically tunable parameters that optimize signal measurement, including the transition polarity, sample window, frequency, and phase. We show that a properly tuned sensor improves co-tenant classification accuracy by 2.5× over prior work and increases the ability to identify the co-tenant computation and its microarchitectural implementation. Across 13 varying applications, our techniques yield an 80% classification accuracy that generalizes beyond a single board. Finally, our sensor improves the ability of a correlation power analysis attack to rank correct subkey values by 2×.


INTRODUCTION
Cloud providers offer FPGAs as a service. FPGA's versatility makes them efficient compute engines for neural networks [8], genome sequencing [3], secure database transactions [1], networking [31], and homomorphic encryption [29]. These applications have strict requirements for data confidentiality and computational integrity.
FPGA cloud providers use strict time-sharing schemes where a user rents the entire FPGA. This can leave the FPGA under-utilized. FPGA virtualization maximizes utilization by supporting multiple concurrent users [41]. It can reduce costs and increase efficiency, making it an attractive option for cloud service providers.
Unfortunately, FPGA virtualization introduces a side channel observable by an attacker implementing a voltage fluctuation sensor within their programmable logic. Voltage fluctuation sensors measure minute voltage changes in the power distribution network that expose details about co-tenant computations. Voltage fluctuation sensors are used as a covert channel [35,42] or a side channel to extract cryptographic keys of co-located encryption cores [35,42]. ○ The attacker instantiates our Tunable Dual-Polarity TDC. 3 ○ Our dynamic tuning techniques improve the ability to classify victim co-tenant computation by 2.5×. 4 ○ After recognizing a cryptographic core, dynamic tuning increases the effectiveness of correlation power analysis by 2.2×.
Time-to-Digital Converters (TDCs) are a common voltage fluctuation sensor that measure the propagation delay through a linear ○ An attacker is given access to a remote multi-tenant FPGA and programs it with a voltage fluctuation sensor. 2 ○ The sensor readings are gathered and sent to the attacker for analysis. 3 ○ The attacker tunes the parameters and to better extract co-tenant information. This paper studies the impact of and tuning.
array of logic elements, which is a function of the power distribution network (PDN) voltage. A slower propagation indicates that the PDN is stressed by some computation. These voltage fluctuations over time can be measured with consecutive TDC output captures. Figure 1 shows our open-source pipeline. In Stage 1 ○, an attacker co-locates temporospatially with a victim user. The attacker measures the shared power distribution network in Stage 2 ○. Our open-source Tunable Dual-Polarity TDC allows the attacker to dynamically tune its sensing parameters, including transition polarity, sample window, frequency, and phase. Previously proposed TDC sensors are statically tuned in one or more of these parameters, which requires detailed knowledge of the computational environment and target computation. Utilizing these techniques in Stage 3 ○, we demonstrate that a well-tuned sensor can improve classification accuracy by 2.5× over a statically-tuned sensor that incorrectly characterizes its environment or target computation. After successfully classifying an AES computation, we demonstrate in Stage 4 ○ that proper sensor calibrations increase the ability to correctly rank subkey values by 2× in a Correlation Power Analysis (CPA) attack.
The contributions of this work are: • An Open-Source Tunable Dual-Polarity TDC sensor for performing side-channel attacks on FPGAs • A study of three metrics for measuring the propagation distance of rising and falling transitions • A technique for maximizing channel information by adjusting capture window duration • A method for tuning to the unknown phase of a co-tenant computation and isolating it from the environment • A study characterizing the impact of these parameters on a 13-application, cross-board classification problem • An application of our tuning methods to a multi-tenant Correlation Power Analysis attack The paper is organized as follows: Section 2 presents the threat model. Section 3 describes our Tunable Dual-Polarity TDC and its tuning abilities. Section 4 experimentally verifies the tuning optimizations presented in the previous section, and then shows how this can be leveraged to perform our classification attack as well as a Correlation Power Analysis. We conclude in Section 6. Figure 2 describes the proposed threat model. The attacker is provided access to a cloud FPGA. The attacker has a design with a voltage fluctuation sensor and deploys it on the FPGA. We assume the system provides logical separation of the tenants [16,21] and the attacker is restricted to system-defined interfaces, e.g., those provided by a shell. The attacker gathers the sensor readings, determines if a targeted co-tenant is present, and extracts confidential information from them. The attack is performed entirely remotely.

THREAT MODEL
The attacker is a malicious adversary that aims to extract information from spatiotemporal co-tenants. This could be as simple as whether a co-tenant is currently using the FPGA, e.g., to know when to launch a fault attack [14,19,37]. The attacker could classify whether a specific type of computation is occurring on the shared FPGA, e.g., is the co-tenant performing encryption? It could infer details about the co-tenant's design, e.g., are they using a soft processor? Is it a RISC-V processor? The attacker could also learn information about the data being computed upon, e.g., extracting a cryptographic key [13,34,42], and leverage the architectural details learned about implementation to increase recovery speeds.
The attacker can implement a voltage fluctuation sensor. Our voltage fluctuation sensor is a variant of a TDC sensor [43]. We assume the sensor will pass bitstream analysis techniques that detect remote attacks [20]. As discussed later, our sensor passes the checks performed by Amazon AWS F1 instances.
We do not make any assumptions about where the sensor is placed, e.g., the victim computation does not need to have one of its wires running through it [12,30,32]. However, the sensors are more sensitive to computations that are spatially closer [17,30], and so, as proximity decreases, demand for sensor tuning described in this increases. We consider only attacks within the same programmable logic. However, similar attacks have been shown from the FPGA to a CPU on the same die [42], across dies on a 2.5D integrated package [10], and across chips on the same board [11,35].

TUNABLE DUAL-POLARITY TDC
Our Tunable Dual-Polarity TDC 1 has four key features: 1) it captures both rising and falling transition polarities (Dual-Edge); 2) it provides real-time adjustment of the sample window duration; 3) it provides real-time phase adjustment of the sample clock relative to the target computation; 4) it provides real-time frequency adjustment of the sample clock. We use these features to tune the sensor to the voltage fluctuations of the PDN caused by the target. Figure 3 shows Tunable Dual-Polarity TDC architecture. The sensor's core is a pulse generator that induces rising and falling transitions through a delay line at a configurable frequency . A single pulse contains a positive (0 → 1) and a negative (1 → 1 The sensor architecture and the sensor implemented alongside a PicoRV core executing AES has been open-sourced for the PYNQ-Z2. Additionally, we provide an easy-toinstall PYNQ package for interacting with the sensor and an example Jupyter Notebook studying how shifting can be used for isolating relevant computation on the PicoRV running AES. All this is available at: https://github.com/KastnerRG/Tunable-TDC.  0) pulse edge. Positive and negative pulse edges are issued sequentially in the Launch Clock domain. Pulse edges cause falling and rising transitions to propagate through a linear array of delay line elements to the Output capture register, which is controlled by the Capture Clock. is the phase difference between the Launch Clock and the Capture Clock -the time between the pulse launch and the subsequent capture of the transition in the output registers. When is set correctly, a transition will be propagating through the delay line when the output registers are clocked and record a metastable transition. The propagation distance is the number delay elements the transition has passed through. An example output sequence from two consecutive pulses is shown at the bottom of Figure 3. Each pulse causes a falling and rising transition to be captured at the output. Rising Transition 0 shows that the 0 → 1 transition reached Output [38].Falling Transition 0 shows that the 1 → 0 propagated to somewhere between Output [21] and Output [23], with some metastability between the two points. In the next pulse, Rising Transition 1 propagates differently; the 0 → 1 transition propagates to between Output [36] and Output [39]. Similarly, Falling Transition 1 propagates to between Output [20] and Output [23]. These changes reflect PDN voltage fluctuations that change the delay line propagation. The variations provide potential information about the operation of the FPGA, including computation by co-tenants.
The sampling frequency is dictated by the length of the delay line and the speed of the underlying FPGA logic. If a higher effective sampling frequency is needed, multiple launch/capture clock pairs with a known phase offset can be generated by the clock generator as is done in related work [4,36,39].
Pulse Generator: The pulse generator produces positive (0 → 1) and negative (1 → 0) pulse edges that cause falling and rising transitions, respectively, in the delay line. Each sample produces a rising and a falling transition on the capture registers. A trace is a series of samples. The pulse generator has two configurable run-time parameters: the sampling frequency , which is an integer fraction of the launch clock frequency, and the number of pulses. Figure 3 demonstrates a trace length = 2, i.e., two rising and two falling transitions. We show that both transitions contain useful information in Section 4.

Programmable Clock Generators:
The Tunable Dual-Polarity TDC has two programmable clock generators implemented using a Xilinx Mixed-Mode Clock Manager (MMCM). The first MMCM ( Figure 3 1 ○) controls the input clock to the TDC and the phase relationship between the target clock and the sensor clock. Section 4.4 discusses the importance of tuning to capture relevant information about a co-located computation better.
A second MMCM (Figure 3 2 ○ ) generates the launch and capture clocks with a programmable phase offset, , between them. Changing modifies when the pulse generator generates an edge and when the capture clock fires and records the location of the subsequent transition in the output registers. Section 4.3 demonstrates the importance of tuning .
During compilation, the TDC sensor is configured to pass timing checks. The phase relationship is unconstrained, and is set to 2 . This means that the TDC sensor cannot be detected by tools that check for timing violations [20].
Delay Line and Capture Registers: The delay line in Figure 3 is a series of combinational logic elements that propagate the rising and falling transitions caused by the pulse generator. The delay elements are constructed from identical digital circuit elements that aim to provide a linear propagation delay, . The delay elements of the TDC should be placed and routed with uniform spacing to ensure consistent delay between each element and a uniform delay through the entire chain.
The Tunable Dual-Polarity TDC uses the fast look-ahead CARRY primitives in Xilinx FPGAs to create the delay line. The CARRY logic provides a relatively linear delay between each output bit within a single CARRY primitive. The carry logic is configured to compute Output = 65'h0_ffff_ffff_ffff_ffff + input so that when input changes from 65'h0 → 65'h1 on a positive pulse edge, the output of the delay line is a transition with falling polarity from Output = 65'h0_ffff_ffff_ffff_ffff to Output = 65'h1_0000_0000_0000_0000. A transition with rising polarity is produced on the negative pulse edge.
The interplay between the number of bits in the delay line and is also a critical TDC design consideration. The delay line length limits the maximum value of and the sampling frequency. A delay line that is too short may not capture all of the PDN variations induced by a target, but a long delay line increases resource consumption. Characterizing how a target computation affects the PDN and the value that best measures variations are crucial for tuning the sensor to provide the most information.
The capture registers shown in Figure 3 record the output of each bit of the carry delay line in the capture clock domain. The path from the pulse generator to the high-order bit of the output meets timing constraints in the FPGA toolchain, and the launch clock and capture clock are configured to be in phase during compilation. This means that the TDC sensor cannot be detected by tools that check for timing violations [20]. Tuning is the process of searching for → 0 such that power variations maximize the extracted information.
and Tuning: The programmable clock generators allow the Tunable Dual-Polarity TDC to tune its parameters to optimize the information sampled from the PDN. Figure 4 defines the relationship between the target clock, the capture clock, and the launch clock using and , the effect of the target computation on , and the effect of varying and on the extracted information. The PDN voltage varies in response to the target's rising clock edge. The upper right graph in Figure 4 demonstrates the effect of varying from 0 to 2 . Increasing provides more time for the pulse to propagate through the delay line; as increases, the transition bit index increases. Section 4.3 experimentally demonstrates the importance of tuning .
The lower right graph in Figure 4 demonstrates the effect of varying from − to . Changing will change the sampling window with respect to the target computation. When the sampling window is correctly positioned to the target clock, the sensor output will maximally change in response to the variations in current drawn by the target. This will cause an increase in the information measured at the sensor. Section 4.4 demonstrates how our TDC sensor enables to be tuned to ensure that the sample window is optimized with respect to the target computation.
Propagation Metric: When is tuned correctly, the capture clock will record how far the signal has propagated through the delay elements. The signal propagation distance can be measured as the index in the capture register. The least significant bits generally have their post-transition value, and the most significant bits typically have their pre-transition value. This imprecise definition reflects the metastability around the transition point that can cause multiple bit flips. This metastability may contain useful information, and ignoring these flips could reduce the side-channel information. This behavior is shown in rising/falling Transition 1 of Figure 3.
We examine three propagation metrics: •

RESULTS
We now report results on the impact of , , and propagation metrics, as applied to the classification experiment to determine if the co-tenant is a cryptographic core and, if so, perform a correlation power analysis. The sensor, classification data on 13 applications, and classifier network are released as open source.

Experimental Setup
Our experimental platforms are Amazon Web Services (AWS) EC2 F1 instances with Xilinx UltraScale+ XCVU9P-FLGB2104 FPGAs and six PYNQ-Z2 boards with Xilinx ZYNQ XC7Z020-1CLG400C FPGAs. On the PYNQ systems, the device is programmed with our sensor and test designs through the Python Productivity for Zynq (PYNQ) infrastructure. The AWS EC2 F1 instances are launched through the EC2 interface and programmed with the unique AGFI identifier associated with our sensor designs. The AGFI is generated by Amazon's unmodified compilation flow with the design checkpoint we provide. Our sensor has passed all design analysis techniques performed by AWS.
A 64-bit Tunable Dual-Polarity TDC is instantiated on PYNQ-Z2 and a 256-bit Tunable Dual-Polarity TDC on AWS. The launch and capture clock domains operate at 100 MHz. This results in a sampling rate, , of 25 MHz. MMCM 1 ○, which allows for the phase shifting of , produces a 100MHz output clock. The internal is maximized for the two MMCMs so that the step granularity of and is maximized with a step size of 11.16 ps on AWS and 14.88 ps on PYNQ.

Applications
Our experiments use our Tunable Dual-Polarity TDC to classify the characteristics of a co-tenant. We have 13 unique applications containing IP cores using different architectural features. The application IP core and the sensor are implemented on the same FPGA. They are logically and physically isolated. The characteristics of the applications are described in the following paragraphs.
Sensor Only: The primary goal of the sensor-only design is to model the lack of another co-tenant. This design only contains the voltage fluctuation sensor and associated data collection logic. This mimics a scenario where only the attacker is present on the FPGA.
Ring Oscillators: Ring Oscillators are a malicious circuit with the sole purpose of aggressively consuming power. These are implemented as banks of combinational loops, resulting in rapid switching and power consumption as the circuit cannot settle on a single output value. Such a circuit can cause voltage disruptions in the power distribution network and can be used as a covert channel or to induce faults [14,19,37].
Arithmetic-Heavy: FPGAs are particularly well suited for highintensity signal processing tasks with arrays of digital signal processors (DSPs). As an approximation of these structures, we implement arrays of DSPs performing a pipelined fused multiply-add operation. All DSPs operate in a single clock domain and compute upon data generated by a randomly-seeded, linear-feedback shift register.
Cryptographic Cores: We study ten different implementations of cryptographic computations consisting of two algorithms (AES, PRESENT) implemented on five different architectures (Custom HLS IP core and as software running on Orca, MicroBlaze, PicoRV, and ARM CortexM3 soft processors).

Tuning and Metric Selection
As shown in Figure 4, is the phase difference between the launch and capture clocks and dictates how long a transition is allowed to propagate through the delay line. It plays two important roles: first, determines the position of the transition in the output and can be used to avoid undesirable behavior caused by discontinuities in the FPGA architecture; second, defines the duration of the sampling window, when the delay line is measuring PDN variations. Figure 5 demonstrates the effect varying has on the transition index as measured by the First Index, Last Index, and Binary Hamming Distance metrics across both falling and rising transition polarities. These experiments are performed on the PYNQ-Z2 Sensor Only and AWS Sensor Only designs. In the experiment is increased from 0 ps with a step size of 11.16 ps on AWS and 14.88 ps on PYNQ, as determined by the maximum frequency for the family and device speed grade. At each value of a trace of 2 14 samples is captured, where a sample is one rising and one falling transition. This process is repeated until the transition index exceeds 64 bits, the maximum length of the delay line for our PYNQ-Z2 implementation. Next, we calculate the transition index using First Index, Last Index, and Binary Hamming Distance metrics, for each value of , for each trace, for both rising/falling transition polarities. The average value of the trace at each value of is plotted. Expressed as the error bar at each point is the standard deviation of the respective trace. Standard deviation, as we will show, is a good measure of the sensitivity of the sensor to voltage changes. The rising/falling transition polarities are shown in blue/orange for AWS and red/green for PYNQ. The three sub-graphs correspond to the three propagation metrics from Section 3 which are studied in the following sections.   Figure 3, neither prior metric is able to discern between 4b'0101 and 4b'0111, potentially missing important information. The data in Figure 5(c) demonstrates that there are few plateaus when using the Binary Hamming Distance metric, and that the standard deviation is relatively consistent across the delay line.
The variable provides the ability to choose where in the delay line a transition falls, and therefore the ability to avoid plateaus we have observed in this section. For the remainder of the paper we use the Binary Hamming Distance metric for measuring rising and falling polarities due to its improved characteristics.
The delays of the carry outputs do not monotonically increase due to the use of carry lookahead adders in the FPGA architecture. Permuting the outputs allows the timing to be maintained [13]. This would change the behavior of the first/last index metric, making them more linear. It does not effect the Binary Hamming Distance.

Tuning and Background Subtraction
is the phase relationship between the target clock and the launch clock of the sensor. Our Tunable Dual-Polarity TDC can dynamically adjust to tune to the target clock and maximize measured information. This provides the ability to reliably isolate where information channel is maximized between the co-tenant and sensor. This has a significant impact on the side-channel information.
To demonstrate this, we sweep through two complete phase rotations (4 ). For equal to 25 MHz this corresponds to 80ns. This process is performed twice: once as a measure of the background environment when the computation is disabled, and again when a co-tenant has been enabled. At each position of , two traces of 1024 samples are captured. One trace records the rising transition polarity (↑) where maximizes the rising transition standard deviation samples and the other trace records the falling transition polarity (↓) where maximizes the falling transition standard deviation samples. The Binary Hamming Distance is computed for each of these transition types. The average ( ) as well as standard deviation ( ) of each trace is calculated. Figures 6(a), 6(c), and 6(b) demonstrate the result of sweeping over the range of 4 on three different designs: AWS Sensor Only, PYNQ-Z2 Sensor Only and PYNQ-Z2 PicoRV AES. The first and fifth row in each subfigure plot the zero-centered trace average for the rising transition (↑ Δ ) and falling transition (↓ Δ ). The raw offset in the Binary Hamming Distance is unimportant, so we consider the deviations from the average across all values of . The blue line is the data recorded when the computation is off (Background), and the red line is the data recorded when the target was on (if applicable). The second and the sixth row plot the pointwise difference between the red and the blue line in their respective preceding plots. The third and the seventh row in each subfigure plot the trace Binary Hamming Distance standard deviation ( ) for the rising transition (↑ ) and falling transition (↓ ).
The fourth and eighth plots are point-wise difference between the red and blue line in their respective preceding plots. AWS Sensor Only: Figure 6   Background subtraction is critical to isolate the variance caused by a target from other information sources on the system and determine the value of that maximizes the information leakage (yellow). The rising (↑) and falling (↓) transition variance maxima are offset by .
Hamming Distance of both edges on both background sweeps (rows 1 and 5, ↑ Δ and ↓ Δ ). When the difference of the two sweeps is taken, the Binary Hamming Distance (↑ Δ and ↓ Δ ) as well as the standard deviation (↑ and ↓ ), is reduced to a flat line.  Figure 7: Our Tunable Dual-Polarity TDC is employed in a 13-way classification task where an attacker extracts the type of co-located computation. The ability to distinguish co-tenant computations is a measure of side-channel information contained in the sensor's traces. 7(a) represents the worst-case where a TDC cannot reconfigure and and achieves 32% accuracy. In 7(b) the TDC can tune and improves to 51% accuracy. In 7(c) both and have been tuned with background subtraction to isolate co-tenant information and achieve 75% accuracy, a 2.3× improvement.
The results demonstrate that there is significant background noise that has an effect on both the Binary Hamming Distance as well as the standard deviation of a trace. 20 peaks of equal amplitude appear over a range of 80 ns within the Binary Hamming Distance, which implies the existence of 250 MHz logic on the FPGA, likely the AWS shell logic which. Using background subtraction techniques [27] it can be removed to isolate the target.
PYNQ-Z2 Sensor Only: Figure 6(b) demonstrates the same experiment on the PYNQ-Z2 platform. We now observe background peaks that indicate 100 MHz synchronous logic. As on AWS, this information is consistent across multiple sweeps. When the background is subtracted, all variation in Binary Hamming Distance and standard standard deviation is reduced to a noisy flat signal.
PYNQ-Z2 PicoRV AES: Figure 6(c) demonstrates the same experiment performed on PYNQ-Z2 platform when the PicoRV AES design is operating at 25 MHz. In contrast to the previous two experiments, we take a single background sweep of with the PicoRV core deactivated, then another sweep of with the processor activated. The difference between the deactivated/activated sweeps produces a peak (yellow) that highlights the correct tuning.
Background subtraction produces a single distinct peak over a range of 2 in the Binary Hamming Distance (Δ ) and standard deviation ( ) plots. We attribute this single peak to the PicoRV AES core running at 25 MHz. This behavior is consistent across designs, algorithms, and architectures. This position of represents not just where standard deviation is maximized (which may be muddied by the presence of background information), but where the channel contains maximum information about the co-tenant. We show in Section 4.5 that this is the best location for tuning the sensor and recovering side-channel information.

Effects of Tuning on Classification
As a precursor to cryptographic key recovery attacks, like a Correlation Power Analysis, an attacker must be able to determine what and when a cryptographic core is executing. We fill this void by demonstrating an attack where we accurately classify a co-tenant computation in a multi-tenant system.
Setup: As described in Section 2, we assume an attacker uploads a voltage fluctuation sensor to a remote multi-tenant FPGA environment to extract the architecture and algorithm of co-tenant computation through the comparison of captured power traces to a known body of labeled training data. This training data can be generated in two possible ways. First, a malicious actor, utilizing two separate user accounts, can instantiate a voltage fluctuation sensor with one user and attempt to co-locate with the second user, which instantiates a known design. This would allow an attacker to build a data set on the same architecture where the attack will be performed. The second option is to create a data set using local boards of the same type as the cloud environment. Because this choice depends on the implementation details of the multi-tenant model, we will not consider this in our analysis. Such an attack serves as a violation of the application anonymity guaranteed by such a multi-tenant system. The attack is performed on each of the 13 applications on 5 PYNQ-Z2 platforms.
Tuning: In the following experiment we consider four configurations of : ↑ , ↑ , ↓ , and ↓ . We sweep through the 64-bit delay line and record a trace's average and standard deviation at each point. The position where standard deviation has been maximized for a particular rising/falling transition polarity we call ↑ / ↓ , and the point where standard deviation has been minimized for a particular rising/falling transition polarity we call ↑ / ↓ .
Tuning: The sensor's will be configured three ways: first at a state of absolute maximum standard deviation ( ), at its absolute minimum standard deviation ( ), and finally the maximum standard deviation under the background subtraction ( ) process of Section 4.4. Just as in Section 4.4, is shifted in 14.88 ps increments 2688 times at each point, capturing a trace of 128 samples-once to capture background noise ( ), and again once the target Table 1: Summary of Prior Work and Quantitative Comparison. Prior works implement a subset of our sensor's capabilities, which can be summarized as a tuning tuple ( , ) of our sensor. Each tuning tuple is tested in our classification experiment to determine accuracy and loss and perform a CPA Attack on a soft processor running AES to report the GSR @ 50K, PSR (Min, Avg, Max) @ 50K, and PGE (Min, Avg, Max) @ 50K traces. No prior experiments have measured how information learned on one board generalizes to other boards (called Cross-Board) -a crucial consideration for cloud attacks. Our tuning methodology and TDC Sensor improve co-tenant classification accuracy by 2.5× and increase the rate that correct subkey values are ranked as most likely (PSR) in a CPA attack by 2.2× relative to an un-tuned sensor.  application has been enabled ( ). In the tuning process, ( ) is the maximum (minimum) standard deviation of the sweep for a given transition type. The position of maximum standard deviation after background tuning ( ) is the maximum standard deviation of − .
Data Collection: The target design and sensor are loaded onto the device. Then, is positioned at one of ↑ , ↓ , ↑ , or ↓ . Finally, is configured to one of , , or . We examine the following tuning combinations of ( , ): (↑ , ) emulates the worst-case of a non-tunable TDC. (↓ , ) introduces tuning to demonstrate how the mitigation of carry-chain non-linearity improves the sensor's ability to resolve cotenant information. (↓ , ) demonstrates the significance of tuning on classification accuracy. (↓ , ) demonstrates how background subtraction improves our ability to optimize the co-tenant side channel. (↑ , ) is used to determine which transition polarity captures the most information, as it can be directly compared against (↓ , ). After the sensor is configured, the target computation is launched, and a trace of 2 16 samples are gathered. This process is repeated 100 times on each application for a total of 1300 traces per tuning combination per board. 2 Post-processing: For a group of 1300 traces from a single tuning configuration ( , ) on a single board, we randomly segment each trace into ten sub-traces of 2 13 samples. Each sub-trace is de-trended to remove the DC offset. The Fourier transform of the processed trace is then computed. From an original set of 1300 traces, we are left with 1000 rising transition Fourier transforms and 1000 falling transition Fourier transforms for each application, amounting to 26000 Fourier transforms per board per configuration.
Network Architecture: We train a simple neural network of only one fully connected layer, a fast and simplistic starting point. We evaluate the classification accuracy (how accurately our network can classify among the 13 classes of computation) and cross-entropy loss (how well our network generalizes to unseen data) on all configurations of training on four boards and testing on a 5th.
Classification Results: The results of our experiments are shown in Table 1, and select confusion matrices from our 13-way experiment are shown in Figure 7. The results are summarized: The baseline dataset exhibits predictably poor performance in our classification task as shown in Table 1 with a 32% accuracy. The confusion matrix in Figure 7(a) demonstrates that the classifier struggles across all applications.
(↓ , ): With the introduction of tuned to the maximum position, we see an immediate improvement in classification accuracy from 32% to 51% in Table 1, with the confusion shown in Figure 7(b). This shows that with proper tuning to avoid plateaus, measured information increases based on its ability to distinguish between soft processors and their applications.
(↓ , ): With the introduction of tuning, accuracy improves to 75% in Table 1. The confusion matrix for this data set is shown in Figure 7(c) and robustly determines co-tenant application.
(↓ , ): To evaluate the effects of background subtraction, we report our network's average accuracy and loss for (↓ , ) and (↓ , ). As seen in Table 1, the network performs 0.236% better without background subtraction; however, background subtraction decreases the loss (0.733 vs. 0.834). This indicates that our network generalizes better with background subtraction. We expand our cross-validation configuration to investigate this result and understand how well the network generalizes to boards it has not trained, i.e., cross-board generalization. We train on all possible 5 * (5 − ) configurations, where ∈ [0, 5] is the number of training boards of data, and test on the remaining (5 − ) boards on both (↓ , ) and (↓ , ). We also train and test on the same board, as standard in prior work [15]. The results in Figure 8 show that when training and testing on the same board as in 'S', the network fits to artifacts of the dataset rather than the computation itself and does not generalize beyond the training board. As the number of training boards increases, the median accuracy increases, and the median loss decreases, demonstrating increased generalization. . Testing always occurs on data from a separate board, except for when data from the same board is used (denoted "S"). Background subtraction decreases the cross-board accuracy's interquartile range (IQR) by 2.3× and the loss's IQR by 5.8×. Multi-board training and background subtraction greatly improve cross-board generalization.
The distributions of accuracy and loss as we train on more boards behave differently when the network trains on data with background and without background subtraction. As seen in Figure 8(b), the interquartile range (IQR) decreases when background subtraction is added. When we train on four boards and test on a 5th, the Loss IQR without background subtraction is 0.429, whereas the IQR with background subtraction is 0.074, an improvement of 5.8×. The network is more likely to generalize to unseen boards with a smaller distribution. This is also reflected in the accuracy distribution in Figure 8(a). In the same four training board setup, the IQR of the accuracy with background subtraction is 2.3× smaller than without background subtraction.
(↑ , ): The use of the rising transition increases the accuracy from 75% to 80% and decreases the loss from .733 to .626. This indicates that the rising and falling transitions contain different information and that both transitions, when properly tuned, perform well in this classification task.

Effects of Tuning on CPA
After recognizing a cryptographic core with our classification procedure, we launch a Correlation Power Analysis (CPA) attack [2] to extract the key values. Because the values affect power consumption during encryption [34], our tuning techniques decrease the number of traces needed to extract the key.
Setup: We perform our CPA attack on the PYNQ-Z2 Orca AES application and consider the configurations (↑ , ) for the well-optimized sensor and (↑ , ) as a worst case un-tunable TDC comparison. The attack is repeated 50 times for each tuning strategy. We randomly generate a 128-bit AES key each time the attack is performed. This key is used by the Orca AES application to encrypt 50000 randomly generated plaintexts known to the attacker. During the encryption of each plaintext, the attacker collects a trace of length 8192, aligned by the measurement setup, so the beginning of the trace coincides with the beginning of the encryption.
A large body of work exists on performing attacks on reducing traces needed [5], alignment methods [9,24,28,38], or filtration methods [25,33]. We have kept our CPA attack as conventional as possible for a fair comparison of the different sensor configurations.
Following prior work, we analyze the results of the CPA attacks using multiple metrics [5]. After processing some traces, the CPA method returns a list for each subkey that ranks the possible subkey values from most likely to least likely. Partial Guessing Entropy (PGE) is the position of the correct subkey value in the list, where lower is better. Partial Success Rate (PSR) is the frequency with which the correct subkey value is ranked as most likely. Global Success Rate (GSR) is the frequency with which all correct subkey values are ranked as most likely. We also consider the mean PGE as a function of the number of traces. This is frequently used to compare the performance of CPA attacks [22,23,26]. Lower average PGE indicates that the attack is performing better, as the correct subkey values are ranked as more likely after processing fewer traces. Our tuning method has a significant impact on key recovery, lowering PGE by 2.2× at 50,000 traces.
CPA Attack Results: Our results demonstrate that the optimized sensor configuration (↑ , ) outperforms the worst-case of an un-tunable sensor (↑ , ). Qualitative results are given in Figure 9, which show that the traces obtained from (↑ , ) exhibit lower PGE on average. This indicates that fewer traces are needed to recover the key when using the (↑ , ) configuration, lowering the overall cost of the attack.
Numerical (Min, Avg, Max) results are given in Table 1. The GSR statistic shows that the optimized sensor configuration (↑ , ) was able to recover all 16 subkeys in a single trial. In contrast, a poorly tuned sensor (↑ , ) never recovered the entire key. The higher PSR values for the well-tuned sensor demonstrate that individual subkeys were recovered around 2× more frequently given the same number of traces as with a poorly tuned sensor. These results show that optimized sensor configuration is crucial for identifying co-tenant computation and significantly increases the rate at which a cryptographic key can be recovered.

RELATED WORK
TDCs for Power Side-Channels: The Tunable Dual-Polarity TDC allows us to rapidly compare the performance of our tuning techniques to different "classes" of prior work. Table 1 summarizes prior and related work as configurations of our sensor.
The majority of these efforts never considered either tuning or tuning [14,15,34,35]. Such sensors cannot achieve the signal resolution gains observed in Section 4. These sensors will not apply well to cloud-FPGA environments or across different FPGAs because they assume a random and on each device and therefore do not extract general computation elements.
Some prior work has introduced the concept of tuning [6,7,13,17,43]. However, many of these are limited in ability and applicability to the cloud-FPGA environment. For example, the method proposed in [7] involves re-configuring the number of delay elements to change where the transition reaches in the delay line. This does not generalize well to the cloud-FPGA model, as the delay elements need to be compiled into the design or updated with partial reconfiguration, making it difficult to respond quickly to changing conditions in a multi-tenant FPGA.
The authors of [17] consider another primitive variant of tuning, where they connect the pulse generator to several different places in the delay line through a set of multiplexers. This allows a user to shift by configuring the input location to the delay line. This is more runtime configurable but adds complexity to the design as the clock input must be duplicated and significantly limits the range of configurability as each position of needs to be predefined.
A more configurable approach for tuning is considered in [6], which leverages partial reconfiguration for modifying the routing between each delay element. This relies on that feature being exposed to the end user, which is potentially not an option in a cloud environment. This will be slower than our approach, which makes it difficult to adjust to rapidly changing conditions in cloud FPGAs (i.e., another user's design is allocated to the board).
The authors of [43] propose that if calibration of the TDC is required, the clock to the delay line can be phase-shifted using a programmable clock generator, as we have in this work. They do not expand on this nor consider how tuning can be leveraged to avoid irregularities in the delay line or adjust to the PDN load created by other users' designs.
The classification experiment we examined in this work is an essential consideration, as multi-tenant FPGA side-channel attacks often presuppose when and what computation is running alongside the attacker, e.g., the attacker assumes that a victim is performing a cryptographic operation. The first work to propose this [15] fails to generalize across FPGAs that training data was not collected on, making it incompatible with the cloud model. It cannot be generalized because it does not consider and , so it leans heavily on the architectural features of the device for its classification. Our work rectifies this limitation and addresses a fundamental optimization step that must be taken with power fluctuation sensors. The classification network used in [15], ResNet50, is significantly more complex yet performs worse than the simplistic single-layer network used in this paper. This is an important consideration if the network is to be implemented in hardware for fast recognition to launch a CPA attack if a cryptographic device is recognized.
To the best of our knowledge, there is no prior work demonstrating as we have done in Section 4.4. Alternative solutions that increase the sampling frequency of TDCs can reduce the importance of tuning (by increasing the likelihood a transition falls when there is activity in the PDN); this remains imprecise and limited as co-tenant frequency increases.
Mitigations: Physical isolation of co-tenants on the FPGA programmable logic [16,21] mitigates attacks that require close physical access [12,32]. Many remote attacks do not have such constraints on sensor placement [34,42]. Our work reduces the benefits of physical isolation with a sensor designed to improve the signal-to-noise ratio through and optimization.
Active fences surround the co-tenant IP core with ring oscillators or other heavy-power-draw circuits [18], which induces noise into the PDN, making it harder to extract the signal. These techniques increase power and area consumption. Our sensor improves the signal-to-noise ratio making attacks more effective, through and optimization, as demonstrated in our results.
Krautter et al. [20] describe techniques that check the design for structures that resemble side-channel sensors. They focus on detecting sensors using ring oscillators, those that induce timing violations, data to clock paths, and high fanouts, which they argue are indicative circuits used in these threat models. Our sensor is resistant to these detection techniques as it has low fanout, no timing violations, and no combinational loops.

CONCLUSION
We present the Tunable Dual-Polarity TDC, which enables a first of its kind pipeline for recognizing co-tenant computation, maximizing recovered leaked information, and effectively extracting confidential information from a victim co-tenant. In a classification experiment with 13 applications, our techniques yield an 80% classification accuracy on 5-board, leave-one-out cross-validation, a 2.5× improvement over prior work. In addition, our sensor and tuning methodology improves the rate at which all correct subkey values are ranked as most likely by 2.2× in a CPA attack.

ACKNOWLEDGMENTS
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-2038238. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.