Abstract
The availability of FPGAs in cloud data centers offers rapid, on-demand access to reconfigurable hardware compute resources that users can adapt to their own needs. However, the low-level access to the FPGA hardware and associated resources such as the PCIe bus, SSD drives, or DRAM modules also opens up threats of malicious attackers uploading designs that are able to infer information about other users or about the cloud infrastructure itself. In particular, this work presents a new, fast PCIe-contention-based channel that is able to transmit data between FPGA-accelerated virtual machines (VMs) by modulating the PCIe bus usage. This channel further works with different operating systems and achieves bandwidths reaching 20 kbps with 99% accuracy. This is the first cross-FPGA covert channel demonstrated on commercial clouds and has a bandwidth which is over 2000× larger than prior voltage- or temperature-based cross-board attacks. This article further demonstrates that the PCIe receivers are able to not just receive covert transmissions, but can also perform fine-grained monitoring of the PCIe bus, including detecting when co-located VMs are initialized, even prior to their associated FPGAs being used. Moreover, the proposed mechanism can be used to infer the activities of other users, or even slow down the programming of the co-located FPGAs as well as other data transfers between the host and the FPGA. Beyond leaking information across different virtual machines, the ability to monitor the PCIe bandwidth over hours or days can be used to estimate the data center utilization and map the behavior of the other users. The article also introduces further novel threats in FPGA-accelerated instances, including contention due to network traffic, contention due to shared NVMe SSDs, as well as thermal monitoring to identify FPGA co-location using the DRAM modules attached to the FPGA boards. This is the first work to demonstrate that it is possible to break the separation of privilege in FPGA-accelerated cloud environments, and highlights that defenses for public clouds using FPGAs need to consider PCIe, SSD, and DRAM resources as part of the attack surface that should be protected.
1 INTRODUCTION
Public cloud infrastructures with FPGA-accelerated virtual machine (VM) instances allow for easy, on-demand access to reconfigurable hardware that users can program with their own designs. The FPGA-accelerated instances can be used to accelerate machine learning, image and video manipulation, or genomic applications, for example [5]. The potential benefits of the instances with FPGAs have resulted in numerous cloud providers including Amazon Web Services (AWS) [14], Alibaba [3], Baidu [20], Huawei [37], and Tencent [59], giving public users direct access to FPGAs. However, providing users low-level access to upload their own hardware designs has resulted in serious implications for the security of cloud users and the cloud infrastructure itself. Several recent works have considered the security implications of shared FPGAs in the cloud, and have demonstrated covert-channel [29] and side-channel [33] attacks in this multi-tenant setting. However, today’s cloud providers, such as AWS with their F1 instances, only offer “single-tenant” access to FPGAs. In the single-tenant setting, each FPGA is fully dedicated to the one user who rents it, while many other users may be in parallel using their separate, dedicated FPGAs which are within the same server. Once an FPGA is released by a user, it can then be assigned to the next user who rents it. This can lead to temporal thermal covert channels [61], where heat generated by one circuit can be later observed by other circuits that are loaded onto the same FPGA. Such channels are slow (less than \( 1 \,b/\mathrm{s} \)), and are only suitable for covert communication since they require the two parties to coordinate and keep being scheduled on the same physical hardware one after the other. Other means of covert communication in the single-tenant setting do not require being assigned to the same FPGA chip. For example, multiple FPGA boards in servers share the same power supply, and prior work has shown the potential for such shared power supplies to leak information between FPGA boards [30]. However, the resulting covert channel was slow (less than \( 10 \,b/\mathrm{s} \)) and was only demonstrated in a lab setup.
Another single-tenant security topic that has been previously explored is that of fingerprinting FPGA instances using Physical Unclonable Functions (PUFs) [60, 62]. Fingerprinting allows users to partially map the infrastructure and get some insights about the allocation of FPGAs (e.g., how likely a user is to be re-assigned to the same physical FPGA they used before), but fingerprinting by itself does not lead to information leaks. A more recent fingerprinting-related work explored mapping FPGA infrastructures using PCIe contention to find which FPGAs are co-located in the same Non-Uniform Memory Access (NUMA) node within a server [63]. However, no prior work has successfully launched a cross-VM covert- or side-channel attack in a real cloud FPGA setting.
By contrast, our work shows that shared resources can be used to leak information across separate VMs running on the FPGA-accelerated F1 instances in AWS data centers. In particular, we use the contention of the PCIe bus to not only demonstrate a new, fast covert channel (reaching up to \( 20 \,\mathrm{k}b/\mathrm{s} \)) that persists across different operating systems but also to identify patterns of activity based on the PCIe signatures of different Amazon FPGA Images (AFIs) used by other users. This includes detecting when co-located VMs are initialized or performing an interference attack that can slow down the programming of other users’ FPGAs, or more generally degrade the transfer bandwidth between the FPGA and the host VM. Our attacks do not require special privileges or potentially malicious circuits such as Ring Oscillators (ROs) or Time-to-Digital Converters (TDCs), and thus cannot easily be detected through static analysis or Design Rule Checks (DRCs) that cloud providers may perform. We further introduce three new methods of finding co-located instances that are in the same physical server: (a) through reducing the network bandwidth via PCIe contention, (b) through resource contention of the Non-Volatile Memory Express (NVMe) SSDs that are accessible from each F1 instance via the PCIe bus, and (c) through the common thermal signatures obtained from the decay rates of each FPGA’s DRAM modules. Our work, therefore, shows that single-tenant attacks in real FPGA-accelerated cloud environments are practical, and our work several ways to infer information about the operations of other cloud users and their FPGA-accelerated VMs or the data center itself.
1.1 Contributions
In summary, the contributions of this work are1:
1.2 Responsible Disclosure
Our findings and a copy of this article have been shared with the AWS security team.
1.3 Article Organization
The remainder of the article is organized as follows. Section 2 provides the background on today’s deployments of FPGAs in public cloud data centers and summarizes related work. Section 3 discusses typical FPGA-accelerated cloud servers and PCIe contention that can occur among the FPGAs, while Section 4 evaluates our fast, PCIe-based, cross-VM channel. Using the ideas from the covert channel, Section 5 investigates how to infer information about other VMs through their PCIe traffic patterns, including detecting the initialization of co-located VMs, long-term PCIe monitoring of data center activity, and slowing down PCIe traffic on adjacent instances. Section 6 then presents alternative sources of information leakage due to network bandwidth contention, shared SSDs, and thermal signatures of DRAM modules. The article concludes in Section 7.
2 BACKGROUND AND RELATED WORK
This section provides a brief background on FPGAs in public cloud computing data centers, with a focus on the F1 instances from AWS [14] that are evaluated in this work. It also summarizes related work in the area of FPGA cloud security.
2.1 AWS F1 Instance Architecture
AWS has offered FPGA-accelerated VM instances to users since late 2016 [4]. These so-called F1 instances are available in three sizes:
Each FPGA board is attached to the server over a x16 PCIe Gen 3 bus. In addition, each FPGA board contains four DDR4 DRAM chips, totaling \( 64 \,\mathrm{G}{\rm i}B \) of memory per FPGA board [14]. These memories are separate from the server’s DRAM and are directly accessible by each FPGA. The F1 instances use Virtex UltraScale+ XCVU9P chips [14], which contain over \( 1.1 \) million lookup tables (LUTs), \( 2.3 \) million flip-flops (FFs), and \( 6.8 \) thousand Digital Signal Processing (DSP) blocks [69].
As has recently been shown, each server contains 8 FPGA boards, which are evenly split between two NUMA nodes [63]. The AWS server architecture deduced by Tian et al. [63] is shown in Figure 1 and is consistent with publicly available information on AWS instances [12, 14]. AWS servers containing FPGAs have two Intel Xeon E5-2686 v4 (Broadwell) processors, connected through an Intel QuickPath Interconnect (QPI) link. Each processor forms its own NUMA node with its associated DRAM and four FPGAs attached as PCIe devices. Due to this architecture, an
Fig. 1. Prior work suggested that AWS servers contain 8 FPGAs divided between two NUMA nodes [63].
2.2 Programming AWS F1 Instances
Users utilizing F1 instances do not retain entirely unrestricted control of the underlying hardware, but instead, need to adapt their hardware designs to fit within a predefined architecture. In particular, user designs are defined as “Custom Logic (CL)” modules that interact with external interfaces through the cloud-provided “Shell”, which hides physical aspects such as clocking logic and I/O pinouts (including for PCIe and DRAM) [29, 62]. This restrictive Shell interface further prevents users from accessing identifier resources, such as eFUSE and Device DNA primitives, which could be used to distinguish between different FPGA boards [29, 62]. Finally, users cannot directly upload bitstreams to the FPGAs. Instead, they generate a Design Checkpoint (DCP) file using Xilinx’s tools and then provide it to Amazon to create the final bitstream (Amazon FPGA Image, or AFI), after it has passed a number of DRCs. The checks, for example, include prohibiting combinatorial loops such as ROs as a way of protecting the underlying hardware [28, 29], though alternative designs bypassing these restrictions have been proposed [29, 57].
2.3 Related Work
Since the introduction of FPGA-accelerated cloud computing about five years ago, a number of researchers have been exploring the security aspects of FPGAs in the cloud. A key feature differentiating such research from prior work on FPGA security outside of cloud environments is the threat model, which assumes remote attackers without physical access to or modifications of the FPGA boards. This section summarizes selected work that is applicable to the cloud setting, leaving traditional FPGA security topics to existing books [38] or surveys [26, 39, 50, 73].
2.3.1 PCIe-Based Threats.
The Peripheral Component Interconnect Express (PCIe) standard provides a high-bandwidth, point-to-point, full-duplex interface for connecting peripherals within servers. Existing work has shown that PCIe switches can cause bottlenecks in multi-GPU systems [21, 25, 27, 55, 56], leading to severe stalls due to their scheduling policy [44]. In terms of PCIe contention in FPGA-accelerated cloud environments, prior work has shown that different driver implementations result in different overheads [66] and that changes in PCIe bandwidth can be used to co-locate different instances on the same server [63]. In parallel to this work, PCIe contention was used for side-channel attacks which can recover the workload of GPUs and NICs via changes in the delay of PCIe responses [58]. Our work is similar but presents the first successful cross-VM attacks using PCIe contention on a real public cloud. Moreover, by going beyond just PCIe, our work is able to deduce cross-NUMA-node co-location using the DRAM thermal fingerprinting approach.
2.3.2 Power-Based Threats.
Computations that cause data-dependent power consumption can result in information leaks that can be detected even by adversaries without physical access to the device under attack. For example, it is known that a shared power supply in a server can be used to leak information between different FPGAs, where one FPGA modulates power consumption and the other measures the resulting voltage fluctuations [30]. However, such work results in low transmission rates (below \( 10 \,b/\mathrm{s} \)), and has only been demonstrated in a lab environment.
Other work has shown that it is possible to develop stressor circuits which modulate the overall power consumption of an FPGA and generate a lot of heat, for instance by using ROs or transient short circuits [1, 2, 35]. These large power draws can be used for fault attacks [40], or as Denial-of-Service (DoS) attacks [42] which simply make the hardware unavailable for an extended period of time. Such attacks could also prematurely age FPGAs, due to the potentially excessive heat for an extended period of time [19]. Our work has instead focused on information leaks and non-destructive reverse-engineering of the cloud infrastructure.
2.3.3 Thermal-Based Threats.
It is now well-known that it is possible to implement temperature sensors suitable for thermal monitoring on FPGAs using ROs [23], whose frequency drifts in response to temperature variations [45, 46, 65, 72]. A receiver FPGA could thus use an RO to observe the ambient temperature of a data center. For example, existing work [61] has explored a new type of temporal thermal attack: heat generated by one circuit can be later observed by other circuits that are loaded onto the same FPGA. This type of attack is able to leak information between different users of an FPGA who are assigned to the same FPGA over time. However, the bandwidth of temporal attacks is low (less than \( 1 \,b/\mathrm{s} \)), while our covert channels can reach a bandwidth of up to \( 20 \,\mathrm{k}b/\mathrm{s} \).
2.3.4 DRAM-Based Threats.
Recent work has shown that direct control of the DRAM connected to the FPGA boards can be used to fingerprint them [62]. This can be combined with existing work [63] to build a map of the cloud data centers where FPGAs are used. Such fingerprinting does not by itself, however, help with cross-VM covert channels, as it does not provide co-location information. By contrast, our PCIe, NIC, SSD, and DRAM approaches are able to co-locate instances in the same server and enable cross-VM covert channels and information leaks.
2.3.5 Multi-Tenant Security.
This work has focused on the single-tenant setting, where each user gets full access to the FPGA, and thus reflects the current environment offered by cloud providers. However, there is also a large body of security research in the multi-tenant context, where a single FPGA is shared by multiple, logically (and potentially physically) isolated users. For example, several researchers have shown how to recover information about the structure [64, 74] or inputs [51] of machine learning models or cause timing faults to reduce their accuracy [24, 54]. Other work in this area has shown that crosstalk due to routing wires [28] and logic elements [31] inside the FPGA chips can be used to leak static signals, while voltage drops due to dynamic signals can lead to covert-channel [29], side-channel [33, 36], and fault [52] attacks. Several works have also tried to address such issues to enable multi-tenant applications, proposing static checks [41, 43], voltage monitors [34, 48, 53], or a combination of the two [42]. Our work on PCIe, SSD, and DRAM threats is orthogonal to such work but is directly applicable to current cloud FPGA deployments.
3 PCIE CONTENTION IN CLOUD FPGAS
The user’s CL running on the FPGA instances can use the Shell to communicate with the server through the PCIe bus. Users cannot directly control the PCIe transactions, but, instead, perform simple reads and writes to predefined address ranges through the Shell. These memory accesses get translated into PCIe commands and PCIe data transfers between the server and the FPGA. Users may also set up Direct Memory Access (DMA) transfers between the FPGA and the server. By designing hardware modules with low logic overhead, users can generate transfers fast enough to saturate the PCIe bandwidth. In fact, because of the shared PCIe bus within each NUMA node, these transfers can create interference and bus contention that affects the PCIe bandwidth of other users. The resulting performance degradation can be used for detecting co-location [63], or, as we show in this work, for fast covert- and side-channel attacks, breaking the isolation between otherwise logically and physically separate VM instances.
In our covert-channel analysis (Section 4), we show that the communication bandwidth is not identical for all pairs of FPGAs in a NUMA node. In particular, this suggests that the 4 PCIe devices are not directly connected to each CPU, but instead likely go through two separate switches, forming the hierarchy shown in Figure 2, improving the deduced model of prior work [63]. Although not publicly confirmed by AWS, this topology is similar to the one described for P4d instances, which contain 8 GPUs [7]. As a result, even though all 4 FPGAs in a NUMA node contend with each other, the covert-channel bandwidth is highest amongst those sharing a PCIe switch, due to the bottleneck imposed by the shared link (Section 4).
Fig. 2. The newly-deduced PCIe configuration for F1 servers is based on the experiments in this work: each CPU has two PCIe links, each of which provides connectivity to two FPGAs, an NVMe SSD, and an NIC through a PCIe switch.
We also expand on the model to show that the PCIe switches provide connectivity to an NVMe SSD drive and a Network Interface Card (NIC), thereby expanding the attack surface by identifying additional sources of PCIe contention. Finally, as we show in Section 4.5, how effectively these PCIe links can be saturated is also dependent on the operating system/kernel configuration instead of just the user-level software and underlying hardware architecture.
4 CROSS-VM COVERT CHANNELS
In this section, we describe our implementation for the first cross-FPGA covert-channel on public clouds (Section 4.1) and discuss our experimental setup (Section 4.2). We then analyze bandwidth vs. accuracy tradeoffs (Section 4.3), before investigating the impact of receiver and transmitter transfer sizes on the covert-channel accuracy for a given covert-channel bandwidth (Section 4.4). We finish the section by discussing differences in the covert-channel bandwidth between VMs using different operating systems (Section 4.5). Side channels and information leaks based on PCIe contention from other VMs are discussed in Section 5.
4.1 Covert-Channel Implementation
Our covert channel is based on saturating the PCIe link between the FPGA and the server, so, at their core, both the transmitter and the receiver consist of (a) an FPGA image that interfaces with the host over PCIe with minimal latency in accepting write requests or responding to read requests, and (b) software that attaches to the FPGA and repeatedly writes to (or reads from) the mapped Base Address Register (BAR). These requests are translated to PCIe transactions, transmitted over the data and physical layers, and then relayed to the CL hardware through the shell (SH) logic as AXI reads or writes. The transmitter stresses its PCIe link to transmit a 1, but remains idle to transmit a bit 0, while the receiver keeps measuring its own bandwidth during the transmission period (the receiver is thus identical to a transmitter that sends a 1 during every measurement period). The receiver then classifies the received bit as a 1 if the bandwidth \( B \) has dropped below a threshold \( T \) and as 0 otherwise.
The two communicating parties need to have agreed upon some minimal information prior to the transmissions: the specific data center to use (region and availability zone, e.g.,
Fig. 3. Example cross-VM covert communication: The transmitter (Alice) sends the ASCII byte “H”, represented as 01001000 in binary, to the receiver (Bob) in 8 intervals by stressing her PCIe bandwidth to transmit a 1 and remaining idle to transmit a 0. If Bob’s FPGA bandwidth \( B \) drops below a threshold \( T \) , he detects a 1, otherwise, a 0 is detected. To ensure no residual effects after each transmission, the time difference \( \delta \) between successive measurements is slightly larger than the transmission duration \( d \) .
Before they can communicate, the two parties (Alice and Bob in the example of Figure 3) first need to ensure that they are co-located on the same NUMA node within the server. To do so, they can launch multiple instances at or near an agreed upon time and attempt to detect whether any of their instances are co-located by sending handshake messages and expecting a handshake response, using the same setup information as for the covert channel itself (i.e., the time \( t^{\prime } \) to start the communication, the measurement duration \( \delta \), and location information such as the data center region and availability zone). They additionally need to have agreed on the handshake message \( H \), which determines the per-handshake measurement duration \( \Delta \). This co-location process is summarized in Figure 4. Note that as prior work has shown [63], by launching multiple instances, the probability for co-location is high, but the two parties would have to agree on a “timeout” approach. For instance, they could have a maximum number of handshake attempts \( M \), after which they re-launch instances at time \( t^{\prime }+M\cdot \Delta \), or launch additional instances for every unsuccessful handshake attempt (e.g., after attempt 1, Alice and Bob both launch a new instance, while Alice terminates instance 1).
Fig. 4. The process to find a pair of co-located f1.2xlarge instances using PCIe contention uses the covert-channel mechanism to check for pre-agreed handshake messages: Alice transmits the handshake message with her first FPGA, and waits to see if Bob acknowledges the handshake message. In parallel, Bob measures the bandwidths of all his FPGAs. In this example, Bob detects the contention in his seventh FPGA during the fourth handshake attempt. Note that Alice and Bob can rent any number of FPGAs for finding co-location, with five and seven shown in this figure as an example.
In our experiments, we typically launch 5 instances per user at the same time in the same region and availability zone, have a 128-bit handshake message \( H \), and consider the co-location attempt successful if the message was recovered with \( \ge \!\!80\% \) accuracy. Other fixed parameters, such as the measurement duration or transfer sizes, were informed by early manual experiments and the work in [63] to ensure we can reliably detect co-location. Note that these parameters can be different from those used after the co-location has been established. For instance, co-location detection can use low-bandwidth transfers (e.g., \( 200 \,b/\mathrm{s} \)) that are reliable across all NUMA node setups, and can be increased as part of the setup process, once co-location has been established.
During the co-location process, the two communicating parties can also establish what the threshold \( T \) should be and whether the communication bandwidth should be increased. As shown in [63], the PCIe bandwidth of an instance drops from over \( 3000 \,\mathrm{M}B/\mathrm{s} \) to under \( 1000 \,\mathrm{M}B/\mathrm{s} \) when there is an external PCIe stressor. As a result, this threshold \( T \) could be simply hardcoded (at, say, \( 2000 \,\mathrm{M}B/\mathrm{s} \)), or be adaptive, as the mid-point between the minimum \( b_m \) and maximum \( b_M \) bandwidths recorded during a successful handshake. The latter is the approach we use in our work: if the two instances are not co-located, \( b_m\approx b_M \), and the decoded bits will be random, and hence will not match the expected handshake message \( H \). If the two instances are co-located, \( b_M\gg b_m \) (assuming \( H \) contains at least one 0 and at least one 1), so any bit 1 will correspond to a bandwidth \( b_1\approx b_m \ll (b_m+b_M)/2=T \) and any bit 0 will result in bandwidth \( b_0\approx b_M \gg (b_m+b_M)/2=T \).

4.2 Experimental Setup
For the majority of our experiments, we use VMs with AWS FPGA Developer Amazon Machine Image (AMI) [17] version 1.8.1, which runs CentOS 7.7.1908, and develop our code with the Hardware and Software Development Kit (HDK/SDK) version 1.4.15 [8]. We vary the operating systems used for the transmitters and receivers and significantly improve the covert-channel bandwidth in Section 4.5. For our FPGA bitstream, we use the unmodified
Unless otherwise noted, we primarily perform experiments with “spot” instances in the
4.3 Bandwidth vs. Accuracy Tradeoffs
Using our co-location mechanism, we are able to easily find 4 distinct
The results of our experiments, shown in Figure 5, indicate that we can create a fast covert channel between any two FPGAs in either direction: at \( 200 \,b/\mathrm{s} \) and below, the accuracy of the covert channel is 100%, with the accuracy at \( 250 \,b/\mathrm{s} \) dropping to 99% for just one pair. At \( 500 \,b/\mathrm{s} \), three of the six possible pairs can communicate at 100% accuracy, while one pair can communicate with 97% accuracy at \( 2 \,\mathrm{k}b/\mathrm{s} \) (and sharply falls to 70% accuracy even at \( 2.5 \,\mathrm{k}b/\mathrm{s} \)—though in Section 4.5 we show that bandwidths of \( 20 \,\mathrm{k}b/\mathrm{s} \) at 99% accuracy are possible). It should be noted that, as expected, the bandwidth within any given pair is symmetric, i.e., it remains the same when the roles of the transmitter and the receiver are reversed. As the VMs occupy a full NUMA node, there should not be any impact from other users’ traffic. The variable bandwidth between different pairs is therefore likely due to the PCIe topology.
Fig. 5. Bandwidth and accuracy for covert-channel transmissions between any pair of FPGAs, among the four FPGAs in the same NUMA node. Each FPGA pair is color-coded, with transmitters indicated through different markers and receivers through different line styles. For any given pair, the bandwidth is approximately the same in each direction, i.e., the bandwidth from FPGA \( X \) to FPGA \( Y \) is approximately the same as the bandwidth from FPGA \( Y \) to FPGA \( X \) . Communication is possible between any two FPGAs in the NUMA node, but the bandwidths for different pairs diverge.
4.4 Transfer Sizes
In this set of experiments, we fix \( d=4 \,\mathrm{m}\mathrm{s} \), \( \delta =5 \,\mathrm{m}\mathrm{s} \) (i.e., a covert-channel bandwidth of \( 200 \,b/\mathrm{s} \)), and vary the transmitter and receiver transfer sizes. Figure 6 first shows the per-pair channel accuracy for different transmitter sizes. The results show that at \( 4 \,\mathrm{k}B \) and above, the covert-channel accuracy is 100%, while it becomes much lower at smaller transfer sizes. This is because sending smaller chunks of data over PCIe results in lower bandwidth due to the associated PCIe overhead of each transaction. For example, in one 4 ms transmission, the transmitter completes 140301 transfers of \( 1 \,B \) each, corresponding to a PCIe bandwidth of only \( 1 \,B\times 140301/4 \,\mathrm{m}\mathrm{s}=33.5 \,\mathrm{M}B/\mathrm{s} \). However, at the same time, a transmitter can complete 1890 transfers of \( 4 \,\mathrm{k}B \), for a PCIe bandwidth of \( 4 \,\mathrm{k}B\times 1890/4 \,\mathrm{m}\mathrm{s}=1.8 \,\mathrm{G}B/\mathrm{s} \).3
Fig. 6. Covert-channel accuracy for different transmitter transfer sizes. Each chunk transmitted over PCIe needs to be \( \ge \!\!4 \,\mathrm{k}B \) to an ensure an accuracy of 100% at \( 200 \,b/\mathrm{s} \) between any two FPGAs in the NUMA node.
The results of the corresponding experiments for receiver transfer sizes are shown in Figure 7. Similar to the transmitter experiments, very small transfer sizes are unsuitable for the covert channel due to the low resulting bandwidth. However, unlike in the transmitter case, large receiver transfer sizes are also problematic, as the number of transfers completed within each measurement interval is too small to be able to distinguish between external transmissions and the inherent measurement noise.
Fig. 7. Covert-channel accuracy for different receiver transfer sizes. Chunks between \( 64 \,B \) and \( 4 \,\mathrm{k}B \) are suitable for 100% accuracies, but sizes outside this range result in a drop in accuracy for at least one pair of FPGAs in the NUMA node.
4.5 Operating Systems
Starting with FPGA AMI version 1.10.0, Amazon has provided AMIs based on Amazon Linux 2 (AL2) [18] alongside AMIs based on CentOS [17] (both using the Xilinx-provided XOCL PCIe driver). AL2 uses a Linux kernel that has been tuned for the AWS infrastructure [15], and may therefore impact the performance of the covert channel. Since the attacker does not have control over the victim’s VM, it is necessary to explore the effect of the operating system on our communication channel, and thus experiment with both types of operating systems as receivers and transmitters. We use the co-location methodology of Section 4.1 to find different instances that are in the same NUMA node, and report the accuracy of our cross-VM covert channel from bandwidths as low as \( 0.1 \,\mathrm{k}b/\mathrm{s} \) to as high as \( 66.6 \,\mathrm{k}b/\mathrm{s} \). As described in Section 3 and shown in Figure 2, each NUMA node consists of 4 distinct
Fig. 8. Bandwidth and accuracy for covert-channel transmissions between any pair of four co-located instances, where three instances are running AL2 and the last one is running CentOS. Each FPGA pair is color-coded, with transmitters indicated through different markers and receivers through different line styles.
Fig. 9. Bandwidth and accuracy for covert-channel transmissions between any pair of four co-located instances, where two instances are running AL2 and the other two are running CentOS.
Fig. 10. Bandwidth and accuracy for covert-channel transmissions between any pair of four co-located instances, where only one instance is running AL2 and the remaining are running CentOS.
Figure 8 shows the covert channel bandwidths for all FPGA pairs, where one instance is running CentOS and the remaining three are running AL2. For any pair of AL2 instances, the covert-channel accuracy at \( 20 \,\mathrm{k}b/\mathrm{s} \) is over 90% (in fact, reaching 99%), and for a subset of those pairs remains above 80% at even \( 40 \,\mathrm{k}b/\mathrm{s} \). However, when a CentOS instance is involved, the bandwidth drops to \( 0.5 \,\mathrm{k}b/\mathrm{s} \), for either direction of communication.
Figures 9 and 10 show that, depending on where the instances are on the PCIe topology, the bandwidth can vary. Indeed, Figure 9 shows that the bandwidth for an AL2 transmitter and a CentOS receiver can reach \( 2.5 \,\mathrm{k}b/\mathrm{s} \) at 98% accuracy, but CentOS transmitters and AL2 receivers generally have bandwidths below \( 0.5 \,\mathrm{k}b/\mathrm{s} \), though in repeated individual experiments (outside of a full NUMA node), we have been able to get a channel at \( 5.9 \,\mathrm{k}b/\mathrm{s} \) at 95% accuracy. The CentOS-CentOS results of Figure 10 are consistent with those of Section 4.3, with bandwidths between \( 250 \,b/\mathrm{s} \) and \( 1.4 \,\mathrm{k}b/\mathrm{s} \) for all but the fastest pair of instances. Table 1 summarizes these results, while Table 2 compares the achieved bandwidths to prior work in cross-FPGA communications.
Table 1. Cross-VM Covert Channel Bandwidth for Different Receiver and Transmitter Operating Systems
Table 2. Cross-FPGA Covert Channel Bandwidth Achieved by Different Works
5 CROSS-VM SIDE-CHANNEL LEAKS
In this section, we explore what kinds of information malicious adversaries can infer about computations performed by un-cooperating victim users that are co-located in the same NUMA node in different, logically isolated VMs. We first show that the PCIe activity of an off-the-shelf video-processing AMI from the AWS Marketplace leaks information about the resolution and bitrate properties of the video being processed, allowing adversaries to infer the activity of different users (Section 5.1). We then show that it is possible to detect when a VM in the same NUMA node is being initialized (Section 5.2), and more generally monitor the PCIe bus over a long period of time (Section 5.3). We finally show that PCIe contention can be used for interference attacks, including slowing down the programming of the FPGA itself, or of other data transfer communications between the FPGA and the host VM (Section 5.4). The attacks of this and the next section are summarized in Figure 11.
5.1 Inferring User Activity
To help users in accelerating various types of computations on F1 FPGA instances, the AWS Marketplace lists numerous VM images created and sold by independent software vendors [16]. Users can purchase instances with pre-loaded software and hardware FPGA designs for data analytics, machine learning, and other applications, and deploy them directly on the AWS Elastic Cloud Compute (EC2) platform. AWS Marketplace products are usually delivered as AMIs, each of which provides the VM setup, system environment settings, and all the required programs for the application that is being sold. AWS Marketplace instances for FPGAs naturally use PCIe to communicate between the software and the hardware of the purchased instance. In this section, we first introduce an AMI we purchased to test as the victim software and hardware design (Section 5.1.1), and then discuss the recovery of potentially private information from the victim AMI’s activity by running a co-located receiver VM that monitors the victim’s PCIe activity (Section 5.1.2).
5.1.1 Experimental Setup.
Among the different hardware accelerator solutions for cloud FPGAs, in this section, we target video processing using the DeepField AMI, which leverages FPGAs to accelerate the Video Super-Resolution (VSR) algorithm to convert low-resolution videos to high-resolution ones [22]. The DeepField AMI is based on AL2, and sets up the system environment to make use of the proprietary, pre-trained neural network models [22]. To use the AMI, the VM software first loads the AFI onto the associated FPGA using the
For our experiments, we first launch a group of
The victim VM then runs the unmodified DeepField AMI to convert different lower-resolution videos to higher-resolution ones using the
5.1.2 Leaking Private Information from Marketplace AMIs.
We now show that private information regarding the activities of co-located instances can be revealed through the PCIe bandwidth traces. Figure 12 shows the PCIe bandwidth measured by the attacker while the victim is running the DeepField AMI on an
Fig. 12. PCIe bandwidth traces collected by the attacker while the victim runs the DeepField AMI to perform VSR conversions with input videos of different resolutions and frame rates. Within each sub-figure, the red lines label the start and end of the VSR conversion on the FPGA.
5.2 Detecting Instance Initialization
In the experiments of this work, we have thus far only focused on covert communication and side-channel information leakage between VM instances that have already been initialized. By contrast, in this section, we show for the first time that the instance initialization process can also be detected by monitoring the bandwidth of the PCIe bus. Indeed, on AWS, there is a time lag between when a user requests that an instance with a target AMI be launched and when it is provisioned, initialized, and ready for the user to connect to it over SSH. This process can take multiple minutes, and, as we show in this work, causes significant PCIe traffic that is measurable by co-located adversaries.
For our experiments, we first create an
Figure 13 plots the PCIe bandwidth of the monitoring instance INST-A, along with three reference lines for each of the five instance initializations:
Fig. 13. Detecting the VM initialization process for co-located f1.2xlarge instances by monitoring the PCIe traffic. In this experiment, five new instances are created in sequence, of which the last three happen to be co-located with the monitoring instance.
— | “Create VM” denotes the request for initializing a new VM. | ||||
— | “Finish Init” means that the VM has been initialized, which we define as being able to SSH into the VM instance. | ||||
— | “Terminate VM” indicates the request for shutting down the VM. | ||||
For each VM, we load the PCIe transmitter AFI and software and attempt a handshake between the “Finish Init” and “Terminate VM” steps. The handshake results suggest that the last three instances are co-located with INST-A but the first two are not. Incidentally, the last the three instances also cause large PCIe bandwidth drops (from \( 1600 \,\mathrm{M}B/\mathrm{s} \) to \( 600 \,\mathrm{M}B/\mathrm{s} \)) during their initialization process, as shown in Figure 13. The PCIe bandwidth stays stable for the first two instances, as they are not co-located with INST-A. Note that this bandwidth drop occurs before we can SSH into the instances, and therefore reflects the initialization process itself. Moreover, it is worth noting that the termination step is not reflected in the PCIe trace, indicating a potentially lazy termination process that does not require heavy data transfers. The ability to detect when other users are being allocated to the same NUMA node not only helps with the covert-channel handshaking process of Section 4.1, but can also alert non-adversarial users to potential interference from other users so that they can tweak their applications to expect slower transfers.
5.3 Long-Term PCIe Monitoring
In this section, we present the results of measuring the PCIe bandwidth for two on-demand
Fig. 14. Long-term PCIe-based data center monitoring between the evening of April 25 and the early morning of April 26, with \( d=4 \,\mathrm{m}\mathrm{s} \) and \( \delta =5 \,\mathrm{m}\mathrm{s} \) on an f1.2xlarge on-demand instance.
Fig. 15. Long-term PCIe-based data center monitoring on a different f1.2xlarge on-demand instance with \( d=18 \,\mathrm{m}\mathrm{s} \) and \( \delta =20 \,\mathrm{m}\mathrm{s} \) .
5.4 Interference Attacks
The PCIe contention mechanism we have uncovered can also be used to degrade the performance of co-located applications by other users. Indeed, as we have shown in a prior work [63], the bandwidth can fall from \( 3 \,\mathrm{G}B/\mathrm{s} \) to under \( 1 \,\mathrm{G}B/\mathrm{s} \) using just one PCIe stressor (transmitter), and to below \( 200 \,\mathrm{M}B/\mathrm{s} \) when using two stressors.
To exemplify how the reduced PCIe bandwidth can affect user applications, we again find a full NUMA node with four co-located VMs, but only use three of them. Specifically, the first VM is running the DeepField AMI VSR algorithm [22], and represents the victim user. The second VM is monitoring the PCIe bandwidth (similar to the experiments of Section 5.1), while the third acts as a PCIe stressor. The fourth one is unused and left idle, to avoid unintended interference. To further minimize any other external effects, the VSR computation in Figure 16 is repeated five times in sequence. As Figure 16 shows, the PCIe bandwidth measured by the monitoring instance drops from over \( 1950 \,\mathrm{M}B/\mathrm{s} \) to under \( 650 \,\mathrm{M}B/\mathrm{s} \), and the conversion time in the victim instance increases by 33%. In addition to slowing down the victim application, when using a stressor, the attacker can extract even more fine-grained information about the victim. Indeed, as Figure 16(b) shows, the boundary between the five repetitions becomes clear, aiding the AMI fingerprinting attacks discussed in Section 5.1.
Fig. 16. PCIe bandwidth traces collected by the monitoring instance while the victim instance runs the DeepField AMI to perform a VSR conversion of the same video five consecutive times, (a) without and (b) with the third instance acting as a PCIe stressor. Within each sub-figure, the red lines label the start and end of the VSR conversion on the FPGA.
Furthermore, one particular, and perhaps unexpected, consequence of the reduced PCIe bandwidth is a more time-consuming programming process that can, in some cases, be more than tripled. To investigate this effect, we measure the FPGA programming time in one of the instances (INST-A) under different conditions including:
(1) | Whether a PCIe bandwidth-hogging application is running on a second instance, INST-B. | ||||
(2) | Whether just the CL or both the CL and FPGA shell (SH) are reloaded with | ||||
(3) | The size of the loaded AFI in terms of the logic resources used (see Table 3). Because AWS uses partial reconfiguration [9], “the size of a partial bitstream is directly proportional to the size of the region it is reconfiguring” [68], with larger images therefore requiring more data transfers from the host to the FPGA device. Table 3. Resources Used by the Three AFIs Tested | ||||
The results of our experiments are summarized in Figure 17, where three AFIs of different sizes are loaded onto INST-A with/without reloading the shell, and with/without PCIe contention on INST-B. As Figure 17(a) shows, PCIe contention slows down the FPGA programming of all AFIs, with the effect being more prominent for larger instances, where programming has slowed down from \( \approx \!\!7 \,\mathrm{s} \) to \( \approx \!\!12 \,\mathrm{s} \). When the shell is also reloaded (Figure 17(b)), the same pattern holds, but the effects are even more pronounced: even reloading the small AFI slows down from \( \approx \!\!\!7 \,\mathrm{s} \) to over \( 20 \,\mathrm{s} \), while the large AFI takes over \( 30 \,\mathrm{s} \) compared to \( \approx \!\!\!9 \,\mathrm{s} \) without PCIe stressing. The effect is likely not just due to the fact that the AFI needs to transferred to the FPGA over PCIe using the
Fig. 17. The FPGA programming time can be slowed down by heavy PCIe traffic from co-located instances. In (a), only the user’s CL is reconfigured, while in (b), both the FPGA shell (SH) and the CL are reloaded onto the FPGA. Three AFIs with different numbers of logic resources are used.
6 OTHER CROSS-INSTANCE EFFECTS
In this section, we investigate how other aspects of the hardware that is present in F1 servers, namely NICs (Section 6.1), NVMe SSD storage (Section 6.2), and DRAM modules directly attached to the FPGAs (Section 6.3) leak information that can permeate the VM instance boundary and can be used to, for example, cause interference on other users, or determine that different VM instances belong to the same server. The NIC and SSD contention-based attacks are summarized in Figure 11(b).
6.1 Network-Based Contention
NIC cards provide connectivity between a VM and the Internet through external devices such as switches and routers. NIC cards are typically also connected to the host over PCIe, and therefore share the bandwidth with the FPGAs. To test whether the FPGA PCIe traffic has any effect on the network bandwidth, we rent three co-located
The results for all six pairs of instances are identical: when the PCIe stressor is not running,
6.2 SSD Contention
Another shared resource that can lead to contention is the SSD storage that F1 instances can access. The public specification of F1 instances notes that
6.2.1 SSD-to-SSD Contention.
SSD contention is tested by measuring the bandwidth of the SSD by using the
This non-uniform SSD behavior can be used for a robust covert channel with a bandwidth of \( 0.125 \,b/\mathrm{s} \) with 100% accuracy. Specifically, for a transmission of bit 1,
The same mechanism can be exploited to deteriorate the performance of other tenants. It can further co-locate instances on an even more fine-grained level than was previously possible. To accomplish this, we rent several
The fact that SSD contention only exists between two
6.2.2 FPGA-to-SSD Contention.
To formalize the above observations, we use the methodology described in Section 4 to find four co-located
The results of these experiments are summarized in Figure 18. During idle periods, the SSD bandwidth is approximately \( 800 \,\mathrm{M}B/\mathrm{s}\,{\rm to}\, 900 \,\mathrm{M}B/\mathrm{s} \). However, for the two instances with SSD contention, i.e., pairs \( (A,D) \) and \( (B,C) \), the bandwidth drops to as low as \( 7 \,\mathrm{M}B/\mathrm{s} \) while the
Fig. 18. NVMe SSD bandwidth for all transmitter and receiver pairs in a NUMA node, as measured by hdparm. Running stress between seconds 60 to 90 causes a bandwidth drop in exactly one other instance in the NUMA node, while running the FPGA-based PCIe stressor (between seconds 120 and 150) reduces the SSD bandwidth in all cases.
We further test for the opposite effect, i.e., whether stressing the SSD can cause a measurable difference to the FPGA-based PCIe performance. We again stress the SSD between \( 60 \,\mathrm{s}\,{\rm to}\, 90 \,\mathrm{s} \), and stress the FPGA between \( 120 \,\mathrm{s}\,{\rm to}\, 150 \,\mathrm{s} \). As the results of Figure 19 show, the PCIe bandwidth drops from almost \( 1.8 \,\mathrm{G}B/\mathrm{s} \) to approximately \( 500 \,\mathrm{M}B/\mathrm{s}\,{\rm to}\, 1000 \,\mathrm{M}B/\mathrm{s} \) when the FPGA-stressor is enabled, but there is no significant difference in performance when the SSD-based stressor is turned on. Similar to the experiments of Section 6.1, this is likely because the FPGA-based stressor can more effectively saturate the PCIe link, while the SSD-based stressor seems to be limited by the performance of the hard drive itself, whose bandwidth when idle (\( 800 \,\mathrm{M}B/\mathrm{s} \)) is much lower than that of the FPGA (\( 1.8 \,\mathrm{G}B/\mathrm{s} \)). In summary, using the FPGA as a PCIe stressor can cause the SSD bandwidth to drop, but the converse is not true, since there is no observable influence on the FPGA PCIe bandwidth as a result of SSD activity.
Fig. 19. FPGA PCIe bandwidth for all transmitter and receiver pairs in a NUMA node, as measured by our covert-channel receiver. Running stress between seconds 60 to 90 does not cause a bandwidth drop, but running the FPGA-based PCIe stressor (between seconds 120 and 150) reduces the bandwidth in all cases.
6.3 DRAM-Based Thermal Monitoring
DRAM decay is known to depend on the temperature of the DRAM chip and its environment [70, 71]. Since the FPGAs in cloud servers have direct access to the on-board DRAM, they can be used as sensors for detecting and estimating the temperature around the FPGA boards, supplementing PCIe-traffic-based measurements.
Figure 20 summarizes how the DRAM decay of on-board chips can be used to monitor thermal changes in the data center. When a DRAM module is being initialized with some data, the DRAM cells will become charged to store the values, with true cells storing logical 1s as charged capacitors, and anti-cells storing them as depleted capacitors. Typically, true and anti-cells are paired, so initializing the DRAM to all ones will ensure only half of the DRAM cells will be charged, even if the actual location of true and anti-cells is not known.
Fig. 20. By alternating between AFIs that instantiate DRAM controllers or leave them unconnected, the decay rate of DRAM cells can be measured as a proxy for environmental temperature monitors [62].
After the data has been written to the DRAM and the cells have been charged, the DRAM refresh is disabled. Disabling DRAM refresh in the server itself is not possible as the physical hardware on the server is controlled by the hypervisor, not the users. However, the FPGA boards have their own DRAMs. By programming the FPGAs with AFIs that do and do not have DRAM controllers, disabling of the DRAM refresh can be emulated, allowing the DRAM cells to decay [62]. Eventually, some of the cells will lose enough charge to “flip” their value (for example, data written as 1 becomes 0 for true cells, since the charge has dissipated).
DRAM data can then be read after a fixed time \( T_{decay} \), which is called the decay time. The number of flipped cells during this time depends on the temperature of the DRAM and its environment [71], and can therefore produce coarse-grained DRAM-based temperature sensors of F1 instances.
Prior work [63] and this article have so far focused on information leaks due to shared resources within a NUMA node, but did not attempt to co-locate instances that are in the same physical server, but belong to different NUMA nodes. In this section, we propose such a methodology that uses the boards’ thermal signatures, which are obtained from the decay rates of each FPGA’s DRAM modules. To collect these signatures, we use the method and code provided by Tian et al. [62] to alternate between bitstreams that instantiate DRAM controllers and ones that leave them unconnected to initialize the memory and then disable its refresh rate. When two instances are in the same server, the temperatures of all 8 FPGAs in an
6.3.1 Setup and Evaluation.
Our method for co-locating instances within a server has two aspects to it: first, we show that we can successfully identify two FPGA boards as being in the same server with high probability using their DRAM decay rates, and then we show that by using PCIe-based co-location we can build the full profile of a server, and identify all eight of its FPGA boards, even if they are in different NUMA nodes. More specifically, we use the open-source software by Tian et al. [62] to collect DRAM decay measurements for several FPGAs over a long period of time and then find which FPGAs’ DRAM decay patterns are the “closest”.
To validate our approach, we rent three
Fig. 21. DRAM decay traces from three f1.16xlarge instances (24 FPGAs in total) for a period of 24 hours, using the differences between successive measurements \( c^i_{\rm diff} \) as the comparison metric, which results in the highest co-location accuracy of 96%. Within each server, measurements from slots in the same NUMA node have been drawn in the same style.
The raw data metric has an accuracy of 75%, the normalized metric is 71% accurate, while the difference metric succeeds in correctly pairing all FPGAs except for one, for an accuracy of 96%. Shorter measurement periods still result in high accuracies. For example, using the DRAM data from the first 12 hours results in only one additional FPGA mis-identification, for an accuracy of 92%. We plot the classification accuracy for the three metrics as a function of time in Figure 22.
Fig. 22. Accuracy of classifying individual FPGAs as belonging to the right server as a function of measurement time using the three different proposed metrics.
In the experiments of Figure 21, the \( c_{\rm diff} \) metric places slots 0–4 of server
However, by using insights about the NUMA nodes that can be extracted through our PCIe-based experiments, the accuracy and reliability of this method can be further increased. For example, slot 0 of server
7 CONCLUSION
This article introduced a novel, fast covert-channel attack between separate users in a public, FPGA-accelerated cloud computing setting. It characterized how contention of the PCIe bus can be used to create a robust communication mechanism, even among users of different operating systems, with bandwidths reaching \( 20 \,\mathrm{k}b/\mathrm{s} \) with 99% accuracy. In addition to making use of contention of the PCIe bus for covert channels, this article demonstrated that contention can be used to monitor or disrupt the activities of other users, including inferring information about their applications, or slowing them down. This work further identified alternative co-location mechanisms, which make use of network cards, SSDs, or even the DRAM modules attached to the FPGA boards, allowing adversaries to co-locate FPGAs in the same server, even if they are on separate NUMA nodes.
More generally, this work demonstrated that malicious adversaries can use PCIe monitoring to observe the data center server activity, breaking the separation of privilege that isolated VM instances are supposed to provide. With more types of accelerators becoming available on the cloud, including FPGAs, GPUs, and TPUs, PCIe-based threats are bound to become a key aspect of cross-user attacks. Overall, our insights showed that low-level, direct hardware access to PCIe, NIC, SSD, and DRAM hardware creates new attack vectors that need to be considered by both users and cloud providers alike when deciding how to tradeoff performance, cost, and security for their designs: even if the endpoints of computations (e.g., CPUs and FPGAs) are assumed to be secure, the shared nature of cloud infrastructures poses new challenges that need to be addressed.
Footnotes
1 This article extends the work accepted at HOST 2021 [32] by (a) measuring and identifying differences in the covert-channel bandwidth across different operating systems, (b) detecting when co-located VM instances with FPGAs are initialized, (c) showing that malicious adversaries can use PCIe contention for slowing down the communication between the host and the FPGA, leading to slower FPGA programming times and applications, and (d) introducing a new method of instance co-location based on network bandwidth contention. Our new findings also allow us to update the deduced PCIe topology of F1 server architectures used by AWS.
Footnote2 Section 4.5 shows that different setups can result in even higher bandwidths exceeding \( 20 \,\mathrm{k}b/\mathrm{s} \).
Footnote3 A maximum transfer size of \( 1 \,\mathrm{M}B \) was chosen to ensure that multiple transfers were possible within each transfer interval without ever interfering with the next measurement interval.
Footnote4 Assuming that slots within a server are assigned randomly, the probability of getting instances with shared SSDs given that they are already co-located in the same NUMA node is 33%: out of the three remaining slots in the same NUMA node, exactly one slot can be in an instance that shares the SSD.
Footnote
- [1] . 2014. Seven recipes for setting your FPGA on fire—A cookbook on heat generators. Microprocessors and Microsystems 38, 8(2014), 911–919.Google Scholar
Digital Library
- [2] . 2019. RAM-Jam: Remote temperature and voltage fault attack on FPGAs using memory collisions. In Proceedings of the Workshop on Fault Diagnosis and Tolerance in Cryptography.Google Scholar
Cross Ref
- [3] . 2022. Instance Families. Retrieved March 20, 2022 from https://www.alibabacloud.com/help/doc-detail/25378.html.Google Scholar
- [4] . 2016. Developer Preview—EC2 Instances (F1) with Programmable Hardware. Retrieved March 20, 2022 from https://aws.amazon.com/blogs/aws/developer-preview-ec2-instances-f1-with-programmable-hardware/.Google Scholar
- [5] . 2018. The Agility of F1: Accelerate Your Applications with Custom Compute Power. Retrieved March 20, 2022 from https://d1.awsstatic.com/Amazon_EC2_F1_Infographic.pdf.Google Scholar
- [6] . 2019. F1 FPGA Application Note: How to Use Write Combining to Improve PCIe Bus Performance. Retrieved March 20, 2022 from https://github.com/awslabs/aws-fpga-app-notes/tree/master/Using-PCIe-Write-Combining.Google Scholar
- [7] . 2020. Amazon EC2 P4d Instances Deep Dive. Retrieved March 20, 2022 from https://aws.amazon.com/blogs/compute/amazon-ec2-p4d-instances-deep-dive/.Google Scholar
- [8] . 2020. Official repository of the AWS EC2 FPGA Hardware and Software Development Kit. Retrieved March 20, 2022 from https://github.com/aws/aws-fpga/tree/v1.4.15.Google Scholar
- [9] . 2021. AWS FPGA - Frequently Asked Questions. Retrieved March 20, 2022 from https://github.com/aws/aws-fpga/blob/master/FAQs.md.Google Scholar
- [10] . 2021. AWS Shell Interface Specification. Retrieved March 20, 2022 from https://github.com/aws/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md.Google Scholar
- [11] . 2021. CL_DRAM_DMA Custom Logic Example. Retrieved March 20, 2022 from https://github.com/aws/aws-fpga/tree/master/hdk/cl/examples/cl_dram_dma.Google Scholar
- [12] . 2021. F1 FPGA Application Note: How to Use the PCIe Peer-2-Peer Version 1.0. Retrieved March 20, 2022 from https://github.com/awslabs/aws-fpga-app-notes/tree/master/Using-PCIe-Peer2Peer.Google Scholar
- [13] . 2021. Hello World CL Example. Retrieved March 20, 2022 from https://github.com/aws/aws-fpga/tree/master/hdk/cl/examples/cl_hello_world.Google Scholar
- [14] . 2022. Amazon EC2 Instance Types. Retrieved March 20, 2022 from https://aws.amazon.com/ec2/instance-types/.Google Scholar
- [15] . 2022. Amazon Linux 2 FAQs. Retrieved March 20, 2022 from https://aws.amazon.com/amazon-linux-2/faqs/.Google Scholar
- [16] . 2022. AWS Marketplace. Retrieved March 20, 2022 from https://aws.amazon.com/marketplace.Google Scholar
- [17] . 2022. FPGA Developer AMI. Retrieved March 20, 2022 from https://aws.amazon.com/marketplace/pp/prodview-gimv3gqbpe57k.Google Scholar
- [18] . 2022. FPGA Developer AMI (Amazon Linux 2). Retrieved March 20, 2022 from https://aws.amazon.com/marketplace/pp/prodview-iehshpgi7hcjg.Google Scholar
- [19] . 2014. Aging effects in FPGAs: An experimental analysis. In Proceedings of the International Conference on Field Programmable Logic and Applications.Google Scholar
Cross Ref
- [20] . 2022. FPGA Cloud Compute. Retrieved March 20, 2022 from https://cloud.baidu.com/product/fpga.html.Google Scholar
- [21] . 2017. TARUC: A topology-aware resource usability and contention benchmark. In Proceedings of the ACM/SPEC International Conference on Performance Engineering.Google Scholar
Digital Library
- [22] . 2021. DeepField-SR Video Super Resolution Hardware Accelerator. Retrieved March 20, 2022 from https://www.xilinx.com/products/acceleration-solutions/deepField-sr.html.Google Scholar
- [23] . 1997. Thermal monitoring on FPGAs using ring-oscillators. In Proceedings of the International Workshop on Field-Programmable Logic and Applications.Google Scholar
Cross Ref
- [24] . 2020. Neighbors from hell: Voltage attacks against deep learning accelerators on multi-tenant FPGAs. In Proceedings of the International Conference on Field-Programmable Technology.Google Scholar
Cross Ref
- [25] . 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the Workshop on General-Purpose Processing on Graphics Processing Units.Google Scholar
Digital Library
- [26] . 2021. A survey of recent attacks and mitigation on FPGA systems. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI.Google Scholar
Cross Ref
- [27] . 2016. Topology-aware GPU selection on multi-GPU nodes. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops.Google Scholar
Cross Ref
- [28] . 2019. Measuring long wire leakage with ring oscillators in cloud FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications.Google Scholar
Cross Ref
- [29] . 2019. Reading between the dies: Cross-SLR covert channels on multi-tenant cloud FPGAs. In Proceedings of the IEEE International Conference on Computer Design.Google Scholar
Cross Ref
- [30] . 2020. C3APSULe: Cross-FPGA covert-channel attacks through power supply unit leakage. In Proceedings of the IEEE Symposium on Security and Privacy.Google Scholar
Cross Ref
- [31] . 2020. Information leakage from FPGA routing and logic elements. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design.Google Scholar
Digital Library
- [32] . 2021. Cross-VM information leaks in FPGA-accelerated cloud environments. In Proceedings of the IEEE International Symposium on Hardware Oriented Security and Trust.Google Scholar
Cross Ref
- [33] . 2020. Are cloud FPGAs really vulnerable to power analysis attacks? In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition.Google Scholar
Cross Ref
- [34] . 2021. Shared FPGAs and the holy grail: Protections against side-channel and fault attacks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition.Google Scholar
Cross Ref
- [35] . 2017. Voltage drop-based fault attacks on FPGAs using valid bitstreams. In Proceedings of the International Conference on Field Programmable Logic and Applications.Google Scholar
Cross Ref
- [36] . 2021. Classifying computations on multi-tenant FPGAs. In Proceedings of the Design Automation Conference.Google Scholar
- [37] . 2022. FPGA Accelerated Cloud Server. Retrieved March 20, 2022 from https://www.huaweicloud.com/en-us/product/fcs.html.Google Scholar
- [38] . 2010. Handbook of FPGA Design Security (1st ed.). Springer.Google Scholar
Digital Library
- [39] . 2020. Security of cloud FPGAs: A survey. arXiv:2005.04867. Retrieved from https://arxiv.org/abs/2005.04867.Google Scholar
- [40] Jonas Krautter, Dennis R. E. Gnad, and Mehdi B. Tahoori. 2019. Mitigating Electrical-level Attacks towards Secure Multi-Tenant FPGAs in the Cloud. ACM Trans. Reconfigurable Technol. Syst. 12, 3, Article 12 (September 2019), 26 pages. Google Scholar
Digital Library
- [41] . 2019. Mitigating electrical-level attacks towards secure multi-tenant FPGAs in the cloud. ACM Transactions on Reconfigurable Technology and Systems 12, 3(2019).Google Scholar
Digital Library
- [42] . 2021. Denial-of-Service on FPGA-based cloud infrastructures – Attack and defense. Transactions on Cryptographic Hardware and Embedded Systems, 3(2021), 441–464.Google Scholar
Cross Ref
- [43] Tuan Minh La, Kaspar Matas, Nikola Grunchevski, Khoa Dang Pham, and Dirk Koch. 2020. FPGADefender: Malicious Self-oscillator Scanning for Xilinx UltraScale + FPGAs. ACM Trans. Reconfigurable Technol. Syst. 13, 3, Article 15 (September 2020), 31 pages. Google Scholar
Digital Library
- [44] . 2019. Priority-based PCIe scheduling for multi-tenant multi-GPU systems. IEEE Computer Architecture Letters 18, 2(2019), 157–160.Google Scholar
Digital Library
- [45] . 2000. Thermal testing on reconfigurable computers. IEEE Design & Test of Computers 17, 1 (
Jan. 2000), 84–91.Google ScholarDigital Library
- [46] . 2002. Dynamically inserting, operating, and eliminating thermal sensors of FPGA-based systems. IEEE Transactions on Components and Packaging Technologies 25, 4(2002), 561–566.Google Scholar
Cross Ref
- [47] . 2021. hdparm. Retrieved March 20, 2022 from https://sourceforge.net/projects/hdparm/.Google Scholar
- [48] . 2020. A quantitative defense framework against power attacks on multi-tenant FPGA. In Proceedings of the International Conference on Computer-Aided Design.Google Scholar
- [49] . 2021. speedtest-cli. Retrieved March 20, 2022 from https://github.com/sivel/speedtest-cli.Google Scholar
- [50] . 2019. Physical side-channel attacks and covert communication on FPGAs: A survey. In Proceedings of the International Conference on Field Programmable Logic and Applications.Google Scholar
Cross Ref
- [51] . 2021. Remote power side-channel attacks on BNN accelerators in FPGAs. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition.Google Scholar
Cross Ref
- [52] G. Provelengios, D. Holcomb, and R. Tessier. 2020. Power Distribution Attacks in Multitenant FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 12 (2020), 2685–2698.
DOI: Google ScholarDigital Library
- [53] George Provelengios, Daniel Holcomb, and Russell Tessier. 2021. Mitigating Voltage Attacks in Multi-Tenant FPGAs. ACM Trans. Reconfigurable Technol. Syst. 14, 2, Article 9 (June 2021), 24 pages. Google Scholar
Digital Library
- [54] . 2021. Deep-Dup: An adversarial weight duplication attack framework to crush deep neural network in multi-tenant FPGA. In Proceedings of the USENIX Security Symposium.Google Scholar
- [55] . 2009. Exploring the multiple-GPU design space. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops.Google Scholar
Digital Library
- [56] . 2011. Quantifying NUMA and contention effects in multi-GPU systems. In Proceedings of the Workshop on General-Purpose Processing on Graphics Processing Units.Google Scholar
Digital Library
- [57] . 2019. Oscillator without a combinatorial loop and its threat to FPGA in data centre. Electronics Letters 15, 11(2019), 640–642.Google Scholar
Cross Ref
- [58] . 2021. Invisible probe: Timing attacks with PCIe congestion side-channel. In Proceedings of the IEEE Symposium on Security and Privacy.Google Scholar
Cross Ref
- [59] . 2022. FPGA Cloud Server. Retrieved March 20, 2022 from https://cloud.tencent.com/product/fpga.Google Scholar
- [60] . 2020. Cloud FPGA security with RO-based primitives. In Proceedings of the International Conference on Field-Programmable Technology.Google Scholar
Cross Ref
- [61] . 2019. Temporal thermal covert channels in cloud FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.Google Scholar
Digital Library
- [62] . 2020. Fingerprinting cloud FPGA infrastructures. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.Google Scholar
Digital Library
- [63] . 2021. Cloud FPGA cartography using PCIe contention. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines.Google Scholar
Cross Ref
- [64] . 2021. Remote power attacks on the versatile tensor accelerator in multi-tenant FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines.Google Scholar
Cross Ref
- [65] . 2008. Modeling and observing the jitter in ring oscillators implemented in FPGAs. In Proceedings of the IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems.Google Scholar
Digital Library
- [66] X. Wang, Y. Niu, F. Liu, and Z. Xu. 2022. When FPGA Meets Cloud: A First Look at Performance. IEEE Transactions on Cloud Computing 10, 2 (2022), 1344–1357.
DOI: Google ScholarCross Ref
- [67] . 2014. stress. Retrieved March 20, 2022 from https://web.archive.org/web/20190502/https://people.seas.harvard.edu/ apw/stress/.Google Scholar
- [68] 2021. 63419 - Vivado Partial Reconfiguration - What types of bitstreams are used in Partial Reconfiguration (PR) solutions? Retrieved March 20, 2022 from https://support.xilinx.com/s/article/63419.Google Scholar
- [69] 2021. UltraScale+ FPGAs: Product Tables and Product Selection Guides. Retrieved March 20, 2022 from https://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf.Google Scholar
- [70] . 2019. Spying on temperature using DRAM. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition.Google Scholar
Cross Ref
- [71] . 2016. Run-time accessible DRAM PUFs in commodity devices. In Proceedings of the Conference on Cryptographic Hardware and Embedded Systems.Google Scholar
Cross Ref
- [72] . 2009. Temperature-aware cooperative ring oscillator PUF. In Proceedings of the IEEE International Workshop on Hardware-Oriented Security and Trust.Google Scholar
- [73] Jiliang Zhang and Gang Qu. 2019. Recent Attacks and Defenses on FPGA-based Systems. ACM Trans. Reconfigurable Technol. Syst. 12, 3, Article 14 (September 2019), 24 pages. Google Scholar
Digital Library
- [74] Y. Zhang, R. Yasaei, H. Chen, Z. Li, and M. A. A. Faruque. 2021. Stealing Neural Network Structure Through Remote FPGA Side-Channel Analysis. IEEE Transactions on Information Forensics and Security 16 (2021), 4377–4388.
DOI: Google ScholarCross Ref
Index Terms
Cross-VM Covert- and Side-Channel Attacks in Cloud FPGAs
Recommendations
Thermal and Voltage Side and Covert Channels and Attacks in Cloud FPGAs
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysCloud FPGAs have been gaining interest in recent years due to the ability of users to request FPGA resources quickly, flexibly, and on-demand. In addition to the existing single-tenant deployments, where each user gets an access to a whole FPGA, recent ...
Information leakage from FPGA routing and logic elements
ICCAD '20: Proceedings of the 39th International Conference on Computer-Aided DesignInformation leakage in FPGAs poses a danger whenever multiple users share the reconfigurable fabric, for example in multi-tenant Cloud FPGAs, or whenever a potentially malicious IP module is synthesized within a single user's design on an FPGA. In such ...
Temporal Thermal Covert Channels in Cloud FPGAs
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWith increasing interest in Cloud FPGAs, such as Amazon's EC2 F1 instances or Microsoft's Azure with Catapult servers, FPGAs in cloud computing infrastructures can become targets for information leakages via convert channel communication. Cloud FPGAs ...




























Comments