Abstract
With data-intensive artificial intelligence (AI) and machine learning (ML) applications rapidly surging, modern high-performance embedded systems, with heterogeneous computing resources, critically demand low-latency and high-bandwidth data communication. As such, the newly emerging NVMe (Non-Volatile Memory Express) protocol, with parallel queuing, access prioritization, and optimized I/O arbitration, starts to be widely adopted as a de facto fast I/O communication interface. However, effectively leveraging the potential of modern NVMe storage proves to be nontrivial and demands fine-grained control, high processing concurrency, and application-specific optimization. Fortunately, modern FPGA devices, capable of efficient parallel processing and application-specific programmability, readily meet the underlying physical layer requirements of the NVMe protocol, therefore providing unprecedented opportunities to implementing a rich-featured NVMe middleware to benefit modern high-performance embedded computing.
In this article, we present how to rethink existing accessing mechanisms of NVMe storage and devise innovative hardware-assisted solutions to accelerating NVMe data access performance for the high-performance embedded computing system. Our key idea is to exploit the massively parallel I/O queuing capability, provided by the NVMe storage system, through leveraging FPGAs’ reconfigurability and native hardware computing power to operate transparently to the main processor. Specifically, our DirectNVM system aims at providing effective hardware constructs for facilitating high-performance and scalable userspace storage applications through (1) hardening all the essential NVMe driver functionalities, therefore avoiding expensive OS syscalls and enabling zero-copy data access from the application, (2) relying on hardware for the I/O communication control instead of relying on OS-level interrupts that can significantly reduce both total I/O latency and its variance, and (3) exposing cutting-edge and application-specific weighted-round-robin I/O traffic scheduling to the userspace.
To validate our design methodology, we developed a complete DirectNVM system utilizing the Xilinx Zynq MPSoC architecture that incorporates a high-performance application processor (APU) equipped with DDR4 system memory and a hardened configurable PCIe Gen3 block in its programmable logic part. We then measured the storage bandwidth and I/O latency of both our DirectNVM system and a conventional OS-based system when executing the standard FIO benchmark suite [2]. Specifically, compared against the PetaLinux built-in kernel driver code running on a Zynq MPSoC, our DirectNVM has shown to achieve up to 18.4× higher throughput and up to 4.5× lower latency. To ensure the fairness of our performance comparison, we also measured our DirectNVM system against the Intel SPDK [26], a highly optimized userspace asynchronous NVMe I/O framework running on a X86 PC system. Our experiment results have shown that our DirectNVM, even running on a considerably less powerful embedded ARM processor than a full-scale AMD processor, achieved up to 2.2× higher throughput and 1.3× lower latency. Furthermore, by experimenting with a multi-threading test case, we have demonstrated that our DirectNVM’s weighted-round-robin scheduling can significantly optimize the bandwidth allocation between latency-constraint frontend applications and other backend applications in real-time systems. Finally, we have developed a theoretical framework of performance modeling with classic queuing theory that can quantitatively define the relationship between a system’s I/O performance and its I/O implementation.
- [1] . 2006. Linux Kernel 2.6.18 - Make CFQ the default IO scheduler. Retrieved from https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b17fd9bceb99610f6dc7998c9a4ed6b71520be2b.Google Scholar
- [2] . 2020. Flexible I/O. Retrieved from https://github.com/axboe/fio.Google Scholar
- [3] . 2013. Linux block IO: introducing multi-queue SSD access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conference. 1–10.Google Scholar
Digital Library
- [4] . 2020. FPGA Drive FMC. Retrieved from https://opsero.com/product/fpga-drive-fmc-dual/.Google Scholar
- [5] . 2017. NVMe Specification 1.3. Retrieved from https://nvmexpress.org/wp-content/uploads/NVM_Express_Revision_1.3.pdf.Google Scholar
- [6] . 2020. CAPI Storage, Network, and Analytics Programming (SNAP) Framework. Retrieved from https://developer.ibm.com/linuxonpower/capi/snap.Google Scholar
- [7] . 2019. Analyzing, modeling, and provisioning QoS for NVME SSDs. In Proceedings of the 11th IEEE/ACM International Conference on Utility and Cloud Computing. 247–256.
DOI: DOI: DOI: https://doi.org/10.1109/UCC.2018.00033Google Scholar - [8] . 2018. It’s Time to Think Beyond Cloud Computing. Retrieved from https://www.wired.com/story/its-time-to-think-beyond-cloud-computing/.Google Scholar
- [9] . 2015. Performance Benchmarking for PCIe and NVMe Enterprise Solid-State Drives. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-pcie-nvme-enterprise-ssds-white-paper.pdf.Google Scholar
- [10] . 2020. Intel Optane Persistent Memory. Retrieved from https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html.Google Scholar
- [11] . 2020. Open Programmable Accelerator Engine. Retrieved from https://opae.github.io/latest/index.html.Google Scholar
- [12] . 2013. Enabling cost-effective data processing with smart SSD. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1–12.Google Scholar
Cross Ref
- [13] . 2016. NVMeDirect: A user-space I/O framework for application-specific optimization on nvme SSDs. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16).Google Scholar
Digital Library
- [14] . 2013. Introduction to Queueing Systems with Telecommunication Applications. Vol. 9781461453. Springer.
DOI: DOI: DOI: https://doi.org/10.1007/978-1-4614-5317-8Google Scholar - [15] . 2017. I/O Latency Optimization with polling. In Proceedings of the Vault Linux Storage and Filesystems Conference.Google Scholar
- [16] . 2019. K2: Work-constraining scheduling of NVMe-attached storage. In Proceedings of the IEEE Real-time Systems Symposium (RTSS). IEEE, 56–68.Google Scholar
Cross Ref
- [17] . 2012. High-performance energy-efficient multicore embedded computing. Trans. Parallel Distrib. Syst. 23 (
05 2012), 684-700.DOI: DOI: DOI: https://doi.org/10.1109/TPDS.2011.214Google Scholar - [18] . 2019. INSIDER: Designing in-storage computing system for emerging high-performance drive. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’19). 379–394.Google Scholar
- [19] . 2020. Samsung 970 EVO Plus Specification. Retrieved from https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/970evoplus/.Google Scholar
- [20] . 2014. Explicit formulae for characteristics of finite-Capacity M/D/1 queues. ETRI J. 36, 4 (2014), 609–616.
DOI: DOI: DOI: https://doi.org/10.4218/etrij.14.0113.0812Google ScholarCross Ref
- [21] . 2019. Low Overhead & Energy Efficient Storage Path for Next Generation Computer Systems.Ph.D. Dissertation. University of Manchester.Google Scholar
- [22] . 2018. FastPath: Towards wire-speed NVMe SSDs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 170–1707.Google Scholar
Cross Ref
- [23] . 2019. DMA/Bridge Subsystem for PCI Express v4.1. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf.Google Scholar
- [24] . 2019. PetaLinux Tools Documentation. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_2/ug1144-petalinux-tools-reference-guide.pdf.Google Scholar
- [25] . 2019. UltraScale+ Devices Integrated Block for PCI Express v1.3. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/pcie4_uscale_plus/v1_3/pg213-pcie4-ultrascale-plus.pdf.Google Scholar
- [26] . 2017. SPDK: A development kit to build high performance storage applications. In Proceedings of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 154–161.Google Scholar
Cross Ref
- [27] . 2020. DirectNVM. Retrieved from https://github.com/yu-zou/DirectNVM.Google Scholar
Index Terms
DirectNVM: Hardware-accelerated NVMe SSDs for High-performance Embedded Computing
Recommendations
FastPath_MP: Low Overhead & Energy-efficient FPGA-based Storage Multi-paths
In this article, we present FastPath_MP, a novel low-overhead and energy-efficient storage multi-path architecture that leverages FPGAs to operate transparently to the main processor and improve the performance and energy efficiency of accessing storage ...
Modeling and predicting performance of high performance computing applications on hardware accelerators
Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
A high performance hardware accelerator for dynamic texture segmentation
The major contribution of this paper is the development of a hardware (FPGA) software (CPU) co-design architecture for accelerating the application of Dynamic Texture Segmentation.This work presents a FPGA implementation of FFT processing sub-system ...






Comments