skip to main content
research-article

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

Published:06 December 2021Publication History
Skip Abstract Section

Abstract

Customized compute acceleration in the datacenter is key to the wider roll-out of applications based on deep neural network (DNN) inference. In this article, we investigate how to maximize the performance and scalability of field-programmable gate array (FPGA)-based pipeline dataflow DNN inference accelerators (DFAs) automatically on computing infrastructures consisting of multi-die, network-connected FPGAs. We present Elastic-DF, a novel resource partitioning tool and associated FPGA runtime infrastructure that integrates with the DNN compiler FINN. Elastic-DF allocates FPGA resources to DNN layers and layers to individual FPGA dies to maximize the total performance of the multi-FPGA system. In the resulting Elastic-DF mapping, the accelerator may be instantiated multiple times, and each instance may be segmented across multiple FPGAs transparently, whereby the segments communicate peer-to-peer through 100 Gbps Ethernet FPGA infrastructure, without host involvement. When applied to ResNet-50, Elastic-DF provides a 44% latency decrease on Alveo U280. For MobileNetV1 on Alveo U200 and U280, Elastic-DF enables a 78% throughput increase, eliminating the performance difference between these cards and the larger Alveo U250. Elastic-DF also increases operating frequency in all our experiments, on average by over 20%. Elastic-DF therefore increases performance portability between different sizes of FPGA and increases the critical throughput per cost metric of datacenter inference.

REFERENCES

  1. [1] Abel F., Weerasinghe J., Hagleitner C., Weiss B., and Paredes S.. 2017. An FPGA platform for hyperscalers. In Proceedings of the IEEE 25th Annual Symposium on High-performance Interconnects (HOTI’17). 2932. https://doi.org/10.1109/HOTI.2017.13Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] AWS Amazon. 2018. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google ScholarGoogle Scholar
  3. [3] Baskin Chaim, Liss Natan, Zheltonozhskii Evgenii, Bronstein Alex M., and Mendelson Avi. 2018. Streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’18). IEEE, 162169.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bianchi Giuseppe, Welzl Michael, Tulumello Angelo, Belocchi Giacomo, Faltelli Marco, and Pontarelli Salvatore. 2018. A fully portable TCP implementation using XFSMs. In Proceedings of the ACM SIGCOMM Conference on Posters and Demos. ACM, 99101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Blott Michaela, Preußer Thomas B., Fraser Nicholas J., Gambardella Giulio, O’brien Kenneth, Umuroglu Yaman, Leeser Miriam, and Vissers Kees. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Bucknall Alex R., Shreejith Shanker, and Fahmy Suhaib A.. 2020. Build automation and runtime abstraction for partial reconfiguration on Xilinx Zynq Ultrascale+. In Proceedings of the International Conference on Field-Programmable Technology.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Calg̀ Salvatore, Korcyl Grzegorz, and Korcyl Piotr. 2020. Using Xilinx Alveo accelerators for lattice QCD. In Proceedings of the Asia-Pacific Intertional Symposium on Lattice Field Theory, Vol. 2020.Google ScholarGoogle Scholar
  8. [8] Caulfield Adrian M., Chung Eric S., Putnam Andrew, Angepat Hari, Fowers Jeremy, Haselman Michael, Heil Stephen, Humphrey Matt, Kaur Puneet, Kim Joo-Young et al. 2016. A cloud-scale acceleration architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chung Eric, Fowers Jeremy, Ovtcharov Kalin, Papamichael Michael, Caulfield Adrian, Massengill Todd, Liu Ming, Lo Daniel, Alkalay Shlomi, Haselman Michael et al. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 820.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Community ONNX. [n.d.]. ONNX: Open Neural Network Exchange. Retrieved from https://github.com/onnx.Google ScholarGoogle Scholar
  11. [11] Cormen Thomas H., Leiserson Charles E., Rivest Ronald L., and Stein Clifford. 2009. Introduction to Algorithms. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 248255.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ding Li, Kang Ping, Yin Wenbo, and Wang Linli. 2016. Hardware TCP offload engine based on 10-Gbps ethernet for low-latency network communication. In Proceedings of the International Conference on Field-Programmable Technology (FPT’16). IEEE, 269272.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Fallside Hamish, John M., and Smith S.. 2000. Internet connected FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 289290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Fowers Jeremy, Ovtcharov Kalin, Papamichael Michael, Massengill Todd, Liu Ming, Lo Daniel, Alkalay Shlomi, Haselman Michael, Adams Logan, Ghandi Mahdi et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Fowers Jeremy, Ovtcharov Kalin, Papamichael Michael K., Massengill Todd, Liu Ming, Lo Daniel, Alkalay Shlomi, Haselman Michael, Adams Logan, Ghandi Mahdi et al. 2019. Inside project Brainwave’s cloud-scale, real-time AI processor. IEEE Micro 39, 3 (2019), 2028. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Ghasemzadeh Mohammad, Samragh Mohammad, and Koushanfar Farinaz. 2018. ReBNet: Residual binarized neural network. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 5764.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Group ETH Zurich Systems. [n.d.]. Vitis with 100 Gbps TCP/IP Network Stack. Retrieved from https://github.com/fpgasystems/Vitis_with_100Gbps_TCP-IP.Google ScholarGoogle Scholar
  19. [19] Pereira Andre Hahn and Betz Vaughn. 2014. Cad and routing architecture for interposer-based multi-fpga systems. In Proceedings of the ACM/SIGDA international symposium on Field-programmable gate arrays. 7584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Hall Mathew and Betz Vaughn. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. Retrieved from https://arXiv:2007.10451.Google ScholarGoogle Scholar
  21. [21] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2015. Deep residual learning for image recognition. Retrieved from https://arXiv:1512.03385.Google ScholarGoogle Scholar
  22. [22] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. Retrieved from http://arxiv.org/abs/1704.04861.Google ScholarGoogle Scholar
  23. [23] Huawei. 2019. Retrieved from https://www.huaweicloud.com/en-us/product/fcs.html.Google ScholarGoogle Scholar
  24. [24] Ji Yong and Hu Qing-Sheng. 2011. 40Gbps Multi-Connection TCP/IP Offload Engine. In Proceedings of the International Conference on Wireless Communications and Signal Processing (WCSP’11). IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Jia Zhihao, Zaharia Matei, and Aiken Alex. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. Retrieved from https://arXiv:1807.05358.Google ScholarGoogle Scholar
  26. [26] Jiang Weiwen, Sha Edwin H.-M., Zhang Xinyi, Yang Lei, Zhuge Qingfeng, Shi Yiyu, and Hu Jingtong. 2019. Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Jouppi Norman P., Young Cliff, Patil Nishant, Patterson David, Agrawal Gaurav, Bajwa Raminder, Bates Sarah, Bhatia Suresh, Boden Nan, Borchers Al et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the International Symposium on Computer Architecture (ISCA’17). ACM, 112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Kachris Christoforos. 2018. Performance evaluation of InAccel ML scalable suite. Technical Report. InAccel. Retrieved from https://www.inaccel.com/wp-content/uploads/inaccel_white_paper.pdf.Google ScholarGoogle Scholar
  29. [29] Kathail Vinod. 2020. Xilinx vitis unified software platform. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 173174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Knapheide Justin, Stabernack Benno, and Kuhnke Maximilian. 2020. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 277283.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Kroes Mairin, Petrica Lucian, Cotofana Sorin, and Blott Michaela. 2020. Evolutionary bin packing for memory-efficient dataflow inference acceleration on FPGA. In Proceedings of the International Conference on Genetic and Evolutionary Computation (GECCO’20). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Labs Xilinx Research. [n.d.]. FINN Dataflow Accelerator Examples. Retrieved from https://github.com/Xilinx/finn-examples.Google ScholarGoogle Scholar
  33. [33] Labs Xilinx Research. [n.d.]. FINN: Dataflow compiler for QNN inference on FPGAs. Retrieved from https://github.com/Xilinx/finn.Google ScholarGoogle Scholar
  34. [34] Mao Fubing, Zhang Wei, Feng Bo, He Bingsheng, and Ma Yuchun. 2016. Modular placement for interposer based multi-FPGA systems. In Retrieved from International Great Lakes Symposium on VLSI (GLSVLSI’16). IEEE, 9398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Murray Kevin E. and Betz Vaughn. 2015. HETRIS: Adaptive floorplanning for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology (FPT’15). IEEE, 8895.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Tarafdar Naif, Di Guglielmo Giuseppe, C. Harris Philip, D. Krupa Jeffrey, Loncar Vladimir, S. Rankin Dylan, Tran Nhan, Wu Zhenbin, Shen Qianfeng, and Chow Paul. [n.d.]. AIgean: An Open Framework for Machine Learning on a Heterogeneous Cluster. Retrieved from https://indico.cern.ch/event/924283/contributions/4105333/attachments/2154984/3634529/aigean_fastml.pdf.Google ScholarGoogle Scholar
  37. [37] Nakahara Hiroki, Que Zhiqiang, and Luk Wayne. 2020. High-Throughput convolutional neural network on an FPGA by customized JPEG compression. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 19.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Nasiri Ehsan, Shaikh Javeed, Pereira Andre Hahn, and Betz Vaughn. 2015. Multiple dice working as one: CAD flows and routing architectures for silicon interposer fpgas. IEEE Trans. Very Large Scale Integr. Syst. 24, 5 (2015), 18211834. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Nimbix. 2019. Retrieved from https://www.nimbix.net/alveo.Google ScholarGoogle Scholar
  40. [40] Pappalardo Alessandro. [n.d.]. Brevitas: Quantization-Aware Training in PyTorch. Retrieved from https://github.com/Xilinx/brevitas.Google ScholarGoogle Scholar
  41. [41] Petrica Lucian. [n.d.]. Elastic-DF partitioner and FINN integration. Retrieved from https://github.com/Xilinx/finn-experimental/blob/main/src/finn/util/platforms.py.Google ScholarGoogle Scholar
  42. [42] Petrica Lucian. [n.d.]. Quantized ResNet-50 Dataflow Acceleration on Alveo. Retrieved from https://github.com/Xilinx/ResNet50-PYNQ.Google ScholarGoogle Scholar
  43. [43] Petrica Lucian, Alonso Tobias, Kroes Mairin, Fraser Nicholas, Cotofana Sorin, and Blott Michaela. 2020. Memory-Efficient Dataflow Inference for Deep CNNs on FPGA. Retrieved from https://arXiv:2011.07317.Google ScholarGoogle Scholar
  44. [44] Program Xilinx University. [n.d.]. Xilinx Adaptive Compute Cluster (XACC) Program. Retrieved from https://www.xilinx.com/support/university/XUP-XACC.html.Google ScholarGoogle Scholar
  45. [45] Ruiz Mario. [n.d.]. XUP Vitis Network Example (VNx). Retrieved from https://github.com/Xilinx/xup_vitis_network_example.Google ScholarGoogle Scholar
  46. [46] Ruiz Mario, Sidler David, Sutter Gustavo, Alonso Gustavo, and López-Buedo Sergio. 2019. Limago: An FPGA-based Open-source 100 GbE TCP/IP Stack. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 286292. https://doi.org/10.1109/FPL.2019.00053Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Saban Kirk. 2012. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency. https://www.xilinx.com/support/documentation/white_papers/wp380_Stacked_Silicon_Interconnect_Technology.pdf.Google ScholarGoogle Scholar
  48. [48] Shan Junnan, Lazarescu Mihai T., Cortadella Jordi, Lavagno Luciano, and Casu Mario R.. 2021. CNN-on-AWS: Efficient allocation of multi-kernel applications on Multi-FPGA platforms. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 2 (2021), 301–314. DOI: 10.1109/TCAD.2020.2994256Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Sidler David, Alonso Gustavo, Blott Michaela, Karras Kimon, Vissers Kees, and Carley Raymond. 2015. Scalable 10 Gbps TCP/IP stack architecture for reconfigurable hardware. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 3643. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Sidler David, István Zsolt, and Alonso Gustavo. 2016. Low-latency TCP/IP stack for data center applications. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 14.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Society IEEE Computer. 2014. IEEE standard for ethernet-amendment 2: Physical layer specifications and management parameters for 100 Gb/s operation over backplanes and copper cables. IEEE Std 802.3 bj-2014 (Amendment to IEEE Std 802.3-2014). Retrieved from https://standards.ieee.org/standard/802-3bj-2014.html.Google ScholarGoogle Scholar
  52. [52] Tetreault Roberto Dicecco Susanne M. Balle, Mark. 2020. Inter-Kernel Links for Direct Inter-FPGA Communication. Retrieved from https://www.intel.com/content/dam/www/programmable/us/en/others/literature/wp/wp-01305-inter-kernel-links-for-direct-inter-fpga-communication.pdf.Google ScholarGoogle Scholar
  53. [53] Sze Vivienne, Chen Yu-Hsin, Yang Tien-Ju, and Emer Joel S.. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 22952329.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Tarafdar Naif, Guglielmo Giuseppe Di, Harris Philip C., Krupa Jeffrey D., Loncar Vladimir, Rankin Dylan S., Tran Nhan, Wu Zhenbin, Shen Qianfeng, and Chow Paul. 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 239239.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Tarafdar Naif, Guglielmo Giuseppe Di, Harris Philip C., Krupa Jeffrey D., Loncar Vladimir, Rankin Dylan S., Tran Nhan, Wu Zhenbin, Shen Qianfeng, and Chow Paul. 2020. AIgean: An open framework for machine learning on heterogeneous clusters. In Proceedings of the 6th International Workshop on Heterogeneous High-performance Reconfigurable Computing. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Tarafdar Naif, Eskandari Nariman, Sharma Varun, Lo Charles, and Chow Paul. 2018. Galapagos: A full stack approach to FPGA integration in the cloud. IEEE Micro 38, 6 (2018), 1824.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Toffolo Haroldo G. Santos and Tulio A. M.. [n.d.]. The Python MIP Package. Retrieved from https://www.python-mip.com/.Google ScholarGoogle Scholar
  58. [58] Umuroglu Yaman, Fraser Nicholas J., Gambardella Giulio, Blott Michaela, Leong Philip, Jahre Magnus, and Vissers Kees. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 6574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Umuroglu Yaman and Jahre Magnus. 2017. Streamlined deployment for quantized neural networks. Retrieved from http://arxiv.org/abs/1709.04060.Google ScholarGoogle Scholar
  60. [60] Venieris Stylianos I. and Bouganis Christos-Savvas. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, 4047.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Voss Nils, Quintana Pablo, Mencer Oskar, Luk Wayne, and Gaydadjiev Georgi. 2019. Memory mapping for multi-die FPGAs. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’19). IEEE, 7886.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Wang Tianqi, Geng Tong, Li Ang, Jin Xi, and Herbordt Martin. 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Trans. Comput. 69, 8 (2020), 11431158. https://doi.org/10.1109/TC.2020.3000118Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Xilinx. [n.d.]. Vitis AI Model Zoo. Retrieved from https://github.com/Xilinx/Vitis-AI/tree/master/models/AI-Model-Zoo.Google ScholarGoogle Scholar
  64. [64] Yu Xiaoyu, Wang Yuwei, Miao Jie, Wu Ephrem, Zhang Heng, Meng Yu, Zhang Bo, Min Biao, Chen Dewei, and Gao Jianlin. 2019. A data-center FPGA acceleration platform for convolutional neural networks. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 151158.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zhang Chen, Wu Di, Sun Jiayu, Sun Guangyu, Luo Guojie, and Cong Jason. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the International Symposium on Low Power Electronics and Design. 326331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Zhang Wentai, Zhang Jiaxi, Shen Minghua, Luo Guojie, and Xiao Nong. 2019. An efficient mapping approach to large-scale DNNs on multi-FPGA architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’19). IEEE, 12411244.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 2
      June 2022
      310 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3501287
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 December 2021
      • Accepted: 1 June 2021
      • Revised: 1 April 2021
      • Received: 1 January 2021
      Published in trets Volume 15, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!