Abstract
Field Programmable Gate Arrays (FPGAs) are increasingly being used in data centers and the cloud due to their potential to accelerate certain workloads as well as for their architectural flexibility, since they can be used as accelerators, smart-NICs, or stand-alone processors. To meet the challenges posed by these new use cases, FPGAs are quickly evolving in terms of their capabilities and organization. The utilization of High Bandwidth Memory (HBM) in FPGA devices is one recent example of such a trend. In this article, we study the potential of FPGAs equipped with HBM from a data analytics perspective. We consider three workloads common in analytics-oriented databases and implement them on an FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. We consider two possible configurations of the HBM, using a single and a dual clock version design. With the right design, FPGA+HBM-based solutions are able to surpass the highest performance provided by either a two-socket POWER91 system or a 14-core Xeon2 E5 by up to 5.9× (range selection), 18.3× (hash join), and 6.1× (SGD).
- [1] . 2016. Intel Xeon Processor E5-2690 v4. Retrieved from https://ark.intel.com/content/www/us/en/ark/products/91770/intel-xeon-processor-e5-2690-v4-35m-cache-2-60-ghz.html.Google Scholar
- [2] . 2017. AWS F1 Instances. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- [3] . 2017. Oracle Data Mining. Retrieved from https://www.oracle.com/technetwork/database/enterprise-edition/odm-techniques-algorithms-097163.html.Google Scholar
- [4] . 2019. Alpha Data ADM-PCIE-9H7. Retrieved from https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9h7.Google Scholar
- [5] . 2019. POWER9 LaGrange Single-Chip Module Datasheet v1.8, OpenPOWER. Retrieved from https://www-50.ibm.com/systems/power/openpower/posting.xhtml?postingId=0646B83F1D410C28852580110015080A.Google Scholar
- [6] . 2019. Xilinx VCU1525. Retrieved from https://www.xilinx.com/support/documentation/boards_and_kits/vcu1525/ug1268-vcu1525-reconfig-accel-platform.pdf.Google Scholar
- [7] . 2020. Baidu FPGA Instances. Retrieved from https://cloud.baidu.com/product/fpga.html.Google Scholar
- [8] . 2020. IBM DB2 Machine Learning. Retrieved from https://www.ibm.com/cloud/garage/dte/tutorial/database-machine-learning-ibm-db2-warehouse-cloud/.Google Scholar
- [9] . 2021. AXI HBM IP Documentation by Xilinx. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf.Google Scholar
- [10] . 2021. New Intel XPU Innovations Target HPC and AI. Retrieved from https://www.intel.com/content/www/us/en/newsroom/news/new-intel-xpu-innovations-target-hpc-ai.html.Google Scholar
- [11] . 2021. UltraScale Architecture-based FPGAs Memory IP. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/ultrascale_memory_ip/v1_4/pg150-ultrascale-memory-ip.pdf.Google Scholar
- [12] . 2021. Xilinx Ultrascale+ Devices. Retrieved from https://www.xilinx.com/support/documentation/selection-guides/ultrascale-plus-fpga-product-selection-guide.pdf.Google Scholar
- [13] . 2019. doppioDB 1.0: Machine learning inside a relational engine. IEEE Data Eng. Bull. 42, 2 (2019), 19–31.Google Scholar
- [14] . 2013. Main-memory Hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 362–373.Google Scholar
Digital Library
- [15] . 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (Feb. 2012), 281–305.Google Scholar
Digital Library
- [16] . 2013. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking. Springer, 61–76.Google Scholar
- [17] . 1999. Database architecture optimized for the new bottleneck: Memory access. In Proceedings of the Very Large Data Base Conference (VLDB’99), Vol. 99. 54–65.Google Scholar
- [18] . 2015. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8, 3–4 (2015), 231–357.Google Scholar
Digital Library
- [19] . 2014. Hardware acceleration of database operations. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 151–160.Google Scholar
Digital Library
- [20] . 2019. Deploying Hash tables on die-stacked high bandwidth memory. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 239–248.Google Scholar
Digital Library
- [21] . 2021. HBM connect: High-performance HLS interconnect for FPGA HBM. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). 116–126.Google Scholar
Digital Library
- [22] . 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8–20.Google Scholar
Cross Ref
- [23] . 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). IEEE, 93–96.Google Scholar
- [24] . 2019. In-memory database acceleration on FPGAs: A survey. VLDB J. (2019), 1–27.Google Scholar
- [25] . 2015. A scalable high-bandwidth architecture for lossless compression on FPGAs. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). IEEE, 52–59.Google Scholar
Digital Library
- [26] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, 1–14.Google Scholar
Digital Library
- [27] . 2020. FP-AMG: FPGA-based acceleration framework for algebraic multigrid solvers. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE, 148–156.Google Scholar
Cross Ref
- [28] . 2021. Shuhai: A tool for benchmarking HighBandwidth memory on FPGAs. IEEE Trans. Comput. (2021). .Google Scholar
Cross Ref
- [29] . 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40–45.Google Scholar
- [30] . 2014. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems. 3068–3076.Google Scholar
- [31] . 2021. MicroRec: Efficient recommendation inference by hardware and data structure solutions. Proc. Mach. Learn. Syst. 3, 1 (2021), 845–859.Google Scholar
- [32] . 2021. FleetRec: Large-scale recommendation inference on hybrid GPU-FPGA clusters. In Proceedings of the 27th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’21).Google Scholar
Digital Library
- [33] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA’17). IEEE, 1–12.Google Scholar
Digital Library
- [34] . 2012. GPU join processing revisited. In Proceedings of the 8th International Workshop on Data Management on New Hardware. ACM, 55–62.Google Scholar
Digital Library
- [35] . 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 160–167.Google Scholar
Cross Ref
- [36] . 2016. Fast and robust hashing for database operators. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL’16). IEEE, 1–4.Google Scholar
Cross Ref
- [37] . 2018. ColumnML: Column-store machine learning with on-the-fly data transformation. Proc. VLDB Endow. 12, 4 (2018), 348–361.Google Scholar
Digital Library
- [38] . 2017. FPGA-based data partitioning. In Proceedings of the ACM International Conference on Management of Data. ACM, 433–445.Google Scholar
Digital Library
- [39] . 2020. High bandwidth memory on FPGAs: A data analytics perspective. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 1–8.Google Scholar
Cross Ref
- [40] . 2019. doppioDB 2.0: Hardware techniques for improved integration of machine learning into databases. Proc. VLDB Endow. 12, 12 (2019), 1818–1821.Google Scholar
Digital Library
- [41] . 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). 105–115.Google Scholar
Digital Library
- [42] . 2013. Predictive Analysis with SAP: The Comprehensive Guide. SAP Press.Google Scholar
Digital Library
- [43] . 2018. In-RDBMS hardware acceleration of advanced analytics. Proc. VLDB Endow. 11, 11 (2018), 1317–1331.Google Scholar
Digital Library
- [44] . 2017. High-performance hardware merge sorter. In Proceedings of the IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 1–8.Google Scholar
Cross Ref
- [45] . 2019. StreamBox-HBM: Stream analytics on high bandwidth hybrid memory. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19). ACM, 167–181.Google Scholar
Digital Library
- [46] . 2011. A reconfigurable computing system based on a cache-coherent fabric. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs. IEEE, 80–85.Google Scholar
Digital Library
- [47] . 2017. Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL’17). IEEE, 1–8.Google Scholar
Cross Ref
- [48] . 2018. Accelerating database systems using FPGAs: A survey. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 125–1255.Google Scholar
Cross Ref
- [49] . 2019. Joins on high-bandwidth memory: A new level in the memory hierarchy. VLDB J. (2019), 1–21.Google Scholar
- [50] . 2014. A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Comput. Architect. News 42, 3 (2014), 13–24.Google Scholar
Digital Library
- [51] . 2019. Limago: An FPGA-based open-source 100 GbE TCP/IP stack. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). IEEE, 286–292.Google Scholar
Cross Ref
- [52] . 2021. Solving large top-K graph eigenproblems with a memory and compute-optimized FPGA design. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). IEEE, 78–87.Google Scholar
Cross Ref
- [53] . 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the ACM International Conference on Management of Data. ACM, 403–415.Google Scholar
Digital Library
- [54] . 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 9–17.Google Scholar
Cross Ref
- [55] . 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016), 34–46.Google Scholar
Digital Library
- [56] . 2018. IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI. IBM J. Res. Dev. 62, 4/5 (2018), 8–1.Google Scholar
Digital Library
- [57] . 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.Google Scholar
Cross Ref
- [58] . 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, 65–74.Google Scholar
Digital Library
- [59] . 2020. Shuhai: Benchmarking high bandwidth memory on FPGAs. In Proceedings of the IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’20). IEEE.Google Scholar
Cross Ref
- [60] . 2019. Accelerating generalized linear models with MLWeaving: A one-size-fits-all system for any-precision learning. Proc. VLDB Endow. 12, 7 (2019), 807–821.Google Scholar
Digital Library
- [61] . 2011. Design space exploration for 3D-stacked DRAMs. In Proceedings of the Design, Automation and Test in Europe (DATE’11). IEEE, 1–6.Google Scholar
Cross Ref
- [62] . 2014. Ibex: An intelligent storage engine with support for advanced SQL offloading. Proc. VLDB Endow. 7, 11 (2014), 963–974.Google Scholar
Digital Library
Index Terms
Exploiting HBM on FPGAs for Data Processing
Recommendations
Accelerating Big Data Analytics Using FPGAs
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing MachinesEmerging big data analytics applications require a significant amount of server computational power. As chips are hitting power limits, computing systems are moving away from general-purpose designs and toward greater specialization. Hardware ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsThis paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...
Improving Data Partitioning Performance on OpenCL-Based FPGAs
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing MachinesWe investigate the performance of relational database applications on recent OpenCL-based FPGAs. As a start, we study the performance of data partitioning, a core operation widely used in relational databases. Due to the random memory accesses, data ...






Comments