Abstract
Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an field-programmable gate array+HBM-based accelerator connected through Open Coherent Accelerator Processor Interface to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by \( 5.3\times \) and \( 12.7\times \) when running two different compound stencil kernels. NERO reduces the energy consumption by \( 12\times \) and \( 35\times \) for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/W and 21.01 GFLOPS/W. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.
- [1] ADM-PCIE-9H7-High-Speed Communications Hub. Retrieved from https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9h7.Google Scholar
- [2] ADM-PCIE-9V3-High-Performance Network Accelerator. Retrieved from https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9v3.Google Scholar
- [3] AXI High Bandwidth Memory Controller v1.0. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf.Google Scholar
- [4] AXI Reference Guide. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/ug761_axi_reference_guide.pdf.Google Scholar
- [5] CentOS-7 (2009) Release Notes. Retrieved from https://wiki.centos.org/Manuals/ReleaseNotes/CentOS7.2009.Google Scholar
- [6] GCC, the GNU Compiler Collection. Retrieved from https://gcc.gnu.org/.Google Scholar
- [7] High Bandwidth Memory (HBM) DRAM (JESD235). Retrieved from https://www.jedec.org/document_search?search_api_views_fulltext=jesd235.Google Scholar
- [8] High Bandwidth Memory (HBM) DRAM. Retrieved from https://www.jedec.org/sites/default/files/JESD235B-HBM_Ballout.zip.Google Scholar
- [9] IBM XL C/C++ for Linux. Retrieved from https://www.ibm.com/products/xl-cpp-linux-compiler-power.Google Scholar
- [10] Intel Stratix 10 MX FPGAs. Retrieved from https://www.intel.com/content/www/us/en/products/programmable/sip/stratix-10-mx.html.Google Scholar
- [11] Intel® Xeon Phi™ Processor 7230 (16GB, 1.30 GHz, 64 core). Retrieved from https://www.intel.com/content/www/us/en/products/sku/94034/intel-xeon-phi-processor-7230-16gb-1-30-ghz-64-core/specifications.html.Google Scholar
- [12] NVIDIA® TESLA® P100 GPU Accelerator. Retrieved from https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf.Google Scholar
- [13] OC-Accel. Retrieved from https://opencapi.github.io/oc-accel-doc/.Google Scholar
- [14] OpenPOWER Work Groups. Retrieved from https://openpowerfoundation.org/technical/working-groups.Google Scholar
- [15] RDIMM. Retrieved from https://www.micron.com/products/dram-modules/rdimm.Google Scholar
- [16] Ubuntu 20.04.3 LTS (Focal Fossa). Retrieved from https://releases.ubuntu.com/20.04/.Google Scholar
- [17] UltraScale Architecture Memory Resources. Retrieved from https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-memory-resources.pdf.Google Scholar
- [18] Virtex UltraScale+ HBM FPGA: A Revolutionary Increase in Memory Performance. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp485-hbm.pdf.Google Scholar
- [19] Virtex UltraScale+. Retrieved from https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html.Google Scholar
- [20] Vivado High-Level Synthesis. Retrieved from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google Scholar
- [21] Xilinx VCU1525. Retrieved from https://www.xilinx.com/products/boards-and-kits/ vcu1525-a.html.Google Scholar
- [22] Xilinx Virtex UltraScale+. Retrieved from https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html.Google Scholar
- [23] Xilinx Vivado. Retrieved from https://www.xilinx.com/support/download.html.Google Scholar
- [24] . A scalable processing-in-memory accelerator for parallel graph processing. In ISCA.Google Scholar
- [25] . PIM-Enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In ISCA.Google Scholar
- [26] . Data reorganization in memory using 3D-stacked DRAM. 2015. In ISCA.Google Scholar
- [27] . Application-Transparent near-memory processing architecture with memory channel network. In MICRO.Google Scholar
- [28] . Accelerating genome analysis: A primer on an ongoing journey. In IEEE Micro.Google Scholar
- [29] . Shouji: A fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 35, 21 (2019), 4255–4263.Google Scholar
- [30] . GateKeeper: A new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics 33, 21 (2017), 3355–3363.Google Scholar
- [31] . Technology dictates algorithms: Recent developments in read alignment. In Genome Biology, Vol. 22. 1–34.Google Scholar
- [32] . SneakySnake: A fast and accurate universal genome pre-alignment filter for CPUs, GPUs, and FPGAs. Bioinformatics 36, 22–23 (2020), 5282–5290.Google Scholar
- [33] . AlignS: A processing-in-memory accelerator for DNA short read alignment leveraging SOT-MRAM. In DAC.Google Scholar
- [34] . OpenTuner: An extensible framework for program autotuning. In PACT.Google Scholar
- [35] . Stencil codes on a vector length agnostic architecture. In PACT.Google Scholar
- [36] . Chameleon: Versatile and practical Near-DRAM acceleration architecture for large memory systems. In MICRO.Google Scholar
- [37] . JAFAR: Near-Data processing for databases. In SIGMOD.Google Scholar
- [38] . The ILLIAC IV computer. In TC.Google Scholar
- [39] . CCIX, Gen-Z, OpenCAPI: Overview and comparison. In OFA.Google Scholar
- [40] . SISA: Set-Centric instruction set architecture for graph mining on processing-in-memory systems. In MICRO.Google Scholar
- [41] . A GPU capable version of the COSMO weather model. In ISC.Google Scholar
- [42] . A semi-implicit semi-Lagrangian scheme using the height coordinate for a nonhydrostatic and fully elastic model of atmospheric flows. In JCP.Google Scholar
- [43] . Google neural network models for edge devices: Analyzing and mitigating machine learning inference bottlenecks. In PACT.Google Scholar
- [44] . Google workloads for consumer devices: Mitigating data movement bottlenecks. In ASPLOS.Google Scholar
- [45] . CoNDA: Efficient cache coherence support for near-data accelerators. In ISCA.Google Scholar
- [46] . LazyPIM: An efficient cache coherence mechanism for processing-in-memory. In CAL.Google Scholar
- [47] . GenASM: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis. In MICRO.Google Scholar
- [48] . A cloud-scale acceleration architecture. In MICRO.Google Scholar
- [49] . Collaborative computing for heterogeneous integrated systems. In ICPE.Google Scholar
- [50] . PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ISCA.Google Scholar
- [51] . SODA: Stencil with optimized dataflow architecture. In ICCAD.Google Scholar
- [52] . A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC.Google Scholar
- [53] . PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In IPDPS.Google Scholar
- [54] . Optimization and performance modeling of stencil computations on modern microprocessors. In SIAM Review.Google Scholar
- [55] . Designing scalable FPGA architectures using high-level synthesis. In PPoPP.Google Scholar
- [56] . StencilFlow: Mapping large stencil programs to distributed spatial computing systems. In CGO.Google Scholar
- [57] . ecTALK: Energy efficient coherent transprecision accelerators—The bidirectional long short-term memory neural network case. In COOL CHIPS.Google Scholar
- [58] . A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping. In FPT.Google Scholar
- [59] . The nonhydrostatic limited-area model LM (Lokal-model) of the DWD. Part I: Scientific documentation. In DWD, GB Forschung und Entwicklung.Google Scholar
- [60] . The Mondrian data engine. In ISCA.Google Scholar
- [61] . Fast inference of deep neural networks in FPGAs for pinproceedings physics. In JINST.Google Scholar
- [62] . In-memory database acceleration on FPGAs: A survey. In VLDB.Google Scholar
Digital Library
- [63] . NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In HPCA.Google Scholar
- [64] . NATSA: A near-data processing accelerator for time series analysis. In ICCD.Google Scholar
- [65] . Very high-speed computing systems. Proceedings of the IEEE 54, 12 (1966), 1901–1909.Google Scholar
- [66] . Eliminating the memory bottleneck: An FPGA-based solution for 3D reverse time migration. In FPGA.Google Scholar
- [67] . Xilinx adaptive compute acceleration platform: Versal™ architecture. In FPGA.Google Scholar
- [68] . ComputeDRAM: In-Memory compute using off-the-shelf DRAMs. In MICRO.Google Scholar
- [69] . Practical near-data processing for in-memory analytics frameworks. In PACT.Google Scholar
- [70] . HRL: Efficient and flexible reconfigurable logic for near-data processing. In HPCA.Google Scholar
- [71] . Processing-in-memory: A workload-driven perspective. In IBM JRD.Google Scholar
- [72] . Demystifying complex workload-DRAM interactions: An Experimental Study. In POMACS.Google Scholar
- [73] . SynCron: Efficient synchronization support for near-data-processing architectures. In HPCA.Google Scholar
- [74] . Accelerating arithmetic kernels with coherent attached FPGA coprocessors. In DATE.Google Scholar
- [75] . Benchmarking a new paradigm: An experimental analysis of a real processing-in-memory architecture arxiv.Google Scholar
- [76] . Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware. In CUT.Google Scholar
- [77] . Speculative execution via address prediction and data prefetching. In ICS.Google Scholar
- [78] . Biscuit: A framework for near-data processing of big data workloads. In ISCA.Google Scholar
- [79] . MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In SC.Google Scholar
- [80] . SIMDRAM: An end-to-end framework for bit-serial SIMD computing in DRAM. In ASPLOS.Google Scholar
- [81] . 2016. Accelerating dependent cache misses with an enhanced memory controller. In ISCA.Google Scholar
- [82] . Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In MICRO.Google Scholar
- [83] . Data layout transformation for stencil computations on short-vector SIMD architectures. In CC.Google Scholar
- [84] . Non-linear fourth-order image interpolation for subpixel edge detection and localization. In IMAVIS.Google Scholar
- [85] . Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In ISCA.Google Scholar
- [86] . Accelerating pointer chasing in 3D-Stacked memory: Challenges, mechanisms, evaluation. In ICCD.Google Scholar
- [87] . Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. In ICPE.Google Scholar
- [88] . High-order methods for computational fluid dynamics: A brief review of compact differential formulations on unstructured grids. In Computers & Fluids.Google Scholar
- [89] . Caribou: Intelligent distributed storage. In VLDB.Google Scholar
- [90] . Boyi: A systematic framework for automatically deciding the right execution model of OpenCL applications on FPGAs. In FPGA.Google Scholar
- [91] . An end-to-end computing model for the square kilometre array. In Computer.Google Scholar
- [92] . BlueDBM: An appliance for big data analytics. In ISCA.Google Scholar
- [93] . Enabling cost-effective data processing with smart SSD. In MSST.Google Scholar
- [94] . FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In FCCM.Google Scholar
- [95] . High bandwidth memory on FPGAs: A data analytics perspective. In FPL.Google Scholar
- [96] . RecNMP: Accelerating personalized recommendation with near-memory processing. In ISCA.Google Scholar
- [97] . High resolution deterministic prediction system (HRDPS) simulations of Manitoba lake breezes. In Atmosphere-Ocean.Google Scholar
- [98] . Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In ISCA.Google Scholar
- [99] . A 1.2 V 12.8 GB/s 2 Gb mobile wide-I/O DRAM with 4\( \times \)128 I/Os using TSV based stacking. In JSSC.Google Scholar
- [100] . GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 19, 2 (2018), 23–40.Google Scholar
- [101] . Summarizer: Trading communication with computing near storage. In MICRO.Google Scholar
- [102] . A 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications. In ISSCC.Google Scholar
- [103] . HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In FPGA.Google Scholar
- [104] . Simultaneous multi-layer access: Improving 3D-Stacked memory bandwidth at low cost. ACM TACO 12, 4 (2016), 1–29.Google Scholar
- [105] . 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In ISSCC.Google Scholar
- [106] . ExtraV: Boosting graph processing near storage with a coherent accelerator. In VLDB.Google Scholar
- [107] . BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models. In PACT.Google Scholar
- [108] . Hardware architecture and software stack for FIM based on commercial DRAM technology. In ISCA.Google Scholar
- [109] . Application codesign of near-data processing for similarity search. In IPDPS.Google Scholar
- [110] . HeteroHalide: From image processing DSL to efficient FPGA acceleration. In FPGA.Google Scholar
- [111] . Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories. In DAC.Google Scholar
- [112] . Processing-in-Memory for energy-efficient neural network training: A heterogeneous approach. In MICRO.Google Scholar
- [113] . Concurrent data structures for near-memory computing. In SPAA.Google Scholar
- [114] . PCI express and advanced switching: Evolutionary path to building next generation interconnects. In HOTI.Google Scholar
- [115] . A performance study for iterative stencil loops on GPUs with ghost zone optimizations. In IJPP.Google Scholar
- [116] . Deploy ML models to field-programmable gate arrays (FPGAs) with Azure Machine Learning. Retrieved from https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-fpga-web-service.Google Scholar
- [117] . GP-SIMD processing-in-memory. ACM TACO 11, 4 (2015), 1–26.Google Scholar
- [118] . Intelligent architectures for intelligent computing systems. In DATE.Google Scholar
- [119] . Enabling practical processing in and near memory for data-intensive computing. In DAC.Google Scholar
- [120] . Processing data where it makes sense: Enabling in-memory computation. In MicPro, Vol. 67. 28–41.Google Scholar
- [121] . A modern primer on processing in memor. In Emerging Computing: From Devices to Systems-Looking Beyond Moore and Von Neumann. Springer.Google Scholar
- [122] . GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks. In HPCA.Google Scholar
- [123] . Active memory cube: A processing-in-memory architecture for exascale systems. IBM JRD 59, 2/3 (2015), 17–1.Google Scholar
- [124] . Genomics and data science: An application within an umbrella. Genome Biology 20, 1 (2019), 1–11.Google Scholar
- [125] . Description of the NCAR Community atmosphere model (CAM 5.0). In NCAR Tech. Note.Google Scholar
- [126] . DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks. In IEEE Access, Vol. 9. 134457–134502.Google Scholar
- [127] . 2017. Finite Difference Methods in Heat Transfer. CRC Press.Google Scholar
Cross Ref
- [128] . TRiM: Enhancing processor-memory interfaces with scalable tensor reduction in memory. In MICRO.Google Scholar
- [129] . Scheduling techniques for gpu architectures with processing-in-memory capabilities. In PACT.Google Scholar
- [130] . Hybrid memory cube (HMC). In HCS.Google Scholar
- [131] . Joins on high-bandwidth memory: A new level in the memory hierarchy. In VLDB.Google Scholar
- [132] . NDC: Analyzing the impact of 3D-Stacked memory+logic devices on mapreduce workloads. In ISPASS.Google Scholar
- [133] . CFD Acceleration with FPGA. In H2RC.Google Scholar
- [134] . IBM POWER9 processor architecture. In IEEE Micro.Google Scholar
- [135] . Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. In TPDS.Google Scholar
- [136] . Operand size reconfiguration for big data processing in memory. In DATE.Google Scholar
- [137] . Kilometer-scale climate models: Prospects and challenges. In BAMS.Google Scholar
- [138] . Fast bulk bitwise AND and OR in DRAM. In CAL.Google Scholar
- [139] . RowClone: Fast and energy-efficient In-DRAM bulk data copy and initialization. In MICRO.Google Scholar
- [140] . Ambit: In-Memory accelerator for bulk bitwise operations using commodity DRAM technology. In MICRO.Google Scholar
- [141] . Buddy-RAM: Improving the performance and efficiency of bulk bitwise operations using DRAM (unpublished).Google Scholar
- [142] . Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses. In MICRO.Google Scholar
- [143] . 2017. Simple operations in memory to reduce data movement. In Advances in Computers.Google Scholar
Cross Ref
- [144] . In-DRAM bulk bitwise execution engine. arxiv.Google Scholar
- [145] . Compute express link. In CXL Consortium White Paper2019.Google Scholar
- [146] . BLADE: An in-cache computing architecture for edge devices. In TC.Google Scholar
- [147] . NAPEL: Near-memory computing application performance prediction via ensemble learning. In DAC.Google Scholar
- [148] . FPGA-based near-memory acceleration of modern data-intensive applications. In IEEE Micro.Google Scholar
- [149] . Near-Memory computing: Past, present, and future. In MicPro, Vol. 71. 102868.Google Scholar
- [150] . A review of near-memory computing architectures: Opportunities and challenges. In DSD.Google Scholar
- [151] . Modeling FPGA-based systems via few-shot learning. In FPGA.Google Scholar
- [152] . NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In FPL.Google Scholar
- [153] . NARMADA: Near-memory horizontal diffusion accelerator for scalable stencil computations. In FPL.Google Scholar
- [154] . Low precision processing for high order stencil computations. In Springer LNCS.Google Scholar
- [155] . Cache oblivious parallelograms in iterative stencil computations. In ICS.Google Scholar
- [156] . IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI. IBM JRD 62, 4/5 (2018), 8–1.Google Scholar
- [157] . CAPI: A coherent accelerator processor interface. IBM JRD 59, 1 (2015), 7–1.Google Scholar
- [158] . ConTutto—A novel FPGA-based prototyping platform enabling innovation in the memory subsystem of a server class processor. In MICRO.Google Scholar
- [159] . Using Intel Xeon Phi coprocessor to accelerate computations in MPDATA algorithm. In PPAM.Google Scholar
- [160] . The Pochoir stencil compiler. In SPAA.Google Scholar
- [161] . Porting the COSMO weather model to Manycore CPUs. In PASC.Google Scholar
- [162] . Elliptic problems in linear differential equations over a network. In Watson Sci. Comput. Lab. Report, Columbia University.Google Scholar
- [163] . Jenga: Software-defined cache hierarchies. In ISCA.Google Scholar
- [164] . Simultaneous multithreading: Maximizing on-chip parallelism. In ISCA.Google Scholar
- [165] . Coherently attached programmable near-memory acceleration platform and its application to stencil processing. In DATE.Google Scholar
- [166] . Benchmarking GPUs to tune dense linear algebra. In SC.Google Scholar
- [167] . Scalable kernel fusion for memory-bound GPU applications. In SC.Google Scholar
- [168] . Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188–53201.Google Scholar
- [169] . OpenCL-based FPGA-platform for stencil computation and its optimization methodology. In TPDS.Google Scholar
- [170] . A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In DAC.Google Scholar
- [171] . Shuhai: Benchmarking high bandwidth memory on FPGAs. In FCCM.Google Scholar
- [172] . Getting started with CAPI SNAP: Hardware development for software engineers. In Euro-Par.Google Scholar
- [173] . Roofline: An insightful visual performance model for multicore architectures. In CACM.Google Scholar
- [174] . Sieve: Scalable In-situ DRAM-based accelerator designs for massively parallel k-mer matching. In ISCA.Google Scholar
- [175] . Performance tuning and analysis for stencil-based applications on POWER8 processor. ACM TACO 15, 4 (2018), 1–25.Google Scholar
- [176] . TOP-PIM: Throughput-oriented programmable processing in memory. In HPDC.Google Scholar
- [177] . Evaluating the impact of improvement in the horizontal diffusion parameterization on hurricane prediction in the operational hurricane weather research and forecast (HWRF) model. In Weather and Forecasting.Google Scholar
- [178] . Performance evaluation and optimization of HBM-enabled GPU for data-intensive applications. In VLSI.Google Scholar
- [179] . Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In FPGA.Google Scholar
Index Terms
Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric
Recommendations
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingGraphics Processing Units (GPUs) have enabled significant improvements in computational performance compared to traditional CPUs in several application domains. Until recently, GPUs have been programmed using C/C++ based methods such as CUDA (NVIDIA) ...
Efficient compilation of CUDA kernels for high-performance computing on FPGAs
Special issue on application-specific processorsThe rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different ...
Reconfigurable dataflow graphs for processing-in-memory
ICDCN '19: Proceedings of the 20th International Conference on Distributed Computing and NetworkingIn order to meet the ever-increasing speed differences between processor clocks and memory access times, there has been an interest in moving computation closer to memory. The near data processing or processing-in-memory is particularly suited for very ...






Comments