skip to main content
research-article

Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric

Published:06 June 2022Publication History
Skip Abstract Section

Abstract

Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an field-programmable gate array+HBM-based accelerator connected through Open Coherent Accelerator Processor Interface to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by \( 5.3\times \) and \( 12.7\times \) when running two different compound stencil kernels. NERO reduces the energy consumption by \( 12\times \) and \( 35\times \) for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/W and 21.01 GFLOPS/W. We conclude that employing near-memory acceleration solutions for weather prediction modeling  is promising as a means to achieve both high performance and high energy efficiency.

REFERENCES

  1. [1] ADM-PCIE-9H7-High-Speed Communications Hub. Retrieved from https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9h7.Google ScholarGoogle Scholar
  2. [2] ADM-PCIE-9V3-High-Performance Network Accelerator. Retrieved from https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9v3.Google ScholarGoogle Scholar
  3. [3] AXI High Bandwidth Memory Controller v1.0. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf.Google ScholarGoogle Scholar
  4. [4] AXI Reference Guide. Retrieved from https://www.xilinx.com/support/documentation/ip_documentation/ug761_axi_reference_guide.pdf.Google ScholarGoogle Scholar
  5. [5] CentOS-7 (2009) Release Notes. Retrieved from https://wiki.centos.org/Manuals/ReleaseNotes/CentOS7.2009.Google ScholarGoogle Scholar
  6. [6] GCC, the GNU Compiler Collection. Retrieved from https://gcc.gnu.org/.Google ScholarGoogle Scholar
  7. [7] High Bandwidth Memory (HBM) DRAM (JESD235). Retrieved from https://www.jedec.org/document_search?search_api_views_fulltext=jesd235.Google ScholarGoogle Scholar
  8. [8] High Bandwidth Memory (HBM) DRAM. Retrieved from https://www.jedec.org/sites/default/files/JESD235B-HBM_Ballout.zip.Google ScholarGoogle Scholar
  9. [9] IBM XL C/C++ for Linux. Retrieved from https://www.ibm.com/products/xl-cpp-linux-compiler-power.Google ScholarGoogle Scholar
  10. [10] Intel Stratix 10 MX FPGAs. Retrieved from https://www.intel.com/content/www/us/en/products/programmable/sip/stratix-10-mx.html.Google ScholarGoogle Scholar
  11. [11] Intel® Xeon Phi™ Processor 7230 (16GB, 1.30 GHz, 64 core). Retrieved from https://www.intel.com/content/www/us/en/products/sku/94034/intel-xeon-phi-processor-7230-16gb-1-30-ghz-64-core/specifications.html.Google ScholarGoogle Scholar
  12. [12] NVIDIA® TESLA® P100 GPU Accelerator. Retrieved from https://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf.Google ScholarGoogle Scholar
  13. [13] OC-Accel. Retrieved from https://opencapi.github.io/oc-accel-doc/.Google ScholarGoogle Scholar
  14. [14] OpenPOWER Work Groups. Retrieved from https://openpowerfoundation.org/technical/working-groups.Google ScholarGoogle Scholar
  15. [15] RDIMM. Retrieved from https://www.micron.com/products/dram-modules/rdimm.Google ScholarGoogle Scholar
  16. [16] Ubuntu 20.04.3 LTS (Focal Fossa). Retrieved from https://releases.ubuntu.com/20.04/.Google ScholarGoogle Scholar
  17. [17] UltraScale Architecture Memory Resources. Retrieved from https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-memory-resources.pdf.Google ScholarGoogle Scholar
  18. [18] Virtex UltraScale+ HBM FPGA: A Revolutionary Increase in Memory Performance. Retrieved from https://www.xilinx.com/support/documentation/white_papers/wp485-hbm.pdf.Google ScholarGoogle Scholar
  19. [19] Virtex UltraScale+. Retrieved from https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html.Google ScholarGoogle Scholar
  20. [20] Vivado High-Level Synthesis. Retrieved from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google ScholarGoogle Scholar
  21. [21] Xilinx VCU1525. Retrieved from https://www.xilinx.com/products/boards-and-kits/ vcu1525-a.html.Google ScholarGoogle Scholar
  22. [22] Xilinx Virtex UltraScale+. Retrieved from https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html.Google ScholarGoogle Scholar
  23. [23] Xilinx Vivado. Retrieved from https://www.xilinx.com/support/download.html.Google ScholarGoogle Scholar
  24. [24] Ahn Junwhan, Hong Sungpack, Yoo Sungjoo, Mutlu Onur, and Choi Kiyoung. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In ISCA.Google ScholarGoogle Scholar
  25. [25] Ahn Junwhan, Yoo Sungjoo, Mutlu Onur, and Choi Kiyoung. 2015. PIM-Enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In ISCA.Google ScholarGoogle Scholar
  26. [26] Akin Berkin, Franchetti Franz, and Hoe James C.. Data reorganization in memory using 3D-stacked DRAM. 2015. In ISCA.Google ScholarGoogle Scholar
  27. [27] Alian M., Min S. W., Asgharimoghaddam H., Dhar A., Wang D. K., Roewer T., McPadden A., O’Halloran O., Chen D., Xiong J., Kim D., Hwu W., and Kim N. S.. 2018. Application-Transparent near-memory processing architecture with memory channel network. In MICRO.Google ScholarGoogle Scholar
  28. [28] Alser Mohammed, Bingöl Zülal, Cali Damla Senol, Kim Jeremie, Ghose Saugata, Alkan Can, and Mutlu Onur. 2020. Accelerating genome analysis: A primer on an ongoing journey. In IEEE Micro.Google ScholarGoogle Scholar
  29. [29] Alser Mohammed, Hassan Hasan, Kumar Akash, Mutlu Onur, and Alkan Can. 2019. Shouji: A fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 35, 21 (2019), 4255–4263.Google ScholarGoogle Scholar
  30. [30] Alser Mohammed, Hassan Hasan, Xin Hongyi, Ergin Oǧuz, Mutlu Onur, and Alkan Can. 2017. GateKeeper: A new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics 33, 21 (2017), 3355–3363.Google ScholarGoogle Scholar
  31. [31] Alser Mohammed, Rotman Jeremy, Taraszka Kodi, Shi Huwenbo, Baykal Pelin Icer, Yang Harry Taegyun, Xue Victor, Knyazev Sergey, Singer Benjamin D., Balliu Brunilda, et al. 2020. Technology dictates algorithms: Recent developments in read alignment. In Genome Biology, Vol. 22. 1–34.Google ScholarGoogle Scholar
  32. [32] Alser Mohammed, Shahroodi Taha, Gomez-Luna Juan, Alkan Can, and Mutlu Onur. 2020. SneakySnake: A fast and accurate universal genome pre-alignment filter for CPUs, GPUs, and FPGAs. Bioinformatics 36, 22–23 (2020), 5282–5290.Google ScholarGoogle Scholar
  33. [33] Angizi Shaahin, Sun Jiao, Zhang Wei, and Fan Deliang. 2019. AlignS: A processing-in-memory accelerator for DNA short read alignment leveraging SOT-MRAM. In DAC.Google ScholarGoogle Scholar
  34. [34] Ansel Jason, Kamil Shoaib, Veeramachaneni Kalyan, Ragan-Kelley Jonathan, Bosboom Jeffrey, O’Reilly Una-May, and Amarasinghe Saman. 2014. OpenTuner: An extensible framework for program autotuning. In PACT.Google ScholarGoogle Scholar
  35. [35] Armejach Adrià, Caminal Helena, Cebrian Juan M., González-Alberquilla Rekai, Adeniyi-Jones Chris, Valero Mateo, Casas Marc, and Moretó Miquel. 2018. Stencil codes on a vector length agnostic architecture. In PACT.Google ScholarGoogle Scholar
  36. [36] Asghari-Moghaddam Hadi, Son Young Hoon, Ahn Jung Ho, and Kim Nam Sung. 2016. Chameleon: Versatile and practical Near-DRAM acceleration architecture for large memory systems. In MICRO.Google ScholarGoogle Scholar
  37. [37] Babarinsa Oreoluwatomiwa O. and Idreos Stratos. 2015. JAFAR: Near-Data processing for databases. In SIGMOD.Google ScholarGoogle Scholar
  38. [38] Barnes George H., Brown Richard M., Kato Maso, Kuck David J., Slotnick Daniel L., and Stokes Richard A.. 1968. The ILLIAC IV computer. In TC.Google ScholarGoogle Scholar
  39. [39] Benton Brad. 2017. CCIX, Gen-Z, OpenCAPI: Overview and comparison. In OFA.Google ScholarGoogle Scholar
  40. [40] Besta Maciej, Kanakagiri Raghavendra, Kwasniewski Grzegorz, Ausavarungnirun Rachata, Beránek Jakub, Kanellopoulos Konstantinos, Janda Kacper, Vonarburg-Shmaria Zur, Gianinazzi Lukas, Stefan Ioana, et al. 2021. SISA: Set-Centric instruction set architecture for graph mining on processing-in-memory systems. In MICRO.Google ScholarGoogle Scholar
  41. [41] Bianco M., Diamanti T., Fuhrer O., Gysi T., Lapillonne X., Osuna C., and Schulthess T.. 2013. A GPU capable version of the COSMO weather model. In ISC.Google ScholarGoogle Scholar
  42. [42] Bonaventura Luca. 2000. A semi-implicit semi-Lagrangian scheme using the height coordinate for a nonhydrostatic and fully elastic model of atmospheric flows. In JCP.Google ScholarGoogle Scholar
  43. [43] Boroumand Amirali, Ghose Saugata, Akin Berkin, Narayanaswami Ravi, Oliveira Geraldo F., Ma Xiaoyu, Shiu Eric, and Mutlu Onur. 2021. Google neural network models for edge devices: Analyzing and mitigating machine learning inference bottlenecks. In PACT.Google ScholarGoogle Scholar
  44. [44] Boroumand Amirali, Ghose Saugata, Kim Youngsok, Ausavarungnirun Rachata, Shiu Eric, Thakur Rahul, Kim Daehyun, Kuusela Aki, Knies Allan, Ranganathan Parthasarathy, and Mutlu Onur. 2018. Google workloads for consumer devices: Mitigating data movement bottlenecks. In ASPLOS.Google ScholarGoogle Scholar
  45. [45] Boroumand Amirali, Ghose Saugata, Patel Minesh, Hassan Hasan, Lucia Brandon, Ausavarungnirun Rachata, Hsieh Kevin, Hajinazar Nastaran, Malladi Krishna T, Zheng Hongzhong, et al. 2019. CoNDA: Efficient cache coherence support for near-data accelerators. In ISCA.Google ScholarGoogle Scholar
  46. [46] Boroumand Amirali, Ghose Saugata, Patel Minesh, Hassan Hasan, Lucia Brandon, Hsieh Kevin, Malladi Krishna T., Zheng Hongzhong, and Mutlu Onur. 2016. LazyPIM: An efficient cache coherence mechanism for processing-in-memory. In CAL.Google ScholarGoogle Scholar
  47. [47] Cali Damla Senol, Kalsi Gurpreet S., Bingöl Zülal, Firtina Can, Subramanian Lavanya, Kim Jeremie S., Ausavarungnirun Rachata, Alser Mohammed, Luna Juan Gómez, Boroumand Amirali, Nori Anant, Scibisz Allison, Subramoney Sreenivas, Alkan Can, Ghose Saugata, and Mutlu Onur. 2020. GenASM: A high-performance, low-power approximate string matching acceleration framework for genome sequence analysis. In MICRO.Google ScholarGoogle Scholar
  48. [48] Caulfield A. M., Chung E. S., Putnam A., Angepat H., Fowers J., Haselman M., Heil S., Humphrey M., Kaur P., Kim J., Lo D., Massengill T., Ovtcharov K., Papamichael M., Woods L., Lanka S., Chiou D., and Burger D.. 2016. A cloud-scale acceleration architecture. In MICRO.Google ScholarGoogle Scholar
  49. [49] Chang Li-Wen, Gómez-Luna Juan, El Hajj Izzat, Huang Sitao, Chen Deming, and Hwu Wen-mei. 2017. Collaborative computing for heterogeneous integrated systems. In ICPE.Google ScholarGoogle Scholar
  50. [50] Chi Ping, Li Shuangchen, Xu Cong, Zhang Tao, Zhao Jishen, Liu Yongpan, Wang Yu, and Xie Yuan. 2016. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In ISCA.Google ScholarGoogle Scholar
  51. [51] Chi Yuze, Cong Jason, Wei Peng, and Zhou Peipei. 2018. SODA: Stencil with optimized dataflow architecture. In ICCAD.Google ScholarGoogle Scholar
  52. [52] Choi Young-kyu, Cong Jason, Fang Zhenman, Hao Yuchen, Reinman Glenn, and Wei Peng. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC.Google ScholarGoogle Scholar
  53. [53] Christen Matthias, Schenk Olaf, and Burkhart Helmar. 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In IPDPS.Google ScholarGoogle Scholar
  54. [54] Datta Kaushik, Kamil Shoaib, Williams Samuel, Oliker Leonid, Shalf John, and Yelick Katherine. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. In SIAM Review.Google ScholarGoogle Scholar
  55. [55] Fine Licht Johannes de, Blott Michaela, and Hoefler Torsten. 2018. Designing scalable FPGA architectures using high-level synthesis. In PPoPP.Google ScholarGoogle Scholar
  56. [56] Fine Licht Johannes de, Kuster Andreas, De Matteis Tiziano, Ben-Nun Tal, Hofer Dominic, and Hoefler Torsten. 2021. StencilFlow: Mapping large stencil programs to distributed spatial computing systems. In CGO.Google ScholarGoogle Scholar
  57. [57] Diamantopoulos Dionysios, Giefers Heiner, and Hagleitner Christoph. 2018. ecTALK: Energy efficient coherent transprecision accelerators—The bidirectional long short-term memory neural network case. In COOL CHIPS.Google ScholarGoogle Scholar
  58. [58] Diamantopoulos Dionysios and Hagleitner Christoph. 2018. A system-level transprecision FPGA accelerator for BLSTM using on-chip memory reshaping. In FPT.Google ScholarGoogle Scholar
  59. [59] Doms G. and Schättler U.. 1999. The nonhydrostatic limited-area model LM (Lokal-model) of the DWD. Part I: Scientific documentation. In DWD, GB Forschung und Entwicklung.Google ScholarGoogle Scholar
  60. [60] Drumond Mario, Daglis Alexandros, Mirzadeh Nooshin, Ustiugov Dmitrii, Picorel Javier, Falsafi Babak, Grot Boris, and Pnevmatikatos Dionisios. 2017. The Mondrian data engine. In ISCA.Google ScholarGoogle Scholar
  61. [61] Duarte Javier, Han Song, Harris Philip, Jindariani Sergo, Kreinar Edward, Kreis Benjamin, Ngadiuba Jennifer, Pierini Maurizio, Rivera Ryan, Tran Nhan, and Wu Z.. 2018. Fast inference of deep neural networks in FPGAs for pinproceedings physics. In JINST.Google ScholarGoogle Scholar
  62. [62] Fang Jian, Mulder Yvo T. B., Hidders Jan, Lee Jinho, and Hofstee H. Peter. 2020. In-memory database acceleration on FPGAs: A survey. In VLDB.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Farmahini-Farahani A., Ahn J. H., Morrow K., and Kim N. S.. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In HPCA.Google ScholarGoogle Scholar
  64. [64] Fernandez Ivan, Quislant Ricardo, Gutiérrez Eladio, Plata Oscar, Giannoula Christina, Alser Mohammed, Gómez-Luna Juan, and Mutlu Onur. 2020. NATSA: A near-data processing accelerator for time series analysis. In ICCD.Google ScholarGoogle Scholar
  65. [65] Flynn Michael J.. 1966. Very high-speed computing systems. Proceedings of the IEEE 54, 12 (1966), 1901–1909.Google ScholarGoogle Scholar
  66. [66] Fu Haohuan and Clapp Robert G.. 2011. Eliminating the memory bottleneck: An FPGA-based solution for 3D reverse time migration. In FPGA.Google ScholarGoogle Scholar
  67. [67] Gaide Brian, Gaitonde Dinesh, Ravishankar Chirag, and Bauer Trevor. 2019. Xilinx adaptive compute acceleration platform: Versal™ architecture. In FPGA.Google ScholarGoogle Scholar
  68. [68] Gao Fei, Tziantzioulis Georgios, and Wentzlaff David. 2019. ComputeDRAM: In-Memory compute using off-the-shelf DRAMs. In MICRO.Google ScholarGoogle Scholar
  69. [69] Gao Mingyu, Ayers Grant, and Kozyrakis Christos. 2015. Practical near-data processing for in-memory analytics frameworks. In PACT.Google ScholarGoogle Scholar
  70. [70] Gao M. and Kozyrakis C.. 2016. HRL: Efficient and flexible reconfigurable logic for near-data processing. In HPCA.Google ScholarGoogle Scholar
  71. [71] Ghose Saugata, Boroumand Amirali, Kim Jeremie S., Gómez-Luna Juan, and Mutlu Onur. 2019. Processing-in-memory: A workload-driven perspective. In IBM JRD.Google ScholarGoogle Scholar
  72. [72] Ghose Saugata, Li Tianshi, Hajinazar Nastaran, Cali Damla Senol, and Mutlu Onur. 2019. Demystifying complex workload-DRAM interactions: An Experimental Study. In POMACS.Google ScholarGoogle Scholar
  73. [73] Giannoula Christina, Vijaykumar Nandita, Papadopoulou Nikela, Karakostas Vasileios, Fernandez Ivan, Gómez-Luna Juan, Orosa Lois, Koziris Nectarios, Goumas Georgios, and Mutlu Onur. 2021. SynCron: Efficient synchronization support for near-data-processing architectures. In HPCA.Google ScholarGoogle Scholar
  74. [74] Giefers Heiner, Polig Raphael, and Hagleitner Christoph. 2015. Accelerating arithmetic kernels with coherent attached FPGA coprocessors. In DATE.Google ScholarGoogle Scholar
  75. [75] Gómez-Luna Juan, Hajj Izzat El, Fernandez Ivan, Giannoula Christina, Oliveira Geraldo F., and Mutlu Onur. 2021. Benchmarking a new paradigm: An experimental analysis of a real processing-in-memory architecture arxiv.Google ScholarGoogle Scholar
  76. [76] Gómez-Luna Juan, Hajj Izzat El, Fernandez Ivan, Giannoula Christina, Oliveira Geraldo F., and Mutlu Onur. 2021. Benchmarking memory-centric computing systems: Analysis of real processing-in-memory hardware. In CUT.Google ScholarGoogle Scholar
  77. [77] González José and González Antonio. 1997. Speculative execution via address prediction and data prefetching. In ICS.Google ScholarGoogle Scholar
  78. [78] Gu Boncheol, Yoon Andre S., Bae Duck-Ho, Jo Insoon, Lee Jinyoung, Yoon Jonghyun, Kang Jeong-Uk, Kwon Moonsang, Yoon Chanho, Cho Sangyeun, Jeong Jaeheon, and Chang Duckhyun. 2016. Biscuit: A framework for near-data processing of big data workloads. In ISCA.Google ScholarGoogle Scholar
  79. [79] Gysi Tobias, Grosser Tobias, and Hoefler Torsten. 2015. MODESTO: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In SC.Google ScholarGoogle Scholar
  80. [80] Hajinazar Nastaran, Oliveira Geraldo F., Gregorio Sven, Ferreira João, Ghiasi Nika Mansouri, Patel Minesh, Alser Mohammed, Ghose Saugata, Luna Juan Gómez, and Mutlu Onur. 2021. SIMDRAM: An end-to-end framework for bit-serial SIMD computing in DRAM. In ASPLOS.Google ScholarGoogle Scholar
  81. [81] Hashemi Milad, Ebrahimi Eiman, Mutlu Onur, Patt Yale N., et al. 2016. Accelerating dependent cache misses with an enhanced memory controller. In ISCA.Google ScholarGoogle Scholar
  82. [82] Hashemi Milad, Mutlu Onur, and Patt Yale N.. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In MICRO.Google ScholarGoogle Scholar
  83. [83] Henretty Tom, Stock Kevin, Pouchet Louis-Noël, Franchetti Franz, Ramanujam J., and Sadayappan P.. 2011. Data layout transformation for stencil computations on short-vector SIMD architectures. In CC.Google ScholarGoogle Scholar
  84. [84] Hermosilla Txomin, Bermejo E., Balaguer A., and Ruiz Luis A.. 2008. Non-linear fourth-order image interpolation for subpixel edge detection and localization. In IMAVIS.Google ScholarGoogle Scholar
  85. [85] Hsieh Kevin, Ebrahimi Eiman, Kim Gwangsun, Chatterjee Niladrish, O’Connor Mike, Vijaykumar Nandita, Mutlu Onur, and Keckler Stephen W.. 2016. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In ISCA.Google ScholarGoogle Scholar
  86. [86] Hsieh Kevin, Khan Samira, Vijaykumar Nandita, Chang Kevin K., Boroumand Amirali, Ghose Saugata, and Mutlu Onur. 2016. Accelerating pointer chasing in 3D-Stacked memory: Challenges, mechanisms, evaluation. In ICCD.Google ScholarGoogle Scholar
  87. [87] Huang Sitao, Chang Li-Wen, El Hajj Izzat, Gonzalo Simon Garcia de, Gómez-Luna Juan, Chalamalasetti Sai Rahul, El-Hadedy Mohamed, Milojicic Dejan, Mutlu Onur, Chen Deming, and Hwu Wen-mei. 2019. Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. In ICPE.Google ScholarGoogle Scholar
  88. [88] Huynh H. T., Wang Zhi J., and Vincent Peter E.. 2014. High-order methods for computational fluid dynamics: A brief review of compact differential formulations on unstructured grids. In Computers & Fluids.Google ScholarGoogle Scholar
  89. [89] István Zsolt, Sidler David, and Alonso Gustavo. 2017. Caribou: Intelligent distributed storage. In VLDB.Google ScholarGoogle Scholar
  90. [90] Jiang Jiantong, Wang Zeke, Liu Xue, Gómez-Luna Juan, Guan Nan, Deng Qingxu, Zhang Wei, and Mutlu Onur. 2020. Boyi: A systematic framework for automatically deciding the right execution model of OpenCL applications on FPGAs. In FPGA.Google ScholarGoogle Scholar
  91. [91] Jongerius R., Wijnholds S., Nijboer R., and Corporaal H.. 2014. An end-to-end computing model for the square kilometre array. In Computer.Google ScholarGoogle Scholar
  92. [92] Jun Sang-Woo, Liu Ming, Lee Sungjin, Hicks Jamey, Ankcorn John, King Myron, Xu Shuotao, et al. 2015. BlueDBM: An appliance for big data analytics. In ISCA.Google ScholarGoogle Scholar
  93. [93] Kang Yangwook, Kee Yang-suk, Miller Ethan L., and Park Chanik. 2013. Enabling cost-effective data processing with smart SSD. In MSST.Google ScholarGoogle Scholar
  94. [94] Kara Kaan, Alistarh Dan, Alonso Gustavo, Mutlu Onur, and Zhang Ce. 2017. FPGA-accelerated dense linear machine learning: A precision-convergence trade-off. In FCCM.Google ScholarGoogle Scholar
  95. [95] Kara Kaan, Hagleitner Christoph, Diamantopoulos Dionysios, Syrivelis Dimitris, and Alonso Gustavo. 2020. High bandwidth memory on FPGAs: A data analytics perspective. In FPL.Google ScholarGoogle Scholar
  96. [96] Ke L., Gupta U., Cho B. Y., Brooks D., Chandra V., Diril U., Firoozshahian A., Hazelwood K., Jia B., Lee H. S., Li M., Maher B., Mudigere D., Naumov M., Schatz M., Smelyanskiy M., Wang X., Reagen B., Wu C., Hempstead M., and Zhang X.. 2020. RecNMP: Accelerating personalized recommendation with near-memory processing. In ISCA.Google ScholarGoogle Scholar
  97. [97] Kehler Scott, Hanesiak John, Curry Michelle, Sills David, and Taylor Neil. 2016. High resolution deterministic prediction system (HRDPS) simulations of Manitoba lake breezes. In Atmosphere-Ocean.Google ScholarGoogle Scholar
  98. [98] Kim Duckhwan, Kung Jaeha, Chai Sek, Yalamanchili Sudhakar, and Mukhopadhyay Saibal. 2016. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In ISCA.Google ScholarGoogle Scholar
  99. [99] Kim J., Oh C. S., Lee H., Lee D., Hwang H. R., Hwang S., Na B., Moon J., Kim J., Park H., Ryu J., Park K., Kang S. K., Kim S., Kim H., Bang J., Cho H., Jang M., Han C., LeeLee J., Choi J. S., and Jun Y.. 2012. A 1.2 V 12.8 GB/s 2 Gb mobile wide-I/O DRAM with 4\( \times \)128 I/Os using TSV based stacking. In JSSC.Google ScholarGoogle Scholar
  100. [100] Kim Jeremie S., Cali Damla Senol, Xin Hongyi, Lee Donghyuk, Ghose Saugata, Alser Mohammed, Hassan Hasan, Ergin Oguz, Alkan Can, and Mutlu Onur. 2018. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 19, 2 (2018), 23–40.Google ScholarGoogle Scholar
  101. [101] Koo Gunjae, Matam Kiran Kumar, I Te, Narra H. V. Krishna Giri, Li Jing, Tseng Hung-Wei, Swanson Steven, and Annavaram Murali. 2017. Summarizer: Trading communication with computing near storage. In MICRO.Google ScholarGoogle Scholar
  102. [102] Kwon Young-Cheon, Han Lee Suk, Lee Jaehoon, Kwon Sang-Hyuk, Min Ryu Je, Son Jong-Pil, O Seongil, Yu Hak-Soo, Lee Haesuk, Young Kim Soo, Cho Youngmin, Guk Kim Jin, Choi Jongyoon, Shin Hyun-Sung, Kim Jin, Phuah BengSeng, Kim HyoungMin, Jun Song Myeong, Choi Ahn, Kim Daeho, Kim SooYoung, Kim Eun-Bong, Wang David, Kang Shinhaeng, Ro Yuhwan, Seo Seungwoo, Song JoonHo, Youn Jaeyoun, Sohn Kyomin, and Sung Kim Nam. 2021. A 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications. In ISSCC.Google ScholarGoogle Scholar
  103. [103] Lai Yi-Hsiang, Chi Yuze, Hu Yuwei, Wang Jie, Yu Cody Hao, Zhou Yuan, Cong Jason, and Zhang Zhiru. 2019. HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing. In FPGA.Google ScholarGoogle Scholar
  104. [104] Lee Donghyuk, Ghose Saugata, Pekhimenko Gennady, Khan Samira, and Mutlu Onur. 2016. Simultaneous multi-layer access: Improving 3D-Stacked memory bandwidth at low cost. ACM TACO 12, 4 (2016), 1–29.Google ScholarGoogle Scholar
  105. [105] Lee D. U., Kim K. W., Kim K. W., Kim H., Kim J. Y., Park Y. J., Kim J. H., Kim D. S., Park H. B., Shin J. W., Cho J. H., Kwon K. H., Kim M. J., Lee J., Park K. W., Chung B., and Hong S.. 2014. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In ISSCC.Google ScholarGoogle Scholar
  106. [106] Lee Jinho, Kim Heesu, Yoo Sungjoo, Choi Kiyoung, Hofstee H. Peter, Nam Gi-Joon, Nutter Mark R., and Jamsek Damir. 2017. ExtraV: Boosting graph processing near storage with a coherent accelerator. In VLDB.Google ScholarGoogle Scholar
  107. [107] Lee Joo Hwan, Sim Jaewoong, and Kim Hyesoon. 2015. BSSync: Processing near memory for machine learning workloads with bounded staleness consistency models. In PACT.Google ScholarGoogle Scholar
  108. [108] Lee Sukhan, Kang Shin-haeng, Lee Jaehoon, Kim Hyeonsu, Lee Eojin, Seo Seungwoo, Yoon Hosang, Lee Seungwon, Lim Kyounghwan, Shin Hyunsung, Kim Jinhyun, O Seongil, Iyer Anand, Wang David, Sohn Kyomin, and Sung Kim Nam. 2021. Hardware architecture and software stack for FIM based on commercial DRAM technology. In ISCA.Google ScholarGoogle Scholar
  109. [109] Lee Vincent T., Mazumdar Amrita, Mundo Carlo C. del, Alaghi Armin, Ceze Luis, and Oskin Mark. 2018. Application codesign of near-data processing for similarity search. In IPDPS.Google ScholarGoogle Scholar
  110. [110] Li Jiajie, Chi Yuze, and Cong Jason. 2020. HeteroHalide: From image processing DSL to efficient FPGA acceleration. In FPGA.Google ScholarGoogle Scholar
  111. [111] Li Shuangchen, Xu Cong, Zou Qiaosha, Zhao Jishen, Lu Yu, and Xie Yuan. 2016. Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories. In DAC.Google ScholarGoogle Scholar
  112. [112] Liu Jiawen, Zhao Hengyu, Ogleari Matheus A., Li Dong, and Zhao Jishen. 2018. Processing-in-Memory for energy-efficient neural network training: A heterogeneous approach. In MICRO.Google ScholarGoogle Scholar
  113. [113] Liu Zhiyu, Calciu Irina, Herlihy Maurice, and Mutlu Onur. 2017. Concurrent data structures for near-memory computing. In SPAA.Google ScholarGoogle Scholar
  114. [114] Mayhew David and Krishnan Venkata. 2003. PCI express and advanced switching: Evolutionary path to building next generation interconnects. In HOTI.Google ScholarGoogle Scholar
  115. [115] Meng Jiayuan and Skadron Kevin. 2011. A performance study for iterative stencil loops on GPUs with ghost zone optimizations. In IJPP.Google ScholarGoogle Scholar
  116. [116] Microsoft. Deploy ML models to field-programmable gate arrays (FPGAs) with Azure Machine Learning. Retrieved from https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-fpga-web-service.Google ScholarGoogle Scholar
  117. [117] Morad Amir, Yavits Leonid, and Ginosar Ran. 2015. GP-SIMD processing-in-memory. ACM TACO 11, 4 (2015), 1–26.Google ScholarGoogle Scholar
  118. [118] Mutlu Onur. 2021. Intelligent architectures for intelligent computing systems. In DATE.Google ScholarGoogle Scholar
  119. [119] Mutlu Onur, Ghose Saugata, Gómez-Luna Juan, and Ausavarungnirun Rachata. 2019. Enabling practical processing in and near memory for data-intensive computing. In DAC.Google ScholarGoogle Scholar
  120. [120] Mutlu Onur, Ghose Saugata, Gómez-Luna Juan, and Ausavarungnirun Rachata. 2019. Processing data where it makes sense: Enabling in-memory computation. In MicPro, Vol. 67. 28–41.Google ScholarGoogle Scholar
  121. [121] Mutlu Onur, Ghose Saugata, Gómez-Luna Juan, and Ausavarungnirun Rachata. 2021. A modern primer on processing in memor. In Emerging Computing: From Devices to Systems-Looking Beyond Moore and Von Neumann. Springer.Google ScholarGoogle Scholar
  122. [122] Nai L., Hadidi R., Sim J., Kim H., Kumar P., and Kim H.. 2017. GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks. In HPCA.Google ScholarGoogle Scholar
  123. [123] Nair R., Antao S. F., Bertolli C., Bose P., Brunheroto J. R., Chen T., Cher C., Costa C. H. A., Doi J., Evangelinos C., Fleischer B. M., Fox T. W., Gallo D. S., Grinberg L., Gunnels J. A., Jacob A. C., Jacob P., Jacobson H. M., Karkhanis T., Kim C., Moreno J. H., O’Brien J. K., Ohmacht M., Park Y., Prener D. A., Rosenburg B. S., Ryu K. D., Sallenave O., Serrano M. J., Siegl P. D. M., Sugavanam K., and Sura Z.. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM JRD 59, 2/3 (2015), 17–1.Google ScholarGoogle Scholar
  124. [124] Navarro Fábio C. P., Mohsen Hussein, Yan Chengfei, Li Shantao, Gu Mengting, Meyerson William, and Gerstein Mark. 2019. Genomics and data science: An application within an umbrella. Genome Biology 20, 1 (2019), 1–11.Google ScholarGoogle Scholar
  125. [125] Neale Richard B., Chen Chih-Chieh, Gettelman Andrew, Lauritzen Peter H., Park Sungsu, Williamson David L., Conley Andrew J., Garcia Rolando, Kinnison Doug, Lamarque Jean-Francois, et al. 2010. Description of the NCAR Community atmosphere model (CAM 5.0). In NCAR Tech. Note.Google ScholarGoogle Scholar
  126. [126] Oliveira Geraldo Francisco, Gómez-Luna Juan, Orosa Lois, Ghose Saugata, Vijaykumar Nandita, Fernandez Ivan, Sadrosadati Mohammad, and Mutlu Onur. 2021. DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks. In IEEE Access, Vol. 9. 134457–134502.Google ScholarGoogle Scholar
  127. [127] Özişik M. Necati, Orlande Helcio R. B., Colaço Marcelo J., and Cotta Renato M.. 2017. Finite Difference Methods in Heat Transfer. CRC Press.Google ScholarGoogle ScholarCross RefCross Ref
  128. [128] Park Jaehyun, Kim Byeongho, Yun Sungmin, Lee Eojin, Rhu Minsoo, and Ahn Jung Ho. 2021. TRiM: Enhancing processor-memory interfaces with scalable tensor reduction in memory. In MICRO.Google ScholarGoogle Scholar
  129. [129] Pattnaik Ashutosh, Tang Xulong, Jog Adwait, Kayiran Onur, Mishra Asit K., Kandemir Mahmut T., Mutlu Onur, and Das Chita R.. 2016. Scheduling techniques for gpu architectures with processing-in-memory capabilities. In PACT.Google ScholarGoogle Scholar
  130. [130] Pawlowski J. T.. 2011. Hybrid memory cube (HMC). In HCS.Google ScholarGoogle Scholar
  131. [131] Pohl Constantin, Sattler Kai-Uwe, and Graefe Goetz. 2019. Joins on high-bandwidth memory: A new level in the memory hierarchy. In VLDB.Google ScholarGoogle Scholar
  132. [132] Pugsley Seth H., Jestes Jeffrey, Zhang Huihui, Balasubramonian Rajeev, Srinivasan Vijayalakshmi, Buyuktosunoglu Alper, Davis Al, and Li Feifei. 2014. NDC: Analyzing the impact of 3D-Stacked memory+logic devices on mapreduce workloads. In ISPASS.Google ScholarGoogle Scholar
  133. [133] Rojek Krzysztof et al. 2019. CFD Acceleration with FPGA. In H2RC.Google ScholarGoogle Scholar
  134. [134] Sadasivam Satish Kumar, Thompto Brian W., Kalla Ron, and Starke William J.. 2017. IBM POWER9 processor architecture. In IEEE Micro.Google ScholarGoogle Scholar
  135. [135] Sano Kentaro, Hatsuda Yoshiaki, and Yamamoto Satoru. 2014. Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth. In TPDS.Google ScholarGoogle Scholar
  136. [136] Santos Paulo C., Oliveira Geraldo F., Tomé Diego G., Alves Marco A. Z., Almeida Eduardo C., and Carro Luigi. 2017. Operand size reconfiguration for big data processing in memory. In DATE.Google ScholarGoogle Scholar
  137. [137] Schär Christoph, Fuhrer Oliver, Arteaga Andrea, Ban Nikolina, Charpilloz Christophe, Di Girolamo Salvatore, Hentgen Laureline, Hoefler Torsten, Lapillonne Xavier, Leutwyler David, Osterried Katherine, Panosetti Davide, Rudishli Stefan, Schlemmer Linda, Schulthess Thomas C., Sprenger Michael, Ubbiali Stefano, and Wernli Heini. 2020. Kilometer-scale climate models: Prospects and challenges. In BAMS.Google ScholarGoogle Scholar
  138. [138] Seshadri Vivek, Hsieh Kevin, Boroum Amirali, Lee Donghyuk, Kozuch Michael A., Mutlu Onur, Gibbons Phillip B., and Mowry Todd C.. 2015. Fast bulk bitwise AND and OR in DRAM. In CAL.Google ScholarGoogle Scholar
  139. [139] Seshadri Vivek, Kim Yoongu, Fallin Chris, Lee Donghyuk, Ausavarungnirun Rachata, Pekhimenko Gennady, Luo Yixin, Mutlu Onur, Gibbons Phillip B., Kozuch Michael A., et al. 2013. RowClone: Fast and energy-efficient In-DRAM bulk data copy and initialization. In MICRO.Google ScholarGoogle Scholar
  140. [140] Seshadri Vivek, Lee Donghyuk, Mullins Thomas, Hassan Hasan, Boroumand Amirali, Kim Jeremie, Kozuch Michael A., Mutlu Onur, Gibbons Phillip B., and Mowry Todd C.. 2017. Ambit: In-Memory accelerator for bulk bitwise operations using commodity DRAM technology. In MICRO.Google ScholarGoogle Scholar
  141. [141] Seshadri Vivek, Lee Donghyuk, Mullins Thomas, Hassan Hasan, Boroumand Amirali, Kim Jeremie, Kozuch Michael A., Mutlu Onur, Gibbons Phillip B., and Mowry Todd C.. 2016. Buddy-RAM: Improving the performance and efficiency of bulk bitwise operations using DRAM (unpublished).Google ScholarGoogle Scholar
  142. [142] Seshadri Vivek, Mullins Thomas, Boroumand Amirali, Mutlu Onur, Gibbons Phillip B., Kozuch Michael A., and Mowry Todd C.. 2015. Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses. In MICRO.Google ScholarGoogle Scholar
  143. [143] Seshadri Vivek and Mutlu Onur. 2017. Simple operations in memory to reduce data movement. In Advances in Computers.Google ScholarGoogle ScholarCross RefCross Ref
  144. [144] Seshadri Vivek and Mutlu Onur. 2019. In-DRAM bulk bitwise execution engine. arxiv.Google ScholarGoogle Scholar
  145. [145] Sharma D. D.. Compute express link. In CXL Consortium White Paper2019.Google ScholarGoogle Scholar
  146. [146] Simon William Andrew, Qureshi Yasir Mahmood, Rios Marco, Levisse Alexandre, Zapater Marina, and Atienza David. 2020. BLADE: An in-cache computing architecture for edge devices. In TC.Google ScholarGoogle Scholar
  147. [147] Singh Gagandeep et al. 2019. NAPEL: Near-memory computing application performance prediction via ensemble learning. In DAC.Google ScholarGoogle Scholar
  148. [148] Singh Gagandeep, Alser Mohammed, Cali Damla Senol, Diamantopoulos Dionysios, Gómez-Luna Juan, Corporaal Henk, and Mutlu Onur. 2021. FPGA-based near-memory acceleration of modern data-intensive applications. In IEEE Micro.Google ScholarGoogle Scholar
  149. [149] Singh Gagandeep, Chelini Lorenzo, Corda Stefano, Awan Ahsan Javed, Stuijk Sander, Jordans Roel, Corporaal Henk, and Boonstra Albert-Jan. 2019. Near-Memory computing: Past, present, and future. In MicPro, Vol. 71. 102868.Google ScholarGoogle Scholar
  150. [150] Singh Gagandeep, Chelini Lorenzo, Corda Stefano, Awan Ahsan Javed, Stuijk Sander, Jordans Roel, Corporaal Henk, and Boonstra Albert-Jan. 2018. A review of near-memory computing architectures: Opportunities and challenges. In DSD.Google ScholarGoogle Scholar
  151. [151] Singh Gagandeep, Diamantopolous Dionysios, Gómez-Luna Juan, Stuijk Sander, Mutlu Onur, and Corporaal Henk. 2021. Modeling FPGA-based systems via few-shot learning. In FPGA.Google ScholarGoogle Scholar
  152. [152] Singh Gagandeep, Diamantopoulos Dionysios, Hagleitner Christoph, Gómez-Luna Juan, Stuijk Sander, Mutlu Onur, and Corporaal Henk. 2020. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling. In FPL.Google ScholarGoogle Scholar
  153. [153] Singh Gagandeep, Diamantopoulos Dionysios, Hagleitner Christoph, Stuijk Sander, and Corporaal Henk. 2019. NARMADA: Near-memory horizontal diffusion accelerator for scalable stencil computations. In FPL.Google ScholarGoogle Scholar
  154. [154] Singh Gagandeep, Diamantopoulos Dionysios, Stuijk Sander, Hagleitner Christoph, and Corporaal Henk. 2019. Low precision processing for high order stencil computations. In Springer LNCS.Google ScholarGoogle Scholar
  155. [155] Strzodka Robert, Shaheen Mohammed, Pajak Dawid, and Seidel Hans-Peter. 2010. Cache oblivious parallelograms in iterative stencil computations. In ICS.Google ScholarGoogle Scholar
  156. [156] Stuecheli Jeffrey et al. 2018. IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI. IBM JRD 62, 4/5 (2018), 8–1.Google ScholarGoogle Scholar
  157. [157] Stuecheli Jeffrey, Blaner Bart, Johns C. R., and Siegel M. S.. 2015. CAPI: A coherent accelerator processor interface. IBM JRD 59, 1 (2015), 7–1.Google ScholarGoogle Scholar
  158. [158] Sukhwani B., Roewer T., Haymes C. L., Kim K., McPadden A. J., Dreps D. M., Sanner D., Lunteren J. V., and Asaad S.. 2017. ConTutto—A novel FPGA-based prototyping platform enabling innovation in the memory subsystem of a server class processor. In MICRO.Google ScholarGoogle Scholar
  159. [159] Szustak Lukasz, Rojek Krzysztof, and Gepner Pawel. 2013. Using Intel Xeon Phi coprocessor to accelerate computations in MPDATA algorithm. In PPAM.Google ScholarGoogle Scholar
  160. [160] Tang Yuan, Chowdhury Rezaul Alam, Kuszmaul Bradley C., Luk Chi-Keung, and Leiserson Charles E.. 2011. The Pochoir stencil compiler. In SPAA.Google ScholarGoogle Scholar
  161. [161] Thaler Felix, Moosbrugger Stefan, Osuna Carlos, Bianco Mauro, Vogt Hannes, Afanasyev Anton, Mosimann Lukas, Fuhrer Oliver, Schulthess Thomas C., and Hoefler Torsten. 2019. Porting the COSMO weather model to Manycore CPUs. In PASC.Google ScholarGoogle Scholar
  162. [162] Thomas Llewellyn. 1949. Elliptic problems in linear differential equations over a network. In Watson Sci. Comput. Lab. Report, Columbia University.Google ScholarGoogle Scholar
  163. [163] Tsai Po-An et al. 2017. Jenga: Software-defined cache hierarchies. In ISCA.Google ScholarGoogle Scholar
  164. [164] Tullsen Dean M., Eggers Susan J., and Levy Henry M.. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In ISCA.Google ScholarGoogle Scholar
  165. [165] Lunteren Jan van, Luijten Ronald, Diamantopoulos Dionysios, Auernhammer Florian, Hagleitner Christoph, Chelini Lorenzo, Corda Stefano, and Singh Gagandeep. 2019. Coherently attached programmable near-memory acceleration platform and its application to stencil processing. In DATE.Google ScholarGoogle Scholar
  166. [166] Volkov Vasily and Demmel James W.. 2008. Benchmarking GPUs to tune dense linear algebra. In SC.Google ScholarGoogle Scholar
  167. [167] Wahib Mohamed and Maruyama Naoya. 2014. Scalable kernel fusion for memory-bound GPU applications. In SC.Google ScholarGoogle Scholar
  168. [168] Waidyasooriya Hasitha Muthumala and Hariyama Masanori. 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188–53201.Google ScholarGoogle Scholar
  169. [169] Waidyasooriya H. M., Takei Y., Tatsumi S., and Hariyama M.. 2017. OpenCL-based FPGA-platform for stencil computation and its optimization methodology. In TPDS.Google ScholarGoogle Scholar
  170. [170] Wang Shuo and Liang Yun. 2017. A comprehensive framework for synthesizing stencil algorithms on FPGAs using OpenCL model. In DAC.Google ScholarGoogle Scholar
  171. [171] Wang Zeke, Huang Hongjing, Zhang Jie, and Alonso Gustavo. 2020. Shuhai: Benchmarking high bandwidth memory on FPGAs. In FCCM.Google ScholarGoogle Scholar
  172. [172] Wenzel Lukas, Schmid Robert, Martin Balthasar, Plauth Max, Eberhardt Felix, and Polze Andreas. 2018. Getting started with CAPI SNAP: Hardware development for software engineers. In Euro-Par.Google ScholarGoogle Scholar
  173. [173] Williams Samuel, Waterman Andrew, and Patterson David. 2009. Roofline: An insightful visual performance model for multicore architectures. In CACM.Google ScholarGoogle Scholar
  174. [174] Wu Lingxi, Sharifi Rasool, Lenjani Marzieh, Skadron Kevin, and Venkat Ashish. 2021. Sieve: Scalable In-situ DRAM-based accelerator designs for massively parallel k-mer matching. In ISCA.Google ScholarGoogle Scholar
  175. [175] Xu Jingheng, Fu Haohuan, Shi Wen, Gan Lin, Li Yuxuan, Luk Wayne, and Yang Guangwen. 2018. Performance tuning and analysis for stencil-based applications on POWER8 processor. ACM TACO 15, 4 (2018), 1–25.Google ScholarGoogle Scholar
  176. [176] Zhang Dongping, Jayasena Nuwan, Lyashevsky Alexander, Greathouse Joseph L., Xu Lifan, and Ignatowski Michael. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In HPDC.Google ScholarGoogle Scholar
  177. [177] Zhang Jun A., Marks Frank D., Sippel Jason A., Rogers Robert F., Zhang Xuejin, Gopalakrishnan Sundararaman G., Zhang Zhan, and Tallapragada Vijay. 2018. Evaluating the impact of improvement in the horizontal diffusion parameterization on hurricane prediction in the operational hurricane weather research and forecast (HWRF) model. In Weather and Forecasting.Google ScholarGoogle Scholar
  178. [178] Zhu Maohua, Zhuo Youwei, Wang Chao, Chen Wenguang, and Xie Yuan. 2018. Performance evaluation and optimization of HBM-enabled GPU for data-intensive applications. In VLSI.Google ScholarGoogle Scholar
  179. [179] Zohouri Hamid Reza, Podobas Artur, and Matsuoka Satoshi. 2018. Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL. In FPGA.Google ScholarGoogle Scholar

Index Terms

  1. Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Reconfigurable Technology and Systems
          ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 4
          December 2022
          476 pages
          ISSN:1936-7406
          EISSN:1936-7414
          DOI:10.1145/3540252
          • Editor:
          • Deming Chen
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 June 2022
          • Online AM: 9 February 2022
          • Accepted: 1 November 2021
          • Revised: 1 October 2021
          • Received: 1 July 2021
          Published in trets Volume 15, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!