skip to main content
10.1145/3466752.3480125acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Open access

Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture

Published: 17 October 2021 Publication History
  • Get Citation Alerts
  • Abstract

    In recent years, attention-based models have achieved impressive performance in natural language processing and computer vision applications by effectively capturing contextual knowledge from the entire sequence. However, the attention mechanism inherently contains a large number of redundant connections, imposing a heavy computational burden on model deployment. To this end, sparse attention has emerged as an attractive approach to reduce the computation and memory footprint, which involves the sampled dense-dense matrix multiplication (SDDMM) and sparse-dense matrix multiplication (SpMM) at the same time, thus requiring the hardware to eliminate zero-valued operations effectively. Existing techniques based on irregular sparse patterns or regular but coarse-grained patterns lead to low hardware efficiency or less computation saving.
    This paper proposes Sanger, a framework that harvests sparsity in the attention mechanism through synergistic hardware and software co-design. The software part prunes the attention matrix into a dynamic structured pattern, and the hardware part features a reconfigurable architecture that exploits such patterns. Specifically, we dynamically sparsify vanilla attention based on a quantized prediction of the attention matrix. Then, the sparse mask is re-arranged into structured blocks that are more amenable to hardware implementation. The hardware design of Sanger features a score-stationary dataflow that keeps sparse scores stationary in the PE to avoid decoding overhead. Using this dataflow and a reconfigurable systolic array design, we can unify the computation of SDDMM and SpMM operations. Typically, the PEs can be configured during runtime to support different data access and partial sum accumulation schemes. Experiments on BERT show that Sanger can prune the model to 0.08 - 0.27 sparsity without accuracy loss, achieving 4.64X, 22.7X, 2.39X, and 1.47X speedup compared to V100 GPU, AMD Ryzen Threadripper 3970X CPU, as well as the state-of-the-art attention accelerators A3 and SpAtten.

    References

    [1]
    Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: constructing hardware in a scala embedded language. In Proceedings of the Design Automation Conference (DAC).
    [2]
    Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150(2020).
    [3]
    Yoshua Bengio, N. Léonard, and Aaron C. Courville. 2013. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. ArXiv abs/1308.3432(2013).
    [4]
    James Bennett and Stan Lanning. 2007. The netflix prize. In Proceedings of KDD cup and workshop.
    [5]
    Shijie Cao, Lingxiao Ma, W. Xiao, Chen Zhang, Yunxin Liu, L. Zhang, L. Nie, and Z. Yang. 2019. SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity Through Low-Bit Quantization. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR).
    [6]
    Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. Sigplan Notices (2014).
    [7]
    Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Oliver Temam. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the International Symposium on Microarchitecture (MICRO).
    [8]
    Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. Journal of Solid-State Circuits(2016).
    [9]
    Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2018. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. arXiv preprint arXiv:1807.07928(2018).
    [10]
    Sharan Chetlur, C. Woolley, Philippe Vandermersch, J. Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. ArXiv abs/1410.0759(2014).
    [11]
    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509(2019).
    [12]
    Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. In Proceedings of Conference on Empirical Methods in Natural Language Processing/International Joint Conference on Natural Language Processing.
    [13]
    Baiyun Cui, Y. Li, Ming Chen, and Z. Zhang. 2019. Fine-tune BERT with Sparse Self-Attention Mechanism. In Proceedings of Conference on Empirical Methods in Natural Language Processing/International Joint Conference on Natural Language Processing.
    [14]
    Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Kevin Siu, and Andreas Moshovos. 2019. Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
    [15]
    Chunhua Deng, Siyu Liao, Yi Xie, Keshab K Parhi, Xuehai Qian, and Bo Yuan. 2018. PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices. In Proceedings of the International Symposium on Microarchitecture (MICRO).
    [16]
    J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies.
    [17]
    Zidong Du, Ro Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of SIGARCH Computer Architecture News.
    [18]
    Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
    [19]
    Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, 2020. A 3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
    [20]
    Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, and Yu Wang. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. In Proceedings of the International Symposium on Field Programmable Gate Arrays (FPGA).
    [21]
    Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the International Symposium on Computer Architecture (ISCA).
    [22]
    Weizhe Hua, Yuan Zhou, Christopher De Sa, Zhiru Zhang, and G Edward Suh. 2019. Boosting the performance of CNN accelerators with dynamic fine-grained channel gating. In Proceedings of the 52nd International Symposium on Microarchitecture (MICRO).
    [23]
    Intel. 2021. Oneapi-Src/oneDNN. https://github.com/oneapi-src/oneDNN.
    [24]
    Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, and Onur Mutlu. 2019. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations. In Proceedings of the International Symposium on Microarchitecture (MICRO).
    [25]
    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. ArXiv abs/2001.04451(2020).
    [26]
    HT Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
    [27]
    HT Kung, Bradley McDanel, Sai Qian Zhang, Xin Dong, and Chih Chiang Chen. 2019. Maestro: A memory-on-logic architecture for coordinated parallel use of many systolic arrays. In Proceedings of International Conference on Application-specific Systems, Architectures and Processors (ASAP).
    [28]
    Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In SIGPLAN Notices, Vol. 53. 461–475.
    [29]
    Alberto Delmás Lascorz, Sayeh Sharify, Isak Edo, Dylan Malone Stuart, Omar Mohamed Awad, Patrick Judd, Mostafa Mahmoud, Milos Nikolic, Kevin Siu, Zissis Poulos, 2019. Shapeshifter: Enabling fine-grain data width adaptation in deep learning. In Proceedings of the International Symposium on Microarchitecture (MICRO).
    [30]
    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of Annual Conference of the Association for Computational Linguistics (2020).
    [31]
    Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. FTRANS: energy-efficient acceleration of transformers using FPGA. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED).
    [32]
    Yun Liang, Liqiang Lu, Yicheng Jin, Jiaming Xie, Ruirui Huang, Jiansong Zhang, and Wei Lin. 2021. An Efficient Hardware Design for Accelerating Sparse CNNs with NAS-based Models. Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) (2021).
    [33]
    Yun Liang, Liqiang Lu, and Jiaming Xie. 2020. OMNI: A framework for integrating hardware and software optimizations for sparse CNNs. Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) (2020).
    [34]
    Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. 2015. Pudiannao: A polyvalent machine learning accelerator. In Proceedings of SIGARCH Computer Architecture News.
    [35]
    Liqiang Lu, Naiqing Guan, Yuyue Wang, Liancheng Jia, Zizhang Luo, Jieming Yin, Jason Cong, and Yun Liang. 2021. TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation. In Proceedings of the International Symposium on Computer Architecture (ISCA).
    [36]
    Liqiang Lu and Yun Liang. 2018. SpWA: an efficient sparse winograd convolutional neural networks accelerator on FPGAs. In Proceedings of the Design Automation Conference (DAC).
    [37]
    Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, and Yun Liang. 2019. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs. In Proceedings of International Symposium on Field-Programmable Custom Computing Machines (FCCM).
    [38]
    Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of The conference on Recommender systems (RecSys).
    [39]
    NVIDIA. 2021. NVIDIA/DeepLearningExamples. https://github.com/NVIDIA/DeepLearningExamples.
    [40]
    Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
    [41]
    Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. 2017. Scnn: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of SIGARCH Computer Architecture News.
    [42]
    Adam Paszke, S. Gross, Francisco Massa, A. Lerer, J. Bradbury, G. Chanan, Trevor Killeen, Z. Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, B. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS.
    [43]
    Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A Horowitz. 2013. Convolution engine: balancing efficiency & flexibility in specialized computing. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA).
    [44]
    Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
    [45]
    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).
    [46]
    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).
    [47]
    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
    [48]
    Aurko Roy, M. Saffar, Ashish Vaswani, and David Grangier. 2020. Efficient Content-Based Sparse Attention with Routing Transformers. In Proceedings of Transactions of the Association for Computational Linguistics (TACL).
    [49]
    Shaden Smith and George Karypis. 2015. Tensor-matrix products with a compressed sparse tensor. In Proceedings of the Workshop on Irregular Applications: Architectures and Algorithms.
    [50]
    Mingcong Song, J. Zhao, Y. Hu, Jiaqi Zhang, and Tao Li. 2018. Prediction Based Execution on Deep Neural Networks. In Proceedings of the International Symposium on Computer Architecture (ISCA).
    [51]
    Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
    [52]
    Yi Tay, Dara Bahri, L. Yang, Donald Metzler, and D. Juan. 2020. Sparse Sinkhorn Attention. In Proceedings of International Conference on Machine Learning (ICML).
    [53]
    Yi Tay, M. Dehghani, Samira Abnar, Y. Shen, Dara Bahri, Philip Pham, J. Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2020. Long Range Arena: A Benchmark for Efficient Transformers. ArXiv abs/2011.04006(2020).
    [54]
    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732(2020).
    [55]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS).
    [56]
    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP.
    [57]
    Hanrui Wang, Zhekai Zhang, and Song Han. 2021. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).
    [58]
    Meiqi Wang, Siyuan Lu, Danyang Zhu, Jun Lin, and Zhongfeng Wang. 2018. A high-speed and low-complexity architecture for softmax function in deep learning. In Proceedings of 2018 Asia Pacific Conference on Circuits and Systems.
    [59]
    X. Wang, Ross B. Girshick, A. Gupta, and Kaiming He. 2018. Non-local Neural Networks. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR).
    [60]
    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
    [61]
    Qingcheng Xiao, Size Zheng, Bingzhe Wu, Xu Pengcheng, Xuehai Qian, and Yun Liang. 2021. HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation. In Proceedings of the International Symposium on Computer Architecture (ISCA).
    [62]
    Qizhe Xie, Guokun Lai, Zihang Dai, and E. Hovy. 2018. Large-scale Cloze Test Dataset Created by Teachers. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
    [63]
    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, R. Salakhutdinov, R. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of International Conference on Machine Learning (ICML).
    [64]
    Tzu-Hsien Yang, Hsiang-Yun Cheng, Chia-Lin Yang, I-Ching Tseng, Han-Wen Hu, Hung-Sheng Chang, and Hsiang-Pang Li. 2019. Sparse reram engine: Joint exploration of activation and weight sparsity in compressed neural networks. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA).
    [65]
    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, 2020. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062(2020).
    [66]
    Han Zhang, I. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. 2019. Self-Attention Generative Adversarial Networks. In Proceedings of International Conference on Machine Learning (ICML).
    [67]
    Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the International Symposium on Microarchitecture (MICRO).
    [68]
    Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, and X. Sun. 2019. Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection. ArXiv abs/1912.11637(2019).
    [69]
    Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li, Tianshi Chen, and Yunji Chen. 2018. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In Proceedings of the International Symposium on Microarchitecture (MICRO).
    [70]
    Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs. In Proceedings of the 52nd International Symposium on Microarchitecture (MICRO).

    Cited By

    View all
    • (2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
    • (2024)HSCONN: Hardware-Software Co-Optimization of Self-Attention Neural Networks for Large Language ModelsProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658709(736-741)Online publication date: 12-Jun-2024
    • (2024)H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge DevicesACM Transactions on Design Automation of Electronic Systems10.1145/364921929:3(1-19)Online publication date: 22-Apr-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
    October 2021
    1322 pages
    ISBN:9781450385572
    DOI:10.1145/3466752
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. Transformer
    2. attention
    3. hardware-software co-design
    4. reconfigurable architecture
    5. sparse
    6. systolic array

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    MICRO '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Upcoming Conference

    MICRO '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3,876
    • Downloads (Last 6 weeks)354

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
    • (2024)HSCONN: Hardware-Software Co-Optimization of Self-Attention Neural Networks for Large Language ModelsProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658709(736-741)Online publication date: 12-Jun-2024
    • (2024)H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge DevicesACM Transactions on Design Automation of Electronic Systems10.1145/364921929:3(1-19)Online publication date: 22-Apr-2024
    • (2024)Design and Implementation of Encoder for Vision Transformer Networks Based on Active Reconfigurable Array Processors2023 6th International Conference on Artificial Intelligence and Pattern Recognition (AIPR)10.1145/3641584.3641642(395-401)Online publication date: 14-Jun-2024
    • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
    • (2024)Mobile Foundation Model as FirmwareProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3649361(279-295)Online publication date: 29-May-2024
    • (2024)IANUS: Integrated Accelerator based on NPU-PIM Unified Memory SystemProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651324(545-560)Online publication date: 27-Apr-2024
    • (2024)Hardware Accelerator Design for Sparse DNN Inference and Training: A TutorialIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2023.334468171:3(1708-1714)Online publication date: Mar-2024
    • (2024)Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language ModelsIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.335066171:6(2858-2871)Online publication date: Jun-2024
    • (2024)CPSAA: Accelerating Sparse Attention Using Crossbar-Based Processing-In-Memory ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334452443:6(1741-1754)Online publication date: Jun-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media