skip to main content
10.1145/3567955.3567961acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open Access

TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

Published:21 December 2022Publication History

ABSTRACT

Memory buffer allocation for on-chip memories is a major challenge in modern machine learning systems that target ML accelerators. In interactive systems such as mobile phones, it is on the critical path of launching ML-enabled applications. In data centers, it is part of complex optimization loops that run many times and are the limiting factor for the quality of compilation results.

In contrast to the traditional memory allocation problem in languages such as C++, where allocation requests dynamically arrive as the application is executing, ML systems typically execute a static control flow graph that is known in advance. The task of the memory allocator is to choose buffer locations in device memory such that the total amount of used memory never exceeds the total memory available on-device. This is a high dimensional, NP-hard optimization problem that is challenging to solve.

Today, ML frameworks approach this problem either using ad-hoc heuristics or solver-based methods. Heuristic solutions work for simple cases but fail for more complex instances of this problem. Solver-based solutions can handle these more complex instances, but are expensive and impractical in scenarios where memory allocation is on the critical path, such as on mobile devices that compile models on-the-fly. We encountered this problem in the development of Google's Pixel 6 phone, where some important models took prohibitively long to compile.

We introduce an approach that solves this challenge by combining constraint optimization with domain-specific knowledge to achieve the best properties of both. We combine a heuristic-based search with a solver to guide its decision making. Our approach matches heuristics for simple inputs while being significantly faster than the best Integer Linear Program (ILP) solver-based approach for complex inputs. We also show how ML can be used to continuously improve the search for the long tail of workloads. Our approach is shipping in two production systems: Google's Pixel 6 phone and TPUv4. It achieves up to two orders of magnitude allocation time speed-up on real ML workloads compared to a highly-tuned production ILP approach that it replaces and enables important real-world models that could not otherwise be supported.

References

  1. 2017. TensorFlow Lite. https://www.tensorflow.org/lite Google ScholarGoogle Scholar
  2. 2020. Optimizing TensorFlow Lite Runtime Memory. https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html Google ScholarGoogle Scholar
  3. 2021. Google OR Tools: CP-SAT Solver. https://developers.google.com/optimization/cp/cp_solver Google ScholarGoogle Scholar
  4. 2021. Google Tensor is a milestone for machine learning. https://blog.google/products/pixel/introducing-google-tensor/ Google ScholarGoogle Scholar
  5. 2022. Android Neural Network API. https://developer.android.com/ndk/guides/neuralnetworks Google ScholarGoogle Scholar
  6. 2022. pprof. https://github.com/google/pprof Google ScholarGoogle Scholar
  7. 2022. TensorFlow GitHub Repository: BFC Allocator. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/bfc_allocator.h Google ScholarGoogle Scholar
  8. 2022. TensorFlow GitHub Repository: Memory Repacker. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/memory_space_assignment_repacking.h Google ScholarGoogle Scholar
  9. 2022. Yggdrasil Decision Forests. https://github.com/google/yggdrasil-decision-forests Google ScholarGoogle Scholar
  10. Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph., 38, 4 (2019), Article 121, jul, issn:0730-0301 https://doi.org/10.1145/3306346.3322967 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Berkin Akin, Suyog Gupta, Yun Long, Anton Spiridonov, Zhuo Wang, Marie White, Hao Xu, Ping Zhou, and Yanqi Zhou. 2022. Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. Google ScholarGoogle ScholarCross RefCross Ref
  12. R. Banakar, S. Steinke, Bo-Sik Lee, M. Balakrishnan, and P. Marwedel. 2002. Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627). https://doi.org/10.1145/774789.774805 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization Space Pruning without Regrets. In CC 2017 - 26th International Conference on Compiler Construction (Proceedings of the International Conference on Compiler Construction). https://doi.org/10.1145/3033019.3033023 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Martin Berger, Michael Schröder, and Karl-Heinz Küfer. 2009. A Constraint-Based Approach for the Two-Dimensional Rectangular Packing Problem with Orthogonal Orientations. In Operations Research Proceedings 2008. isbn:978-3-642-00141-3 https://doi.org/10.1007/978-3-642-00142-0_69 Google ScholarGoogle ScholarCross RefCross Ref
  15. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax Google ScholarGoogle Scholar
  16. Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Quentin Cappart, Thierry Moisan, Louis-Martin Rousseau, Isabeau Prémont-Schwarz, and Andre Cire. 2020. Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization. arxiv:2006.01610. Google ScholarGoogle Scholar
  18. Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, Robert Bradshaw, and Nathan. 2010. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). http://dl.acm.org/citation.cfm?id=1806638 Google ScholarGoogle Scholar
  19. Prasanth Chatarasi, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, Tushar Krishna, and Vivek Sarkar. 2021. Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators. ACM Trans. Archit. Code Optim., 19, 1 (2021), Article 6, dec, issn:1544-3566 https://doi.org/10.1145/3485137 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tianshi Chen, Qi Guo, Ke Tang, Olivier Temam, Zhiwei Xu, Zhi-Hua Zhou, and Yunji Chen. 2014. ArchRanker: A ranking approach to design space exploration. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA.2014.6853198 Google ScholarGoogle ScholarCross RefCross Ref
  21. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). isbn:978-1-939133-08-3 https://www.usenix.org/conference/osdi18/presentation/chen Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. CoRR, abs/1604.06174 (2016), arXiv:1604.06174. arxiv:1604.06174 Google ScholarGoogle Scholar
  23. Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52, 1 (2017), 127–138. https://doi.org/10.1109/JSSC.2016.2616357 Google ScholarGoogle ScholarCross RefCross Ref
  25. Henri Fraisse and Dinesh Gaitonde. 2018. A SAT-based Timing Driven Place and Route Flow for Critical Soft IP. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 8–87. https://doi.org/10.1109/FPL.2018.00009 Google ScholarGoogle ScholarCross RefCross Ref
  26. Sanjay Ghemawat and Paul Menage. 2009. Tcmalloc: Thread-caching malloc. Google ScholarGoogle Scholar
  27. Ubaid Ullah Hafeez, Xiao Sun, Anshul Gandhi, and Zhenhua Liu. 2021. Towards Optimal Placement and Scheduling of DNN Operations with Pesto. In Proceedings of the 22nd International Middleware Conference (Middleware ’21). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W. Fletcher. 2021. Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). isbn:9781450383172 https://doi.org/10.1145/3445814.3446762 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA52012.2021.00050 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.). 2, 497–511. https://proceedings.mlsys.org/paper/2020/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper.pdf Google ScholarGoogle Scholar
  31. Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA52012.2021.00010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sam Kaufman, Phitchaya Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, and Mike Burrows. 2021. A Learned Performance Model for Tensor Processing Units. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.). 3, 387–400. https://proceedings.mlsys.org/paper/2021/file/85d8ce590ad8981ca2c8286f79f59954-Paper.pdf Google ScholarGoogle Scholar
  34. Shauharda Khadka, Estelle Aflalo, Mattias Marder, Avrech Ben-David, Santiago Miret, Shie Mannor, Tamir Hazan, Hanlin Tang, and Somdeb Majumdar. 2020. Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning. arxiv:2007.07298. Google ScholarGoogle Scholar
  35. Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52). isbn:9781450369381 https://doi.org/10.1145/3352460.3358252 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Doug Lea and Wolfram Gloger. 1996. A memory allocator. Google ScholarGoogle Scholar
  37. Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. TensorFlow Dev Summit. Google ScholarGoogle Scholar
  38. Juhyun Lee and Yury Pisarchyk. 2020. Efficient Memory Management for Deep Neural Net Inference. In MLSys 2020 Workshop on Resource-Constrained Machine Learning (ReCoML 2020). Google ScholarGoogle Scholar
  39. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations. https://openreview.net/forum?id=qrwe7XHTmYb Google ScholarGoogle Scholar
  40. Xiangwei Li and Douglas L. Maskell. 2019. Time-Multiplexed FPGA Overlay Architectures: A Survey. ACM Trans. Des. Autom. Electron. Syst., 24, 5 (2019), Article 54, jul, 19 pages. issn:1084-4309 https://doi.org/10.1145/3339861 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Chang Liu, Austin Harris, Martin Maas, Michael Hicks, Mohit Tiwari, and Elaine Shi. 2015. GhostRider: A Hardware-Software System for Memory Trace Oblivious Computation. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). isbn:9781450328357 https://doi.org/10.1145/2694344.2694385 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Changxi Liu, Hailong Yang, Rujun Sun, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. 2019. swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture. arxiv:1904.07404. Google ScholarGoogle Scholar
  43. Evan Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, and Junwhan Ahn. 2020. An Imitation Learning Approach for Cache Replacement. In Proceedings of the 37th International Conference on Machine Learning, Hal Daumé III and Aarti Singh (Eds.) (Proceedings of Machine Learning Research, Vol. 119). PMLR, 6237–6247. http://proceedings.mlr.press/v119/liu20f.html Google ScholarGoogle Scholar
  44. Martin Maas. 2020. A Taxonomy of ML for Systems Problems. IEEE Micro, 40, 5 (2020), 8–16. https://doi.org/10.1109/MM.2020.3012883 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators. IEEE Trans. Comput., 70, 8 (2021), 1160–1174. https://doi.org/10.1109/TC.2021.3059962 Google ScholarGoogle ScholarCross RefCross Ref
  46. Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, and Jeff Dean. 2018. Hierarchical Planning for Device Placement. In International Conference on Learning Representations. https://openreview.net/pdf?id=Hkc-TeZ0W Google ScholarGoogle Scholar
  47. Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Quoc Le, James Laudon, Richard Ho, Roger Carpenter, and Jeff Dean. 2021. A graph placement methodology for fast chip design. Nature, 594 (2021), 06, 207–212. https://doi.org/10.1038/s41586-021-03544-w Google ScholarGoogle ScholarCross RefCross Ref
  48. Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O’Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, and Yori Zwols. 2021. Solving Mixed Integer Programs Using Neural Networks. arxiv:2012.13349. Google ScholarGoogle Scholar
  49. Tony Nowatzki, Newsha Ardalani, Karthikeyan Sankaralingam, and Jian Weng. 2018. Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18). Association for Computing Machinery, New York, NY, USA. Article 36, 15 pages. isbn:9781450359863 https://doi.org/10.1145/3243176.3243212 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A General Constraint-Centric Scheduling Framework for Spatial Architectures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). isbn:9781450320146 https://doi.org/10.1145/2491956.2462163 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 304–315. https://doi.org/10.1109/ISPASS.2019.00042 Google ScholarGoogle ScholarCross RefCross Ref
  52. Phitchaya Mangpo Phothilimthana, Amit Sabne, Nikhil Sarda, Karthik Srinivasa Murthy, Yanqi Zhou, Christof Angermueller, Mike Burrows, Sudip Roy, Ketan Mandke, Rezsa Farahani, Yu Emma Wang, Berkin Ilbeyi, Blake Hechtman, Bjarke Roune, Shen Wang, Yuanzhong Xu, and Samuel J. Kaufman. 2021. A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). 1–16. https://doi.org/10.1109/PACT52795.2021.00008 Google ScholarGoogle ScholarCross RefCross Ref
  53. Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2014. System-level memory optimization for high-level synthesis of component-based SoCs. In 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarGoogle Scholar
  54. Artur Podobas, Kentaro Sano, and Satoshi Matsuoka. 2020. A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective. IEEE Access, 8 (2020), 146719–146743. https://doi.org/10.1109/ACCESS.2020.3012084 Google ScholarGoogle ScholarCross RefCross Ref
  55. Esther Roorda, Seyedramin Rasoulinezhad, Philip H. W. Leong, and Steven J. E. Wilton. 2022. FPGA Architecture Exploration for DNN Acceleration. ACM Trans. Reconfigurable Technol. Syst., 15, 3 (2022), Article 33, may, issn:1936-7406 https://doi.org/10.1145/3503465 Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ananda Samajdar, Jan Moritz Joseph, Matthew Denton, and Tushar Krishna. 2021. AIRCHITECT: Learning Custom Architecture Design and Mapping Space. https://doi.org/10.48550/ARXIV.2108.08295 Google ScholarGoogle Scholar
  57. Kayla O Seager, Ananta Tiwari, Michael A. Laurenzano, Joshua Peraza, Pietro Cicotti, and Laura Carrington. 2012. Efficient HPC Data Motion via Scratchpad Memory. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. https://doi.org/10.1109/SC.Companion.2012.111 Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Taro Sekiyama, Takashi Imamichi, Haruki Imai, and Rudy Raymond. 2018. Profile-guided memory optimization for deep neural networks. arXiv preprint arXiv:1804.10001. Google ScholarGoogle Scholar
  59. Tianqi Tang, Sheng Li, Lifeng Nai, Norm Jouppi, and Yuan Xie. 2021. NeuroMeter: An Integrated Power, Area, and Timing Modeling Framework for Machine Learning Accelerators Industry Track Paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). https://doi.org/10.1109/HPCA51647.2021.00075 Google ScholarGoogle ScholarCross RefCross Ref
  60. Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys ’16). Article 35, isbn:9781450342407 https://doi.org/10.1145/2901318.2901355 Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. 1995. Dynamic Storage Allocation: A Survey and Critical Review. Springer-Verlag, 1–116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Qingcheng Xiao, Size Zheng, Bingzhe Wu, Pengcheng Xu, Xuehai Qian, and Yun Liang. 2021. HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA52012.2021.00086 Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. 2020. Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators. isbn:9781450371025 https://doi.org/10.1145/3373376.3378514 Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Amir Yazdanbakhsh, Kiran Seshadri, Berkin Akin, James Laudon, and Ravi Narayanaswami. 2021. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. https://doi.org/10.48550/ARXIV.2102.10423 Google ScholarGoogle Scholar
  65. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’15). isbn:9781450333153 https://doi.org/10.1145/2684746.2689060 Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini. 2022. A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22). isbn:9781450392051 https://doi.org/10.1145/3503222.3507767 Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Yanqi Zhou, Xuanyi Dong, Berkin Akin, Mingxing Tan, Daiyi Peng, Tianjian Meng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon. 2021. Rethinking Co-design of Neural Architectures and Hardware Accelerators. arxiv:2102.08619. Google ScholarGoogle Scholar
  68. Yanqi Zhou, Xuanyi Dong, Tianjian Meng, Mingxing Tan, Berkin Akin, Daiyi Peng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon. 2022. Towards the Co-design of Neural Networks and Accelerators. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.). 4, 141–152. https://proceedings.mlsys.org/paper/2022/file/31fefc0e570cb3860f2a6d4b38c6490d-Paper.pdf Google ScholarGoogle Scholar
  69. Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon. 2020. Transferable Graph Optimizers for ML Compilers. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). 33, https://proceedings.neurips.cc/paper/2020/file/9f29450d2eb58feb555078bdefe28aa5-Paper.pdf Google ScholarGoogle Scholar

Index Terms

  1. TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader