ABSTRACT
Memory buffer allocation for on-chip memories is a major challenge in modern machine learning systems that target ML accelerators. In interactive systems such as mobile phones, it is on the critical path of launching ML-enabled applications. In data centers, it is part of complex optimization loops that run many times and are the limiting factor for the quality of compilation results.
In contrast to the traditional memory allocation problem in languages such as C++, where allocation requests dynamically arrive as the application is executing, ML systems typically execute a static control flow graph that is known in advance. The task of the memory allocator is to choose buffer locations in device memory such that the total amount of used memory never exceeds the total memory available on-device. This is a high dimensional, NP-hard optimization problem that is challenging to solve.
Today, ML frameworks approach this problem either using ad-hoc heuristics or solver-based methods. Heuristic solutions work for simple cases but fail for more complex instances of this problem. Solver-based solutions can handle these more complex instances, but are expensive and impractical in scenarios where memory allocation is on the critical path, such as on mobile devices that compile models on-the-fly. We encountered this problem in the development of Google's Pixel 6 phone, where some important models took prohibitively long to compile.
We introduce an approach that solves this challenge by combining constraint optimization with domain-specific knowledge to achieve the best properties of both. We combine a heuristic-based search with a solver to guide its decision making. Our approach matches heuristics for simple inputs while being significantly faster than the best Integer Linear Program (ILP) solver-based approach for complex inputs. We also show how ML can be used to continuously improve the search for the long tail of workloads. Our approach is shipping in two production systems: Google's Pixel 6 phone and TPUv4. It achieves up to two orders of magnitude allocation time speed-up on real ML workloads compared to a highly-tuned production ILP approach that it replaces and enables important real-world models that could not otherwise be supported.
- 2017. TensorFlow Lite. https://www.tensorflow.org/lite
Google Scholar
- 2020. Optimizing TensorFlow Lite Runtime Memory. https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html
Google Scholar
- 2021. Google OR Tools: CP-SAT Solver. https://developers.google.com/optimization/cp/cp_solver
Google Scholar
- 2021. Google Tensor is a milestone for machine learning. https://blog.google/products/pixel/introducing-google-tensor/
Google Scholar
- 2022. Android Neural Network API. https://developer.android.com/ndk/guides/neuralnetworks
Google Scholar
- 2022. pprof. https://github.com/google/pprof
Google Scholar
- 2022. TensorFlow GitHub Repository: BFC Allocator. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/bfc_allocator.h
Google Scholar
- 2022. TensorFlow GitHub Repository: Memory Repacker. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/memory_space_assignment_repacking.h
Google Scholar
- 2022. Yggdrasil Decision Forests. https://github.com/google/yggdrasil-decision-forests
Google Scholar
- Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph., 38, 4 (2019), Article 121, jul, issn:0730-0301 https://doi.org/10.1145/3306346.3322967
Google Scholar
Digital Library
- Berkin Akin, Suyog Gupta, Yun Long, Anton Spiridonov, Zhuo Wang, Marie White, Hao Xu, Ping Zhou, and Yanqi Zhou. 2022. Searching for Efficient Neural Architectures for On-Device ML on Edge TPUs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
Google Scholar
Cross Ref
- R. Banakar, S. Steinke, Bo-Sik Lee, M. Balakrishnan, and P. Marwedel. 2002. Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627). https://doi.org/10.1145/774789.774805
Google Scholar
Digital Library
- Ulysse Beaugnon, Antoine Pouille, Marc Pouzet, Jacques Pienaar, and Albert Cohen. 2017. Optimization Space Pruning without Regrets. In CC 2017 - 26th International Conference on Compiler Construction (Proceedings of the International Conference on Compiler Construction). https://doi.org/10.1145/3033019.3033023
Google Scholar
Digital Library
- Martin Berger, Michael Schröder, and Karl-Heinz Küfer. 2009. A Constraint-Based Approach for the Two-Dimensional Rectangular Packing Problem with Orthogonal Orientations. In Operations Research Proceedings 2008. isbn:978-3-642-00141-3 https://doi.org/10.1007/978-3-642-00142-0_69
Google Scholar
Cross Ref
- James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
Google Scholar
- Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Google Scholar
Digital Library
- Quentin Cappart, Thierry Moisan, Louis-Martin Rousseau, Isabeau Prémont-Schwarz, and Andre Cire. 2020. Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization. arxiv:2006.01610.
Google Scholar
- Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, Robert Bradshaw, and Nathan. 2010. FlumeJava: Easy, Efficient Data-Parallel Pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). http://dl.acm.org/citation.cfm?id=1806638
Google Scholar
- Prasanth Chatarasi, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, Tushar Krishna, and Vivek Sarkar. 2021. Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators. ACM Trans. Archit. Code Optim., 19, 1 (2021), Article 6, dec, issn:1544-3566 https://doi.org/10.1145/3485137
Google Scholar
Digital Library
- Tianshi Chen, Qi Guo, Ke Tang, Olivier Temam, Zhiwei Xu, Zhi-Hua Zhou, and Yunji Chen. 2014. ArchRanker: A ranking approach to design space exploration. In 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA.2014.6853198
Google Scholar
Cross Ref
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). isbn:978-1-939133-08-3 https://www.usenix.org/conference/osdi18/presentation/chen
Google Scholar
Digital Library
- Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. CoRR, abs/1604.06174 (2016), arXiv:1604.06174. arxiv:1604.06174
Google Scholar
- Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18).
Google Scholar
Digital Library
- Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52, 1 (2017), 127–138. https://doi.org/10.1109/JSSC.2016.2616357
Google Scholar
Cross Ref
- Henri Fraisse and Dinesh Gaitonde. 2018. A SAT-based Timing Driven Place and Route Flow for Critical Soft IP. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 8–87. https://doi.org/10.1109/FPL.2018.00009
Google Scholar
Cross Ref
- Sanjay Ghemawat and Paul Menage. 2009. Tcmalloc: Thread-caching malloc.
Google Scholar
- Ubaid Ullah Hafeez, Xiao Sun, Anshul Gandhi, and Zhenhua Liu. 2021. Towards Optimal Placement and Scheduling of DNN Operations with Pesto. In Proceedings of the 22nd International Middleware Conference (Middleware ’21).
Google Scholar
Digital Library
- Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W. Fletcher. 2021. Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). isbn:9781450383172 https://doi.org/10.1145/3445814.3446762
Google Scholar
Digital Library
- Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA52012.2021.00050
Google Scholar
Digital Library
- Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.). 2, 497–511. https://proceedings.mlsys.org/paper/2020/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper.pdf
Google Scholar
- Norman P. Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. 2021. Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA52012.2021.00010
Google Scholar
Digital Library
- Sheng-Chun Kao and Tushar Krishna. 2020. GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9.
Google Scholar
Digital Library
- Sam Kaufman, Phitchaya Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, and Mike Burrows. 2021. A Learned Performance Model for Tensor Processing Units. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.). 3, 387–400. https://proceedings.mlsys.org/paper/2021/file/85d8ce590ad8981ca2c8286f79f59954-Paper.pdf
Google Scholar
- Shauharda Khadka, Estelle Aflalo, Mattias Marder, Avrech Ben-David, Santiago Miret, Shie Mannor, Tamir Hazan, Hanlin Tang, and Somdeb Majumdar. 2020. Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning. arxiv:2007.07298.
Google Scholar
- Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52). isbn:9781450369381 https://doi.org/10.1145/3352460.3358252
Google Scholar
Digital Library
- Doug Lea and Wolfram Gloger. 1996. A memory allocator.
Google Scholar
- Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. TensorFlow Dev Summit.
Google Scholar
- Juhyun Lee and Yury Pisarchyk. 2020. Efficient Memory Management for Deep Neural Net Inference. In MLSys 2020 Workshop on Resource-Constrained Machine Learning (ReCoML 2020).
Google Scholar
- Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations. https://openreview.net/forum?id=qrwe7XHTmYb
Google Scholar
- Xiangwei Li and Douglas L. Maskell. 2019. Time-Multiplexed FPGA Overlay Architectures: A Survey. ACM Trans. Des. Autom. Electron. Syst., 24, 5 (2019), Article 54, jul, 19 pages. issn:1084-4309 https://doi.org/10.1145/3339861
Google Scholar
Digital Library
- Chang Liu, Austin Harris, Martin Maas, Michael Hicks, Mohit Tiwari, and Elaine Shi. 2015. GhostRider: A Hardware-Software System for Memory Trace Oblivious Computation. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). isbn:9781450328357 https://doi.org/10.1145/2694344.2694385
Google Scholar
Digital Library
- Changxi Liu, Hailong Yang, Rujun Sun, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. 2019. swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture. arxiv:1904.07404.
Google Scholar
- Evan Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, and Junwhan Ahn. 2020. An Imitation Learning Approach for Cache Replacement. In Proceedings of the 37th International Conference on Machine Learning, Hal Daumé III and Aarti Singh (Eds.) (Proceedings of Machine Learning Research, Vol. 119). PMLR, 6237–6247. http://proceedings.mlr.press/v119/liu20f.html
Google Scholar
- Martin Maas. 2020. A Taxonomy of ML for Systems Problems. IEEE Micro, 40, 5 (2020), 8–16. https://doi.org/10.1109/MM.2020.3012883
Google Scholar
Digital Library
- Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators. IEEE Trans. Comput., 70, 8 (2021), 1160–1174. https://doi.org/10.1109/TC.2021.3059962
Google Scholar
Cross Ref
- Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, and Jeff Dean. 2018. Hierarchical Planning for Device Placement. In International Conference on Learning Representations. https://openreview.net/pdf?id=Hkc-TeZ0W
Google Scholar
- Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Quoc Le, James Laudon, Richard Ho, Roger Carpenter, and Jeff Dean. 2021. A graph placement methodology for fast chip design. Nature, 594 (2021), 06, 207–212. https://doi.org/10.1038/s41586-021-03544-w
Google Scholar
Cross Ref
- Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O’Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, and Yori Zwols. 2021. Solving Mixed Integer Programs Using Neural Networks. arxiv:2012.13349.
Google Scholar
- Tony Nowatzki, Newsha Ardalani, Karthikeyan Sankaralingam, and Jian Weng. 2018. Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18). Association for Computing Machinery, New York, NY, USA. Article 36, 15 pages. isbn:9781450359863 https://doi.org/10.1145/3243176.3243212
Google Scholar
Digital Library
- Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A General Constraint-Centric Scheduling Framework for Spatial Architectures. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). isbn:9781450320146 https://doi.org/10.1145/2491956.2462163
Google Scholar
Digital Library
- Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 304–315. https://doi.org/10.1109/ISPASS.2019.00042
Google Scholar
Cross Ref
- Phitchaya Mangpo Phothilimthana, Amit Sabne, Nikhil Sarda, Karthik Srinivasa Murthy, Yanqi Zhou, Christof Angermueller, Mike Burrows, Sudip Roy, Ketan Mandke, Rezsa Farahani, Yu Emma Wang, Berkin Ilbeyi, Blake Hechtman, Bjarke Roune, Shen Wang, Yuanzhong Xu, and Samuel J. Kaufman. 2021. A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). 1–16. https://doi.org/10.1109/PACT52795.2021.00008
Google Scholar
Cross Ref
- Christian Pilato, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2014. System-level memory optimization for high-level synthesis of component-based SoCs. In 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).
Google Scholar
- Artur Podobas, Kentaro Sano, and Satoshi Matsuoka. 2020. A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective. IEEE Access, 8 (2020), 146719–146743. https://doi.org/10.1109/ACCESS.2020.3012084
Google Scholar
Cross Ref
- Esther Roorda, Seyedramin Rasoulinezhad, Philip H. W. Leong, and Steven J. E. Wilton. 2022. FPGA Architecture Exploration for DNN Acceleration. ACM Trans. Reconfigurable Technol. Syst., 15, 3 (2022), Article 33, may, issn:1936-7406 https://doi.org/10.1145/3503465
Google Scholar
Digital Library
- Ananda Samajdar, Jan Moritz Joseph, Matthew Denton, and Tushar Krishna. 2021. AIRCHITECT: Learning Custom Architecture Design and Mapping Space. https://doi.org/10.48550/ARXIV.2108.08295
Google Scholar
- Kayla O Seager, Ananta Tiwari, Michael A. Laurenzano, Joshua Peraza, Pietro Cicotti, and Laura Carrington. 2012. Efficient HPC Data Motion via Scratchpad Memory. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. https://doi.org/10.1109/SC.Companion.2012.111
Google Scholar
Digital Library
- Taro Sekiyama, Takashi Imamichi, Haruki Imai, and Rudy Raymond. 2018. Profile-guided memory optimization for deep neural networks. arXiv preprint arXiv:1804.10001.
Google Scholar
- Tianqi Tang, Sheng Li, Lifeng Nai, Norm Jouppi, and Yuan Xie. 2021. NeuroMeter: An Integrated Power, Area, and Timing Modeling Framework for Machine Learning Accelerators Industry Track Paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). https://doi.org/10.1109/HPCA51647.2021.00075
Google Scholar
Cross Ref
- Alexey Tumanov, Timothy Zhu, Jun Woo Park, Michael A. Kozuch, Mor Harchol-Balter, and Gregory R. Ganger. 2016. TetriSched: Global Rescheduling with Adaptive Plan-Ahead in Dynamic Heterogeneous Clusters. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys ’16). Article 35, isbn:9781450342407 https://doi.org/10.1145/2901318.2901355
Google Scholar
Digital Library
- Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. 1995. Dynamic Storage Allocation: A Survey and Critical Review. Springer-Verlag, 1–116.
Google Scholar
Digital Library
- Qingcheng Xiao, Size Zheng, Bingzhe Wu, Pengcheng Xu, Xuehai Qian, and Yun Liang. 2021. HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA52012.2021.00086
Google Scholar
Digital Library
- Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. 2020. Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators. isbn:9781450371025 https://doi.org/10.1145/3373376.3378514
Google Scholar
Digital Library
- Amir Yazdanbakhsh, Kiran Seshadri, Berkin Akin, James Laudon, and Ravi Narayanaswami. 2021. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. https://doi.org/10.48550/ARXIV.2102.10423
Google Scholar
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’15). isbn:9781450333153 https://doi.org/10.1145/2684746.2689060
Google Scholar
Digital Library
- Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini. 2022. A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22). isbn:9781450392051 https://doi.org/10.1145/3503222.3507767
Google Scholar
Digital Library
- Yanqi Zhou, Xuanyi Dong, Berkin Akin, Mingxing Tan, Daiyi Peng, Tianjian Meng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon. 2021. Rethinking Co-design of Neural Architectures and Hardware Accelerators. arxiv:2102.08619.
Google Scholar
- Yanqi Zhou, Xuanyi Dong, Tianjian Meng, Mingxing Tan, Berkin Akin, Daiyi Peng, Amir Yazdanbakhsh, Da Huang, Ravi Narayanaswami, and James Laudon. 2022. Towards the Co-design of Neural Networks and Accelerators. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.). 4, 141–152. https://proceedings.mlsys.org/paper/2022/file/31fefc0e570cb3860f2a6d4b38c6490d-Paper.pdf
Google Scholar
- Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon. 2020. Transferable Graph Optimizers for ML Compilers. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). 33, https://proceedings.neurips.cc/paper/2020/file/9f29450d2eb58feb555078bdefe28aa5-Paper.pdf
Google Scholar
Index Terms
TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators
Recommendations
Optimal task allocation on non-volatile memory based hybrid main memory
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied ComputationThis paper targets task allocation problem on hybrid main memory composed of non-volatile memory (NVM) and DRAM. Compared to the conventional memory technology DRAM, the emerging NVM has excellent energy performance due to the ultra low leakage power. ...
Recursive function data allocation to scratch-pad memory
CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systemsThis paper presents the first automatic scheme to allocate local (stack) data in recursive functions to scratch-pad memory (SPM) in embedded systems. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-...
Energy-aware memory allocation in heterogeneous non-volatile memory systems
ISLPED '03: Proceedings of the 2003 international symposium on Low power electronics and designMemory systems consume a significant portion of power in hand-held embedded systems. So far, low-power memory techniques have addressed the power consumption when the system is turned on. In this paper, we consider data retention energy during the power-...





Comments