skip to main content
research-article

Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

Published:29 September 2020Publication History
Skip Abstract Section

Abstract

Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip/off-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM) and DRAM-based off-chip memory. Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Optimizations for off-chip memory such as memory access order, data mapping and the choice of a suitable memory access granularity are employed to reduce the contention in the off-chip memory. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 32% and 73%, respectively, compared to an iso-capacity SRAM. The overall DRAM dynamic energy consumption improvements due to memory optimizations amount to 80%.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. http://download.tensorflow.org/paper/whitepaper2015.pdf.Google ScholarGoogle Scholar
  2. Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2014. Compilers: Principles, Techniques, and Tools. Pearson.Google ScholarGoogle Scholar
  3. Ehsan Atoofian. 2015. Reducing shift penalty in domain wall memory through register locality. In Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’15). IEEE Press, Piscataway, N.J., 177--186. http://dl.acm.org/citation.cfm?id=2830689.2830711.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Baumgartner, A. A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. J. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, Chi-Chung Lam, Q. Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93 (2005), 276--292.Google ScholarGoogle Scholar
  5. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy).Google ScholarGoogle ScholarCross RefCross Ref
  6. R. Bläsing, A. A. Khan, P. C. Filippou, C. Garg, F. Hameed, J. Castrillon, and S. S. P. Parkin. 2020. Magnetic racetrack memory: From physics to the cusp of applications within a decade. Proc. IEEE 108, 8 (2020), 1303--1321. DOI:10.1109/JPROC.2020.2975719Google ScholarGoogle ScholarCross RefCross Ref
  7. Jeronimo Castrillon, Matthias Lieber, Sascha Klüppelholz, Marcus Völp, Nils Asmussen, Uwe Assmann, Franz Baader, Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés Goens, Sebastian Haas, Dirk Habich, Hermann Härtig, Mattis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol, Akash Kumar, Wolfgang Lehner, Linda Leuschner, Siqi Ling, Steffen Märcker, Christian Menard, Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, Axel Voigt, and Sascha Wunderlich. 2018. A hardware/software stack for heterogeneous systems. IEEE Transactions on Multi-Scale Computing Systems 4, 3 (July 2018), 243--259. DOI:https://doi.org/10.1109/TMSCS.2017.2771750Google ScholarGoogle ScholarCross RefCross Ref
  8. K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji, B. Akesson, N. Wehn, and K. Goossens. [n.d.]. DRAMPower: Open-source DRAM Power and Energy Estimation Tool. http://www.drampower.info.Google ScholarGoogle Scholar
  9. S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thottethodi. 2002. Recursive array layouts and fast matrix multiplication. IEEE Transactions on Parallel and Distributed Systems 13, 11 (Nov 2002), 1105--1123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, 269--284. DOI:https://doi.org/10.1145/2541940.2541967Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chen.Google ScholarGoogle Scholar
  12. Xianzhang Chen, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Chun Jason Xue, Weiwen Jiang, and Yuangang Wang. 2016. Efficient data placement for improving data access performance on domain-wall memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 10 (Oct. 2016), 3094--3104. DOI:https://doi.org/10.1109/TVLSI.2016.2537400Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Clinton Whaley, Antoine Petitet, and Jack Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27 (01 2001), 3--35. DOI:https://doi.org/10.1016/S0167-8191(00)00087-9Google ScholarGoogle Scholar
  14. D. Coppersmith and S. Winograd. 1987. Matrix multiplication via arithmetic progressions. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing (STOC’87). ACM, New York, 1--6. DOI:https://doi.org/10.1145/28395.28396Google ScholarGoogle Scholar
  15. Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US, Boston, MA, 1581--1592. DOI:https://doi.org/10.1007/978-0-387-09766-4_502Google ScholarGoogle Scholar
  16. Roman Gareev, Tobias Grosser, and Michael Kruse. 2018. High-performance generalized tensor operations: A compiler-oriented approach. ACM Trans. Archit. Code Optim. 15, 3, Article 34 (Sept. 2018), 27 pages. DOI:https://doi.org/10.1145/3235029Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Goossens, T. Kouters, B. Akesson, and K. Goossens. 2012. Memory-map selection for firm real-time SDRAM controllers. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE). 828--831.Google ScholarGoogle Scholar
  18. Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages. DOI:https://doi.org/10.1145/1356052.1356053Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A family of high-performance matrix multiplication algorithms. In Proceedings of the International Conference on Computational Sciences - Part I (ICCS’01). Springer-Verlag, Berlin, 51--60. http://dl.acm.org/citation.cfm?id=645455.653765.Google ScholarGoogle Scholar
  20. F. Hameed, A. A. Khan, and J. Castrillon. 2018. Performance and energy-efficient design of STT-RAM last-level cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 6 (June 2018), 1059--1072. DOI:https://doi.org/10.1109/TVLSI.2018.2804938Google ScholarGoogle ScholarCross RefCross Ref
  21. J. Hu, C. J. Xue, Q. Zhuge, W. Tseng, and E. H. Sha. 2013. Data allocation optimization for hybrid scratch pad memory with SRAM and nonvolatile memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 6 (June 2013), 1094--1102. DOI:https://doi.org/10.1109/TVLSI.2012.2202700Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu. 2004. Banked scratch-pad memory management for reducing leakage energy consumption. In Proceedings of the 2004 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’04). IEEE Computer Society, Washington, DC, 120--124. DOI:https://doi.org/10.1109/ICCAD.2004.1382555Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu. 2005. Compiler-guided leakage optimization for banked scratch-pad memories. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 13, 10 (Oct, 2005), 1136--1146. DOI:https://doi.org/10.1109/TVLSI.2005.859478Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. 2001. Dynamic management of scratch-pad memory space. In Proceedings of the 38th Annual Design Automation Conference (DAC’01). ACM, New York, 690--695. DOI:https://doi.org/10.1145/378239.379049Google ScholarGoogle Scholar
  25. Asif Ali Khan, Andrés Goens, Fazal Hameed, and Jeronimo Castrillon. 2020. Generalized data placement strategies for racetrack memories. In Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE) (DATE’20). EDA Consortium, 1502--1507.Google ScholarGoogle ScholarCross RefCross Ref
  26. A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, and J. Castrillon. 2019. RTSim: A cycle-accurate simulator for racetrack memories. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 43--46. DOI:https://doi.org/10.1109/LCA.2019.2899306Google ScholarGoogle ScholarCross RefCross Ref
  27. Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, and Jeronimo Castrillon. 2019. Shiftsreduce: Minimizing shifts in racetrack memory 4.0. ACM Transactions on Architecture and Code Optimization (TACO) 16, 4 (2019), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Asif Ali Khan, Norman A. Rink, Fazal Hameed, and Jeronimo Castrillon. 2019. Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads. In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2019). Association for Computing Machinery, New York, 5--18. DOI:https://doi.org/10.1145/3316482.3326351Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Kim, A. Sukumaran-Rajam, V. Thumma, S. Krishnamoorthy, A. Panyala, L. Pouchet, A. Rountev, and P. Sadayappan. 2019. A code generator for high-performance tensor contractions on GPUs. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019). IEEE Press, Piscataway, N. J., 85--95.Google ScholarGoogle ScholarCross RefCross Ref
  30. Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. DOI:https://doi.org/10.1145/3133901Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jakub Kurzak, Wesley Alvaro, and Jack Dongarra. 2009. Optimizing matrix multiplication for a short-vector SIMD architecture—CELL processor. Parallel Comput. 35 (03 2009), 138--150. DOI:https://doi.org/10.1016/j.parco.2008.12.010Google ScholarGoogle Scholar
  32. Nikolaos Kyrtatas, Daniele G. Spampinato, and Markus Püschel. 2015. A basic linear algebra compiler for embedded processors. In Proceedings of the 2015 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’15). EDA Consortium, San Jose, CA, 1054--1059. http://dl.acm.org/citation.cfm?id=2757012.2757058.Google ScholarGoogle ScholarCross RefCross Ref
  33. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308--323. DOI:https://doi.org/10.1145/355841.355847Google ScholarGoogle Scholar
  34. Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2017. An energy-efficient GPGPU register file architecture using racetrack memory. IEEE Trans. Comput. 66, 9 (2017), 1478--1490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Devin Matthews. 2016. High-performance tensor contraction without BLAS. CoRR abs/1607.00291 (2016). arxiv:1607.00291 http://arxiv.org/abs/1607.00291.Google ScholarGoogle Scholar
  36. Vijay Menon and Keshav Pingali. 1999. High-level semantic optimization of numerical codes. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). ACM, New York, 434--443. DOI:https://doi.org/10.1145/305138.305230Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Mittal, J. S. Vetter, and D. Li. 2015. A survey of architectural approaches for managing embedded DRAM and non-volatile on-chip caches. IEEE Transactions on Parallel and Distributed Systems 26, 6 (June 2015), 1524--1537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Mittal, R. Wang, and J. Vetter. 2017. DESTINY: A comprehensive tool with 3D and multi-level cell memory modeling capability. Journal of Low Power Electronics and Applications 7, 3 (2017).Google ScholarGoogle ScholarCross RefCross Ref
  39. Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann.Google ScholarGoogle Scholar
  40. Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, and Jeronimo Castrillon. 2019. SHRIMP: Efficient instruction delivery with domain wall memory. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’19). ACM, New York, 1. DOI:https://doi.org/10.1109/ISLPED.2019.8824954Google ScholarGoogle ScholarCross RefCross Ref
  41. S. Ohshima, K. Kise, T. Katagiri, and T. Yuba. 2006. Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment. In Proceedings of the Conference on High Performance Computing for Computational Science - VECPAR 2006. 305--318.Google ScholarGoogle Scholar
  42. N. Park, W. Liu, V. K. Prasanna, and C. S. Raghavendra. 2000. Efficient matrix multiplication using cache conscious data layouts. In Proceedings of HPCMO User Group Conference.Google ScholarGoogle Scholar
  43. S. Parkin, M. Hayashi, and L. Thomas. 2008. Magnetic domain-wall racetrack memory. Science 320, 5873 (2008), 190--194.Google ScholarGoogle Scholar
  44. Stuart Parkin and See-Hun Yang. 2015. Memory on the racetrack. Nature Nanotechnology 10, 3 (2015), 195--198.Google ScholarGoogle ScholarCross RefCross Ref
  45. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.Google ScholarGoogle Scholar
  46. M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, Jianxin Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (Feb 2005), 232--275. DOI:https://doi.org/10.1109/JPROC.2004.840306Google ScholarGoogle ScholarCross RefCross Ref
  47. N. A. Rink, I. Huismann, A. Susungi, J. Castrillon, J. Stiller, J. Fröhlich, and C. Tadonki. 2018. CFDlang: High-level code generation for high-order methods in fluid dynamics. In Proceedings of the Real World Domain Specific Languages Workshop 2018 (RWDSL2018). ACM, New York, Article 5, 10 pages. DOI:https://doi.org/10.1145/3183895.3183900Google ScholarGoogle Scholar
  48. S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. 2000. Memory access scheduling. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA). 128--138.Google ScholarGoogle Scholar
  49. Daniele G. Spampinato and Markus Püschel. 2016. A basic linear algebra compiler for structured matrices. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO’16). ACM, New York, 117--127. DOI:https://doi.org/10.1145/2854038.2854060Google ScholarGoogle Scholar
  50. Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Trans. Math. Softw. 44, 3, Article 28 (Jan. 2018), 29 pages. DOI:https://doi.org/10.1145/3157733Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Z. Sun, Wenqing Wu, and Hai Li. 2013. Cross-layer racetrack memory design for ultra high density and low power consumption. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, and Claude Tadonki. 2018. Meta-programming for cross-domain tensor optimizations. In Proceedings of the 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE 2018). ACM, New York, 79--92. DOI:https://doi.org/10.1145/3278122.3278131Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. L. Thomas, See-Hun Yang, Kwang-Su Ryu, B. Hughes, C. Rettner, Ding-Shuo Wang, Ching-Hsiang Tsai, Kuei-Hung Shen, and S. S. P. Parkin. 2011. Racetrack memory: A high-performance, low-cost, non-volatile memory based on magnetic domain walls. In Proceedings of the 2011 International Electron Devices Meeting. 24.2.1--24.2.4. DOI:https://doi.org/10.1109/IEDM.2011.6131603Google ScholarGoogle ScholarCross RefCross Ref
  54. Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. CoRR abs/1802.04730 (2018). arxiv:1802.04730Google ScholarGoogle Scholar
  55. Virginia Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the Annual ACM Symposium on Theory of Computing, 887--898. DOI:https://doi.org/10.1145/2213977.2214056Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Rangharajan Venkatesan, Vivek Kozhikkottu, Charles Augustine, Arijit Raychowdhury, Kaushik Roy, and Anand Raghunathan. 2012. TapeCache: A high density, energy efficient cache based on domain wall memory. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’12). ACM, New York, NY, USA, 185--190. DOI:https://doi.org/10.1145/2333660.2333707Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. STAG: Spintronic-tape architecture for GPGPU cache hierarchies. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, 253--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Wang, L. Ma, M. Zhang, J. An, H. Li, and Y. Chen. 2017. Shift-optimized energy-efficient racetrack-based main memory. Journal of Circuits, Systems and Computers 27 (09 2017), 1--16. DOI:https://doi.org/10.1142/S0218126618500810Google ScholarGoogle Scholar
  59. Z. Wang, Z. Gu, M. Yao, and Z. Shao. 2015. Endurance-aware allocation of data variables on NVM-based scratchpad memory in real-time embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (Oct 2015), 1600--1612. DOI:https://doi.org/10.1109/TCAD.2015.2422846Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC’98). IEEE Computer Society, Washington, DC, USA, 1--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. H.-S. Philip Wong, Simone Raoux, Sangbum Kim, Jiale Liang, John Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth Goodson. 2010. Phase change memory. 98 (12 2010).Google ScholarGoogle Scholar
  62. H. Xu, Y. Alkabani, R. Melhem, and A. K. Jones. 2016. FusedCache: A naturally inclusive, racetrack memory, dual-level private cache. IEEE Transactions on Multi-Scale Computing Systems 2, 2 (April 2016), 69--82. DOI:https://doi.org/10.1109/TMSCS.2016.2536020Google ScholarGoogle ScholarCross RefCross Ref
  63. Chao Zhang, Guangyu Sun, Weiqi Zhang, Fan Mi, Hai Li, and W. Zhao. 2015. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In Proceedings of the 20th Asia and South Pacific Design Automation Conference. 100--105. DOI:https://doi.org/10.1109/ASPDAC.2015.7058988Google ScholarGoogle ScholarCross RefCross Ref
  64. P. Zhang and Y. Gao. 2015. Matrix multiplication on high-density multi-GPU architectures: Theoretical and experimental investigations. Lecture Notes in Computer Science, vol. 9137, 17--30. DOI:https://doi.org/10.1007/978-3-319-20119-1_2Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!