Abstract
We address the problem of optimizing sparse tensor algebra in a compiler and show how to define standard loop transformations---split, collapse, and reorder---on sparse iteration spaces. The key idea is to track the transformation functions that map the original iteration space to derived iteration spaces. These functions are needed by the code generator to emit code that maps coordinates between iteration spaces at runtime, since the coordinates in the sparse data structures remain in the original iteration space. We further demonstrate that derived iteration spaces can tile both the universe of coordinates and the subset of nonzero coordinates: the former is analogous to tiling dense iteration spaces, while the latter tiles sparse iteration spaces into statically load-balanced blocks of nonzeros. Tiling the space of nonzeros lets the generated code efficiently exploit heterogeneous compute resources such as threads, vector units, and GPUs.
We implement these concepts by extending the sparse iteration theory implementation in the TACO system. The associated scheduling API can be used by performance engineers or it can be the target of an automatic scheduling system. We outline one heuristic autoscheduling system, but other systems are possible. Using the scheduling API, we show how to optimize mixed sparse-dense tensor algebra expressions on CPUs and GPUs. Our results show that the sparse transformations are sufficient to generate code with competitive performance to hand-optimized implementations from the literature, while generalizing to all of the tensor algebra.
Supplemental Material
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jefrey Dean, Matthieu Devin, Sanjay Ghemawat, Geofrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) ( OSDI'16). USENIX Association, USA, 265-283. Google Scholar
Digital Library
- Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph. 38, 4, Article 121 ( July 2019 ), 12 pages. Google Scholar
Digital Library
- Frances E. Allen and John Cocke. 1972. A Catalogue of Optimizing Transformations. In Design and Optimization of Compilers, R. Rustin (Ed.). Prentice-Hall, Englewood Clifs, NJ, 1-30.Google Scholar
- Corinne Ancourt and François Irigoin. 1991. Scanning polyhedra with DO loops. Principles and Pratice of Parallel Programming 26, 7 (April 1991 ), 39-50. Google Scholar
Digital Library
- Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. 2006. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics 104, 2 ( 2006 ), 211-228. Google Scholar
Cross Ref
- R. Baghdadi, J. Ray, M. B. Romdhane, E. D. Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, and S. Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193-205. Google Scholar
Cross Ref
- Utpal Banerjee. 1990. Unimodular transformations of double loops. Available as Nicolau A., Gelernter D., Gross T., Padua D. (eds) Advances in languages and compilers for parallel computing ( 1991 ). The MIT Press, Cambridge, pp 192-219. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC).Google Scholar
- M. Baskaran, T. Henretty, B. Pradelle, M. H. Langston, D. Bruns-Smith, J. Ezick, and R. Lethin. 2017. Memory-eficient parallel tensor decompositions. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1-7. https: //doi.org/10.1109/HPEC. 2017.8091026 Google Scholar
Cross Ref
- Muthu Baskaran, Benoit Meister, and Richard Lethin. 2014. Low-overhead load-balanced scheduling for sparse tensor computations. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 1-6. https://doi.org/10.1109/HPEC. 2014.7041006 Google Scholar
Cross Ref
- Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. In International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, Portland, Oregon, 18 : 1-18 : 11. https://doi.org/10.1145/1654059.1654078 Google Scholar
Digital Library
- Aart J. C. Bik and Harry A. G. Wijshof. 1993. Compilation Techniques for Sparse Matrix Computations. In Proceedings of the 7th International Conference on Supercomputing (Tokyo, Japan) ( ICS '93). Association for Computing Machinery, New York, NY, USA, 416-424. https://doi.org/10.1145/165939.166023 Google Scholar
Digital Library
- Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. University of Southern California. 28 pages. http://citeseerx.ist.psu.edu/viewdoc/download?doi =10.1. 1.214.8396&rep=rep1&type=pdfGoogle Scholar
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning This paper is included in the Proceedings of the. In Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, 578-594. https://www.usenix.org/conference/osdi18/ presentation/chenGoogle Scholar
- Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018b. Learning to Optimize Tensor Programs. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3389-3400. http://papers.nips.cc/paper/7599-learning-to-optimize-tensor-programs.pdfGoogle Scholar
- Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format Abstraction for Sparse Tensor Algebra Compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (nov 2018 ), 123 : 1-123 : 30. https://doi.org/10.1145/3276493 Google Scholar
Digital Library
- Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2020. Automatic Generation of Eficient Sparse Tensor Format Conversion Routines. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) ( PLDI 2020 ). Association for Computing Machinery, New York, NY, USA, 823-838. https: //doi.org/10.1145/3385412.3385963 Google Scholar
Digital Library
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 ( Dec. 2011 ), 25 pages. https://doi.org/10.1145/2049662.2049663 Google Scholar
Digital Library
- Evgeny Epifanovsky, Michael Wormit, Tomasz Kuś, Arie Landau, Dmitry Zuev, Kirill Khistyaev, Prashant Manohar, Ilya Kaliman, Andreas Dreuw, and Anna I. Krylov. 2013. New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations. Journal of computational chemistry 34, 26 ( 2013 ), 2293-2309. https://doi.org/10.1002/jcc.23377 Google Scholar
Cross Ref
- Paul Feautrier. 1988. Parametric integer programming. RAIRO-Operations Research 22, 3 ( 1988 ), 243-268. https://doi.org/10. 1051/ro/1988220302431 Google Scholar
Cross Ref
- Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.orgGoogle Scholar
- Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix Multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (Washington, District of Columbia) ( PPoPP '19). Association for Computing Machinery, New York, NY, USA, 300-314. https://doi.org/10.1145/3293883.3295712 Google Scholar
Digital Library
- Intel. 2012. Intel math kernel library reference manual. Technical Report. 630813-051US, 2012. http://software.intel.com/ sites/products/documentation/hpc/mkl/mklman/mklman.pdf.Google Scholar
- Inah Jeon, Evangelos E. Papalexakis, U Kang, and Christos Faloutsos. 2015. HaTen2: Billion-scale Tensor Decompositions. In IEEE International Conference on Data Engineering (ICDE). https://doi.org/10.1109/ICDE. 2015.7113355 Google Scholar
Cross Ref
- Yangqing Jia, Evan Shelhamer, Jef Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Cafe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) ( MM '14). Association for Computing Machinery, New York, NY, USA, 675-678. https://doi.org/10.1145/2647868.2654889 Google Scholar
Digital Library
- Fredrik Kjolstad. 2020. Sparse Tensor Algebra Compilation. Ph.D. Dissertation. Massachusetts Institute of Technology, Cambridge, MA. http://groups.csail.mit.edu/commit/papers/2020/kjolstad-thesis.pdfGoogle Scholar
Digital Library
- Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor Algebra Compilation with Workspaces. In International Symposium on Code Generation and Optimization. IEEE Press, Washington, DC, 180-192. https://doi. org/10.1109/CGO. 2019.8661185 Google Scholar
Cross Ref
- Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017a. taco: A Tool to Generate Tensor Algebra Kernels. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 943-948. https://doi.org/10.1109/ASE. 2017.8115709 Google Scholar
Cross Ref
- Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017b. The Tensor Algebra Compiler. Proceedings of the ACM on Programming Languages 1, OOPSLA (oct 2017 ), 77 : 1-77 : 29. https://doi.org/10.1145/3133901 Google Scholar
Digital Library
- Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. 1997. A relational approach to the compilation of sparse matrix programs. In Euro-Par'97 Parallel Processing. Springer, 318-327. https://doi.org/10.1007/BFb0002751 Google Scholar
Cross Ref
- Leslie Lamport. 1974. The Parallel Execution of DO loops. Commun. ACM 17, 2 (Feb. 1974 ), 83-93. https://doi.org/10.1145/ 360827.360844 Google Scholar
Digital Library
- Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Raleigh, NC, USA) ( PPoPP 09). Association for Computing Machinery, New York, NY, USA, 10 pages. https://doi.org/10.1145/1504176.1504194 Google Scholar
Digital Library
- Duane Merrill and Michael Garland. 2016. Merge-Based Parallel Sparse Matrix-Vector Multiplication. International Conference for High Performance Computing, Networking, Storage and Analysis, SC November ( 2016 ). https://doi.org/10.1109/SC. 2016.57 Google Scholar
Cross Ref
- Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 ( July 2016 ), 11 pages. https: //doi.org/10.1145/2897824.2925952 Google Scholar
Digital Library
- Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard W. Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20-24, 2019. 123-133. https://doi.org/10.1109/IPDPS. 2019.00023 Google Scholar
Cross Ref
- NVIDIA V10.1.243. 2019. cuSPARSE Software Library. https://docs.nvidia.com/cuda/archive/10.1/cusparse/index.htmlGoogle Scholar
- Sreepathi Pai and Keshav Pingali. 2016. A Compiler for Throughput Optimization of Graph Algorithms on GPUs. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Amsterdam, Netherlands) ( OOPSLA 2016 ). Association for Computing Machinery, New York, NY, USA, 1-19. https://doi.org/10.1145/2983990.2984015 Google Scholar
Digital Library
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic diferentiation in PyTorch. ( 2017 ). https://openreview.net/pdf?id= BJJsrmfCZGoogle Scholar
- William Pugh and Tatiana Shpeisman. 1999. SIPR: A New Framework for Generating Eficient Code for Sparse Matrix Computations. In Languages and Compilers for Parallel Computing, Siddhartha Chatterjee, Jan F. Prins, Larry Carter, Jeanne Ferrante, Zhiyuan Li, David Sehr, and Pen-Chung Yew (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 213-229. https://doi.org/10.1007/3-540-48319-5_14 Google Scholar
Cross Ref
- Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2012. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics 31, 4 ( 2012 ), 1-12. https://doi.org/10.1145/2185520.2335383 Google Scholar
Cross Ref
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) ( PLDI '13). ACM, New York, NY, USA, 519-530. https://doi.org/10.1145/2491956.2462176 Google Scholar
Digital Library
- Ryan Senanayake. 2020. A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra. M. Eng. Thesis. Massachusetts Institute of Technology, Cambridge, MA. http://groups.csail.mit.edu/commit/papers/2020/ryan_2020.pdfGoogle Scholar
- Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/Google Scholar
- Shaden Smith and George Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas) ( IA3 ' 15 ). Association for Computing Machinery, New York, NY, USA, Article 5, 7 pages. https://doi.org/10.1145/2833179.2833183 Google Scholar
Digital Library
- Shaden Smith, Niranjay Ravindran, Nicholas Sidiropoulos, and George Karypis. 2015. SPLATT: Eficient and Parallel Sparse Tensor-Matrix Multiplication. In IEEE International Parallel and Distributed Processing Symposium. IEEE, 61-70. https://doi.org/10.1109/IPDPS. 2015.27 Google Scholar
Digital Library
- Edgar Solomonik, Devin Matthews, Jef R. Hammond, John F. Stanton, and James Demmel. 2014. A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel and Distrib. Comput. 74, 12 ( 2014 ), 3176-3190. https://doi.org/10.1016/j.jpdc. 2014. 06. 002 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing. Google Scholar
Digital Library
- Michelle Mills Strout, Mary Hall, and Catherine Olschanowsky. 2018. The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code. Proc. IEEE 106, 11 ( 2018 ), 1921-1934. https://doi.org/10.1109/JPROC. 2018. 2857721 Google Scholar
Cross Ref
- Patricia Suriana, Andrew Adams, and Shoaib Kamil. 2017. Parallel Associative Reductions in Halide. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (Austin, USA) ( CGO '17). IEEE Press, 281-291. https://doi.org/10.1109/CGO. 2017.7863747 Google Scholar
Cross Ref
- Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. Technical Report. 12 pages. arXiv: 1802.04730 http://arxiv.org/abs/ 1802.04730Google Scholar
- Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2015 ). 521-532. https://doi.org/10. 1145/2737924.2738003 Google Scholar
Digital Library
- Ziheng Wang. 2020. Automatic Optimization of Sparse Tensor Algebra Programs. M.Eng. Thesis. Massachusetts Institute of Technology, Cambridge, MA. https://hdl.handle. net/1721.1/127536Google Scholar
- Michael J Wolfe. 1982. Optimizing Supercompilers for Supercomputers. Ph.D. Dissertation. University of Illinois at UrbanaChampaign. https://dl.acm.org/doi/book/10.5555/910705 Google Scholar
Cross Ref
- David Wonnacott and William Pugh. 1995. Nonlinear array dependence analysis. In Proc. Third Workshop on Languages, Compilers and Run-Time Systems for Scalable Computers.Google Scholar
- Carl Yang, Aydın Buluç, and John D. Owens. 2018. Design Principles for Sparse Matrix Multiplication on the GPU. In Euro-Par 2018: Parallel Processing, Marco Aldinucci, Luca Padovani, and Massimo Torquati (Eds.). Springer International Publishing, Cham, 672-687. https://doi.org/10.1007/978-3-319-96983-1_48 Google Scholar
Digital Library
- Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-Performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA, Article 121 (Oct. 2018 ), 30 pages. https: //doi.org/10.1145/3276491 Google Scholar
Digital Library
Index Terms
A sparse iteration space transformation framework for sparse tensor algebra
Recommendations
Automatic generation of efficient sparse tensor format conversion routines
PLDI 2020: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and ImplementationThis paper shows how to generate code that efficiently converts sparse tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL, and many others. We decompose sparse tensor conversion into three logical phases: coordinate remapping,...
Compilation of dynamic sparse tensor algebra
Many applications, from social network graph analytics to control flow analysis, compute on sparse data that evolves over the course of program execution. Such data can be represented as dynamic sparse tensors and efficiently stored in formats (data ...






Comments