ABSTRACT
Partitioning large matrices is an important problem in distributed linear algebra computing, used in ML among others. Briefly, our goal is to perform a sequence of matrix algebra operations in a distributed manner on these large matrices. However, not all partitioning schemes work well with different matrix algebra operations and their implementations (algorithms). This is a type of data tiling problem. In this paper we consider a data tiling problem using hypergraphs. We prove some hardness results and give a theoretical characterization of its complexity on random instances. Additionally we develop a greedy algorithm and experimentally show its efficacy.
- Amit Agarwal, Moses Charikar, Konstantin Makarychev, and Yury Makarychev. 2005. O (Math 104) approximation algorithms for min UnCut, min 2CNF deletion, and directed cut problems. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing. 573–581.Google Scholar
Digital Library
- Adi Avidor and Michael Langberg. 2007. The multi-multiway cut problem. Theoretical Computer Science 377, 1-3 (2007), 35–42.Google Scholar
Digital Library
- Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for strassen’s matrix multiplication. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures. 193–204.Google Scholar
Digital Library
- Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2015. Brief announcement: Hypergraph partitioning for parallel sparse matrix-matrix multiplication. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures. 86–88.Google Scholar
Digital Library
- Michael Bauer and Michael Garland. 2019. Legate NumPy: accelerated and distributed array computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–23.Google Scholar
Digital Library
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.Google Scholar
Digital Library
- Steven R Brandt, Bita Hasheminezhad, Nanmiao Wu, Sayef Azad Sakin, Alex R Bigelow, Katherine E Isaacs, Kevin Huck, and Hartmut Kaiser. 2020. Distributed Asynchronous Array Computing with the JetLag Environment. In 2020 IEEE/ACM 9th Workshop on Python for High-Performance and Scientific Computing (PyHPC). IEEE, 49–57.Google Scholar
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arxiv:2005.14165 [cs.CL]Google Scholar
- Jaeyoung Choi, David W Walker, and Jack J Dongarra. 1994. PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience 6, 7 (1994), 543–570.Google Scholar
Cross Ref
- Bhaskar DasGupta, German Andres Enciso, Eduardo Sontag, and Yi Zhang. 2007. Algorithmic and complexity results for decompositions of biological networks into monotone subsystems. Biosystems 90, 1 (2007), 161–178.Google Scholar
Cross Ref
- Karen D Devine, Erik G Boman, Robert T Heaphy, Rob H Bisseling, and Umit V Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 10–pp.Google Scholar
Cross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805(2018). arxiv:1810.04805http://arxiv.org/abs/1810.04805Google Scholar
- Rong Gu, Yun Tang, Chen Tian, Hucheng Zhou, Guanru Li, Xudong Zheng, and Yihua Huang. 2017. Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms. IEEE Transactions on Parallel and Distributed Systems 28, 9 (2017), 2539–2552.Google Scholar
Digital Library
- Chien-Chin Huang, Qi Chen, Zhaoguo Wang, Russell Power, Jorge Ortiz, Jinyang Li, and Zhen Xiao. 2015. Spartan: A distributed array framework with smart tiling. In 2015 {USENIX} Annual Technical Conference ({USENIX}{ATC} 15). 1–15.Google Scholar
- Falk Hüffner, Nadja Betzler, and Rolf Niedermeier. 2007. Optimal edge deletions for signed graph balancing. In International Workshop on Experimental and Efficient Algorithms. Springer, 297–310.Google Scholar
Cross Ref
- George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. 1999. Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7, 1(1999), 69–79.Google Scholar
Digital Library
- Subhash Khot. 2002. On the power of unique 2-prover 1-round games. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 767–775.Google Scholar
Digital Library
- Subhash Khot and Nisheeth K Vishnoi. 2005. On the unique games conjecture. In FOCS, Vol. 5. Citeseer, 3.Google Scholar
- Pat Langley. 1996. Elements of machine learning. Morgan Kaufmann.Google Scholar
- Chase Roberts, Ashley Milsted, Martin Ganahl, Adam Zalcman, Bruce Fontaine, Yijian Zou, Jack Hidary, Guifre Vidal, and Stefan Leichenauer. 2019. Tensornetwork: A library for physics and machine learning. arXiv preprint arXiv:1905.01330(2019).Google Scholar
- Dimitrios M Thilikos, Maria Serna, and Hans L Bodlaender. 2005. Cutwidth I: A linear time fixed parameter algorithm. Journal of Algorithms 56, 1 (2005), 1–24.Google Scholar
Digital Library
- R Tohid, Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini, Katy Williams, Kate Isaacs, Kevin Huck, 2018. Asynchronous execution of python code on task-based runtime systems. In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 37–45.Google Scholar
Cross Ref
- Mingxing Zhang, Yongwei Wu, Kang Chen, Teng Ma, and Weimin Zheng. 2016. Measuring and optimizing distributed array programs. Proceedings of the VLDB Endowment 9, 12 (2016), 912–923.Google Scholar
Digital Library
Index Terms
Distributed Matrix Tiling Using A Hypergraph Labeling Formulation
Recommendations
On a tiling conjecture of Komlós for 3-chromatic graphs
Given two graphs G and H, an H-matching of G (or a tiling of G with H) is a subgraph of G consisting of vertex-disjoint copies of H. For an r-chromatic graph H on h vertices, we write u=u(H) for the smallest possible color-class size in any r-coloring ...
Adaptive sparse tiling for sparse matrix multiplication
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel ProgrammingTiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern ...
Minimum Degree Thresholds for Bipartite Graph Tiling
Given a bipartite graph H and a positive integer n such that v(H) divides 2n, we define the minimum degree threshold for bipartite H-tiling, δ2(n, H), as the smallest integer k such that every bipartite graph G with n vertices in each partition and ...





Comments