Abstract
How do we develop programs that are easy to express, easy to reason about, and able to achieve high performance on massively parallel machines? To address this problem, we introduce GraphStep, a domain-specific compute model that captures algorithms that act on static, irregular, sparse graphs. In GraphStep, algorithms are expressed directly without requiring the programmer to explicitly manage parallel synchronization, operation ordering, placement, or scheduling details. Problems in the sparse graph domain are usually highly concurrent and communicate along graph edges. Exposing concurrency and communication structure allows scheduling of parallel operations and management of communication that is necessary for performance on a spatial computer. We study the performance of a semantic network application, a shortest-path application, and a max-flow/min-cut application. We introduce a language syntax for GraphStep applications. The total speedup over sequential versions of the applications studied ranges from a factor of 19 to a factor of 15,000. Spatially-aware graph optimizations (e.g., node decomposition, placement and route scheduling) delivered speedups from 3 to 30 times over a spatially-oblivious mapping.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Spatial Hardware Implementation for Sparse Graph Algorithms in GraphStep
- Agha, G. 1998. ACTORS: A model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Blelloch, G. E., Chatterjee, S., Hardwick, J. C., Sipelstein, J., and Zagha, M. 1993. Implementation of a portable nested data-parallel language. In Proceedings 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 102--111. Google Scholar
Digital Library
- Boykov, Y., Veksler, O., and Zabih, R. 1998. Markov random fields with efficient approximations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--655. Google Scholar
Digital Library
- Brook Project. 2004. Brook project web page. http://brook.sourceforge.net.Google Scholar
- Caldwell, A., Kahng, A., and Markov, I. 2000. Improved algorithms for hypergraph bipartitioning. In Proceedings of the Asia and South Pacific Design Automation Conference. 661--666. Google Scholar
Digital Library
- Caspi, E., Chu, M., Huang, R., Weaver, N., Yeh, J., Wawrzynek, J., and DeHon, A. 2000. Stream computations organized for reconfigurable execution (SCORE): Extended abstract. In Proceedings of the International Conference on Field-Programmable Logic and Applications. Lecture Notes in Computer Science. Springer, 605--614. Google Scholar
Digital Library
- Castanos, J. and Savage, J. 2000. Repartitioning unstructured adaptive meshes. In Proceedings of the Parallel and Distributed Processing Symposium. IEEE, 823--832. Google Scholar
Digital Library
- Chandy, K. M. and Misra, J. 1982. Distributed computation on graphs: Shortest path algorithms. Comm. ACM 25, 11, 833--837. Google Scholar
Digital Library
- Cormen, T., Leiserson, C., and Rivest, R. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Symposium on Operating System Design and Implementation. 137--150. Google Scholar
Digital Library
- DeHon, A. 2000. Compact, multilayer layout for butterfly fat-tree. In Proceedings of the 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA'00). ACM, 206--215. Google Scholar
Digital Library
- DeHon, A., Huang, R., and Wawrzynek, J. 2006. Stochastic spatial routing for reconfigurable networks. J. Microprocess. Microsyst. 30, 6, 301--318.Google Scholar
Cross Ref
- deLorimier, M. and DeHon, A. 2005. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 75--85. Google Scholar
Digital Library
- deLorimier, M., Kapre, N., Mehta, N., Rizzo, D., Eslick, I., Rubin, R., Uribe, T. E., Knight, Jr., T. F., and DeHon, A. 2006. GraphStep: A system architecture for sparse-graph algorithms. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 143--151. Google Scholar
Digital Library
- Fahlman, S. E. 1979. NETL: A System for Representing and Using Real-World Knowledge. MIT Press, Cambridge, MA.Google Scholar
Cross Ref
- Habata, S., Yokokawa, M., and Kitawaki, S. 2003. The earth simulator system. NEC Res. & Develop. 44, 1, 21--26.Google Scholar
- Hestenes, M. R. and Stiefel, E. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 49, 6, 409--436.Google Scholar
Cross Ref
- Hillis, W. D. 1985. The Connection Machine. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Hillis, W. D. and Steele, G. L. 1986. Data parallel algorithms. Comm. ACM 29, 12, 1170--1183. Google Scholar
Digital Library
- Kahn, G. 1974. The semantics of a simple language for parallel programming. In Proceedings of the IFIP CONGRESS 74. North-Holland Publishing Company, 471--475.Google Scholar
- Kapre, N. and DeHon, A. 2009. Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In Proceedings of the International Conference on Field-Programmable Technology. IEEE, 190--198.Google Scholar
Cross Ref
- Kapre, N., Mehta, N., deLorimier, M., Rubin, R., Barnor, H., Wilson, M. J., Wrighton, M., and DeHon, A. 2006. Packet-Switched vs. time-multiplexed FPGA overlay networks. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 205--213. Google Scholar
Digital Library
- Karypis, G. and Kumar, V. 1999. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1. Google Scholar
Digital Library
- Kildall, G. A. 1973. A unified approach to global program optimization. In Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL'73). ACM Press, New York, 194--206. Google Scholar
Digital Library
- Kim, J.-T. and Moldovan, D. I. 1993. Classification and retrieval of knowledge on a parallel marker-passing architecture. IEEE Trans. Knowl. Data Engin. 5, 5, 753--761. Google Scholar
Digital Library
- Koelbel, C. H., Loveman, D. B., Schreiber, R. S., Guy L. Steele, J., and Zosel, M. E. 1994. The High Performance Fortran Handbook. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Kolmogorov, V. and Zabih, R. 2001. Computing visual correspondence with occlusions using graph cuts. In Proceedings of the IEEE International Conference on Computer Vision. Vol. 2. 508--515.Google Scholar
- Landman, B. S. and Russo, R. L. 1971. On pin versus block relationship for partitions of logic circuits. IEEE Trans. Comput. 20, 1469--1479. Google Scholar
Digital Library
- Lee, E. 2005. UC Berkley ptolemy project. http://www.ptolemy.eecs.berkeley.edu/.Google Scholar
- Lee, E. A. and Messerschmitt, D. G. 1987. Synchronous data flow. Proc. IEEE 75, 9, 1235--1245.Google Scholar
Cross Ref
- Leiserson, C., Rose, F., and Saxe, J. 1983. Optimizing synchronous circuitry by retiming. In Proceedings of the 3rd Caltech Conference On VLSI.Google Scholar
- Leiserson, C. E. 1985. Fat-Trees: Universal networks for hardware efficient supercomputing. IEEE Trans. Comput. C-34, 10, 892--901. Google Scholar
Digital Library
- Lieberman, H. 1987. Concurrent Object-Oriented Programming in Act 1. MIT Press, Cambridge, MA.Google Scholar
- Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. 2008. Nvidia tesla: A unified graphics and computing architecture. IEEE Micro 28, 2, 39--55. Google Scholar
Digital Library
- Liu, H. and Singh, P. 2004. Conceptnet -- A practical commonsense reasoning tool-kit. BT Tech. J. 22, 4, 211. Google Scholar
Digital Library
- Logemann, G., Loveland, D., and Davis, M. 1962. A machine program for theorem proving. Comm. ACM 5, 7, 394--397. Google Scholar
Digital Library
- Marques-Silva, J. P. and Sakallah, K. A. 1999. GRASP: A search algorithm for propositional satisfiability. IEEE Trans. Comput. 48, 5, 506--521. Google Scholar
Digital Library
- Microsystems, S. 1995. The java language environment. White paper. http://java.sun.com/docs/white/langenv/.Google Scholar
- Pierce, B. C. 2002. Types and Programming Languages. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Shah, N., Plishker, W., Ravindran, K., and Keutzer, K. 2004. NP-Click: A productive software development approach for network processors. IEEE Micro 24, 5, 45--54. Google Scholar
Digital Library
- Valiant, L. G. 1990. A bridging model for parallel computation. Comm. ACM 33, 8, 103--111. Google Scholar
Digital Library
- Wrighton, M. and DeHon, A. 2003. Hardware-assisted simulated annealing with application for fast FPGA placement. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 33--42. Google Scholar
Digital Library
Index Terms
Spatial hardware implementation for sparse graph algorithms in GraphStep
Recommendations
Finding a maximum-weight induced k-partite subgraph of an i-triangulated graph
An i-triangulated graph is a graph in which every odd cycle has two non-crossing chords; i-triangulated graphs form a subfamily of perfect graphs. A slightly more general family of perfect graphs are clique-separable graphs. A graph is clique-separable ...
Algorithms for GA-H reduced graphs
Let GA be a hereditary family of graphs and H a hereditary family of acyclically directed family of graphs. A graph G ( V , E ) is a GA-Hreduced graph if it can be obtained from a graph GA ( V , D ) GA by deleting the edges of an edge subgraph H ( V , E ...
Complexity and Algorithms for Semipaired Domination in Graphs
AbstractFor a graph G = (V, E) with no isolated vertices, a set is called a semipaired dominating set of G if (i)D is a dominating set of G, and (ii)D can be partitioned into two element subsets such that the vertices in each two element set are at ...






Comments