skip to main content
research-article

Spatial hardware implementation for sparse graph algorithms in GraphStep

Published:29 September 2011Publication History
Skip Abstract Section

Abstract

How do we develop programs that are easy to express, easy to reason about, and able to achieve high performance on massively parallel machines? To address this problem, we introduce GraphStep, a domain-specific compute model that captures algorithms that act on static, irregular, sparse graphs. In GraphStep, algorithms are expressed directly without requiring the programmer to explicitly manage parallel synchronization, operation ordering, placement, or scheduling details. Problems in the sparse graph domain are usually highly concurrent and communicate along graph edges. Exposing concurrency and communication structure allows scheduling of parallel operations and management of communication that is necessary for performance on a spatial computer. We study the performance of a semantic network application, a shortest-path application, and a max-flow/min-cut application. We introduce a language syntax for GraphStep applications. The total speedup over sequential versions of the applications studied ranges from a factor of 19 to a factor of 15,000. Spatially-aware graph optimizations (e.g., node decomposition, placement and route scheduling) delivered speedups from 3 to 30 times over a spatially-oblivious mapping.

Skip Supplemental Material Section

Supplemental Material

References

  1. Agha, G. 1998. ACTORS: A model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Blelloch, G. E., Chatterjee, S., Hardwick, J. C., Sipelstein, J., and Zagha, M. 1993. Implementation of a portable nested data-parallel language. In Proceedings 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 102--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Boykov, Y., Veksler, O., and Zabih, R. 1998. Markov random fields with efficient approximations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648--655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brook Project. 2004. Brook project web page. http://brook.sourceforge.net.Google ScholarGoogle Scholar
  5. Caldwell, A., Kahng, A., and Markov, I. 2000. Improved algorithms for hypergraph bipartitioning. In Proceedings of the Asia and South Pacific Design Automation Conference. 661--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Caspi, E., Chu, M., Huang, R., Weaver, N., Yeh, J., Wawrzynek, J., and DeHon, A. 2000. Stream computations organized for reconfigurable execution (SCORE): Extended abstract. In Proceedings of the International Conference on Field-Programmable Logic and Applications. Lecture Notes in Computer Science. Springer, 605--614. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Castanos, J. and Savage, J. 2000. Repartitioning unstructured adaptive meshes. In Proceedings of the Parallel and Distributed Processing Symposium. IEEE, 823--832. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chandy, K. M. and Misra, J. 1982. Distributed computation on graphs: Shortest path algorithms. Comm. ACM 25, 11, 833--837. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cormen, T., Leiserson, C., and Rivest, R. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Symposium on Operating System Design and Implementation. 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. DeHon, A. 2000. Compact, multilayer layout for butterfly fat-tree. In Proceedings of the 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA'00). ACM, 206--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. DeHon, A., Huang, R., and Wawrzynek, J. 2006. Stochastic spatial routing for reconfigurable networks. J. Microprocess. Microsyst. 30, 6, 301--318.Google ScholarGoogle ScholarCross RefCross Ref
  13. deLorimier, M. and DeHon, A. 2005. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 75--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. deLorimier, M., Kapre, N., Mehta, N., Rizzo, D., Eslick, I., Rubin, R., Uribe, T. E., Knight, Jr., T. F., and DeHon, A. 2006. GraphStep: A system architecture for sparse-graph algorithms. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 143--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fahlman, S. E. 1979. NETL: A System for Representing and Using Real-World Knowledge. MIT Press, Cambridge, MA.Google ScholarGoogle ScholarCross RefCross Ref
  16. Habata, S., Yokokawa, M., and Kitawaki, S. 2003. The earth simulator system. NEC Res. & Develop. 44, 1, 21--26.Google ScholarGoogle Scholar
  17. Hestenes, M. R. and Stiefel, E. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand. 49, 6, 409--436.Google ScholarGoogle ScholarCross RefCross Ref
  18. Hillis, W. D. 1985. The Connection Machine. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hillis, W. D. and Steele, G. L. 1986. Data parallel algorithms. Comm. ACM 29, 12, 1170--1183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kahn, G. 1974. The semantics of a simple language for parallel programming. In Proceedings of the IFIP CONGRESS 74. North-Holland Publishing Company, 471--475.Google ScholarGoogle Scholar
  21. Kapre, N. and DeHon, A. 2009. Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In Proceedings of the International Conference on Field-Programmable Technology. IEEE, 190--198.Google ScholarGoogle ScholarCross RefCross Ref
  22. Kapre, N., Mehta, N., deLorimier, M., Rubin, R., Barnor, H., Wilson, M. J., Wrighton, M., and DeHon, A. 2006. Packet-Switched vs. time-multiplexed FPGA overlay networks. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE, 205--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Karypis, G. and Kumar, V. 1999. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kildall, G. A. 1973. A unified approach to global program optimization. In Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL'73). ACM Press, New York, 194--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kim, J.-T. and Moldovan, D. I. 1993. Classification and retrieval of knowledge on a parallel marker-passing architecture. IEEE Trans. Knowl. Data Engin. 5, 5, 753--761. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Koelbel, C. H., Loveman, D. B., Schreiber, R. S., Guy L. Steele, J., and Zosel, M. E. 1994. The High Performance Fortran Handbook. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kolmogorov, V. and Zabih, R. 2001. Computing visual correspondence with occlusions using graph cuts. In Proceedings of the IEEE International Conference on Computer Vision. Vol. 2. 508--515.Google ScholarGoogle Scholar
  28. Landman, B. S. and Russo, R. L. 1971. On pin versus block relationship for partitions of logic circuits. IEEE Trans. Comput. 20, 1469--1479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lee, E. 2005. UC Berkley ptolemy project. http://www.ptolemy.eecs.berkeley.edu/.Google ScholarGoogle Scholar
  30. Lee, E. A. and Messerschmitt, D. G. 1987. Synchronous data flow. Proc. IEEE 75, 9, 1235--1245.Google ScholarGoogle ScholarCross RefCross Ref
  31. Leiserson, C., Rose, F., and Saxe, J. 1983. Optimizing synchronous circuitry by retiming. In Proceedings of the 3rd Caltech Conference On VLSI.Google ScholarGoogle Scholar
  32. Leiserson, C. E. 1985. Fat-Trees: Universal networks for hardware efficient supercomputing. IEEE Trans. Comput. C-34, 10, 892--901. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Lieberman, H. 1987. Concurrent Object-Oriented Programming in Act 1. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  34. Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. 2008. Nvidia tesla: A unified graphics and computing architecture. IEEE Micro 28, 2, 39--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Liu, H. and Singh, P. 2004. Conceptnet -- A practical commonsense reasoning tool-kit. BT Tech. J. 22, 4, 211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Logemann, G., Loveland, D., and Davis, M. 1962. A machine program for theorem proving. Comm. ACM 5, 7, 394--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Marques-Silva, J. P. and Sakallah, K. A. 1999. GRASP: A search algorithm for propositional satisfiability. IEEE Trans. Comput. 48, 5, 506--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Microsystems, S. 1995. The java language environment. White paper. http://java.sun.com/docs/white/langenv/.Google ScholarGoogle Scholar
  39. Pierce, B. C. 2002. Types and Programming Languages. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shah, N., Plishker, W., Ravindran, K., and Keutzer, K. 2004. NP-Click: A productive software development approach for network processors. IEEE Micro 24, 5, 45--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Valiant, L. G. 1990. A bridging model for parallel computation. Comm. ACM 33, 8, 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wrighton, M. and DeHon, A. 2003. Hardware-assisted simulated annealing with application for fast FPGA placement. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Spatial hardware implementation for sparse graph algorithms in GraphStep

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Autonomous and Adaptive Systems
              ACM Transactions on Autonomous and Adaptive Systems  Volume 6, Issue 3
              September 2011
              150 pages
              ISSN:1556-4665
              EISSN:1556-4703
              DOI:10.1145/2019583
              Issue’s Table of Contents

              Copyright © 2011 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 29 September 2011
              • Accepted: 1 June 2010
              • Revised: 1 April 2010
              • Received: 1 August 2009
              Published in taas Volume 6, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!