Abstract
State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an $n$ node graph from $Thn^2n$ to $Thn$, stencil computations on a $d$-dimensional hypercubic grid of width $w$ for $h$ time steps from $Th(d^2 h) w^ (d+2) - 1$ to $Thh$, and LCS on two sequences of length $n$ each from $Thn^_2 3$ to $Thn$. In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a $3$ - $5$ times improvement in absolute running time, $10$ - $20$ times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.
- U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. In Proc. of the 12th ACM Annual Symp. on Parallel Algorithms and Architectures (SPAA 2000), pages 1–12, 2000. Google Scholar
Digital Library
- K. Agrawal, C. E. Leiserson, and J. Sukha. Executing task graphs using work stealing. In IPDPS, pages 1–12. IEEE, April 2010.Google Scholar
- A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley Publishing Company, 1974. Google Scholar
Digital Library
- R. Bellman. Dynamic Programming. Princeton University Press, 1957. Google Scholar
Digital Library
- G. Bikshandi, J. Guo, D. Hoeflinger, G. Almasi, B. B. Fraguela, M. J. Garzarán, D. Padua, and C. von Praun. Programming for parallelism and locality with hierarchically tiled arrays. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pages 48–57, New York, NY, USA, 2006. ACM. Google Scholar
Digital Library
- R. Bleck, C. Rooth, D. Hu, and L. T. Smith. Salinity-driven thermocline transients in a wind- and thermohaline-forced isopycnic coordinate model of the North Atlantic. Journal of Physical Oceanography, 22(12):1486–1505, 1992. ISSN 0022-3670.Google Scholar
Cross Ref
- G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235–244. ACM, 2004. Google Scholar
Digital Library
- G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 355–366, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 207–216, Santa Barbara, California, July 1995. Google Scholar
Digital Library
- S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. SC Conference, 0:42, 2000. Google Scholar
Digital Library
- R. Chowdhury. Cache-efficient Algorithms and Data Structures: Theory and Experimental Evaluation. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, Austin, Texas, 2007. Google Scholar
Digital Library
- R. Chowdhury and V. Ramachandran. Cache-efficient Dynamic Programming Algorithms for Multicores. In Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 207–216, 2008. Google Scholar
Digital Library
- R. Chowdhury and V. Ramachandran. The cache-oblivious Gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. Theory of Computing Systems, 47(4):878–919, 2010. Google Scholar
Digital Library
- R. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. Journal of Parallel and Distributed Computing (Special issue on best papers from IPDPS 2010, 2011 and 2012), 73(7):911–925, 2013. A preliminary version appeared as {17}.Google Scholar
- R. A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 06, pages 591–600, 2006. Google Scholar
Digital Library
- R. A. Chowdhury, H.-S. Le, and V. Ramachandran. Cache-oblivious dynamic programming for bioinformatics. TCBB, 7(3):495–510, July-Sept. 2010. Google Scholar
Digital Library
- R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. In Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, pages 1––12, April 2010..Google Scholar
Cross Ref
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009. Google Scholar
Digital Library
- K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, pages 4:1–4:12, Austin, TX, Nov. 15–18 2008. Google Scholar
Digital Library
- R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.Google Scholar
Cross Ref
- H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta. In-core optimization of high-order stencil computations. In PDPTA, pages 533–538, Las Vegas, NV, July13–16 2009.Google Scholar
- R. Floyd. Algorithm 97 (SHORTEST PATH). Commun. ACM, 5(6): 345, 1962. Google Scholar
Digital Library
- M. Frigo and V. Strumpen. Cache oblivious stencil computations. In ICS, pages 361–366, Cambridge, MA, June 20–22 2005. Google Scholar
Digital Library
- M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In SPAA, pages 271–280, 2006. Google Scholar
Digital Library
- M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. Theory of Computing Systems, 45(2):203– 233, 2009. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI ’98, pages 212–223, 1998. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cacheoblivious algorithms. In FOCS, pages 285–297, New York, NY, Oct. 17–19 1999. Google Scholar
Digital Library
- Z. Galil and R. Giancarlo. Speeding up dynamic programming with applications to molecular biology. Theoretical Computer Science, 64: 107–118, 1989. Google Scholar
Digital Library
- Z. Galil and K. Park. Parallel algorithms for dynamic programming recurrences with more than O(1) dependency. Journal of Parallel and Distributed Computing, 21:213–222, 1994. Google Scholar
Digital Library
- R. Giegerich, C. Meyer, and P. Steffen. A discipline of dynamic programming over sequence data. Science of Computer Programming, 51(3):215–263, 2004. Google Scholar
Digital Library
- O. Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705–708, 1982.Google Scholar
- J. Guo, G. Biksh, B. B. Fraguela, M. J. Garzarn, and D. Padua. Programming with tiles. In In PPoPP 08: Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. Google Scholar
Digital Library
- D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997. Google Scholar
Digital Library
- Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In SPAA, pages 145–156, Santorini, Greece, June 13–15 2010. Google Scholar
Digital Library
- Intel Corporation. The Intel Many Integrated Core Architecture. http://www.intel.com/content/www/us/en/ architecture-and-technology/many-integrated-core/ intel-many-integrated-core-architecture.html, 2011.Google Scholar
- S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In MSP, pages 36–43, Chicago, IL, June 12 2005. Google Scholar
Digital Library
- S. Kamil, K. Datta, S. Williams, L. Oliker, J. Shalf, and K. Yelick. Implicit and explicit optimizations for stencil computations. In MSPC, pages 51–60, San Jose, CA, 2006. ISBN 1-59593-578-9.. URL http://doi.acm.org/10.1145/1178597.1178605. Google Scholar
Digital Library
- J. O. S. Kennedy. Applications of dynamic programming to agriculture, forestry and fisheries: Review and prognosis. Review of Marketing and Agricultural Economics, 49(03), 1981.Google Scholar
- S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, San Diego, CA, June 10–13 2007. Google Scholar
Digital Library
- S. Maleki, M. Musuvathi, and T. Mytkowicz. Parallelizing dynamic programming through rank convergence. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’14, pages 219–232, New York, NY, USA, 2014. Google Scholar
Digital Library
- ACM. ISBN 978-1-4503-2656-8.Google Scholar
- J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010, pages 1–12, 2010.Google Scholar
Cross Ref
- A. Nakano, R. Kalia, and P. Vashishta. Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computers. Computer Physics Communications, 83(2-3):197–214, 1994. ISSN 0010-4655.Google Scholar
Cross Ref
- A. Nitsure. Implementation and optimization of a cache oblivious lattice Boltzmann algorithm. Master’s thesis, Institut für Informatic, Friedrich-Alexander-Universität Erlangen-Nürnberg, July 2006.Google Scholar
- L. Peng, R. Seymour, K.-i. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, and C. C. Wong. High-order stencil computations on multicore clusters. In IPDPS, pages 1–11, Rome, Italy, May 23–29 2009. Google Scholar
Digital Library
- A. A. Robichek, E. J. Elton, and M. J. Gruber. Dynamic programming applications in finance. The Journal of Finance, 26(2):473–506, 1971.Google Scholar
Cross Ref
- D. Romer. It’s fourth down and what does the bellman equation say? a dynamic programming analysis of football strategy. Technical report, National Bureau of Economic Research, 2002.Google Scholar
Cross Ref
- J. Rust. Numerical dynamic programming in economics. Handbook of computational economics, 1:619–729, 1996.Google Scholar
- J. Shun, G. E. Blelloch, J. T. Fineman, and P. B. Gibbons. Reducing contention through priority updates. In SPAA, pages 152–163, 2013. Google Scholar
Digital Library
- H. V. Simhadri, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and A. Kyrola. Experimental analysis of space-bounded schedulers. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’14, pages 30–41, New York, NY, USA, 2014. ACM. Google Scholar
Digital Library
- D. K. Smith. Dynamic programming and board games: A survey. European Journal of Operational Research, 176(3):1299–1318, 2007.Google Scholar
Cross Ref
- A. Taflove and S. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, Norwood, MA, 2000. ISBN 1580530761.Google Scholar
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In SPAA, San Jose, CA, USA, 2011. Google Scholar
Digital Library
- Y. Tang, R. A. Chowdhury, C.-K. Luk, and C. E. Leiserson. Coding stencil computation using the Pochoir stencil-specification language. In HotPar’11, Berkeley, CA, USA, May 2011.Google Scholar
- G. Tzenakis, A. Papatriantafyllou, H. Vandierendonck, P. Pratikakis, and D. S. Nikolopoulos. BDDT: block-level dynamic dependence analysis for task-based parallelism. In Advanced Parallel Processing Technologies - 10th International Symposium, APPT 2013, Stockholm, Sweden, August 27-28, 2013, Revised Selected Papers, pages 17–31, 2013.Google Scholar
- S. Warshall. A theorem on boolean matrices. J. ACM, 9(1):11–12, 1962. Google Scholar
Digital Library
- M. Waterman. Introduction to Computational Biology. Chapman & Hall, London, UK, 1995.Google Scholar
Cross Ref
- S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In IPDPS, pages 1–14, Miami, FL, Apr. 2008.Google Scholar
Cross Ref
Index Terms
Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency
Recommendations
Provably Efficient Scheduling of Cache-oblivious Wavefront Algorithms
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and ArchitecturesIterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and ...
Cache-oblivious wavefront: improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingState-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best ...
Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance
WOSC '14: Proceedings of the Second Workshop on Optimizing Stencil ComputationsThe state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach ...






Comments