Abstract
Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in advance, however, if there is uncertainty as to when each task's input data will be available. Ideally, each task would run in direct response to the arrival of its input data, thus allowing the computation to proceed in a fine-grained event-driven manner. Realizing this ideal is difficult in practice, and typically requires sacrificing flexibility for performance.
In Anton 2, a massively parallel special-purpose supercomputer for molecular dynamics simulations, we addressed this challenge by including a hardware block, called the dispatch unit, that provides flexible and efficient support for fine-grained event-driven computation. Its novel features include a many-to-many mapping from input data to a set of synchronization counters, and the ability to prioritize tasks based on their type. To solve the additional problem of using a fixed set of synchronization counters to track input data for a potentially large number of tasks, we created a software library that allows programmers to treat Anton 2 as an idealized machine with infinitely many synchronization counters. The dispatch unit, together with this library, made it possible to simplify our molecular dynamics software by expressing it as a collection of independent tasks, and the resulting fine-grained execution schedule improved overall performance by up to 16% relative to a coarse-grained schedule for precisely the same computation.
- Ghiath Al-Kadi and Andrei Sergeevich Terechko, "A hardware task scheduler for embedded video processing," 4th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC '09), Paphos, Cyprus, January 25-28, 2009, pp. 140--152. Google Scholar
Digital Library
- Nimar S. Arora, Robert D. Blumofe and C. Greg Plaxton, "Thread scheduling for multiprogrammed multiprocessors," 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '98), Puerto Vallarta, Mexico, June 28-July 2, 1998, pp. 119--129. Google Scholar
Digital Library
- Joseph M. Arul and Krishna M. Kavi, "Scalability of scheduled dataflow architecture (SDF) with register contexts," 5th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2002), Beijing, China, October 23-25, 2002, pp. 214--221. Google Scholar
Digital Library
- Arvind and David E. Culler, "Dataflow architectures," Annual Review of Computer Science, Volume 1, June, 1986, pp. 225--253. Google Scholar
Digital Library
- Arvind and Rishiyur S. Nikhil, "Executing a program on the MIT tagged-token dataflow architecture," IEEE Transactions on Computers, Volume 39, Issue 3, March, 1990, pp. 300--318. Google Scholar
Digital Library
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall and Yuli Zhou, "Cilk: an efficient multithreaded runtime system," Journal of Parallel and Distributed Computing, Volume 37, Issue 1, August, 1996, pp. 55--69. Google Scholar
Digital Library
- Robert D. Blumofe and Charles E. Leiserson, "Scheduling multithreaded computations by work stealing," Journal of the ACM, Volume 46, Number 5, September, 1999, pp. 720--748. Google Scholar
Digital Library
- Greg Buzzard, David Jacobson, Milon Mackay, Scott Marovich and John Wilkes, "An implementation of the Hamlyn sender-managed interface architecture," 2nd USENIX Symposium on Operating System Design and Implementation (OSDI '96), Seattle, WA, October 28-31, 1996, pp. 245--259. Google Scholar
Digital Library
- David Chase and Yossi Lev, "Dynamic circular work-stealing deque," 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2005), Las Vegas, NV, July 18-20, 2005, pp. 21--28. Google Scholar
Digital Library
- Ron O. Dror, J.P. Grossman, Kenneth M. Mackenzie, Brian Towles, Edmond Chow, John K. Salmon, Cliff Young, Joseph A. Bank, Brannon Batson, Martin M. Deneroff, Jeffrey S. Kuskin, Richard H. Larson, Mark A. Moraes and David E. Shaw, "Exploiting 162-nanosecond end-to-end communication latency on Anton," International Conference on High Performance Computing, Networking, Storage and Analysis (SC10), New Orleans, LA, November 15-18, 2010. Google Scholar
Digital Library
- Thorsten von Eicken, David E. Culler, Seth Copen Goldstein and Klaus Erik Schauser, "Active messages: a mechanism for integrated communication and computation," 19th International Symposium on Computer Architecture (ISCA 1992), Gold Coast, Australia, May 19-21, 1992, pp. 430--440. Google Scholar
Digital Library
- Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M. Badia, Eduard Ayguade, Jesus Labarta and Mateo Valero, "Task superscalar: an out-of-order task pipeline," 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '43), Atlanta, Georgia, December 4-8, 2010, pp. 89--100. Google Scholar
Digital Library
- Mark Gebhart, Bertrand A. Maher, Katherine E. Coons, Jeff Diamond, Paul Gratz, Mario Marino, Nitya Ranganathan, Behnam Robatimili, Aaron Smith, James Burrill, Stephen W. Keckler, Doug Berger and Kathryn S. McKinley, "An evaluation of the TRIPS computer system," 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2009), Washington, D.C., March 7-11, 2009, pp. 1--12. Google Scholar
Digital Library
- Danny Hendler and Nir Shavit, "Non-blocking steal-half work queues," 21st Annual ACM Symposium on Principles of Distributed Computing (PODC 2002), Monterey, CA, July 21-24, 2002, pp. 280--289. Google Scholar
Digital Library
- Ralf Hoffmann, Matthias Korch and Thomas Rauber, "Performance evaluation of task pools based on hardware synchronization," ACM/IEEE Conference on High Performance Networking and Computing (SC04), Pittsburgh, PA, November 6-12, 2004. Google Scholar
Digital Library
- Laxmikant V. Kale and Sanjeev Krishnan, "CHARM++: a portable concurrent object oriented system based on C++," 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA 1993), Washington, D.C., September 26-October 1, 1993, pp. 91--108. Google Scholar
Digital Library
- Matthias Korch and Thomas Rauber, "A comparison of task pools for dynamic load balancing of irregular algorithms," Journal of Concurrency and Computation: Practice & Experience, Volume 16, Issue 1, December, 2003, pp. 1--47. Google Scholar
Digital Library
- Sameer Kumar, Gabor Dozsa, Gheorghe Almasi, Dong Chen, Mark E. Giampapa, Philip Heidelberger, Michael Blocksome, Ahmad Faraj, Jeff Parker, Joseph Ratterman, Brian Smith and Charles Archer, "The deep computing messaging framework: Generalized scalable message passing on the Blue Gene/P supercomputer," 22nd International Conference on Supercomputing (ICS '08), Island of Kos, Greece, June 7-12, 2008, pp. 94--103. Google Scholar
Digital Library
- Sanjeev Kumar, Christopher J. Hughes and Anthony Nguyen, "Carbon: architectural support for fine-grained parallelism on chip multiprocessors," 34th International Symposium on Computer Architecture (ISCA 2007), San Diego, CA, June 9-13, 2007, pp. 162--173. Google Scholar
Digital Library
- Jeffrey S. Kuskin, Cliff Young, J.P. Grossman, Brannon Batson, Martin M. Deneroff, Ron O. Dror and David E. Shaw, "Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation," 14th International Symposium on High Performance Computer Architecture (HPCA-14), Salt Lake City, UT, February 16-20, 2008, pp. 343--354.Google Scholar
- Michael D. Noakes, Deborah A. Wallach and William J. Dally, "The J-Machine multicomputer: an architectural evaluation," 20th International Symposium on Computer Architecture (ISCA 1993), San Diego, CA, May 16-19, 1993, pp. 224--235. Google Scholar
Digital Library
- Gregory M. Papadopoulos and Kenneth R. Traub, "Multithreading: a revisionist view of dataflow architectures," 18th Annual International Symposium on Computer Architecture (ISCA 1991), Toronto, Canada, May 27-30, 1991, pp. 342--251. Google Scholar
Digital Library
- Shuichi Sakai, Yoshinori Yamaguchi, Kei Hiraki, Yuetsu Kodama and Toshitsugu Yuba, "An architecture of a dataflow single chip processor," 16th Annual International Symposium on Computer Architecture (ISCA 1989), Jerusalem, Israel, June, 1989, pp. 46--53. Google Scholar
Digital Library
- Daniel Sanchez, Richard M. Yoo and Christos Kozyrakis, "Flexible architectural support for fine-grain scheduling," 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2010), Pittsburgh, PA, March 13--17, 2010, pp. 311--322. Google Scholar
Digital Library
- Steven L. Scott, "Synchronization and communication in the T3E multiprocessor," 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 1996), Cambridge, MA, October 1-5, 1996, pp. 26--36. Google Scholar
Digital Library
- David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey S. Kuskin, Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood, Joseph Gagliardo, J.P. Grossman, C. Richard Ho, Douglas J. Ierardi, István Kolossváry, John L. Klepeis, Timothy Layman, Christine McLeavey, Mark A. Moraes, Rolf Mueller, Edward C. Priest, Yibing Shan, Jochen Spengler, Michael Theobald, Brian Towles and Stanley C. Wang, "Anton, a special-purpose machine for molecular dynamics simulation," 34th Annual International Symposium on Computer Architecture (ISCA 2007), San Diego, CA, June 9-13, 2007, pp. 1--12. Google Scholar
Digital Library
- Magnus Själander, Andrei Terechko and Marc Duranton, "A look-ahead task management unit for embedded multi-core architectures," 11th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2008), Parma, Italy, September 3-5, 2008, pp. 149--157. Google Scholar
Digital Library
- Kyriakos Stavrou, Costas Kyriacou, Paraskevas Evripidou and Pedro Trancoso, "Chip multiprocessor based on data-driven multithreading model," International Journal of High Performance Systems Architectures, Volume 1, Number 1, 2007, pp. 24--43. Google Scholar
Digital Library
- David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III and Anant Agarwal, "On-chip interconnection architecture of the Tile Processor," IEEE Micro, Volume 27, Issue 5, September, 2007, pp. 15--31. Google Scholar
Digital Library
Index Terms
Hardware support for fine-grained event-driven computation in Anton 2
Recommendations
Hardware support for fine-grained event-driven computation in Anton 2
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsExploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in ...
Hardware support for fine-grained event-driven computation in Anton 2
ASPLOS '13Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in ...
Distributed computation of inverse dynamics of robots
HIPC '96: Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)This paper presents a task scheduling to perform parallel computation of dynamics of robots. For illustration, the inverse dynamic analysis considered is based on the Newton-Euler recursive formulation. The results are presented for a robot with six ...







Comments