skip to main content
research-article

An Out-of-Order Load-Store Queue for Spatial Computing

Authors Info & Claims
Published:27 September 2017Publication History
Skip Abstract Section

Abstract

The efficiency of spatial computing depends on the ability to achieve maximal parallelism. This necessitates memory interfaces that can correctly handle memory accesses that arrive in arbitrary order while still respecting data dependencies and ensuring appropriate ordering for semantic correctness. However, a typical memory interface for out-of-order processors (i.e., a load-store queue) cannot immediately meet these requirements: a different allocation policy is needed to achieve out-of-order execution in spatial systems that naturally omit the notion of sequential program order, a fundamental piece of information for correct execution. We show a novel and practical way to organize the allocation for an out-of-order load-store queue for spatial computing. The main idea is to dynamically allocate groups of memory accesses (depending on the dynamic behavior of the application), where the access order within the group is statically predetermined (for instance by a high-level synthesis tool). We detail the construction of our load-store queue and demonstrate on a few practical cases its advantages over standard accelerator-memory interfaces.

References

  1. M. Alle, A. Morvan, and S. Derrien. 2013. Runtime dependency analysis for loop pipelining in high-level synthesis. In Proceedings of the 50th Design Automation Conference. Austin, Tex, June 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Budiu, P. V. Artigas, and S. C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Austin, Tex., 177--186, Mar. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Budiu and S. C. Goldstein. 2002. Pegasus: An Efficient Intermediate Representation. Technical Report CMU-CS-02-107. Carnegie Mellon University, May 2002.Google ScholarGoogle Scholar
  4. M. Budiu and S. C. Goldstein. 2003. Optimizing memory accesses for spatial computation. In Proceedings of the 1st International ACM/IEEE Symposium on Code Generation and Optimization. San Francisco, Calif., 216--27, Mar. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, CAD-20, 9 (Sept. 2001), 1059--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Cheng and J. Wawrzynek. 2016. Synthesis of statically analyzable accelerator networks from sequential programs. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 126--133, Nov. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Cortadella, M. Kishinevsky, and B. Grundmann. 2006. Synthesis of synchronous elastic architectures. In Proceedings of the 43rd Design Automation Conference. San Francisco, Calif., 657--62, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Dai, M. Tan, K. Hao, and Z. Zhang. 2014. Flushing-enabled loop pipelining for high-level synthesis. In Proceedings of the 51st Design Automation Conference. San Francisco, Calif., 1--6, June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Dai, R. Zhao, S. S. Gai Liu, U. Gupta, C. Batten, and Z. Zhang. 2017. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. In Proceedings of the 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 189--94, Feb. 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Huang, Y. Huang, Y. Chen, P. Ienne, O. Temam, and C. Wu. 2014. A low-cost memory interface for high-throughput accelerators. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. New Delhi, 11:1--11:10, Oct. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu. 2013. Elastic CGRAs. In Proceedings of the 21st ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 171--80, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster, E. G. Mercer, and C. J. Myers. 2002. Synchronous interlocked pipelines. In Proceedings of the 8th International Symposium on Advanced Research in Asynchronous Circuits and Systems. Manchester, 3--12, Apr. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Kam, M. Kishinevsky, J. Cortadella, and M. Galceran-Oms. 2008. Correct-by-construction microarchitectural pipelining. Proceedings of the International Conference on Computer Aided Design (Nov. 2008), 434--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Liu, S. Bayliss, and G. A. Constantinides. 2015. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. In Proceedings of the 23rd IEEE Symposium on Field-Programmable Custom Computing Machines. Vancouver, B.C., 159--62, May 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Park, C.-L. Ooi, and T. N. Vijaykumar. 2003. Reducing design complexity of the load/store queue. In Proceedings of the 36th Annual International Symposium on Microarchitecture. San Diego, Calif., 411--22, Dec. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Pellauer, A. Parashar, M. Adler, B. Ahsan, R. L. Allmon, N. C. Crago, K. Fleming, M. Gambhir, A. Jaleel, T. Krishna, D. Lustig, S. Maresh, V. Pavlov, R. Rayess, A. Zhai, and J. S. Emer. 2015. Efficient control and communication paradigms for coarse-grained spatial architectures. ACM Trans. Comput. Syst. 33, 3 (2015), 10:1--10:32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Pericàs, A. Cristal, F. J. Cazorla, R. González, A. V. Veidenbaum, D. A. Jiménez, and M. Valero. 2008. A two-level load/store queue based on execution locality. In Proceedings of the 35th International Symposium on Computer Architecture. Beijing, 25--36, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Sethumadhavan, F. Roesner, J. S. Emer, D. Burger, and S. W. Keckler. 2007. Late-binding: Enabling unordered load-store queues. In Proceedings of the 34th International Symposium on Computer Architecture. San Diego, Calif., 347--57, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Tan, G. Liu, R. Zhao, S. Dai, and Z. Zhang. 2015. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 78--85, Nov. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 9th ACM/IEEE International Conference on Formal Methods and Models for Codesign. 171--80, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Wong, V. Betz, and J. Rose. 2013. Efficient methods for out-of-order load/store execution for high-performance soft processors. In Proceedings of the IEEE International Conference on Field Programmable Technology. Kyoto, 442--445, Dec. 2013.Google ScholarGoogle Scholar

Index Terms

  1. An Out-of-Order Load-Store Queue for Spatial Computing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!