Abstract
The efficiency of spatial computing depends on the ability to achieve maximal parallelism. This necessitates memory interfaces that can correctly handle memory accesses that arrive in arbitrary order while still respecting data dependencies and ensuring appropriate ordering for semantic correctness. However, a typical memory interface for out-of-order processors (i.e., a load-store queue) cannot immediately meet these requirements: a different allocation policy is needed to achieve out-of-order execution in spatial systems that naturally omit the notion of sequential program order, a fundamental piece of information for correct execution. We show a novel and practical way to organize the allocation for an out-of-order load-store queue for spatial computing. The main idea is to dynamically allocate groups of memory accesses (depending on the dynamic behavior of the application), where the access order within the group is statically predetermined (for instance by a high-level synthesis tool). We detail the construction of our load-store queue and demonstrate on a few practical cases its advantages over standard accelerator-memory interfaces.
- M. Alle, A. Morvan, and S. Derrien. 2013. Runtime dependency analysis for loop pipelining in high-level synthesis. In Proceedings of the 50th Design Automation Conference. Austin, Tex, June 2013. Google Scholar
Digital Library
- M. Budiu, P. V. Artigas, and S. C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Austin, Tex., 177--186, Mar. 2005. Google Scholar
Digital Library
- M. Budiu and S. C. Goldstein. 2002. Pegasus: An Efficient Intermediate Representation. Technical Report CMU-CS-02-107. Carnegie Mellon University, May 2002.Google Scholar
- M. Budiu and S. C. Goldstein. 2003. Optimizing memory accesses for spatial computation. In Proceedings of the 1st International ACM/IEEE Symposium on Code Generation and Optimization. San Francisco, Calif., 216--27, Mar. 2003. Google Scholar
Digital Library
- L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, CAD-20, 9 (Sept. 2001), 1059--76. Google Scholar
Digital Library
- S. Cheng and J. Wawrzynek. 2016. Synthesis of statically analyzable accelerator networks from sequential programs. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 126--133, Nov. 2016. Google Scholar
Digital Library
- J. Cortadella, M. Kishinevsky, and B. Grundmann. 2006. Synthesis of synchronous elastic architectures. In Proceedings of the 43rd Design Automation Conference. San Francisco, Calif., 657--62, July 2006. Google Scholar
Digital Library
- S. Dai, M. Tan, K. Hao, and Z. Zhang. 2014. Flushing-enabled loop pipelining for high-level synthesis. In Proceedings of the 51st Design Automation Conference. San Francisco, Calif., 1--6, June 2014. Google Scholar
Digital Library
- S. Dai, R. Zhao, S. S. Gai Liu, U. Gupta, C. Batten, and Z. Zhang. 2017. Dynamic hazard resolution for pipelining irregular loops in high-level synthesis. In Proceedings of the 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 189--94, Feb. 2017. Google Scholar
Digital Library
- J. Huang, Y. Huang, Y. Chen, P. Ienne, O. Temam, and C. Wu. 2014. A low-cost memory interface for high-throughput accelerators. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems. New Delhi, 11:1--11:10, Oct. 2014. Google Scholar
Digital Library
- Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu. 2013. Elastic CGRAs. In Proceedings of the 21st ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Monterey, Calif., 171--80, Feb. 2013. Google Scholar
Digital Library
- H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster, E. G. Mercer, and C. J. Myers. 2002. Synchronous interlocked pipelines. In Proceedings of the 8th International Symposium on Advanced Research in Asynchronous Circuits and Systems. Manchester, 3--12, Apr. 2002. Google Scholar
Digital Library
- T. Kam, M. Kishinevsky, J. Cortadella, and M. Galceran-Oms. 2008. Correct-by-construction microarchitectural pipelining. Proceedings of the International Conference on Computer Aided Design (Nov. 2008), 434--41. Google Scholar
Digital Library
- J. Liu, S. Bayliss, and G. A. Constantinides. 2015. Offline synthesis of online dependence testing: Parametric loop pipelining for HLS. In Proceedings of the 23rd IEEE Symposium on Field-Programmable Custom Computing Machines. Vancouver, B.C., 159--62, May 2015. Google Scholar
Digital Library
- I. Park, C.-L. Ooi, and T. N. Vijaykumar. 2003. Reducing design complexity of the load/store queue. In Proceedings of the 36th Annual International Symposium on Microarchitecture. San Diego, Calif., 411--22, Dec. 2003. Google Scholar
Digital Library
- M. Pellauer, A. Parashar, M. Adler, B. Ahsan, R. L. Allmon, N. C. Crago, K. Fleming, M. Gambhir, A. Jaleel, T. Krishna, D. Lustig, S. Maresh, V. Pavlov, R. Rayess, A. Zhai, and J. S. Emer. 2015. Efficient control and communication paradigms for coarse-grained spatial architectures. ACM Trans. Comput. Syst. 33, 3 (2015), 10:1--10:32. Google Scholar
Digital Library
- M. Pericàs, A. Cristal, F. J. Cazorla, R. González, A. V. Veidenbaum, D. A. Jiménez, and M. Valero. 2008. A two-level load/store queue based on execution locality. In Proceedings of the 35th International Symposium on Computer Architecture. Beijing, 25--36, June 2008. Google Scholar
Digital Library
- S. Sethumadhavan, F. Roesner, J. S. Emer, D. Burger, and S. W. Keckler. 2007. Late-binding: Enabling unordered load-store queues. In Proceedings of the 34th International Symposium on Computer Architecture. San Diego, Calif., 347--57, June 2007. Google Scholar
Digital Library
- M. Tan, G. Liu, R. Zhao, S. Dai, and Z. Zhang. 2015. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. In Proceedings of the International Conference on Computer Aided Design. Austin, Tex., 78--85, Nov. 2015. Google Scholar
Digital Library
- M. Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 9th ACM/IEEE International Conference on Formal Methods and Models for Codesign. 171--80, July 2009. Google Scholar
Digital Library
- H. Wong, V. Betz, and J. Rose. 2013. Efficient methods for out-of-order load/store execution for high-performance soft processors. In Proceedings of the IEEE International Conference on Field Programmable Technology. Kyoto, 442--445, Dec. 2013.Google Scholar
Index Terms
An Out-of-Order Load-Store Queue for Spatial Computing
Recommendations
Straight to the Queue: Fast Load-Store Queue Allocation in Dataflow Circuits
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate ArraysDynamically scheduled high-level synthesis can exploit high levels of parallelism in poorly-predictable control-dominated applications. Yet, dataflow circuits are often generated by literal conversion of basic blocks into circuits interconnected in such ...
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution
In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to ...
Caching Values in the Load Store Queue
MASCOTS '04: Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications SystemsThe latency of an L1 data cache continues to grow with increasing clock frequency, cache size and associativity. The increased latency is an important source of performance loss in high-performance processors. This paper proposes to cache data utilizing ...






Comments