skip to main content
research-article

Throughput-memory footprint trade-off in synthesis of streaming software on embedded multiprocessors

Published:24 December 2013Publication History
Skip Abstract Section

Abstract

We study the trade-off between throughput and memory footprint of embedded software that is synthesized from acyclic static dataflow (task graph) specifications targeting distributed memory multiprocessors. We identify iteration overlapping as a knob in the synthesis process by which one can trade application throughput for its memory requirement. Given an initial processor assignment and non-overlapped task schedule, we formally present underlying properties of the problem, such as constraints on a valid iteration overlapping, maximum possible throughput, and minimum memory footprint. Moreover, we develop an effective algorithm for generation of a rich set of design points that provide a range of trade-off options. Experimental results on a number of applications and architectures validate the effectiveness of our approach.

References

  1. Battacharyya, S. S., Lee, E. A., and Murthy, P. K. 1996. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, Norwell, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bell, S., Edwards, B., Amann, J., Conlin, R., Joyce, K., Lenng, V., MacKay, J., Reif, M., Bao, L., Brown, J., Mattina, M., Mia., C.-C., Ramey, C., Wentzlaff, D., Anderson, W., Berger, E., Fairbanks, N., Khan, D., Montenegro, F., Sticknay, J., and Zooks, J. 2008. TILE64 processor: A 64-core SoC with mesh interconnect. In Proceedings of the International Solid-State Circuits Conference (ISSCC).Google ScholarGoogle Scholar
  3. Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., and Hughes, B. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Geilen, M. and Basten, T. 2004. Reactive process networks. In Proceedings of the International Conference on Embedded Software (EMSOFT), 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gordon, M. I., Theis, W., and Amarasingher, S. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hormati, A. H., Choi, Y., Kudlur, M., Rabbah, R., Mudge, T., and Mahlke, S. 2009. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 214--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Iosifidis, Y., Mallik, A., Mamagkakis, S., DeGreef, E., Bartzas, A., Sondris, D., and Catthoor, F. 2010. A framework for automatic parallelization, static and dynamic memory optimization in MPSoC platforms. In Proceedings of the Design Automation Conference (DAC), 549--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2005. Introduction to the Cell multiprocessor. IBM J. Res. Develop. 49, 4/5, 589--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ko, M.-Y., Murthy, P. K., and Bhattacharyya, S. S. 2007. Beyond single-appearance schedules: Efficient DSP software synthesis using nested procedure calls. ACM Trans. Embed. Comput. Syst. (TECS) 6, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kudlur, M. and Mahlke, S. 2008. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 114--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lee, E. A. and Messerschmitt, D. G. 1987. Synchronous data flow. Proc. IEEE 75, 9, 1235--1245.Google ScholarGoogle ScholarCross RefCross Ref
  12. Murthy, P. K. and Bhattacharyya, S. S. 2004. Buffer merging - a powerful technique for reducing memory requirements of synchronous dataflow specifications. ACM Trans. Des. Autom. Electron. Syst. (TODAES) 9, 2, 212--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Panesar, G., Towner, D., Duller, A., Gray, A., and Robbins, W. 2006. Deterministic parallel processing. Int. J. Parallel Program. 34, 323--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Pimentel, A. D., Hertzberger, L. O., Lieverse, P., van der Wolf, P., and Deprettere, E. F. 2001. Exploring embedded-systems architectures with Artemis. IEEE Comput. 34, 11, 57--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rau, R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the International Symposium on Microarchitecture (MICRO). 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rusu, S., Tam, S., Muljono, H., Ayers, D., Chang, J., Varada, R., Ratta, M., and Vora, S. 2009. A 45nm 8-core enterprise Xeon processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC). 9--12.Google ScholarGoogle Scholar
  17. Ruttenbergand, J., Gao, G., Stoutchinin, A., and Lichtenstein, W. 1996. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Stuijk, S., Geilen, M., and Basten, T. 2008. Throughput-buffering trade-off exploration for cyclo-static and synchronous dataflow graphs. IEEE Trans. Comput. 57, 10, 1331--1345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Thiele, L. and Wandeler, E. 2005. Performance analysis of distributed embedded systems. In Proceedings of the Embedded Systems Handbook. CRC Press.Google ScholarGoogle Scholar
  20. Truong, D., Cheng, W. H., Mohsenin, T., Yu, Z., Jacobsen, A. T., Landge, G., Meenwsen, M. J., Watnik, C., Tran, A. T., Xiao, Z., Work, E. W., Webb, J. W., Mejia, P. V., and Bass, B. M. 2008. A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling. In Proceedings of the Symposium on VLSI Circuits.Google ScholarGoogle Scholar
  21. Wiggers, M., Bekooij, M., Geilen, M., and Basten, T. 2010. Simultaneous budget and buffer size computation for throughput-constrained task graphs. In Proceedings of the Design, Automation, and Test in Europe (DATE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xue, L., Ozturk, O., Li, F., Kandemir, M. T., and Kolcu, I. 2006. Dynamic partitioning of processing and memory resources in embedded MPSoC architectures. In Proceedings of the Design, Automation, and Test in Europe (DATE), 690--695. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Throughput-memory footprint trade-off in synthesis of streaming software on embedded multiprocessors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!