Abstract
We study the trade-off between throughput and memory footprint of embedded software that is synthesized from acyclic static dataflow (task graph) specifications targeting distributed memory multiprocessors. We identify iteration overlapping as a knob in the synthesis process by which one can trade application throughput for its memory requirement. Given an initial processor assignment and non-overlapped task schedule, we formally present underlying properties of the problem, such as constraints on a valid iteration overlapping, maximum possible throughput, and minimum memory footprint. Moreover, we develop an effective algorithm for generation of a rich set of design points that provide a range of trade-off options. Experimental results on a number of applications and architectures validate the effectiveness of our approach.
- Battacharyya, S. S., Lee, E. A., and Murthy, P. K. 1996. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, Norwell, MA. Google Scholar
Digital Library
- Bell, S., Edwards, B., Amann, J., Conlin, R., Joyce, K., Lenng, V., MacKay, J., Reif, M., Bao, L., Brown, J., Mattina, M., Mia., C.-C., Ramey, C., Wentzlaff, D., Anderson, W., Berger, E., Fairbanks, N., Khan, D., Montenegro, F., Sticknay, J., and Zooks, J. 2008. TILE64 processor: A 64-core SoC with mesh interconnect. In Proceedings of the International Solid-State Circuits Conference (ISSCC).Google Scholar
- Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., and Hughes, B. 2010. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro 30, 2, 16--29. Google Scholar
Digital Library
- Geilen, M. and Basten, T. 2004. Reactive process networks. In Proceedings of the International Conference on Embedded Software (EMSOFT), 137--146. Google Scholar
Digital Library
- Gordon, M. I., Theis, W., and Amarasingher, S. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Hormati, A. H., Choi, Y., Kudlur, M., Rabbah, R., Mudge, T., and Mahlke, S. 2009. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 214--223. Google Scholar
Digital Library
- Iosifidis, Y., Mallik, A., Mamagkakis, S., DeGreef, E., Bartzas, A., Sondris, D., and Catthoor, F. 2010. A framework for automatic parallelization, static and dynamic memory optimization in MPSoC platforms. In Proceedings of the Design Automation Conference (DAC), 549--554. Google Scholar
Digital Library
- Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2005. Introduction to the Cell multiprocessor. IBM J. Res. Develop. 49, 4/5, 589--604. Google Scholar
Digital Library
- Ko, M.-Y., Murthy, P. K., and Bhattacharyya, S. S. 2007. Beyond single-appearance schedules: Efficient DSP software synthesis using nested procedure calls. ACM Trans. Embed. Comput. Syst. (TECS) 6, 2. Google Scholar
Digital Library
- Kudlur, M. and Mahlke, S. 2008. Orchestrating the execution of stream programs on multicore platforms. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 114--124. Google Scholar
Digital Library
- Lee, E. A. and Messerschmitt, D. G. 1987. Synchronous data flow. Proc. IEEE 75, 9, 1235--1245.Google Scholar
Cross Ref
- Murthy, P. K. and Bhattacharyya, S. S. 2004. Buffer merging - a powerful technique for reducing memory requirements of synchronous dataflow specifications. ACM Trans. Des. Autom. Electron. Syst. (TODAES) 9, 2, 212--237. Google Scholar
Digital Library
- Panesar, G., Towner, D., Duller, A., Gray, A., and Robbins, W. 2006. Deterministic parallel processing. Int. J. Parallel Program. 34, 323--341. Google Scholar
Digital Library
- Pimentel, A. D., Hertzberger, L. O., Lieverse, P., van der Wolf, P., and Deprettere, E. F. 2001. Exploring embedded-systems architectures with Artemis. IEEE Comput. 34, 11, 57--63. Google Scholar
Digital Library
- Rau, R. 1994. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the International Symposium on Microarchitecture (MICRO). 63--74. Google Scholar
Digital Library
- Rusu, S., Tam, S., Muljono, H., Ayers, D., Chang, J., Varada, R., Ratta, M., and Vora, S. 2009. A 45nm 8-core enterprise Xeon processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC). 9--12.Google Scholar
- Ruttenbergand, J., Gao, G., Stoutchinin, A., and Lichtenstein, W. 1996. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Stuijk, S., Geilen, M., and Basten, T. 2008. Throughput-buffering trade-off exploration for cyclo-static and synchronous dataflow graphs. IEEE Trans. Comput. 57, 10, 1331--1345. Google Scholar
Digital Library
- Thiele, L. and Wandeler, E. 2005. Performance analysis of distributed embedded systems. In Proceedings of the Embedded Systems Handbook. CRC Press.Google Scholar
- Truong, D., Cheng, W. H., Mohsenin, T., Yu, Z., Jacobsen, A. T., Landge, G., Meenwsen, M. J., Watnik, C., Tran, A. T., Xiao, Z., Work, E. W., Webb, J. W., Mejia, P. V., and Bass, B. M. 2008. A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling. In Proceedings of the Symposium on VLSI Circuits.Google Scholar
- Wiggers, M., Bekooij, M., Geilen, M., and Basten, T. 2010. Simultaneous budget and buffer size computation for throughput-constrained task graphs. In Proceedings of the Design, Automation, and Test in Europe (DATE). Google Scholar
Digital Library
- Xue, L., Ozturk, O., Li, F., Kandemir, M. T., and Kolcu, I. 2006. Dynamic partitioning of processing and memory resources in embedded MPSoC architectures. In Proceedings of the Design, Automation, and Test in Europe (DATE), 690--695. Google Scholar
Digital Library
Index Terms
Throughput-memory footprint trade-off in synthesis of streaming software on embedded multiprocessors
Recommendations
Joint Modulo Scheduling and Memory Partitioning with Multi-Bank Memory for High-Level Synthesis (Abstract Only)
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysHigh-Level Synthesis (HLS) has been widely recognized and accepted as an efficient compilation process targeting FPGAs for algorithm evaluation and product prototyping. However, the massively parallel memory access demands and the extremely expensive ...
Joint Application Mapping/Interconnect Synthesis Techniques for Embedded Chip-Scale Multiprocessors
As transistor sizes shrink, interconnects represent an increasing bottleneck for chip designers. Several groups are developing new interconnection methods and system architectures to cope with this trend. New architectures require new methods for high-...
Write activity reduction on non-volatile main memories for embedded chip multiprocessors
Recent advances in circuit and semiconductor technologies have pushed Non-Volatile Memory (NVM) technologies into a new era. These technologies exhibit appealing properties such as low power consumption, non-volatility, shock-resistivity, and high ...






Comments