Abstract
The stream-processing model is a natural fit for multicore systems because it exposes the inherent locality and concurrency of a program and highlights its separable tasks for efficient parallel implementations. We present flexible filters, a load-balancing optimization technique for stream programs. Flexible filters utilize the programmability of the cores in order to improve the data-processing throughput of individual bottleneck tasks by “borrowing” resources from neighbors in the stream. Our technique is distributed and scalable because all runtime load-balancing decisions are based on point-to-point handshake signals exchanged between neighboring cores. Load balancing with flexible filters increases the system-level processing throughput of stream applications, particularly those with large dynamic variations in the computational load of their tasks. We empirically evaluate flexible filters in a homogeneous multicore environment over a suite of five real-word stream programs.
- Arpaci-Dusseau, R. H., Anderson, E., Treuhaft, N., Culler, D. E., Hellerstein, J. M., Patterson, D., and Yelick, K. 1999. Cluster I/O with River: Making the fast case common. In Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems. 10--22. Google Scholar
Digital Library
- Barnes, K. B., Chen, Y. N., Lundgren, W. I., Pridmore, J. S., Rivera, J. A., Schaming, W. B., and Toombs, L. E. 1993. Data flow graph-programming environment for embedded multiprocessing. Proc. SPIE, Vol. 1957, 297--304.Google Scholar
- Bender, M. A. and Rabin, M. O. 2002. Online scheduling of parallel programs on heterogeneous systems with applications to Cilk. Theory Comput. Syst. 35, 3, 289--304.Google Scholar
Cross Ref
- Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 72--81. Google Scholar
Digital Library
- Bonfietti, A., Benini, L., Lombardi, M., and Milano, M. 2010. An efficient and complete approach for throughput-maximal SDF allocation and scheduling on multi-core platforms. In Proceedings of the Conference on Design, Automation and Test in Europe. 897--902. Google Scholar
Digital Library
- Bonfietti, A., Lombardi, M., Milano, M., and Benini, L. 2009. Throughput constraint for synchronous data flow graphs. In Proceedings of the International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems. 26--40. Google Scholar
Digital Library
- Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: Stream computing on graphics hardware. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 777--786. Google Scholar
Digital Library
- Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F., and Shah, M. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the Conference on Innovative Data Systems Research. 668--668. Google Scholar
Digital Library
- Chen, J., Gordon, M. I., Thies, W., Zwicker, M., Pulli, K., and Durand, F. 2005. A reconfigurable architecture for load-balanced rendering. In Proceedings of the SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware. 71--80. Google Scholar
Digital Library
- Chen, T., Sura, Z., O'Brien, K., and O'Brien, J. K. 2007. Optimizing the use of static buffers for DMA on a CELL chip. In Proceedings of the International Conference on Languages and Compilers for Parallel Computing. 314--329. Google Scholar
Digital Library
- Collins, R. L. and Carloni, L. P. 2009. Flexible filters: Load balancing through backpressure for stream programs. In Proceedings ACM International Conference on Embedded Software (EMSOFT). 205--214. Google Scholar
Digital Library
- Collins, R. L. and Carloni, L. P. 2010. Flexible filters for high-performance embedded computing. In Proceedings of the High Performance Embedded Computing Workshop.Google Scholar
- Däcker, B. 2000. Concurrent functional programming for telecommunications: A case study of technology introduction. In Licentiate Thesis, KTH Royal Institute of Technology.Google Scholar
- Dasdan, A. and Gupta, R. K. 1998. Faster maximum and minimum mean cycle algorithms for system performance analysis. IEEE Trans. Comput. Aid. Design Integr. Circuits Syst. 17, 10, 889--899. Google Scholar
Digital Library
- Dick, R., Rhodes, D., and Wolf, W. 1998. TGFF task graphs for free. In Proceedings of the 6th International Workshop on Hardware/Software Co-Design (CODES). 97--101. Google Scholar
Digital Library
- Fellheimer, E. T. 2006. Dynamic load-balancing of StreamIt cluster computations. M.S. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.Google Scholar
- Frigo, M., Leiserson, C. E., and Randall, K. H. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the SIGPLAN Conference on Program Language Design and Implementation. 212--223. Google Scholar
Digital Library
- Gedik, B., Andrade, H., Wu, K.-L., Yu, P. S., and Doo, M. 2008. Spade: The system S declarative stream processing engine. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1123--1134. Google Scholar
Digital Library
- Gonzalez, R. C. and Woods, R. E. 2001. Digital Image Processing. Addison-Wesley Longman Publ. Co., Inc., Boston, MA. Google Scholar
Digital Library
- Gordon, M. I., Thie, W., and Amarasinghe, S. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. SIGOPS Oper. Syst. Rev. 40, 5, 151--162. Google Scholar
Digital Library
- Graef, A., Kersten, S., and Orlarey, Y. 2006. DSP programming with Faust, Q and Supercollider. In Proceedings of the Linux Audio Conference.Google Scholar
- Gummaraju, J. and Rosenblum, M. 2005. Stream programming on general-purpose processors. In Proceedings of the International Symposium on Microarchitecture (MICRO). 343--354. Google Scholar
Digital Library
- Haney, R., Meuse, T., Kepner, J., and Lebak, J. 2005. The HPEC challenge benchmark suite. In Proceedings of the High-Performance Embedded Computing Workshop.Google Scholar
- Hölzenspies, P. K. F., Smit, G. J. M., and Kuper, J. 2007. Mapping streaming applications on a reconfigurable mpsoc platform at run-time. In Proceedings of the International Symposium on System-on-Chip (SoC'07). 74--77.Google Scholar
- Hormati, A., Choi, Y., Kudlur, M., Rabbah, R. M., Mudge, T. N., and Mahlke, S. A. 2009. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 214--223. Google Scholar
Digital Library
- Huston, L., Nizhner, A., Pillai, P., Sukthankar, R., Steenkiste, P., and Zhang, J. 2005. Dynamic load balancing for distributed search. In Proceedings of the International Symposium on High Performance Distributed Computing. 157--166. Google Scholar
Digital Library
- Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2005. Introduction to the cell multiprocessor. IBM J. Res. Develop. 49, 4-5, 589--604. Google Scholar
Digital Library
- Kakulavarapu, P., Maquelin, O., Amaral, J. N., and Gao, G. R. 2001. Dynamic load balancers for a multithreaded multiprocessor system. Parallel Process. Lett. 11, 1, 169--184.Google Scholar
Cross Ref
- Kapasi, U. J., Rixner, S., Dally, W. J., Khailany, B., Ahn, J. H., Mattson, P., and Owens, J. D. 2003. Programmable stream processors. IEEE Computer 36, 8, 54--62. Google Scholar
Digital Library
- Karp, R. M. 1978. A characterization of the minimum cycle mean in a digraph. Discrete Math. 23, 3, 309--311.Google Scholar
Cross Ref
- Kistler, M., Perrone, M., and Petrini, F. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3, 10--23. Google Scholar
Digital Library
- Kudlur, M. and Mahlke, S. 2008. Orchestrating the execution of stream programs on multicore platforms. ACM SIGPLAN Not. 43, 6, 114--124. Google Scholar
Digital Library
- Lee, E. and Messerschmitt, D. 1987. Synchronous data flow. Proc. IEEE 75, 9, 1235--1245.Google Scholar
Cross Ref
- Lundgren, W., Steed, J., and Barnes, K. 2005. Integrating the hardware description with gedae's single sample language to generate efficient code. In Proceedings of the Electro Magnetic Remote Sensing Defence Technology Centre Conference.Google Scholar
- McCool, M. D. 2006. Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform. In Proceedings of the GSPx Multicore Applications Conference.Google Scholar
- Minsky, Y. and Weeks, S. 2008. CAML trading - experiences with functional programming on Wall Street. J. Functional Program. 18, 4, 553--564. Google Scholar
Digital Library
- Moreira, O. and Bekooij, M. 2007. Self-timed scheduling analysis for real-time applications. EURASIP J. Adv. Signal Process.Google Scholar
- Moreira, O., Mol, J.-D., Bekooij, M., and van Meerbergen, J. 2005. Multiprocessor resource allocation for hard-real-time streaming with a dynamic job-mix. In Proceedings of the IEEE Real Time on Embedded Technology and Applications Symposium. 332--341. Google Scholar
Digital Library
- Nanda, A. K., Moulic., J. R., Hanson, R. E., Goldrian, G., Day, M. N., D'Arnora, D. B., and Kesavarapu, S. 2007. Cell/B.E. blades: Building blocks for scalable, real-time, interactive, and digital media servers. IBM J. Res. Develop. 51, 5, 573--582. Google Scholar
Digital Library
- Novak, L. M., Owirka, G. J., Brower, W. S., and Weaver, A. L. 1997. The automatic target-recognition system in SAIP. Lincoln Lab. J. 10, 2, 187--202.Google Scholar
- Petri, C. A. 1962. Kommunikation mit automaten (“communication with automata”). Ph.D. thesis, Darmstadt University of Technology.Google Scholar
- Pham, D., Asano, S., Bolliger, M., Day, M. N., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., and Yazawa, K. 2005. The design and implementation of a first-generation CELL processor. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC). 184--185.Google Scholar
- Poplavko, P., Basten, T., Bekooij, M., van Meerbergen, J., and Mesman, B. 2003. Task-level timing models for guaranteed performance in multiprocessor networks-on-chip. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems. 63--72. Google Scholar
Digital Library
- Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. 2007. Evaluating MapReduce for multi-core and multiprocessor systems. In Proceedings of the Symposium on High Performance Computer Architecture. 13--24. Google Scholar
Digital Library
- Shah, M. A., Hellerstein, J. M., Chandrasekaran, S., and Franklin, M. J. 2003. Flux: An adaptive partitioning operator for continuous query systems. In Proceedings of the International Conference on Data Engineering. 25--36.Google Scholar
- Staschulat, J. and Bekooij, M. 2009. Dataflow models for shared memory access latency analysis. In Proceedings of the International Conference on Embedded Software. 275--284. Google Scholar
Digital Library
- Stuijk, S., Geilen, M., and Basten, T. 2006. Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the Design Automation Conference 899--904. Google Scholar
Digital Library
- Stuijk, S., Geilen, M., Theelen, B. D., and Basten, T. 2011. Scenario-aware dataflow: Modeling, analysis and implementation of dynamic applications. In Proceedings of the International Symposium on Systems, Architectures, Modeling and Simulation. 404--411.Google Scholar
- Thies, W., Karczmarek, M., Gordon, M., Maze, D., Wong, J., Hoffmann, H., Brown, M., and Amarasinghe, S. 2001. StreamIt: A compiler for streaming applications. Tech. rep., MIT-LCS Technical Memo TM-622, MIT, Cambridge, MA.Google Scholar
- Wiggers, M. H., Bekooij, M. J. G., and Smit, G. J. M. 2007. Modelling run-time arbitration by latency-rate servers in dataflow graphs. In Proceedings of the International Workshop on Software and Compilers for Embedded Systems. 11--22. Google Scholar
Digital Library
- Xing, Y., Zdonik, S., and Hwang, J.-H. 2005. Dynamic load distribution in the Borealis stream processor. In Proceedings of the International Conference on Data Engineering. 791--802. Google Scholar
Digital Library
- Zhang, D., Li, Q. J., Rabbah, R., and Amarasinghe, S. 2008. A lightweight streaming layer for multicore execution. SIGARCH Comput. Archit. News 36, 18--27. Google Scholar
Digital Library
Index Terms
Flexible filters in stream programs
Recommendations
Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsThe StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...
Flexible filters: load balancing through backpressure for stream programs
EMSOFT '09: Proceedings of the seventh ACM international conference on Embedded softwareStream processing is a promising paradigm for programming multi-core systems for high-performance embedded applications. We propose flexible filters as a technique that combines static mapping of the stream program tasks with dynamic load balancing of ...
Synergistic execution of stream programs on multicores with accelerators
LCTES '09The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...






Comments