skip to main content
research-article

Flexible filters in stream programs

Published:24 December 2013Publication History
Skip Abstract Section

Abstract

The stream-processing model is a natural fit for multicore systems because it exposes the inherent locality and concurrency of a program and highlights its separable tasks for efficient parallel implementations. We present flexible filters, a load-balancing optimization technique for stream programs. Flexible filters utilize the programmability of the cores in order to improve the data-processing throughput of individual bottleneck tasks by “borrowing” resources from neighbors in the stream. Our technique is distributed and scalable because all runtime load-balancing decisions are based on point-to-point handshake signals exchanged between neighboring cores. Load balancing with flexible filters increases the system-level processing throughput of stream applications, particularly those with large dynamic variations in the computational load of their tasks. We empirically evaluate flexible filters in a homogeneous multicore environment over a suite of five real-word stream programs.

References

  1. Arpaci-Dusseau, R. H., Anderson, E., Treuhaft, N., Culler, D. E., Hellerstein, J. M., Patterson, D., and Yelick, K. 1999. Cluster I/O with River: Making the fast case common. In Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems. 10--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Barnes, K. B., Chen, Y. N., Lundgren, W. I., Pridmore, J. S., Rivera, J. A., Schaming, W. B., and Toombs, L. E. 1993. Data flow graph-programming environment for embedded multiprocessing. Proc. SPIE, Vol. 1957, 297--304.Google ScholarGoogle Scholar
  3. Bender, M. A. and Rabin, M. O. 2002. Online scheduling of parallel programs on heterogeneous systems with applications to Cilk. Theory Comput. Syst. 35, 3, 289--304.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bonfietti, A., Benini, L., Lombardi, M., and Milano, M. 2010. An efficient and complete approach for throughput-maximal SDF allocation and scheduling on multi-core platforms. In Proceedings of the Conference on Design, Automation and Test in Europe. 897--902. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bonfietti, A., Lombardi, M., Milano, M., and Benini, L. 2009. Throughput constraint for synchronous data flow graphs. In Proceedings of the International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems. 26--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: Stream computing on graphics hardware. In International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 777--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F., and Shah, M. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the Conference on Innovative Data Systems Research. 668--668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen, J., Gordon, M. I., Thies, W., Zwicker, M., Pulli, K., and Durand, F. 2005. A reconfigurable architecture for load-balanced rendering. In Proceedings of the SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware. 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen, T., Sura, Z., O'Brien, K., and O'Brien, J. K. 2007. Optimizing the use of static buffers for DMA on a CELL chip. In Proceedings of the International Conference on Languages and Compilers for Parallel Computing. 314--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Collins, R. L. and Carloni, L. P. 2009. Flexible filters: Load balancing through backpressure for stream programs. In Proceedings ACM International Conference on Embedded Software (EMSOFT). 205--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Collins, R. L. and Carloni, L. P. 2010. Flexible filters for high-performance embedded computing. In Proceedings of the High Performance Embedded Computing Workshop.Google ScholarGoogle Scholar
  13. Däcker, B. 2000. Concurrent functional programming for telecommunications: A case study of technology introduction. In Licentiate Thesis, KTH Royal Institute of Technology.Google ScholarGoogle Scholar
  14. Dasdan, A. and Gupta, R. K. 1998. Faster maximum and minimum mean cycle algorithms for system performance analysis. IEEE Trans. Comput. Aid. Design Integr. Circuits Syst. 17, 10, 889--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dick, R., Rhodes, D., and Wolf, W. 1998. TGFF task graphs for free. In Proceedings of the 6th International Workshop on Hardware/Software Co-Design (CODES). 97--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fellheimer, E. T. 2006. Dynamic load-balancing of StreamIt cluster computations. M.S. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  17. Frigo, M., Leiserson, C. E., and Randall, K. H. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the SIGPLAN Conference on Program Language Design and Implementation. 212--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gedik, B., Andrade, H., Wu, K.-L., Yu, P. S., and Doo, M. 2008. Spade: The system S declarative stream processing engine. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1123--1134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gonzalez, R. C. and Woods, R. E. 2001. Digital Image Processing. Addison-Wesley Longman Publ. Co., Inc., Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gordon, M. I., Thie, W., and Amarasinghe, S. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. SIGOPS Oper. Syst. Rev. 40, 5, 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Graef, A., Kersten, S., and Orlarey, Y. 2006. DSP programming with Faust, Q and Supercollider. In Proceedings of the Linux Audio Conference.Google ScholarGoogle Scholar
  22. Gummaraju, J. and Rosenblum, M. 2005. Stream programming on general-purpose processors. In Proceedings of the International Symposium on Microarchitecture (MICRO). 343--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Haney, R., Meuse, T., Kepner, J., and Lebak, J. 2005. The HPEC challenge benchmark suite. In Proceedings of the High-Performance Embedded Computing Workshop.Google ScholarGoogle Scholar
  24. Hölzenspies, P. K. F., Smit, G. J. M., and Kuper, J. 2007. Mapping streaming applications on a reconfigurable mpsoc platform at run-time. In Proceedings of the International Symposium on System-on-Chip (SoC'07). 74--77.Google ScholarGoogle Scholar
  25. Hormati, A., Choi, Y., Kudlur, M., Rabbah, R. M., Mudge, T. N., and Mahlke, S. A. 2009. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. 214--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Huston, L., Nizhner, A., Pillai, P., Sukthankar, R., Steenkiste, P., and Zhang, J. 2005. Dynamic load balancing for distributed search. In Proceedings of the International Symposium on High Performance Distributed Computing. 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2005. Introduction to the cell multiprocessor. IBM J. Res. Develop. 49, 4-5, 589--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kakulavarapu, P., Maquelin, O., Amaral, J. N., and Gao, G. R. 2001. Dynamic load balancers for a multithreaded multiprocessor system. Parallel Process. Lett. 11, 1, 169--184.Google ScholarGoogle ScholarCross RefCross Ref
  29. Kapasi, U. J., Rixner, S., Dally, W. J., Khailany, B., Ahn, J. H., Mattson, P., and Owens, J. D. 2003. Programmable stream processors. IEEE Computer 36, 8, 54--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Karp, R. M. 1978. A characterization of the minimum cycle mean in a digraph. Discrete Math. 23, 3, 309--311.Google ScholarGoogle ScholarCross RefCross Ref
  31. Kistler, M., Perrone, M., and Petrini, F. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3, 10--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kudlur, M. and Mahlke, S. 2008. Orchestrating the execution of stream programs on multicore platforms. ACM SIGPLAN Not. 43, 6, 114--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Lee, E. and Messerschmitt, D. 1987. Synchronous data flow. Proc. IEEE 75, 9, 1235--1245.Google ScholarGoogle ScholarCross RefCross Ref
  34. Lundgren, W., Steed, J., and Barnes, K. 2005. Integrating the hardware description with gedae's single sample language to generate efficient code. In Proceedings of the Electro Magnetic Remote Sensing Defence Technology Centre Conference.Google ScholarGoogle Scholar
  35. McCool, M. D. 2006. Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform. In Proceedings of the GSPx Multicore Applications Conference.Google ScholarGoogle Scholar
  36. Minsky, Y. and Weeks, S. 2008. CAML trading - experiences with functional programming on Wall Street. J. Functional Program. 18, 4, 553--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Moreira, O. and Bekooij, M. 2007. Self-timed scheduling analysis for real-time applications. EURASIP J. Adv. Signal Process.Google ScholarGoogle Scholar
  38. Moreira, O., Mol, J.-D., Bekooij, M., and van Meerbergen, J. 2005. Multiprocessor resource allocation for hard-real-time streaming with a dynamic job-mix. In Proceedings of the IEEE Real Time on Embedded Technology and Applications Symposium. 332--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Nanda, A. K., Moulic., J. R., Hanson, R. E., Goldrian, G., Day, M. N., D'Arnora, D. B., and Kesavarapu, S. 2007. Cell/B.E. blades: Building blocks for scalable, real-time, interactive, and digital media servers. IBM J. Res. Develop. 51, 5, 573--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Novak, L. M., Owirka, G. J., Brower, W. S., and Weaver, A. L. 1997. The automatic target-recognition system in SAIP. Lincoln Lab. J. 10, 2, 187--202.Google ScholarGoogle Scholar
  41. Petri, C. A. 1962. Kommunikation mit automaten (“communication with automata”). Ph.D. thesis, Darmstadt University of Technology.Google ScholarGoogle Scholar
  42. Pham, D., Asano, S., Bolliger, M., Day, M. N., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., and Yazawa, K. 2005. The design and implementation of a first-generation CELL processor. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC). 184--185.Google ScholarGoogle Scholar
  43. Poplavko, P., Basten, T., Bekooij, M., van Meerbergen, J., and Mesman, B. 2003. Task-level timing models for guaranteed performance in multiprocessor networks-on-chip. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems. 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. 2007. Evaluating MapReduce for multi-core and multiprocessor systems. In Proceedings of the Symposium on High Performance Computer Architecture. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Shah, M. A., Hellerstein, J. M., Chandrasekaran, S., and Franklin, M. J. 2003. Flux: An adaptive partitioning operator for continuous query systems. In Proceedings of the International Conference on Data Engineering. 25--36.Google ScholarGoogle Scholar
  46. Staschulat, J. and Bekooij, M. 2009. Dataflow models for shared memory access latency analysis. In Proceedings of the International Conference on Embedded Software. 275--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Stuijk, S., Geilen, M., and Basten, T. 2006. Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the Design Automation Conference 899--904. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Stuijk, S., Geilen, M., Theelen, B. D., and Basten, T. 2011. Scenario-aware dataflow: Modeling, analysis and implementation of dynamic applications. In Proceedings of the International Symposium on Systems, Architectures, Modeling and Simulation. 404--411.Google ScholarGoogle Scholar
  49. Thies, W., Karczmarek, M., Gordon, M., Maze, D., Wong, J., Hoffmann, H., Brown, M., and Amarasinghe, S. 2001. StreamIt: A compiler for streaming applications. Tech. rep., MIT-LCS Technical Memo TM-622, MIT, Cambridge, MA.Google ScholarGoogle Scholar
  50. Wiggers, M. H., Bekooij, M. J. G., and Smit, G. J. M. 2007. Modelling run-time arbitration by latency-rate servers in dataflow graphs. In Proceedings of the International Workshop on Software and Compilers for Embedded Systems. 11--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xing, Y., Zdonik, S., and Hwang, J.-H. 2005. Dynamic load distribution in the Borealis stream processor. In Proceedings of the International Conference on Data Engineering. 791--802. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhang, D., Li, Q. J., Rabbah, R., and Amarasinghe, S. 2008. A lightweight streaming layer for multicore execution. SIGARCH Comput. Archit. News 36, 18--27. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Flexible filters in stream programs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Article Metrics

          • Downloads (Last 12 months)2
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!