Abstract
Big data is revolutionizing how all sectors of our economy do business, including telecommunication, transportation, medical, and finance. Big data comes in two flavors: data at rest and data in motion. Processing data in motion is stream processing. Stream processing for big data analytics often requires scale that can only be delivered by a distributed system, exploiting parallelism on many hosts and many cores. One such distributed stream processing system is IBM Streams. Early customer experience with IBM Streams uncovered that another core requirement is extensibility, since customers want to build high-performance domain-specific operators for use in their streaming applications. Based on these two core requirements of distribution and extensibility, we designed and implemented the Streams Processing Language (SPL). This article describes SPL with an emphasis on the language design, distributed runtime, and extensibility mechanism. SPL is now the gateway for the IBM Streams platform, used by our customers for stream processing in a broad range of application domains.
- Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uğur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag S. Maskey, Alexander Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik. 2005. The design of the Borealis stream processing engine. In Conference on Innovative Data Systems Research (CIDR). 277--289.Google Scholar
- Daniel J. Abadi, Don Carney, Uğur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. 2003. Aurora: A new model and architecture for data stream management. VLDB J. 12, 2 (2003), 120--139. Google Scholar
Digital Library
- Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In International Conference on Management of Data (SIGMOD). 147--160. Google Scholar
Digital Library
- Yanif Ahmad and Christoph Koch. 2009. DBToaster: A SQL compiler for high-performance delta processing in main-memory databases. In Demonstration at Very Large Data Bases (VLDB-Demo). 1566--1569. Google Scholar
Digital Library
- Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at internet scale. In Very Large Data Bases (VLDB) Industrial Track. 734--746. Google Scholar
Digital Library
- Mohamed Ali, Badrish Chandramouli, Jonathan Goldstein, and Roman Schindlauer. 2011. The extensibility framework in Microsoft streaminsight. In International Conference on Data Engineering (ICDE). 1242--1253. Google Scholar
Digital Library
- Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL continuous query language: Semantic foundations and query execution. VLDB J. 15, 2 (2006), 121--142. Google Scholar
Digital Library
- Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear road: A stream data management benchmark. In Conference on Very Large Data Bases (VLDB). 480--491. Google Scholar
Digital Library
- Matthew Arnold, David Grove, Benjamin Herta, Michael Hind, Martin Hirzel, Arun Iyengar, Louis Mandel, V. A. Saraswat, Avraham Shinnar, Jérôme Siméon, Mikio Takeuchi, Olivier Tardieu, and Wei Zhang. 2016. META: Middleware for events, transactions, and analytics. IBM J. Res. Dev. 60, 2--3 (2016), 15:1--15:10.Google Scholar
Digital Library
- Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. 2010. Lime: A java-compatible and synthesizable language for heterogeneous architectures. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 89--108. Google Scholar
Digital Library
- Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Principles of Database Systems (PODS). 1--16. Google Scholar
Digital Library
- Jonathan Bachrach and Keith Playford. 2001. The Java Syntactic Extender (JSE). In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 31--42. Google Scholar
Digital Library
- Roger S. Barga, Jonathan Goldstein, Mohamed Ali, and Mingsheng Hong. 2007. Consistent streaming through time: A vision for event stream processing. In Conference on Innovative Data Systems Research (CIDR). 363--373.Google Scholar
- Gérard Berry and Georges Gonthier. 1992. The Esterel synchronous programming language: Design, semantics, implementation. Sci. Comput. Program. 19, 2 (1992), 87--152. Google Scholar
Digital Library
- Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J. Shekita. 2011. Jaql: A scripting language for large scale semistructured data analysis. In Conference on Very Large Data Bases (VLDB). 1272--1283.Google Scholar
- Alain Biem, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, Olivier Verscheure, Haris Koutsopoulos, and Carlos Moran. 2010a. IBM infosphere streams for scalable, real-time, intelligent transportation services. In International Conference on Management of Data (SIGMOD). 1093--1104. Google Scholar
Digital Library
- Alain Biem, Bruce Elmegreen, Olivier Verscheure, Deepak Turaga, Henrique Andrade, and Tim Cornwell. 2010b. A streaming approach to radio astronomy imaging. In Acoustics, Speech, and Signal Processing (ICASSP). 1654--1657.Google Scholar
- Jeffrey Bosboom, Sumanaruban Rajadurai, Weng-Fai Wong, and Saman Amarasinghe. 2014. StreamJIT: A commensal compiler for high-performance stream programming. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 177--195. Google Scholar
Digital Library
- Eric Bouillet, Ravi Kothari, Vibhore Kumar, Laurent Mignet, Senthil Nathan, Anand Ranganathan, Deepak S. Turaga, Octavian Udrea, and Olivier Verscheure. 2012. Experience report: Processing 6 billion CDRs/day: from research to production. In Conference on Distributed Event-Based Systems (DEBS). 264--267. Google Scholar
Digital Library
- Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Programming Language Design and Implementation (PLDI). 363--375. Google Scholar
Digital Library
- Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel Madden, Vijayshankar Raman, Frederick Reiss, and Mehul A. Shah. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. 2000. NiagaraCQ: A scalable continuous query system for internet databases. In International Conference on Management of Data (SIGMOD). 379--390. Google Scholar
Digital Library
- Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Networked Systems Design and Implementation (NSDI). 313--328. Google Scholar
Digital Library
- Corinna Cortes, Kathleen Fisher, Daryl Pregibon, and Anne Rogers. 2000. Hancock: A language for extracting signatures from data streams. In Knowledge Discovery and Data Mining (KDD). 9--17. Google Scholar
Digital Library
- Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. 2003. Gigascope: A stream database for network applications. In International Conference on Management of Data (SIGMOD) Industrial Track. 647--651. Google Scholar
Digital Library
- Wim De Pauw, Mihai Letia, Buğra Gedik, Henrique Andrade, Andy Frenkiel, Michael Pfeifer, and Daby Sow. 2010. Visual debugging for stream processing applications. In International Conference on Runtime Verification (RV). 18--35. Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Operating Systems Design and Implementation (OSDI). 137--150. Google Scholar
Digital Library
- Alan Demers, Johannes Gehrke, Biswanath Panda, Mirek Riedewald, Varun Sharma, and Walker White. 2007. Cayuga: A general purpose event monitoring system. In Conference on Innovative Data Systems Research (CIDR). 412--422.Google Scholar
- Esper. 2014. Event processing with esper and nesper. Retrieved June 2014 from http://esper.codehaus.org/.Google Scholar
- Buğra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S. Yu, and MyungCheol Doo. 2008. SPADE: The system s declarative stream processing engine. In International Conference on Management of Data (SIGMOD). 1123--1134. Google Scholar
Digital Library
- Buğra Gedik, Scott Schneider, Martin Hirzel, and Kun-Lung Wu. 2014. Elastic scaling for data stream processing. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2014), 1447--1463. Google Scholar
Digital Library
- Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Architectural Support for Programming Languages and Operating Systems (ASPLOS). 151--162. Google Scholar
Digital Library
- Nicholas Halbwachs, Paul Caspi, Pascal Raymond, and Daniel Pilaud. 1991. The synchronous data flow programming language LUSTRE. Proc. IEEE 79, 9 (1991), 1305--1320.Google Scholar
- Martin Hirzel. 2012. Partition and compose: Parallel complex event processing. In Conference on Distributed Event-Based Systems (DEBS). 191--200. Google Scholar
Digital Library
- Martin Hirzel, Henrique Andrade, Buğra Gedik, Gabriela Jacques-Silva, Rohit Khandekar, Vibhore Kumar, Mark Mendell, Howard Nasgaard, Scott Schneider, Robert Soulé, and Kun-Lung Wu. 2013. IBM streams processing language: Analyzing big data in motion. IBM J. Res. Dev. 57, 3/4 (2013), 7:1--7:11. Google Scholar
Digital Library
- Martin Hirzel, Henrique Andrade, Buğra Gedik, Vibhore Kumar, Giuliano Losa, Mark Mendell, Howard Nasgaard, Robert Soulé, and Kun-Lung Wu. 2009. SPL Streams Processing Language Specification. Technical Report RC24897. IBM Research.Google Scholar
- Martin Hirzel and Buğra Gedik. 2012. Streams that compose using macros that oblige. In Workshop on Partial Evaluation and Program Manipulation (PEPM). 141--150. Google Scholar
Digital Library
- Martin Hirzel and Robert Grimm. 2007. Jeannie: Granting Java native interface developers their wishes. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 19--38. Google Scholar
Digital Library
- Martin Hirzel, Rodric Rabbah, Philippe Suter, Olivier Tardieu, and Mandana Vaziri. 2016. Spreadsheets for stream processing with unbounded windows and partitions. In Conference on Distributed Event-Based Systems (DEBS). 49--60. Google Scholar
Digital Library
- Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. ACM Computing Surveys (CSUR) 46, 4 (April 2014). Google Scholar
Digital Library
- Paul Hudak. 1998. Modular domain specific languages and tools. In International Conference on Software Reuse (ICSR). 134--142. Google Scholar
Digital Library
- Paul Hudak, Antony Courtney, Henrik Nilsson, and John Peterson. 2003. Arrows, robots, and functional reactive programming. In Summer School on Advanced Functional Programming, Oxford University.Google Scholar
- Westley M. Johnston, J. R. Paul Hanna, and Richard J. Millar. 2004. Advances in dataflow programming languages. ACM Comput. Surv. 36, 1 (2004), 1--34. Google Scholar
Digital Library
- Gilles Kahn. 1974. The semantics of a simple language for parallel processing. In Information Processing. 471--475.Google Scholar
- Rohit Khandekar, Kirsten Hildrum, Sujay Parekh, Deepak Rajan, Joel Wolf, Kun-Lung Wu, Henrique Andrade, and Buğra Gedik. 2009. COLA: Optimizing stream processing applications via graph partitioning. In Middleware Conference. 308--327. Google Scholar
Digital Library
- Romeo Kienzler, Rémy Bruggmann, Anand Ranganathan, and Nesime Tatbul. 2012. Incremental DNA sequence analysis in the cloud. In Scientific and Statistical Database Management (SSDBM) Demonstration. 640--645. Google Scholar
Digital Library
- Byeongcheol Lee, Robert Grimm, Martin Hirzel, and Kathryn S. McKinley. 2012. Marco: Safe, expressive macros for any language. In European Conference on Object-Oriented Programming (ECOOP). 589--613. Google Scholar
Digital Library
- E. A. Lee and D. G. Messerschmitt. 1987. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235--1245.Google Scholar
Cross Ref
- Peng Li, Kunal Agrawal, Jeremy Buhler, and Roger D. Chamberlain. 2010. Deadlock avoidance for streaming computations with filtering. In Symposium on Parallelism in Algorithms and Architectures (SPAA). 243--252. Google Scholar
Digital Library
- LogMon. 2014. SPL LogAnalysisBenchmark on StreamsExchange. Retrieved June 2014 from https://www.ibm.com/developerworks/community/files/app?lang=en#/file/fe90e883-3025-4eb1-a78f-87469a3d4d53.Google Scholar
- Mark P. Mendell, Howard Nasgaard, Eric Bouillet, Martin Hirzel, and Buğra Gedik. 2012. Extending a general-purpose streaming system for XML. In Conference on Extending Database Technology (EDBT). 534--539. Google Scholar
Digital Library
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. 2013. Naiad: A timely dataflow system. In Symposium on Operating Systems Principles (SOSP). 439--455. Google Scholar
Digital Library
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In International Conference on Management of Data (SIGMOD). 1099--1110. Google Scholar
Digital Library
- OpenMP. 2014. The OpenMP API specificaiton for parallel programming. Retrieved June 2014 from http://openmp.org/.Google Scholar
- Yoonho Park, Richard King, Senthil Nathan, Wesley Most, and Henrique Andrade. 2012. Evaluation of a high-volume, low-latency market data processing system implemented with IBM middleware. Softw. Prac. Exp. 42, 1 (2012), 37--56. Google Scholar
Digital Library
- Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with Sawzall. Sci. Comput. 13, 4 (2005), 277--298. Google Scholar
Digital Library
- Massimiliano Poletto, Wilson C. Hsieh, Dawson R. Engler, and M. Frans Kaashoek. 1999. ’C and tcc: A language and compiler for dynamic code generation. Trans. Program. Lang. Syst. 21, 2 (1999), 324--369. Google Scholar
Digital Library
- Anton V. Riabov, Eric Bouillet, Mark D. Feblowitz, Zhen Liu, and Anand Ranganathan. 2008. Wishful search: Interactive composition of data mashups. In International World Wide Web Conferences (WWW). 775--784. Google Scholar
Digital Library
- Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2012. Auto-parallelizing stateful distributed streaming applications. In Parallel Architectures and Compilation Techniques (PACT). 53--64. Google Scholar
Digital Library
- Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2015. Safe data parallelism for general streaming. IEEE Trans. Comput. 64, 2 (2015), 504--517.Google Scholar
Cross Ref
- Naomi Seyfer, Richard Tibbetts, and Nathaniel Mishkin. 2011. Capture fields: Modularity in a stream-relational event processing language. In Conference on Distributed Event-Based Systems (DEBS). 15--22. Google Scholar
Digital Library
- Robert Soulé, Michael I. Gordon, Saman Amarasinghe, Robert Grimm, and Martin Hirzel. 2013. Dynamic expressivity with static optimization for streaming languages. In Conference on Distributed Event-Based Systems (DEBS). 159--170. Google Scholar
Digital Library
- Robert Soulé, Martin Hirzel, Robert Grimm, Buğra Gedik, Henrique Andrade, Vibhore Kumar, and Kun-Lung Wu. 2010. A universal calculus for stream processing languages. In European Symposium on Programming (ESOP). 507--528. Google Scholar
Digital Library
- Daby M. Sow, Jimeng Sun, Alain Biem, Jianying Hu, Marion Blount, and Shahram Ebadollahi. 2012. Real-time analysis for short-term prognosis in intensive care. IBM J. Res. Dev. 56, 5 (2012), 3:1--3:10. Google Scholar
Digital Library
- Robert Stephens. 1997. A survey of stream processing. Acta Inform. 34, 7 (1997), 491--541.Google Scholar
Cross Ref
- Walid Taha and Tim Sheard. 1997. Multi-stage programming with explicit annotation. In Workshop on Partial Evaluation and Program Manipulation (PEPM). 203--217. Google Scholar
Digital Library
- Yuzhe Tang and Buğra Gedik. 2013. Autopipelining for data stream processing. IEEE Trans. Parallel Distrib. Syst. 24, 11 (2013), 2344--2354. Google Scholar
Digital Library
- Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General incremental sliding-window aggregation. In Conference on Very Large Data Bases (VLDB). 702--713. Google Scholar
Digital Library
- Ashish Thusoo, Sen Joydeep Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive—A warehousing solution over a map-reduce framework. In Demo at Very Large Data Bases (VLDB-Demo). 1626--1629. Google Scholar
Digital Library
- Sam Tobin-Hochstadt, Vincent St-Amour, Ryan Culpepper, Matthew Flatt, and Matthias Felleisen. 2011. Languages as libraries. In Programming Language Design and Implementation (PLDI). 132--141. Google Scholar
Digital Library
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm @Twitter. In International Conference on Management of Data (SIGMOD). 147--156. Google Scholar
Digital Library
- Mandana Vaziri, Olivier Tardieu, Rodric Rabbah, Philippe Suter, and Martin Hirzel. 2014. Stream processing with a spreadsheet. In European Conference on Object-Oriented Programming (ECOOP). 360--384. Google Scholar
Digital Library
- Zhihong Xu, Martin Hirzel, Gregg Rothermel, and Kun-Lung Wu. 2013. Testing properties of dataflow program operators. In Conference on Automated Software Engineering (ASE). 103--113.Google Scholar
Digital Library
- Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Operating System Design and Implementation (OSDI). 1--14. Google Scholar
Digital Library
- Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Symposium on Operating Systems Principles (SOSP). 423--438. Google Scholar
Digital Library
- Qiong Zou, Buğra Gedik, and Kun Wang. 2011. SpamWatcher: A streaming social network analytic on the IBM wire-speed processor. In Conference on Distributed Event-Based Systems (DEBS). 267--278. Google Scholar
Digital Library
Index Terms
SPL: An Extensible Language for Distributed Stream Processing
Recommendations
Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesIn recent years there has been a surge in applications focusing on streaming data to generate insights in real-time. Both academia, as well as industry, have tried to address this use case by developing a variety of Stream Processing Engines (SPEs) with ...
A configurable and executable model of Spark Streaming on Apache YARN
Streams of data are produced today at an unprecedented scale. Efficient and stable processing of these streams requires a careful interplay between the parameters of the streaming application and of the underlying stream processing framework. Today, ...
Pre-processing and data validation in IoT data streams
DEBS '20: Proceedings of the 14th ACM International Conference on Distributed and Event-based SystemsIn the last few years, distributed stream processing engines have been on the rise due to their crucial impacts on real-time data processing with guaranteed low latency in several application domains such as financial markets, surveillance systems, ...






Comments