ABSTRACT
Storm has long served as the main platform for real-time analytics at Twitter. However, as the scale of data being processed in real-time at Twitter has increased, along with an increase in the diversity and the number of use cases, many limitations of Storm have become apparent. We need a system that scales better, has better debug-ability, has better performance, and is easier to manage -- all while working in a shared cluster infrastructure. We considered various alternatives to meet these needs, and in the end concluded that we needed to build a new real-time stream data processing system. This paper presents the design and implementation of this new system, called Heron. Heron is now the de facto stream data processing engine inside Twitter, and in this paper we also share our experiences from running Heron in production. In this paper, we also provide empirical evidence demonstrating the efficiency and scalability of Heron.
References
- Apache Aurora. http://aurora.incubator.apache.orgGoogle Scholar
- Apache Samza. http://samza.incubator.apache.orgGoogle Scholar
- Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033--1044 (2013) Google Scholar
Digital Library
- Mohamed H. Ali, Badrish Chandramouli, Jonathan Goldstein, Roman Schindlauer: The extensibility framework in Microsoft StreamInsight. ICDE 2011: 1242--1253 Google Scholar
Digital Library
- Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, Shivakumar Venkataraman: Photon: fault-tolerant and scalable joining of continuous data streams. SIGMOD 2013: 577--588 Google Scholar
Digital Library
- Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Rajeev Motwani, Itaru Nishizawa, Utkarsh Srivastava, Dilys Thomas, Rohit Varma, Jennifer Widom: STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull. 26(1): 19--26 (2003)Google Scholar
- Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Eduardo F. Galvez, Jon Salz, Michael Stonebraker, Nesime Tatbul, Richard Tibbetts, Stanley B. Zdonik: Retrospective on Aurora. VLDB J. 13(4): 370--383 (2004) Google Scholar
Digital Library
- P. Oscar Boykin, Sam Ritchie, Ian O'Connell, Jimmy Lin: Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. PVLDB 7(13): 1441--1451 (2014) Google Scholar
Digital Library
- DataTorrent. https://www.datatorrent.comGoogle Scholar
- Minos N. Garofalakis, Johannes Gehrke: Querying and Mining Data Streams: You Only Get One Look. VLDB 2002Google Scholar
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy H. Katz, Scott Shenker, Ion Stoica: Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011 Google Scholar
Digital Library
- IBM Infosphere Streams. http://www-03.ibm.com/software/products/en/infosphere-streams/Google Scholar
- Kestrel: A simple, distributed message queue system. http://robey.github.com/kestrelGoogle Scholar
- Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: a distributed messaging system for log processing. SIGMOD Workshop on Networking Meets Databases, 2011.Google Scholar
- Simon Loesing, Martin Hentschel, Tim Kraska, Donald Kossmann: Stormy: an elastic and highly available streaming service in the cloud. EDBT/ICDT Workshops 2012: 55--60 Google Scholar
Digital Library
- Nathan Marz: (Storm) Tutorial. https://github.com/nathanmarz/storm/wiki/TutorialGoogle Scholar
- S4 Distributed stream computing platform. http://incubator.apache.org/s4/Google Scholar
- Spark Streaming. https://spark.apache.org/streaming/Google Scholar
- Sankar Subramanian, Srikanth Bellamkonda, Hua-Gang Li, Vince Liang, Lei Sheng, Wayne Smith, James Terry, Tsae-Feng Yu, Andrew Witkowski: Continuous Queries in Oracle. VLDB 2007: 1173--1184 Google Scholar
Digital Library
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthikeyan Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy V. Ryaboy: [email protected] SIGMOD 2014: 147--156 Google Scholar
Digital Library
- Trident: https://github.com/nathanmarz/storm/wikiGoogle Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, Eric Baldeschwieler: Apache Hadoop YARN: yet another resource negotiator. SoCC 2013: 5Google Scholar
- ZeroMQ: http://zeromq.org. Retrieved December 1, 2014.Google Scholar
Index Terms
Twitter Heron: Stream Processing at Scale





Comments