Abstract
Streaming analytics require real-time aggregation and processing of geographically distributed data streams continuously over time. The typical analytics infrastructure for processing such streams follow a hub-and-spoke model, comprising multiple edges connected to a center by a wide-area network (WAN). The aggregation of such streams often require that the results be available at the center within a certain acceptable delay bound. Further, the WAN bandwidth available between the edges and the center is often scarce or expensive, requiring that the traffic between the edges and the center be minimized. We propose a novel Time-to-Live (TTL-)based mechanism for real-time aggregation that provably optimizes both delay and traffic, providing a theoretical basis for understanding the delay-traffic tradeoff that is fundamental to streaming analytics. Our TTL-based optimization model provides analytical answers to how much aggregation should be performed at the edge versus the center, how much delay can be incurred at the edges, and how the edge-to-center bandwidth must be apportioned across applications with different delay requirements. To evaluate our approach, we implement our TTL-based aggregation mechanism in Apache Flink, a popular stream analytics framework. We deploy our Flink implementation in a hub-and-spoke architecture on geo-distributed Amazon EC2 data centers and a WAN-emulated local testbed, and run aggregation tasks for realistic workloads derived from extensive Akamai and Twitter traces. The delay-traffic tradeoff achieved by our Flink implementation agrees closely with theoretical predictions of our model. We show that by deriving the optimal TTLs using our model, our system can achieve a "sweet spot" where both delay and traffic are minimized, in comparison to traditional aggregation schemes such as batching and streaming.
- Akamai Download Analytics solution. Accessed: 2018--10--29. https://www.akamai.com/us/en/multimedia/documents/product-brief/download-analytics-product-brief.pdf.Google Scholar
- Akamai Download Manager. Accessed: 2018--10--29. https://www.akamai.com/us/en/products/media-delivery/download-manager-overview.jsp.Google Scholar
- Akamai Media Analytics. Accessed: 2018--10--29. https://www.akamai.com/us/en/products/media-delivery/media-analytics.jsp.Google Scholar
- Tyler Akidau, Eric Schmidt, Sam Whittle, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, and Frances Perry. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. (2015).Google Scholar
- Hrishikesh Amur, Wolfgang Richter, David G. Andersen, Michael Kaminsky, Karsten Schwan, Athula Balachandran, and Erik Zawadzki. 2013. Memory-efficient Groupby-aggregate Using Compressed Buffer Trees. In Proc. of ACM SOCC. Google Scholar
Digital Library
- D. Berger, P. Gland, S. Singla, and F. Ciucu. 2014. Exact Analysis of TTL Cache Networks. Performance Evaluation, Vol. 79 (2014), 2--23. Google Scholar
Digital Library
- Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin. 2014. Summingbird: A framework for integrating batch and online mapreduce computations. VLDB, Vol. 7, 13 (2014), 1441--1451. Google Scholar
Digital Library
- Robert Goodell Brown. 1963. Smoothing, forecasting and prediction of discrete time series .Prentice-Hall Englewood Cliffs, N.J. 468 p. pages.Google Scholar
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. (2015).Google Scholar
- H. Che, Y. Tung, and Z. Wang. 2002. Hierarchical Web Caching Systems: Modeling, Design and Experimental Results. IEEE Journal on Selected Areas in Communications, Vol. 20, 7 (2002), 1305--1314. Google Scholar
Digital Library
- Ronald Fagin. 1977. Asymptotic Miss Ratios over Independent References. J. Comput. System Sci., Vol. 14, 2 (1977), 222--250.Google Scholar
- A. Ferragut, I. Rodr'iguez, and F. Paganini. 2016. Optimizing TTL Caches under Heavy-tailed Demands. In Proc. of ACM SIGMETRICS. Google Scholar
Digital Library
- Philippe Flajolet, Éric Fusy, Olivier Gandouet, and et al. 2007. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Proc. of IN AOFA.Google Scholar
- N. C. Fofack, M. Dehghan, D. Towsley, M. Badov, and D. L. Goeckel. 2014. On the Performance of General Cache Networks. In VALUETOOLS. Google Scholar
Digital Library
- N. C. Fofack, P. Nain, G. Neglia, and D. Towsley. 2012. Analysis of TTL-based Cache Networks. In VALUETOOLS.Google Scholar
- M. Garetto, E. Leonardi, and V. Martina. 2016. A Unified Approach to the Performance Analysis of Caching Systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, 3 (2016), 12. Google Scholar
Digital Library
- Benjamin Heintz, Abhishek Chandra, and Ramesh K. Sitaraman. 2016. Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics. In Proc. of ACM SoCC .Google Scholar
- Benjamin Heintz, Abhishek Chandra, and Ramesh K. Sitaraman. 2017. Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics. (2017).Google Scholar
- Chien-Chun Hung, Ganesh Ananthanarayanan, Leana Golubchik, Minlan Yu, and Mingyang Zhang. 2018. Wide-area analytics with multiple resources. In Proc. of EuroSys. Google Scholar
Digital Library
- B. Jiang, P. Nain, and D. Towsley. 2016. On the Convergence of the TTL Approximation for an LRU Cache under Independent Stationary Rrequest Processes. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 3, 4 (2016), 20. Google Scholar
Digital Library
- Albert Jonathan, Abhishek Chandra, and Jon Weissman. 2018. Multi-Query Optimization in Wide-Area Streaming Analytics. In Proc. of ACM SoCC. Google Scholar
Digital Library
- J. Jung, A. Berger, and H. Balakrishnan. 2003. Analysis of TTL-based Cache Networks. In IEEE INFOCOM.Google Scholar
- KSQL: Streaming SQL for Kafka. Accessed: 2018--10--29. https://www.confluent.io/ product/ksql/.Google Scholar
- Sanjeev Kulkarni, Nikunj Bhagat, Masong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proc. Of ACM SIGMOD. Google Scholar
Digital Library
- Erik Nygren, Ramesh K. Sitaraman, and Jennifer Sun. 2010. The Akamai Network: A Platform for High-performance Internet Applications. SIGOPS Oper. Syst. Rev., Vol. 44, 3 (2010), 2--19. Google Scholar
Digital Library
- N. K. Panigrahy, J. Li, F. Zafari, D. Towsley, and P. Yu. 2018. Optimizing Timer-based Policies for General Cache Networks. Arxiv preprint arXiv:1711.03941 (2018).Google Scholar
- Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, and Ion Stoica. 2015. Low Latency Geo-distributed Data Analytics. In ACM SIGCOMM. Google Scholar
Digital Library
- Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S. Pai, and Michael J. Freedman. 2014. Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area. In Proc. of USENIX NSDI. Google Scholar
Digital Library
- Ramesh Rajagopalan and Pramod Varshney. 2006. Data-aggregation techniques in sensor networks: a survey. 4 (2006), 48--63. Google Scholar
Digital Library
- Twitter Analytics. Accessed: 2018--10--29. https://business.twitter.com/en/analytics.html.Google Scholar
- Twitter Developer APIs. Accessed: 2018--10--29. https://developer.twitter.com/en/docs.Google Scholar
- Twitter usage statistics. Accessed: 2018--10--29. http://www.internetlivestats.com/twitter-statistics/.Google Scholar
- Raajay Viswanathan, Ganesh Ananthanarayanan, and Aditya Akella. 2016. CLARINET: WAN-Aware Optimization for Analytics Queries. In Proc. of USENIX OSDI. Google Scholar
Digital Library
- Ashish Vulimiri, Carlo Curino, Philip Brighten Godfrey, Thomas Jungblut, Konstantinos Karanasos, Jitendra Padhye, and George Varghese. 2015. Wanalytics: Geo-distributed analytics for a data intensive world. In ACM SIGMOD. 1087--1092. Google Scholar
Digital Library
- Windows API in Apache Flink. Accessed: 2018--10--29. https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html.Google Scholar
- Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In ACM SOSP. 423--438.Google Scholar
- Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Edward A. Lee. 2018. AWStream: adaptive wide-area streaming analytics. In Proc. of ACM SIGCOMM.Google Scholar
Index Terms
A TTL-based Approach for Data Aggregation in Geo-distributed Streaming Analytics
Recommendations
Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingMany applications must ingest rapid data streams and produce analytics results in near-real-time. It is increasingly common for inputs to such applications to originate from geographically distributed sources. The typical infrastructure for processing ...
Multi-Query Optimization in Wide-Area Streaming Analytics
SoCC '18: Proceedings of the ACM Symposium on Cloud ComputingWide-area data analytics has gained much attention in recent years due to the increasing need for analyzing data that are geographically distributed. Many of such queries often require real-time analysis on data streams that are continuously being ...
A TTL-based Approach for Data Aggregation in Geo-distributed Streaming Analytics
SIGMETRICS '19: Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsStreaming data analytics has been an important topic of research in recent years. Large quantities of data are generated continuously over time across a variety of application domains such as web and social analytics, scientific computing and energy ...






Comments