Abstract
We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle.
To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes.
We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.
- AMPLab. [n.d.]. AMP big-data benchmark. Retrieved from https://amplab.cs.berkeley.edu/benchmark/.Google Scholar
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, NY, 1383--1394. DOI:https://doi.org/10.1145/2723372.2742797Google Scholar
Digital Library
- Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau. 2001. Information and control in gray-box systems. In ACM SIGOPS Operating Systems Review, Vol. 35. ACM, 43--56.Google Scholar
- Laurent Bindschaedler, Jasmina Malicevic, Nicolas Schiper, Ashvin Goel, and Willy Zwaenepoel. 2018. Rock you like a hurricane: Taming skew in large scale analytics. In Proceedings of the 13th EuroSys Conference (EuroSys’18). ACM, New York, NY. DOI:https://doi.org/10.1145/3190508.3190532Google Scholar
Digital Library
- Mario Blaum, Jim Brady, Jehoshua Bruck, and Jai Menon. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2 (1995), 192--202.Google Scholar
Digital Library
- Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 285--300.Google Scholar
- Carlo Curino, Subru Krishnan, Konstantinos Karanasos, Sriram Rao, Giovanni M. Fumarola, Botong Huang, Kishore Chaliparambil, Arun Suresh, Young Chen, Solom Heddaya et al. 2019. Hydra: A federated resource manager for data-center scale analytics. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 177--191.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. DOI:https://doi.org/10.1145/1327452.1327492Google Scholar
Digital Library
- Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad. 2010. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 515--529. DOI:https://doi.org/10.14778/1920841.1920908Google Scholar
Digital Library
- Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, and Jörg Schad. 2012. Only aggressive elephants are fast elephants. Proc. VLDB Endow. 5, 11 (July 2012), 1591--1602. DOI:https://doi.org/10.14778/2350229.2350272Google Scholar
Digital Library
- Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson. 2011. CoHadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4, 9 (June 2011), 575--585. DOI:https://doi.org/10.14778/2002938.2002943Google Scholar
Digital Library
- Daniel Ford, François Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation. USENIX. https://www.usenix.org/conference/osdi10/availability-globally-distributed-storage-systems.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). ACM, New York, NY, 29--43. DOI:https://doi.org/10.1145/945445.945450Google Scholar
Digital Library
- H.-I. Hsiao and David J. DeWitt. 1990. Chained declustering: A new availability strategy for multiprocessor database machines. In Proceedings of the 6th International Conference on Data Engineering. IEEE, 456--465.Google Scholar
- Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’12). USENIX, Boston, MA, 15--26. Retrieved from https://www.usenix.org/conference/atc12/technical-sessions/presentation/huang.Google Scholar
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 59--72.Google Scholar
Digital Library
- Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a globally-deployed software defined wan. In Proceedings of the ACM SIGCOMM Conference on SIGCOMM (SIGCOMM’13). ACM, New York, NY, 3--14. DOI:https://doi.org/10.1145/2486001.2486019Google Scholar
Digital Library
- Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. In ACM SIGPLAN Notices, Vol. 31. ACM, 84--92.Google Scholar
- Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases. 330--339. Retrieved from http://www.vldb2010.org/accept.htm.Google Scholar
Digital Library
- Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of data. ACM, 165--178.Google Scholar
Digital Library
- Raghu Ramakrishnan, Baskar Sridharan, John R. Douceur, Pavan Kasturi, Balaji Krishnamachari-Sampath, Karthick Krishnamoorthy, Peng Li, Mitica Manu, Spiro Michaylov, Rogério Ramos, Neil Sharman, Zee Xu, Youssef Barakat, Chris Douglas, Richard Draves, Shrikant S. Naidu, Shankar Shastry, Atul Sikaria, Simon Sun, and Ramarathnam Venkatesan. 2017. Azure data lake store: A hyperscale distributed file service for big-data analytics. In Proceedings of the SIGMOD Conference.Google Scholar
Digital Library
- Ravishankar Ramamurthy, David J. DeWitt, and Qi Su. 2003. A case for fractured mirrors. VLDB J. 12, 2 (2003), 89--101.Google Scholar
Digital Library
- K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2014. A “hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the ACM Conference on SIGCOMM (SIGCOMM’14). ACM, New York, NY, 331--342. DOI:https://doi.org/10.1145/2619239.2626325Google Scholar
Digital Library
- David D. Redell, Yogen K. Dalal, Thomas R. Horsley, Hugh C. Lauer, William C. Lynch, Paul R. McJones, Hal G. Murray, and Stephen C. Purcell. 1980. Pilot: An operating system for a personal computer. Commun. ACM 23, 2 (1980), 81--92.Google Scholar
Digital Library
- Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J. Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 229--241. DOI:https://doi.org/10.1145/3127479.3131613Google Scholar
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE Computer Society, Washington, DC, 1--10. DOI:https://doi.org/10.1109/MSST.2010.5496972Google Scholar
Digital Library
- Muthian Sivathanu, Vijayan Prabhakaran, Florentina I. Popovici, Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2003. Semantically smart disk systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST'03). USENIX Association, 73--88. http://dl.acm.org/citation.cfm?id=1090694.1090702.Google Scholar
- Liwen Sun, Sanjay Krishnan, Reynold S. Xin, and Michael J. Franklin. 2014. A partitioning framework for aggressive data skipping. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1617--1620. DOI:https://doi.org/10.14778/2733004.2733044Google Scholar
Digital Library
- Amy Tai, Michael Wei, Michael J. Freedman, Ittai Abraham, and Dahlia Malkhi. 2016. Replex: A scalable, highly available multi-index data store. In Proceedings of the USENIX Annual Technical Conference. 337--350.Google Scholar
- Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. 1997. Frangipani: A Scalable Distributed File System. Vol. 31. ACM.Google Scholar
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1626--1629. DOI:https://doi.org/10.14778/1687553.1687609Google Scholar
Digital Library
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache spark: A unified engine for big-data processing. Commun. ACM 59, 11 (Oct. 2016), 56--65. DOI:https://doi.org/10.1145/2934664Google Scholar
Digital Library
- Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, and Michael J. Freedman. 2018. Riffle: Optimized shuffle service for large-scale data analytics. In Proceedings of the 13th European Conference on Computer Systems (EuroSys’18). ACM, New York, NY. DOI:https://doi.org/10.1145/3190508.3190534Google Scholar
- Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB J. 21, 5 (Oct. 2012), 611--636. DOI:https://doi.org/10.1007/s00778-012-0280-zGoogle Scholar
Digital Library
- J. Zhou, P. A. Larson, and R. Chaiken. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 1060--1071. DOI:https://doi.org/10.1109/ICDE.2010.5447802Google Scholar
Index Terms
INSTalytics: Cluster Filesystem Co-design for Big-data Analytics
Recommendations
INSTalytics: cluster filesystem co-design for big-data analytics
FAST'19: Proceedings of the 17th USENIX Conference on File and Storage TechnologiesWe present the design, implementation, and evaluation of INSTalytics a co-designed stack of a cluster file system and the compute layer, for efficient big data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data ...
Prefetched wald adaptive boost classification based Czekanowski similarity MapReduce for user query processing with bigdata
AbstractWith large volumes of data being generated in recent years and the inception of big data analytics on social media necessitates accurate user query processing with minimum time complexity. Several research works have been conducted in this area, ...
Characterizing the spatio-temporal burstiness of storage workloads
CloudDP '15: Proceedings of the 5th International Workshop on Cloud Data and PlatformsComputing technology are undergoing evolutionary changes in platform and environment. Computing becomes more and more data-intensive. The design of the data center storage of cloud-based system determines whether data could be accessed efficiently. ...






Comments