skip to main content
research-article

INSTalytics: Cluster Filesystem Co-design for Big-data Analytics

Published:16 January 2020Publication History
Skip Abstract Section

Abstract

We present the design, implementation, and evaluation of INSTalytics, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle.

To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes.

We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

References

  1. AMPLab. [n.d.]. AMP big-data benchmark. Retrieved from https://amplab.cs.berkeley.edu/benchmark/.Google ScholarGoogle Scholar
  2. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, NY, 1383--1394. DOI:https://doi.org/10.1145/2723372.2742797Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau. 2001. Information and control in gray-box systems. In ACM SIGOPS Operating Systems Review, Vol. 35. ACM, 43--56.Google ScholarGoogle Scholar
  4. Laurent Bindschaedler, Jasmina Malicevic, Nicolas Schiper, Ashvin Goel, and Willy Zwaenepoel. 2018. Rock you like a hurricane: Taming skew in large scale analytics. In Proceedings of the 13th EuroSys Conference (EuroSys’18). ACM, New York, NY. DOI:https://doi.org/10.1145/3190508.3190532Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mario Blaum, Jim Brady, Jehoshua Bruck, and Jai Menon. 1995. EVENODD: An efficient scheme for tolerating double disk failures in RAID architectures. IEEE Trans. Comput. 44, 2 (1995), 192--202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 285--300.Google ScholarGoogle Scholar
  7. Carlo Curino, Subru Krishnan, Konstantinos Karanasos, Sriram Rao, Giovanni M. Fumarola, Botong Huang, Kishore Chaliparambil, Arun Suresh, Young Chen, Solom Heddaya et al. 2019. Hydra: A federated resource manager for data-center scale analytics. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 177--191.Google ScholarGoogle Scholar
  8. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. DOI:https://doi.org/10.1145/1327452.1327492Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad. 2010. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 515--529. DOI:https://doi.org/10.14778/1920841.1920908Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, and Jörg Schad. 2012. Only aggressive elephants are fast elephants. Proc. VLDB Endow. 5, 11 (July 2012), 1591--1602. DOI:https://doi.org/10.14778/2350229.2350272Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson. 2011. CoHadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4, 9 (June 2011), 575--585. DOI:https://doi.org/10.14778/2002938.2002943Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Daniel Ford, François Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation. USENIX. https://www.usenix.org/conference/osdi10/availability-globally-distributed-storage-systems.Google ScholarGoogle Scholar
  13. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). ACM, New York, NY, 29--43. DOI:https://doi.org/10.1145/945445.945450Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H.-I. Hsiao and David J. DeWitt. 1990. Chained declustering: A new availability strategy for multiprocessor database machines. In Proceedings of the 6th International Conference on Data Engineering. IEEE, 456--465.Google ScholarGoogle Scholar
  15. Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’12). USENIX, Boston, MA, 15--26. Retrieved from https://www.usenix.org/conference/atc12/technical-sessions/presentation/huang.Google ScholarGoogle Scholar
  16. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 59--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a globally-deployed software defined wan. In Proceedings of the ACM SIGCOMM Conference on SIGCOMM (SIGCOMM’13). ACM, New York, NY, 3--14. DOI:https://doi.org/10.1145/2486001.2486019Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Edward K. Lee and Chandramohan A. Thekkath. 1996. Petal: Distributed virtual disks. In ACM SIGPLAN Notices, Vol. 31. ACM, 84--92.Google ScholarGoogle Scholar
  19. Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases. 330--339. Retrieved from http://www.vldb2010.org/accept.htm.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of data. ACM, 165--178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Raghu Ramakrishnan, Baskar Sridharan, John R. Douceur, Pavan Kasturi, Balaji Krishnamachari-Sampath, Karthick Krishnamoorthy, Peng Li, Mitica Manu, Spiro Michaylov, Rogério Ramos, Neil Sharman, Zee Xu, Youssef Barakat, Chris Douglas, Richard Draves, Shrikant S. Naidu, Shankar Shastry, Atul Sikaria, Simon Sun, and Ramarathnam Venkatesan. 2017. Azure data lake store: A hyperscale distributed file service for big-data analytics. In Proceedings of the SIGMOD Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ravishankar Ramamurthy, David J. DeWitt, and Qi Su. 2003. A case for fractured mirrors. VLDB J. 12, 2 (2003), 89--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2014. A “hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the ACM Conference on SIGCOMM (SIGCOMM’14). ACM, New York, NY, 331--342. DOI:https://doi.org/10.1145/2619239.2626325Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. David D. Redell, Yogen K. Dalal, Thomas R. Horsley, Hugh C. Lauer, William C. Lynch, Paul R. McJones, Hal G. Murray, and Stephen C. Purcell. 1980. Pilot: An operating system for a personal computer. Commun. ACM 23, 2 (1980), 81--92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J. Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 229--241. DOI:https://doi.org/10.1145/3127479.3131613Google ScholarGoogle Scholar
  26. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE Computer Society, Washington, DC, 1--10. DOI:https://doi.org/10.1109/MSST.2010.5496972Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Muthian Sivathanu, Vijayan Prabhakaran, Florentina I. Popovici, Timothy E. Denehy, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2003. Semantically smart disk systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST'03). USENIX Association, 73--88. http://dl.acm.org/citation.cfm?id=1090694.1090702.Google ScholarGoogle Scholar
  28. Liwen Sun, Sanjay Krishnan, Reynold S. Xin, and Michael J. Franklin. 2014. A partitioning framework for aggressive data skipping. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1617--1620. DOI:https://doi.org/10.14778/2733004.2733044Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Amy Tai, Michael Wei, Michael J. Freedman, Ittai Abraham, and Dahlia Malkhi. 2016. Replex: A scalable, highly available multi-index data store. In Proceedings of the USENIX Annual Technical Conference. 337--350.Google ScholarGoogle Scholar
  30. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. 1997. Frangipani: A Scalable Distributed File System. Vol. 31. ACM.Google ScholarGoogle Scholar
  31. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1626--1629. DOI:https://doi.org/10.14778/1687553.1687609Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache spark: A unified engine for big-data processing. Commun. ACM 59, 11 (Oct. 2016), 56--65. DOI:https://doi.org/10.1145/2934664Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Haoyu Zhang, Brian Cho, Ergin Seyfe, Avery Ching, and Michael J. Freedman. 2018. Riffle: Optimized shuffle service for large-scale data analytics. In Proceedings of the 13th European Conference on Computer Systems (EuroSys’18). ACM, New York, NY. DOI:https://doi.org/10.1145/3190508.3190534Google ScholarGoogle Scholar
  34. Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB J. 21, 5 (Oct. 2012), 611--636. DOI:https://doi.org/10.1007/s00778-012-0280-zGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Zhou, P. A. Larson, and R. Chaiken. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 1060--1071. DOI:https://doi.org/10.1109/ICDE.2010.5447802Google ScholarGoogle Scholar

Index Terms

  1. INSTalytics: Cluster Filesystem Co-design for Big-data Analytics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 15, Issue 4
          Usenix Fast 2019 Special Section and Regular Papers
          November 2019
          228 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3373756
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 January 2020
          • Revised: 1 October 2019
          • Accepted: 1 October 2019
          • Received: 1 July 2019
          Published in tos Volume 15, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!