skip to main content
research-article

Cybertron: pushing the limit on I/O reduction in data-parallel programs

Authors Info & Claims
Published:15 October 2014Publication History
Skip Abstract Section

Abstract

I/O reduction has been a major focus in optimizing data-parallel programs for big-data processing. While the current state-of-the-art techniques use static program analysis to reduce I/O, Cybertron proposes a new direction that incorporates runtime mechanisms to push the limit further on I/O reduction. In particular, Cybertron tracks how data is used in the computation accurately at runtime to filter unused data at finer granularity dynamically, beyond what current static-analysis based mechanisms are capable of, and to facilitate a new mechanism called constraint based encoding for more efficient encoding. Cybertron has been implemented and applied to production data-parallel programs; our extensive evaluations on real programs and real data have shown its effectiveness on I/O reduction over the existing mechanisms at reasonable CPU cost, and its improvement on end-to-end performance in various network environments.

References

  1. D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apache. Hadoop. http://lucene.apache.org/hadoop/.Google ScholarGoogle Scholar
  3. D. Brumley, J. Newsome, D. X. Song, H. Wang, and S. Jha. Towards automatic generation of vulnerability-based signatures. In IEEE Symposium on Security and Privacy, pages 2--16, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, pages 209--224, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Castro, M. Costa, and J.-P. Martin. Better bug reporting with better privacy. In ASPLOS, pages 319--328, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In PLDI, pages 363--375, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., pages 4:1--4:26, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396--402, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. Costa, M. Castro, L. Zhou, L. Zhang, and M. Peinado. Bouncer: securing software by blocking bad input. In SOSP, pages 117--130, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In TACAS, pages 337--340, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Deutsch. DEFLATE compressed data format specification version 1.3. http://www.ietf.org/rfc/rfc1951.txt. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Gkantsidis, D. Vytiniotis, O. Hodson, D. Narayanan, F. Dinu, and A. Rowstron. Rhea: automatic filtering for unstructured cloud storage. In NSDI, pages 343--356, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through PeriSCOPE. In OSDI, pages 121--133, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ILSpy. The open-source .NET assembly browser and decompiler. http://ilspy.net/.Google ScholarGoogle Scholar
  16. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. PVLDB, 4(6):385--396, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385--394, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, pages 330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Microsoft. PEX. http://research.microsoft.com/en-us/projects/pex/.Google ScholarGoogle Scholar
  21. C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX ATC, pages 267--273, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. Pavlov. 7-Zip. http://www.7-zip.org/, 2013.Google ScholarGoogle Scholar
  24. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: a column-oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: diagnosing production run failures at the user's site. In SOSP, pages 131--144, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In SOSP, pages 247--260, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: error diagnosis by connecting clues from run-time logs. In ASPLOS, pages 143--154, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In NSDI, pages 295--308, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In ICDE, pages 1060--1071, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  32. J. Zhou, N. Bruno, M. chuan Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. In The VLDB Journal, volume 21, pages 611--636, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cybertron: pushing the limit on I/O reduction in data-parallel programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 49, Issue 10
        OOPSLA '14
        October 2014
        907 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2714064
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications
          October 2014
          946 pages
          ISBN:9781450325851
          DOI:10.1145/2660193

        Copyright © 2014 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 October 2014

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!