Abstract
I/O reduction has been a major focus in optimizing data-parallel programs for big-data processing. While the current state-of-the-art techniques use static program analysis to reduce I/O, Cybertron proposes a new direction that incorporates runtime mechanisms to push the limit further on I/O reduction. In particular, Cybertron tracks how data is used in the computation accurately at runtime to filter unused data at finer granularity dynamically, beyond what current static-analysis based mechanisms are capable of, and to facilitate a new mechanism called constraint based encoding for more efficient encoding. Cybertron has been implemented and applied to production data-parallel programs; our extensive evaluations on real programs and real data have shown its effectiveness on I/O reduction over the existing mechanisms at reasonable CPU cost, and its improvement on end-to-end performance in various network environments.
- D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006. Google Scholar
Digital Library
- Apache. Hadoop. http://lucene.apache.org/hadoop/.Google Scholar
- D. Brumley, J. Newsome, D. X. Song, H. Wang, and S. Jha. Towards automatic generation of vulnerability-based signatures. In IEEE Symposium on Security and Privacy, pages 2--16, 2006. Google Scholar
Digital Library
- C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, pages 209--224, 2008. Google Scholar
Digital Library
- M. Castro, M. Costa, and J.-P. Martin. Better bug reporting with better privacy. In ASPLOS, pages 319--328, 2008. Google Scholar
Digital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In PLDI, pages 363--375, 2010. Google Scholar
Digital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., pages 4:1--4:26, 2008. Google Scholar
Digital Library
- J. Cleary and I. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396--402, 1984.Google Scholar
Cross Ref
- M. Costa, M. Castro, L. Zhou, L. Zhang, and M. Peinado. Bouncer: securing software by blocking bad input. In SOSP, pages 117--130, 2007. Google Scholar
Digital Library
- L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In TACAS, pages 337--340, 2008. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google Scholar
Digital Library
- P. Deutsch. DEFLATE compressed data format specification version 1.3. http://www.ietf.org/rfc/rfc1951.txt. Google Scholar
Digital Library
- C. Gkantsidis, D. Vytiniotis, O. Hodson, D. Narayanan, F. Dinu, and A. Rowstron. Rhea: automatic filtering for unstructured cloud storage. In NSDI, pages 343--356, 2013. Google Scholar
Digital Library
- Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through PeriSCOPE. In OSDI, pages 121--133, 2012. Google Scholar
Digital Library
- ILSpy. The open-source .NET assembly browser and decompiler. http://ilspy.net/.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google Scholar
Digital Library
- E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. PVLDB, 4(6):385--396, 2011. Google Scholar
Digital Library
- J. C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385--394, 1976. Google Scholar
Digital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, pages 330--339, 2010. Google Scholar
Digital Library
- Microsoft. PEX. http://research.microsoft.com/en-us/projects/pex/.Google Scholar
- C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX ATC, pages 267--273, 2008. Google Scholar
Digital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google Scholar
Digital Library
- I. Pavlov. 7-Zip. http://www.7-zip.org/, 2013.Google Scholar
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: a column-oriented DBMS. In VLDB, pages 553--564, 2005. Google Scholar
Digital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009. Google Scholar
Digital Library
- J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: diagnosing production run failures at the user's site. In SOSP, pages 131--144, 2007. Google Scholar
Digital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008. Google Scholar
Digital Library
- Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In SOSP, pages 247--260, 2009. Google Scholar
Digital Library
- D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: error diagnosis by connecting clues from run-time logs. In ASPLOS, pages 143--154, 2010. Google Scholar
Digital Library
- J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions. In NSDI, pages 295--308, 2012. Google Scholar
Digital Library
- J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In ICDE, pages 1060--1071, 2010.Google Scholar
Cross Ref
- J. Zhou, N. Bruno, M. chuan Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. In The VLDB Journal, volume 21, pages 611--636, 2012. Google Scholar
Digital Library
Index Terms
Cybertron: pushing the limit on I/O reduction in data-parallel programs
Recommendations
Cybertron: pushing the limit on I/O reduction in data-parallel programs
OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & ApplicationsI/O reduction has been a major focus in optimizing data-parallel programs for big-data processing. While the current state-of-the-art techniques use static program analysis to reduce I/O, Cybertron proposes a new direction that incorporates runtime ...
Mars: Accelerating MapReduce with Graphics Processors
We design and implement Mars, a MapReduce runtime system accelerated with graphics processing units (GPUs). MapReduce is a simple and flexible parallel programming paradigm originally proposed by Google, for the ease of large-scale data processing on ...
Manycore performance-portability: Kokkos multidimensional array library
A New Overview of the Trilinos Project --Part 1Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a ...







Comments