Abstract
We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and accepting and running a precise and a user-defined approximate version of the MapReduce code. We then show how to leverage statistical theories to compute error bounds for popular classes of MapReduce programs when approximating with input data sampling and/or task dropping. We implement the proposed mechanisms and error bound estimations in a prototype system called ApproxHadoop. Our evaluation uses MapReduce applications from different domains, including data analytics, scientific computing, video encoding, and machine learning. Our results show that ApproxHadoop can significantly reduce application execution time and/or energy consumption when the user is willing to tolerate small errors. For example, ApproxHadoop can reduce runtimes by up to 32x when the user can tolerate an error of 1% with 95% confidence. We conclude that our framework and system can make approximation easily accessible to many application domains using the MapReduce model.
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- Apache Mahout. http://mahout.apache.org.Google Scholar
- Apache Nutch. http://nutch.apache.org.Google Scholar
- S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the European Conference on Computer Systems (EuroSys), 2013. Google Scholar
Digital Library
- G. Ananthanarayanan, M. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. GRASS: Trimming Stragglers in Approximation Analytics. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014. Google Scholar
Digital Library
- W. Baek and T. M. Chilimbi. Green: A Framework for Supporting Energy-Conscious Programming using Controlled Approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010. Google Scholar
Digital Library
- S. Bhat, J. Borgstrom, A. D. Gordon, and C. Russo. Deriving Probability Density Functions from Probabilistic Functional Programs. In Proceedings of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 2013. Google Scholar
Digital Library
- S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A Comparison of Join Algorithms for Log Processing in MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2010. Google Scholar
Digital Library
- J. Bornholt, T. Mytkowicz, and K. S. McKinley. Uncertain : A First-Order Type for Uncertain Data. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google Scholar
Digital Library
- S. Chaudhuri, G. Das, and V. Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems (TODS), 32(2), 2007. Google Scholar
Digital Library
- S. Coles. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.Google Scholar
Cross Ref
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2004. Google Scholar
Digital Library
- A. Doucet, S. Godsill, and C. Andrieu. On Sequential Monte Carlo Sampling Methods for Bayesian Filtering. Statistics and Computing, 10(3), 2000. Google Scholar
Digital Library
- J. Ekanayake, S. Pallickara, and G. Fox. MapReduce for Data Intensive Scientific Analyses. In Proceedings of the IEEE International Conference on e-Science (e-Science), 2008. Google Scholar
Digital Library
- Z. Fadika, E. Dede, M. Govindaraju, and L. Ramakrishnan. Adapting MapReduce for HPC environments. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2011. Google Scholar
Digital Library
- M. N. Garofalakis and P. B. Gibbons. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Databases (VLDB), 2001. Google Scholar
Digital Library
- I. Goiri, K. Le, J. Guitart, J. Torres, and R. Bianchini. Intelligent Placement of Datacenters for Internet Services. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), 2011. Google Scholar
Digital Library
- I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. Technical Report DCS-TR-709, Department of Computer Science, Rutgers University, 2014.Google Scholar
- P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proceedings of the International Conference on Very Large Databases (VLDB), 1995. Google Scholar
Digital Library
- J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997. Google Scholar
Digital Library
- H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic Knobs for Responsive Power-Aware Computing. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011. Google Scholar
Digital Library
- O. Kiselyov and C.-C. Shan. Embedded Probabilistic Programming. In Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages (DSL), 2009. Google Scholar
Digital Library
- J. Lin. Cloud9: A Hadoop Toolkit for Working with Big Data. http://lintool.github.io/Cloud9.Google Scholar
- J. W. Liu, W.-K. Shih, K.-J. Lin, R. Bettati, and J.-Y. Chung. Imprecise Computations. Proceedings of the IEEE, 82(1), 1994.Google Scholar
- S. Liu and W. Q. Meeker. Statistical Methods for Estimating the Minimum Thickness Along a Pipeline. Technometrics, 2014.Google Scholar
- S. Lohr. Sampling: Design and Analysis. Cengage Learning, 2009.Google Scholar
- T. Minka, J. Winn, J. Guiver, S. Webster, Y. Zaykov, B. Yangel, A. Spengler, and J. Bronskill. Infer.NET 2.6. Microsoft Research Cambridge, 2014. http://research.microsoft.com/infernet.Google Scholar
- S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard. Quality of Service Profiling. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE), 2010. Google Scholar
Digital Library
- S. Misailovic, D. M. Roy, and M. C. Rinard. Probabilistically Accurate Program Transformations. In Proceedings of the International Static Analysis Symposium (SAS), 2011. Google Scholar
Digital Library
- S. Misailovic, S. Sidiroglou, H. Hoffmann, M. Carbin, A. Agarwal, and M. Rinard. Code Perforation: Automatically and Dynamically Trading Accuracy for Performance and Power, 2014. http://groups.csail.mit.edu/cag/codeperf/.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford InfoLab, 1999.Google Scholar
- N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. Proceedings of the VLDB Endowment (PVLDB), 4(11), 2011.Google Scholar
Digital Library
- A. Pfeffer. A General Importance Sampling Algorithm for Probabilistic Programs. Technical Report TR-12-07, Harvard University, 2007.Google Scholar
- M. Rinard. Probabilistic Accuracy Bounds for Fault-tolerant Computations That Discard Tasks. In Proceedings of the Annual International Conference on Supercomputing (ICS), 2006. Google Scholar
Digital Library
- M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2012. Google Scholar
Digital Library
- M. Samadi, J. Lee, A. Jamshidi, A. Hormati, and S. Mahlke. SAGE: Self-Tuning Approximation for Graphics Engines. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013. Google Scholar
Digital Library
- A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2011. Google Scholar
Digital Library
- A. Sampson, J. Nelson, K. Strauss, and L. Ceze. Approximate Storage in Solid-State Memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), 2013. Google Scholar
Digital Library
- A. Sampson, P. Panchekha, T. Mytkowicz, K. S. McKinley, D. Grossman, and L. Ceze. Expressing and Verifying Probabilistic Assertions. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014. Google Scholar
Digital Library
- S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing Performance vs. Accuracy Trade-offs with Loop Perforation. In Proceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), 2011. Google Scholar
Digital Library
- L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2011.Google Scholar
- J. Slauson and Q. Wan. Approximate Hadoop, 2012. http://www.joshslauson.com/pdf/cs736_project.pdf.Google Scholar
- A. Verma, N. Zea, B. Cho, I. Gupta, and R. H. Campbell. Breaking the MapReduce Stage Barrier. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster), 2010. Google Scholar
Digital Library
- Wikipedia. Wikipedia Database, 2014. http://en.wikipedia.org/wiki/Wikipedia_database.Google Scholar
- Wikipedia. Wikimedia Downloads, 2014. http://dumps.wikimedia.org.Google Scholar
- D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.Google Scholar
- M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008. Google Scholar
Digital Library
Index Terms
ApproxHadoop: Bringing Approximations to MapReduce Frameworks
Recommendations
ApproxHadoop: Bringing Approximations to MapReduce Frameworks
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsWe propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and ...
ApproxHadoop: Bringing Approximations to MapReduce Frameworks
ASPLOS'15We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and ...
High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop
ICICA '14: Proceedings of the 2014 International Conference on Intelligent Computing ApplicationsHadoop is a quickly budding ecosystem of components based on Google's MapReduce algorithm and file system work for implementing MapReduce algorithms in a scalable fashion and distributed on commodity hardware. Hadoop enables users to store and process ...







Comments