ABSTRACT
Experience froman operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.
- Hadoop distributed filesystem. http://hadoop.apache.org.Google Scholar
- A. Faraj, X. Yuan, D. Lowenthal. STAR-MPI: Self Tuned Adaptive Routines for MPI Collective Operations. In SC, 2006. Google Scholar
Digital Library
- A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009. Google Scholar
Digital Library
- I. Ahmad and M. K. Dhodhi. Semi-distributed load balancing for massively parallel multicomputer systems. In IEEE TSE, 1991. Google Scholar
Digital Library
- G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, and Y. Lu. Reigning in the outliers inmap-reduce clusters. Technical Report MSR-TR-2010-69, Microsoft Research, 2010.Google Scholar
- B. Ucar, C. Aykanat, K. Kaya, M. Ikinci. Task assignment in Heterogeneous Computing Systems. In JPDC, 2006. Google Scholar
Digital Library
- L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. An analysis of data corruption in the storage stack. In FAST, 2008. Google Scholar
Digital Library
- R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008. Google Scholar
Digital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmleegy, and R. Sears. Mapreduce online. In NSDI, 2010. Google Scholar
Digital Library
- D. Culler et al. LogP: Towards a Realistic Model of Parallel Computation. In SIGPLAN PPoPP, 1993. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google Scholar
Digital Library
- R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2), 1969.Google Scholar
Cross Ref
- M. Isard et al. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Eurosys, 2007. Google Scholar
Digital Library
- S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the Tightrope: Responsive Yet Stable Traffic Engineering. In SIGCOMM, 2005. Google Scholar
Digital Library
- S. Ko, I. Hoque, B. Cho, and I. Gupta. Making cloud intermediate data fault-tolerant. In SOCC, 2010. Google Scholar
Digital Library
- A. Krishnamurthy and K. Yelick. Analysis and optimizations for shared address space programs. JPDC, 1996. Google Scholar
Digital Library
- M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In SIGCOMM, 2008. Google Scholar
Digital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In SOSP, 2009. Google Scholar
Digital Library
- M. Lauria and A. Chien. MPI-FM: High Performance MPI on Workstation Clusters. In JPDC, 1997. Google Scholar
Digital Library
- M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, 2008. Google Scholar
Digital Library
- P. Patarasuk, A. Faraj, X. Yuan. Pipelined Broadcast on Ethernet Switched Clusters. In IEEE IPDPS, 2006. Google Scholar
Digital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. R. Madden, and M. Stonebraker. A comparison of approaches to large scale data analysis. In SIGMOD, 2009. Google Scholar
Digital Library
- S. Kandula, S. Sengupta, A. Greenberg, P. Patel, R. Chaiken. Nature of Datacenter Traffic: Measurements and Analysis. In IMC, 2009. Google Scholar
Digital Library
- S. Manoharan. Effect of task duplication on assignment of dependency graphs. In Parallel Comput., 2001. Google Scholar
Digital Library
- T. Sandholm and K. Lai. Mapreduce optimization using regulated dynamic prioritization. In SIGMETRICS, 2009. Google Scholar
Digital Library
- Y. Kwon, M. Balazinska, B. Howe, J. Rolia. Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. In SOCC, 2010. Google Scholar
Digital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, J. Currey. DryadLINQ: A System for General-Purpose Data-Parallel Computing Using a High-Level Language. In OSDI, 2008. Google Scholar
Digital Library
- Y. Yu, P. K. Gunda, and M. Isard. Distributed Aggregation for Data-Parallel Computing: Interfaces, Impl. In SOSP, 2009. Google Scholar
Digital Library
Index Terms
Reining in the outliers in map-reduce clusters using Mantri
Recommendations
On scheduling in map-reduce and flow-shops
SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architecturesThe map-reduce paradigm is now standard in industry and academia for processing large-scale data. In this work, we formalize job scheduling in map-reduce as a novel generalization of the two-stage classical flexible flow shop (FFS) problem: instead of a ...
Detecting Clusters and Outliers for Multi-dimensional Data
MUE '08: Proceedings of the 2008 International Conference on Multimedia and Ubiquitous EngineeringNowadays many data mining algorithms focus on clustering methods. There are also a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, ...




Comments