ABSTRACT
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- Apache Hive. http://hadoop.apache.org/hive.Google Scholar
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google Scholar
Digital Library
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google Scholar
Digital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google Scholar
Digital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google Scholar
Digital Library
- D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google Scholar
Digital Library
- D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In VLDB, pages 443--452, 1991. Google Scholar
Digital Library
- A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: the Pig experience. PVLDB, 2(2):1414--1425, 2009. Google Scholar
Digital Library
- Genbank. http://www.ncbi.nlm.nih.gov/Genbank.Google Scholar
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google Scholar
Digital Library
- L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001. Google Scholar
Digital Library
- M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284-291, 2006. Google Scholar
Digital Library
- T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google Scholar
Digital Library
- Jaql. http://www.jaql.org.Google Scholar
- Jaql - Fuzzy join tutorial. http://code.google.com/p/jaql/wiki/fuzzyJoinTutorial.Google Scholar
- M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (sdc). In VLDB, pages 210--221, 1990. Google Scholar
Digital Library
- M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Comput., 1(1):63--74, 1983.Google Scholar
Digital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. Detectives: detecting coalition hit inflation attacks in advertising networks streams. In WWW, pages 241-250, 2007. Google Scholar
Digital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009. Google Scholar
Digital Library
- M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377--386, 2006. Google Scholar
Digital Library
- S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google Scholar
Digital Library
- D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD Conference, pages 110--121, 1989. Google Scholar
Digital Library
- E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, pages 678--684, 2005. Google Scholar
Digital Library
- R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. Technical report, Department of Computer Science, UC Irvine, March 2010. http://asterix.ics.uci.edu.Google Scholar
- Web 1t 5-gram version 1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.Google Scholar
- C. Xiao, W. Wang, and X. Lin. Ed-join: An efficient algorithm for similarity joins with edit distance constraints. In VLDB, 2008. Google Scholar
Digital Library
- C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008. Google Scholar
Digital Library
- H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, 2007. Google Scholar
Digital Library
Index Terms
Efficient parallel set-similarity joins using MapReduce
Recommendations
Supporting set-valued joins in NoSQL using MapReduce
NoSQL systems are increasingly adopted for Web applications requiring scalability that relational database systems cannot meet. Although NoSQL systems have not been designed to support joins, as they are applied to a wide variety of applications, the ...
On Spatial Joins in MapReduce
This paper provides the first attempt for a full-fledged query optimizer for MapReduce-based spatial join algorithms. The optimizer develops its own taxonomy that covers almost all possible ways of doing a spatial join for any two input datasets. The ...
Exploiting MapReduce-based similarity joins
Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller ...





Comments