10.1145/1807167.1807222acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Results Reproduced / v1.1

Efficient parallel set-similarity joins using MapReduce

Published:06 June 2010Publication History

ABSTRACT

In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

References

  1. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  2. Apache Hive. http://hadoop.apache.org/hive.Google ScholarGoogle Scholar
  3. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evaluation of non-equijoin algorithms. In VLDB, pages 443--452, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: the Pig experience. PVLDB, 2(2):1414--1425, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Genbank. http://www.ncbi.nlm.nih.gov/Genbank.Google ScholarGoogle Scholar
  12. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284-291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jaql. http://www.jaql.org.Google ScholarGoogle Scholar
  17. Jaql - Fuzzy join tutorial. http://code.google.com/p/jaql/wiki/fuzzyJoinTutorial.Google ScholarGoogle Scholar
  18. M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (sdc). In VLDB, pages 210--221, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Application of hash to data base machine and its architecture. New Generation Comput., 1(1):63--74, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Metwally, D. Agrawal, and A. E. Abbadi. Detectives: detecting coalition hit inflation attacks in advertising networks streams. In WWW, pages 241-250, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377--386, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD Conference, pages 110--121, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, pages 678--684, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. Technical report, Department of Computer Science, UC Irvine, March 2010. http://asterix.ics.uci.edu.Google ScholarGoogle Scholar
  27. Web 1t 5-gram version 1. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.Google ScholarGoogle Scholar
  28. C. Xiao, W. Wang, and X. Lin. Ed-join: An efficient algorithm for similarity joins with edit distance constraints. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient parallel set-similarity joins using MapReduce

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
          June 2010
          1286 pages
          ISBN:9781450300322
          DOI:10.1145/1807167

          Copyright © 2010 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 June 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate 785 of 4,003 submissions, 20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!