research-article

Dedoop: efficient deduplication with Hadoop

Online:01 August 2012Publication History

Abstract

We demonstrate a powerful and easy-to-use tool called Dedoop (<u>De</u>duplication with Ha<u>doop</u>) for MapReduce-based entity resolution (ER) of large datasets. Dedoop supports a browser-based specification of complex ER workflows including blocking and matching steps as well as the optional use of machine learning for the automatic generation of match classifiers. Specified workflows are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. To achieve high performance Dedoop supports several advanced load balancing strategies.

References

  1. Dedoop. http://dbs.uni-leipzig.de/dedoop.Google ScholarGoogle Scholar
  2. Baxter, Christen, and Churches. A Comparison of Fast Blocking Methods for Record Linkage. In Workshop Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, 2003.Google ScholarGoogle Scholar
  3. Kolb, Thor, and Rahm. Load Balancing for MapReduce-based Entity Resolution. In ICDE, pages 618--629, 2012. Google ScholarGoogle Scholar
  4. Kolb, Thor, and Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. CSRD, 27(1):45--63, 2012. Google ScholarGoogle Scholar
  5. Köpcke and Rahm. Frameworks for Entity Matching: A Comparison. Data Knowl. Eng., 69(2):197--210, 2010. Google ScholarGoogle Scholar
  6. Köpcke, Thor, and Rahm. Evaluation of Entity Resolution Approaches on real-world Match Problems. PVLDB, 3(1):484--493, 2010. Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Dedoop: efficient deduplication with Hadoop

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          Proceedings of the VLDB Endowment cover image
          Proceedings of the VLDB Endowment  Volume 5, Issue 12
          August 2012
          340 pages

          Publisher

          VLDB Endowment

          Publication History

          • Online: 1 August 2012

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!