10.1145/2245276.2245363acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Entity matching for semistructured data in the Cloud

Online:26 March 2012Publication History

ABSTRACT

The rapid expansion of available information, on the Web or inside companies, is increasing. With Cloud infrastructure maturing (including tools for parallel data processing, text analytics, clustering, etc.), there is more interest in integrating data to produce higher-value content. New challenges, notably include entity matching over large volumes of heterogeneous data.

In this paper, we describe an approach for entity matching over large amounts of semistructured data in the Cloud. The approach combines ChuQL[4], a recently proposed extension of XQuery with MapReduce, and a blocking technique for entity matching which can be efficiently executed on top of MapReduce. We illustrate the proposed approach by applying it to extract automatically and enrich references in Wikipedia and report on an experimental evaluation of the approach.

References

  1. R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD '03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, 2003.Google ScholarGoogle Scholar
  2. P. Christen and T. Churches. Febrl: Freely extensible biomedical record linkage. Manual, release 0.2, 2003.Google ScholarGoogle Scholar
  3. M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2: 9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Khatchadourian, M. Consens, and J. Simeon. Having a ChuQL at XML on the Cloud. In AMW, 2011.Google ScholarGoogle Scholar
  5. S. Khatchadourian, M. Consens, and J. Simeon. ChuQL: Processing XML with XQery using Hadoop. In Cascon, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Köpcke, and E. Rahm. Data Partitioning for Parallel Entity Matching. CoRR, 2010.Google ScholarGoogle Scholar
  7. L. Kolb, H. Köpcke, A. Thor, and E. Rahm. Learning-based Entity Resolution with MapReduce. In CloudDb 2011, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Kolb, A. Thor, and E. Rahm. Parallel Sorted Neighborhood Blocking with MapReduce. In BTW, volume 180, pages 45--64, 2011.Google ScholarGoogle Scholar
  9. S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In AGENTS '99, pages 392--393, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00, pages 169--178, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, pages 1425--1432, 2003.Google ScholarGoogle Scholar
  12. R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    ACM Conferences cover image
    SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied Computing
    March 2012
    2179 pages
    ISBN:9781450308571
    DOI:10.1145/2245276
    • Conference Chairs:
    • Sascha Ossowski,
    • Paola Lecca

    Copyright © 2012 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Online: 26 March 2012

    Permissions

    Request permissions about this article.

    Request Permissions

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate 1,328 of 4,517 submissions, 29%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!