ABSTRACT
The rapid expansion of available information, on the Web or inside companies, is increasing. With Cloud infrastructure maturing (including tools for parallel data processing, text analytics, clustering, etc.), there is more interest in integrating data to produce higher-value content. New challenges, notably include entity matching over large volumes of heterogeneous data.
In this paper, we describe an approach for entity matching over large amounts of semistructured data in the Cloud. The approach combines ChuQL[4], a recently proposed extension of XQuery with MapReduce, and a blocking technique for entity matching which can be efficiently executed on top of MapReduce. We illustrate the proposed approach by applying it to extract automatically and enrich references in Wikipedia and report on an experimental evaluation of the approach.
References
- R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD '03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25--27, 2003.Google Scholar
- P. Christen and T. Churches. Febrl: Freely extensible biomedical record linkage. Manual, release 0.2, 2003.Google Scholar
- M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2: 9--37, 1998. Google Scholar
Digital Library
- S. Khatchadourian, M. Consens, and J. Simeon. Having a ChuQL at XML on the Cloud. In AMW, 2011.Google Scholar
- S. Khatchadourian, M. Consens, and J. Simeon. ChuQL: Processing XML with XQery using Hadoop. In Cascon, 2011. Google Scholar
Digital Library
- T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Köpcke, and E. Rahm. Data Partitioning for Parallel Entity Matching. CoRR, 2010.Google Scholar
- L. Kolb, H. Köpcke, A. Thor, and E. Rahm. Learning-based Entity Resolution with MapReduce. In CloudDb 2011, 2011. Google Scholar
Digital Library
- L. Kolb, A. Thor, and E. Rahm. Parallel Sorted Neighborhood Blocking with MapReduce. In BTW, volume 180, pages 45--64, 2011.Google Scholar
- S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In AGENTS '99, pages 392--393, 1999. Google Scholar
Digital Library
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In KDD '00, pages 169--178, 2000. Google Scholar
Digital Library
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, pages 1425--1432, 2003.Google Scholar
- R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010. Google Scholar
Digital Library




Comments