Abstract
One of the main challenges in data matching and data cleaning, in highly integrated systems, is duplicates detection. While the literature abounds of approaches detecting duplicates corresponding to the same real-world entity, most of these approaches tend to eliminate duplicates (wrong information) from the sources, hence leading to what is called data repair. In this article, we propose a framework that automatically detects duplicates at query time and effectively identifies the consistent version of the data, while keeping inconsistent data in the sources. Our framework uses matching dependencies (MDs) to detect duplicates through the concept of data reconciliation rules (DRR) and conditional function dependencies (CFDs) to assess the quality of different attribute values. We also build a duplicate reconciliation index (DRI), based on clusters of duplicates detected by a set of DRRs to speed up the online data reconciliation process. Our experiments of a real-world data collection show the efficiency and effectiveness of our framework.
- Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR (SIGIR’06). ACM, New York, NY, 19--26. DOI:http://dx.doi.org/10.1145/1148170.1148177 Google Scholar
Digital Library
- Periklis Andritsos, Ariel Fuxman, and Renee J. Miller. 2006. Clean answers over dirty databases: A probabilistic approach. In Proc. of ICDE (ICDE’06). IEEE Computer Society, Washington, DC, 30. DOI:http://dx.doi.org/10.1109/ICDE.2006.35 Google Scholar
Digital Library
- Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. The VLDB Journal 18, 1 (Jan. 2009), 255--276. DOI:http://dx.doi.org/10.1007/s00778-008-0098-x Google Scholar
Digital Library
- Leopoldo Bertossi. 2006. Consistent query answering in databases. SIGMOD Rec. 35, 2 (June 2006), 68--76. DOI:http://dx.doi.org/10.1145/1147376.1147391 Google Scholar
Digital Library
- Leopoldo Bertossi and Jaffer Gardezi. 2013. Tractable vs. intractable cases of matching dependencies for query answering under entity resolution. arXiv preprint arXiv:1309.1884 (2013).Google Scholar
- Leopoldo Bertossi, Solmaz Kolahi, and Laks V. Lakshmanan. 2013. Data cleaning and query answering with matching dependencies and matching functions. Theor. Comp. Sys. 52, 3 (April 2013), 441--482. DOI:http://dx.doi.org/10.1007/s00224-012-9402-7 Google Scholar
Digital Library
- Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. In Proc. of ICDE. IEEE, 746--755.Google Scholar
Cross Ref
- Loreto Bravo, Wenfei Fan, Floris Geerts, and Shuai Ma. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proc. of ICDE (ICDE’08). IEEE Computer Society, Washington, DC, 516--525. DOI:http://dx.doi.org/10.1109/ICDE.2008.4497460 Google Scholar
Digital Library
- Loreto Bravo, Wenfei Fan, and Shuai Ma. 2007. Extending dependencies with conditions. In Proc. of VLDB (VLDB’07). VLDB Endowment, 243--254. Google Scholar
Digital Library
- Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. Proc. VLDB Endow. 1, 1 (Aug. 2008), 1166--1177. DOI:http://dx.doi.org/10.14778/1453856.1453980 Google Scholar
Digital Library
- Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In Proc. of VLDB (VLDB’07). VLDB Endowment, 315--326. Google Scholar
Digital Library
- Graham Cormode, Lukasz Golab, Korn Flip, Andrew McGregor, Divesh Srivastava, and Xi Zhang. 2009. Estimating the confidence of conditional functional dependencies. In Proc. of SIGMOD (SIGMOD’09). ACM, New York, NY, 469--482. DOI:http://dx.doi.org/10.1145/1559845.1559895 Google Scholar
Digital Library
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (Jan. 2007), 1--16. DOI:http://dx.doi.org/10.1109/TKDE.2007.9 Google Scholar
Digital Library
- Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proc. VLDB Endow. 2, 1 (Aug. 2009), 407--418. DOI:http://dx.doi.org/10.14778/1687627.1687674 Google Scholar
Digital Library
- Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. Interaction between record matching and data repairing. In Proc. of SIGMOD (SIGMOD’11). ACM, New York, NY, 469--480. DOI:http://dx.doi.org/10.1145/1989323.1989373 Google Scholar
Digital Library
- Jaffer Gardezi and Leopoldo Bertossi. 2012. Query rewriting using datalog for duplicate resolution. In Proc. of Datalog (Datalog 2.0’12). Springer-Verlag, Berlin, 86--98. DOI:http://dx.doi.org/10.1007/978-3-642-32925-8_10 Google Scholar
Digital Library
- Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (July 2013), 625--636. DOI:http://dx.doi.org/10.14778/2536360.2536363 Google Scholar
Digital Library
- Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB Endow. 1, 1 (Aug. 2008), 376--390. DOI:http://dx.doi.org/10.14778/1453856.1453900 Google Scholar
Digital Library
- Shawn R Jeffery, Liwen Sun, Matt DeLand, Nick Pendar, Rick Barber, and Andrew Galdi. 2013. Arnold: Declarative crowd-machine data integration. In Proc. of CIDR.Google Scholar
- Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 2 (Feb. 2010), 197--210. DOI:http://dx.doi.org/10.1016/j.datak.2009.10.003 Google Scholar
Digital Library
- Xiang Lian, Lei Chen, and Shaoxu Song. 2010. Consistent query answers in inconsistent probabilistic databases. In Proc. of SIGMOD (SIGMOD’10). ACM, New York, NY, 303--314. DOI:http://dx.doi.org/10.1145/1807167.1807202 Google Scholar
Digital Library
- Xuan Liu, Xin Luna Dong, Beng Chin Ooi, and Divesh Srivastava. 2011. Online data fusion. Proc. VLDB Endow. 4, 11 (2011).Google Scholar
- Ravali Pochampally, Anish Das Sarma, Xin Luna Dong, Alexandra Meliou, and Divesh Srivastava. 2014. Fusing data with correlations. In Proc. of SIGMOD (SIGMOD’14). ACM, New York, NY, 433--444. DOI:http://dx.doi.org/10.1145/2588555.2593674 Google Scholar
Digital Library
- Guido Sautter, Klemens Bhm, and David King. 2013. RefConcile - automated online reconciliation of bibliographic references. In Digital Libraries: Social Media and Community Networks. Vol. 8279. Springer, 161--170. DOI:http://dx.doi.org/10.1007/978-3-319-03599-4_20 Google Scholar
Digital Library
- Xuehua Shen and ChengXiang Zhai. 2005. Active feedback in Ad Hoc information retrieval. In Proc. of SIGIR (SIGIR’05). ACM, New York, NY, 59--66. DOI:http://dx.doi.org/10.1145/1076034.1076047 Google Scholar
Digital Library
- Shaoxu Song and Lei Chen. 2009. Discovering matching dependencies. In Proc. of CIKM (CIKM’09). ACM, New York, NY, 1421--1424. DOI:http://dx.doi.org/10.1145/1645953.1646135 Google Scholar
Digital Library
- Bin Tan, Atulya Velivelli, Hui Fang, and ChengXiang Zhai. 2007. Term feedback for information retrieval with language models. In Proc. of SIGIR (SIGIR’07). ACM, New York, NY, 263--270. DOI:http://dx.doi.org/10.1145/1277741.1277788 Google Scholar
Digital Library
- Daisy Zhe Wang, Xin Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Y. Halevy. 2009. Functional dependency generation and applications in Pay-As-You-Go data integration systems. In Proc. of WebDB.Google Scholar
- We Wayne. 2004. Data quality and the bottom line: Achieving business success through a commitment to high quality data. TDWI Report.Google Scholar
- Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. Proc. VLDB Endow 4, 5 (Feb. 2011), 279--289. DOI:http://dx.doi.org/10.14778/1952376.1952378 Google Scholar
Digital Library
Index Terms
Quality-Based Online Data Reconciliation
Recommendations
Crafting Rules: Context-Reflective Data Quality Problem Solving
<P> Motivated by the growing importance of data quality in data-intensive, global business environments and by burgeoning data quality activities, this study builds a conceptual model of data quality problem solving. The study analyzes data quality ...
Research on data reconciliation based on generalized T distribution with historical data
In the most of previous data reconciliation(DR) studies, process data were conventionally characterized by normal Gaussian distribution, so the optimality/validity of DR estimator is implicitly based on a main assumption that errors follow normal ...
Rule-based data quality
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementIn the business intelligence/data warehouse user community, there is a growing confusion as to the difference between data cleansing and data quality. While many data cleansing products can help in applying data edits to name and address data, or help ...






Comments