skip to main content
research-article

Quality-Based Online Data Reconciliation

Published:01 February 2016Publication History
Skip Abstract Section

Abstract

One of the main challenges in data matching and data cleaning, in highly integrated systems, is duplicates detection. While the literature abounds of approaches detecting duplicates corresponding to the same real-world entity, most of these approaches tend to eliminate duplicates (wrong information) from the sources, hence leading to what is called data repair. In this article, we propose a framework that automatically detects duplicates at query time and effectively identifies the consistent version of the data, while keeping inconsistent data in the sources. Our framework uses matching dependencies (MDs) to detect duplicates through the concept of data reconciliation rules (DRR) and conditional function dependencies (CFDs) to assess the quality of different attribute values. We also build a duplicate reconciliation index (DRI), based on clusters of duplicates detected by a set of DRRs to speed up the online data reconciliation process. Our experiments of a real-world data collection show the efficiency and effectiveness of our framework.

References

  1. Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR (SIGIR’06). ACM, New York, NY, 19--26. DOI:http://dx.doi.org/10.1145/1148170.1148177 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Periklis Andritsos, Ariel Fuxman, and Renee J. Miller. 2006. Clean answers over dirty databases: A probabilistic approach. In Proc. of ICDE (ICDE’06). IEEE Computer Society, Washington, DC, 30. DOI:http://dx.doi.org/10.1109/ICDE.2006.35 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: A generic approach to entity resolution. The VLDB Journal 18, 1 (Jan. 2009), 255--276. DOI:http://dx.doi.org/10.1007/s00778-008-0098-x Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Leopoldo Bertossi. 2006. Consistent query answering in databases. SIGMOD Rec. 35, 2 (June 2006), 68--76. DOI:http://dx.doi.org/10.1145/1147376.1147391 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Leopoldo Bertossi and Jaffer Gardezi. 2013. Tractable vs. intractable cases of matching dependencies for query answering under entity resolution. arXiv preprint arXiv:1309.1884 (2013).Google ScholarGoogle Scholar
  6. Leopoldo Bertossi, Solmaz Kolahi, and Laks V. Lakshmanan. 2013. Data cleaning and query answering with matching dependencies and matching functions. Theor. Comp. Sys. 52, 3 (April 2013), 441--482. DOI:http://dx.doi.org/10.1007/s00224-012-9402-7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2007. Conditional functional dependencies for data cleaning. In Proc. of ICDE. IEEE, 746--755.Google ScholarGoogle ScholarCross RefCross Ref
  8. Loreto Bravo, Wenfei Fan, Floris Geerts, and Shuai Ma. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proc. of ICDE (ICDE’08). IEEE Computer Society, Washington, DC, 516--525. DOI:http://dx.doi.org/10.1109/ICDE.2008.4497460 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Loreto Bravo, Wenfei Fan, and Shuai Ma. 2007. Extending dependencies with conditions. In Proc. of VLDB (VLDB’07). VLDB Endowment, 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fei Chiang and Renée J. Miller. 2008. Discovering data quality rules. Proc. VLDB Endow. 1, 1 (Aug. 2008), 1166--1177. DOI:http://dx.doi.org/10.14778/1453856.1453980 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving data quality: Consistency and accuracy. In Proc. of VLDB (VLDB’07). VLDB Endowment, 315--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Graham Cormode, Lukasz Golab, Korn Flip, Andrew McGregor, Divesh Srivastava, and Xi Zhang. 2009. Estimating the confidence of conditional functional dependencies. In Proc. of SIGMOD (SIGMOD’09). ACM, New York, NY, 469--482. DOI:http://dx.doi.org/10.1145/1559845.1559895 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (Jan. 2007), 1--16. DOI:http://dx.doi.org/10.1109/TKDE.2007.9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proc. VLDB Endow. 2, 1 (Aug. 2009), 407--418. DOI:http://dx.doi.org/10.14778/1687627.1687674 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. Interaction between record matching and data repairing. In Proc. of SIGMOD (SIGMOD’11). ACM, New York, NY, 469--480. DOI:http://dx.doi.org/10.1145/1989323.1989373 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jaffer Gardezi and Leopoldo Bertossi. 2012. Query rewriting using datalog for duplicate resolution. In Proc. of Datalog (Datalog 2.0’12). Springer-Verlag, Berlin, 86--98. DOI:http://dx.doi.org/10.1007/978-3-642-32925-8_10 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (July 2013), 625--636. DOI:http://dx.doi.org/10.14778/2536360.2536363 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lukasz Golab, Howard Karloff, Flip Korn, Divesh Srivastava, and Bei Yu. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB Endow. 1, 1 (Aug. 2008), 376--390. DOI:http://dx.doi.org/10.14778/1453856.1453900 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shawn R Jeffery, Liwen Sun, Matt DeLand, Nick Pendar, Rick Barber, and Andrew Galdi. 2013. Arnold: Declarative crowd-machine data integration. In Proc. of CIDR.Google ScholarGoogle Scholar
  20. Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 2 (Feb. 2010), 197--210. DOI:http://dx.doi.org/10.1016/j.datak.2009.10.003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xiang Lian, Lei Chen, and Shaoxu Song. 2010. Consistent query answers in inconsistent probabilistic databases. In Proc. of SIGMOD (SIGMOD’10). ACM, New York, NY, 303--314. DOI:http://dx.doi.org/10.1145/1807167.1807202 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xuan Liu, Xin Luna Dong, Beng Chin Ooi, and Divesh Srivastava. 2011. Online data fusion. Proc. VLDB Endow. 4, 11 (2011).Google ScholarGoogle Scholar
  23. Ravali Pochampally, Anish Das Sarma, Xin Luna Dong, Alexandra Meliou, and Divesh Srivastava. 2014. Fusing data with correlations. In Proc. of SIGMOD (SIGMOD’14). ACM, New York, NY, 433--444. DOI:http://dx.doi.org/10.1145/2588555.2593674 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Guido Sautter, Klemens Bhm, and David King. 2013. RefConcile - automated online reconciliation of bibliographic references. In Digital Libraries: Social Media and Community Networks. Vol. 8279. Springer, 161--170. DOI:http://dx.doi.org/10.1007/978-3-319-03599-4_20 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xuehua Shen and ChengXiang Zhai. 2005. Active feedback in Ad Hoc information retrieval. In Proc. of SIGIR (SIGIR’05). ACM, New York, NY, 59--66. DOI:http://dx.doi.org/10.1145/1076034.1076047 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Shaoxu Song and Lei Chen. 2009. Discovering matching dependencies. In Proc. of CIKM (CIKM’09). ACM, New York, NY, 1421--1424. DOI:http://dx.doi.org/10.1145/1645953.1646135 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Bin Tan, Atulya Velivelli, Hui Fang, and ChengXiang Zhai. 2007. Term feedback for information retrieval with language models. In Proc. of SIGIR (SIGIR’07). ACM, New York, NY, 263--270. DOI:http://dx.doi.org/10.1145/1277741.1277788 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Daisy Zhe Wang, Xin Luna Dong, Anish Das Sarma, Michael J. Franklin, and Alon Y. Halevy. 2009. Functional dependency generation and applications in Pay-As-You-Go data integration systems. In Proc. of WebDB.Google ScholarGoogle Scholar
  29. We Wayne. 2004. Data quality and the bottom line: Achieving business success through a commitment to high quality data. TDWI Report.Google ScholarGoogle Scholar
  30. Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. Proc. VLDB Endow 4, 5 (Feb. 2011), 279--289. DOI:http://dx.doi.org/10.14778/1952376.1952378 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Quality-Based Online Data Reconciliation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Internet Technology
          ACM Transactions on Internet Technology  Volume 16, Issue 1
          February 2016
          129 pages
          ISSN:1533-5399
          EISSN:1557-6051
          DOI:10.1145/2869768
          • Editor:
          • Munindar P. Singh
          Issue’s Table of Contents

          Copyright © 2016 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 February 2016
          • Accepted: 1 July 2015
          • Revised: 1 April 2015
          • Received: 1 November 2014
          Published in toit Volume 16, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!