skip to main content
research-article

Technical Perspective:: Toward Building Entity Matching Management Systems

Published:10 September 2018Publication History
Skip Abstract Section

Abstract

Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then describe Magellan, a new kind of EM system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users execute these steps; the tools seek to cover the entire EM pipeline, not just blocking and matching as current EM systems do. (3) Tools are built into the Python open-source data science ecosystem, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick "patching" of the system. We describe research challenges and present extensive experiments that show the promise of the Magellan approach.

References

  1. BigGorilla: An Open-source Data Integration and Data Preparation Ecosystem: https://recruit-holdings.com/news_ data/release/2017/0630_7890.html.Google ScholarGoogle Scholar
  2. CS 838: Data Science: Principles, Algorithms, and Applications https://sites.google.com/site/anhaidgroup/courses/ cs-838-spring-2017/project-description/stage-3.Google ScholarGoogle Scholar
  3. Magellan home page https://sites.google.com/site/anhaidgroup/projects/magellan.Google ScholarGoogle Scholar
  4. S. Amershi et al. Modeltracker: Redesigning performance analysis tools for machine learning. CHI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Ankerst et al. Visual classification: An interactive approach to decision tree construction. KDD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Becker, R. Kohavi, and D. Sommerfield. Visualizing the simple Bayesian classifier. In Information Visualization in Data Mining and Knowledge Discovery, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. KDD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Bernstein et al. MetaSRA: normalized human sample-specific metadata for the sequence read archive. Bioinformatics, 33(18):2914-2923, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  10. L. Buitinck et al. API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238, 2013.Google ScholarGoogle Scholar
  11. D. Caragea, D. Cook, and V. Honavar. Gaining insights into support vector machine pattern classifiers using projection-based tour methods. KDD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Christen. Febrl: A freely available record linkage system with a graphical user interface. HDKM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Christen. Data Matching. Springer, 2012.Google ScholarGoogle Scholar
  14. P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 24(9):1537-1555, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Dallachiesa et al. Nadeef: A commodity data cleaning system. SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Das et al. The Magellan data repository. https://sites.google.com/site/anhaidgroup/projects/dataGoogle ScholarGoogle Scholar
  17. S. Das et al. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Doan. What is our agenda for data science? In CIDR, 2017.Google ScholarGoogle Scholar
  19. A. Doan et al. Human-in-the-loop challenges for entity matching: A midterm report. In HILDA, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Doan et al. Toward a system building agenda for data integration and cleaning. In IEEE Data Engineering Bulletin, Special Issue on Data Integration (to appear), 2018.Google ScholarGoogle Scholar
  21. M. Ebraheem et al. DeepER-deep entity resolution. arXiv preprint arXiv:1710.00597, 2017.Google ScholarGoogle Scholar
  22. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1-16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Fortini et al. Towards an open source toolkit for building record linkage workflows. In In SIGMOD Workshop on Information Quality in Information Systems, 2006.Google ScholarGoogle Scholar
  24. M. J. Franklin et al. CrowdDB: answering queries with crowdsourcing. SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Ge et al. Private exploration primitives for data cleaning. arXiv preprint arXiv:1712.10266, 2017.Google ScholarGoogle Scholar
  26. C. Gokhale et al. Corleone: Hands-off crowdsourcing for entity matching. SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Govind et al. Cloudmatcher: A cloud/crowd service for entity matching. In BIGDAS, 2017.Google ScholarGoogle Scholar
  28. M. A. Hern'andez et al. HIL: a high-level scripting language for entity integration. In EDBT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Konda et al. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197-1208, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Konda et al. Magellan: Toward building entity matching management systems. 2016. Technical Report, http://www.cs.wisc.edu/~anhai/papers/magellan-tr.pdf.Google ScholarGoogle Scholar
  31. P. Konda et al. Magellan: Toward building entity matching management systems over data science stacks. PVLDB, 9(13):1581-1584, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Konda et al. Performing entity matching end to end: A case study. 2016. Technical Report, http://www.cs.wisc.edu/~anhai/papers/umetrics-tr.pdf.Google ScholarGoogle Scholar
  33. E. LaRose et al. Entity matching using Magellan: Mapping drug reference tables. In AIMA Joint Summit, 2017.Google ScholarGoogle Scholar
  34. H. Li et al. Matchcatcher: A debugger for blocking in entity matching. In EDBT, 2018.Google ScholarGoogle Scholar
  35. S. Mudgal et al. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Panahi et al. Towards interactive debugging of rule-based entity matching. In EDBT, 2017.Google ScholarGoogle Scholar
  37. P. Pessig. Entity matching using Magellan - Matching drug reference tables. In CPCP Retreat 2017. http://cpcp.wisc.edu/ resources/cpcp-2017-retreat-entity-matching.Google ScholarGoogle Scholar
  38. K. Qian et al. Active learning for large-scale entity resolution. In CIKM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Talbot et al. Ensemblematrix: Interactive visualization to support machine learning with multiple classifiers. CHI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. W.-C. Tan et al. Big gorilla: an open-source ecosystem for data preparation and integration. In IEEE Data Engineering Bulletin, Special Issue on Data Integration (to appear), 2018.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!