Abstract
Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then describe Magellan, a new kind of EM system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users execute these steps; the tools seek to cover the entire EM pipeline, not just blocking and matching as current EM systems do. (3) Tools are built into the Python open-source data science ecosystem, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick "patching" of the system. We describe research challenges and present extensive experiments that show the promise of the Magellan approach.
- BigGorilla: An Open-source Data Integration and Data Preparation Ecosystem: https://recruit-holdings.com/news_ data/release/2017/0630_7890.html.Google Scholar
- CS 838: Data Science: Principles, Algorithms, and Applications https://sites.google.com/site/anhaidgroup/courses/ cs-838-spring-2017/project-description/stage-3.Google Scholar
- Magellan home page https://sites.google.com/site/anhaidgroup/projects/magellan.Google Scholar
- S. Amershi et al. Modeltracker: Redesigning performance analysis tools for machine learning. CHI, 2015. Google Scholar
Digital Library
- M. Ankerst et al. Visual classification: An interactive approach to decision tree construction. KDD, 1999. Google Scholar
Digital Library
- A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. SIGMOD, 2010. Google Scholar
Digital Library
- B. Becker, R. Kohavi, and D. Sommerfield. Visualizing the simple Bayesian classifier. In Information Visualization in Data Mining and Knowledge Discovery, 2002. Google Scholar
Digital Library
- K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. KDD, 2012. Google Scholar
Digital Library
- M. Bernstein et al. MetaSRA: normalized human sample-specific metadata for the sequence read archive. Bioinformatics, 33(18):2914-2923, 2017.Google Scholar
Cross Ref
- L. Buitinck et al. API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238, 2013.Google Scholar
- D. Caragea, D. Cook, and V. Honavar. Gaining insights into support vector machine pattern classifiers using projection-based tour methods. KDD, 2001. Google Scholar
Digital Library
- P. Christen. Febrl: A freely available record linkage system with a graphical user interface. HDKM, 2008. Google Scholar
Digital Library
- P. Christen. Data Matching. Springer, 2012.Google Scholar
- P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 24(9):1537-1555, 2012. Google Scholar
Digital Library
- M. Dallachiesa et al. Nadeef: A commodity data cleaning system. SIGMOD, 2013. Google Scholar
Digital Library
- S. Das et al. The Magellan data repository. https://sites.google.com/site/anhaidgroup/projects/dataGoogle Scholar
- S. Das et al. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, 2017. Google Scholar
Digital Library
- A. Doan. What is our agenda for data science? In CIDR, 2017.Google Scholar
- A. Doan et al. Human-in-the-loop challenges for entity matching: A midterm report. In HILDA, 2017. Google Scholar
Digital Library
- A. Doan et al. Toward a system building agenda for data integration and cleaning. In IEEE Data Engineering Bulletin, Special Issue on Data Integration (to appear), 2018.Google Scholar
- M. Ebraheem et al. DeepER-deep entity resolution. arXiv preprint arXiv:1710.00597, 2017.Google Scholar
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1-16, 2007. Google Scholar
Digital Library
- M. Fortini et al. Towards an open source toolkit for building record linkage workflows. In In SIGMOD Workshop on Information Quality in Information Systems, 2006.Google Scholar
- M. J. Franklin et al. CrowdDB: answering queries with crowdsourcing. SIGMOD, 2011. Google Scholar
Digital Library
- C. Ge et al. Private exploration primitives for data cleaning. arXiv preprint arXiv:1712.10266, 2017.Google Scholar
- C. Gokhale et al. Corleone: Hands-off crowdsourcing for entity matching. SIGMOD, 2014. Google Scholar
Digital Library
- Y. Govind et al. Cloudmatcher: A cloud/crowd service for entity matching. In BIGDAS, 2017.Google Scholar
- M. A. Hern'andez et al. HIL: a high-level scripting language for entity integration. In EDBT, 2013. Google Scholar
Digital Library
- P. Konda et al. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197-1208, 2016. Google Scholar
Digital Library
- P. Konda et al. Magellan: Toward building entity matching management systems. 2016. Technical Report, http://www.cs.wisc.edu/~anhai/papers/magellan-tr.pdf.Google Scholar
- P. Konda et al. Magellan: Toward building entity matching management systems over data science stacks. PVLDB, 9(13):1581-1584, 2016. Google Scholar
Digital Library
- P. Konda et al. Performing entity matching end to end: A case study. 2016. Technical Report, http://www.cs.wisc.edu/~anhai/papers/umetrics-tr.pdf.Google Scholar
- E. LaRose et al. Entity matching using Magellan: Mapping drug reference tables. In AIMA Joint Summit, 2017.Google Scholar
- H. Li et al. Matchcatcher: A debugger for blocking in entity matching. In EDBT, 2018.Google Scholar
- S. Mudgal et al. Deep learning for entity matching: A design space exploration. In SIGMOD, 2018. Google Scholar
Digital Library
- F. Panahi et al. Towards interactive debugging of rule-based entity matching. In EDBT, 2017.Google Scholar
- P. Pessig. Entity matching using Magellan - Matching drug reference tables. In CPCP Retreat 2017. http://cpcp.wisc.edu/ resources/cpcp-2017-retreat-entity-matching.Google Scholar
- K. Qian et al. Active learning for large-scale entity resolution. In CIKM, 2017. Google Scholar
Digital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. KDD, 2002. Google Scholar
Digital Library
- J. Talbot et al. Ensemblematrix: Interactive visualization to support machine learning with multiple classifiers. CHI, 2009. Google Scholar
Digital Library
- W.-C. Tan et al. Big gorilla: an open-source ecosystem for data preparation and integration. In IEEE Data Engineering Bulletin, Special Issue on Data Integration (to appear), 2018.Google Scholar
Recommendations
Technical Perspective: Bipartite Matching: What to do in the Real World When Computing Assignment Costs Dominates Finding the Optimal Assignment
The optimal assignment problem is a classic combinatorial optimization problem. Given a set of n agents A, a set T of m tasks, and an n×m cost matrix C, the objective is to find the matching between A and T, which minimizes or maximizes an aggregate ...






Comments