ABSTRACT
We make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the information available on the web for each concept instance. The goal of building and maintaining such a web of concepts presents many challenges, but also offers the promise of enabling many powerful applications, including novel search and information discovery paradigms. We present the goal, motivate it with example usage scenarios and some analysis of Yahoo! logs, and discuss the challenges in building and leveraging such a web of concepts. We place this ambitious research agenda in the context of the state of the art in the literature, and describe various ongoing efforts at Yahoo! Research that are related.
- ]]J. Allan. Topic Detection and Tracking. Kluwer Academic, 2002.Google Scholar
Cross Ref
- ]]R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--596, 2002. Google Scholar
Digital Library
- ]]R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6:1817--1853, 2005. Google Scholar
Digital Library
- ]]T. Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google Scholar
- ]]J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi. Semantically annotated snapshot of the English Wikipedia. In LREC, 2008.Google Scholar
- ]]H. Bast, A. Chitea, F. Suchanek, and I. Weber. Ester: Efficient search on text, entities and relations. In SIGIR, pages 671--678, 2007. Google Scholar
Digital Library
- ]]R. Baumgartner, S. Flesca, and G. Gottlob. Visual web information extraction with Lixto. In VLDB, pages 119--128, 2001. Google Scholar
Digital Library
- ]]O. Benjelloun, A.D. Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953--964, 2006. Google Scholar
Digital Library
- ]]T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, 2001.Google Scholar
Digital Library
- ]]P.A. Bernstein and L. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008. Google Scholar
Digital Library
- ]]I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. In SDM, 2006.Google Scholar
Cross Ref
- ]]I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007. Google Scholar
Digital Library
- ]]R. Brachman and H. Levesque. Knowledge Representation and Reasoning. Morgan Kaufmann, 2004. Google Scholar
Digital Library
- ]]P. Buneman, S. Khanna, and W.C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316--330, 2001. Google Scholar
Digital Library
- ]]M.J. Cafarella, A. Halevy, D.Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google Scholar
Digital Library
- ]]C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65--79, 1997.Google Scholar
- ]]S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, pages 865--876, 2005. Google Scholar
Digital Library
- ]]F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE, pages 943--952, 2008. Google Scholar
Digital Library
- ]]J. Cheney, P. Buneman, and B. Ludäscher. Report on the principles of provenance workshop. SIGMOD Record, 37(1):62--65, 2008. Google Scholar
Digital Library
- ]]W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI Workshop on Information Integration on the Web, pages 73--78, 2003.Google Scholar
- ]]V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google Scholar
Digital Library
- ]]N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction : An approach based on a probabilistic tree-edit model. In SIGMOD, 2009. Google Scholar
Digital Library
- ]]N. Dalvi, R. Kumar, B. Pang, and A. Tomkins. Matching reviews with objects using a language model. In Manuscript, 2008.Google Scholar
- ]]N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB, 16(4):523--544, 2004. Google Scholar
Digital Library
- ]]P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google Scholar
Digital Library
- ]]A. Doan, J. Madhavan, P. Domingos, and A.Y. Halevy. Ontology matching: A machine learning approach. In Handbook on Ontologies, pages 385--404, 2004.Google Scholar
Cross Ref
- ]]A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: State of the art and research directions. In SIGMOD, pages 799--800, 2006. Google Scholar
Digital Library
- ]]P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.Google Scholar
- ]]X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85--96, 2005. Google Scholar
Digital Library
- ]]O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google Scholar
Digital Library
- ]]I.P. Fellegi and A.B. Sunter. A theory for record linkage. JASA, 64:1183--1210, 1969.Google Scholar
Cross Ref
- ]]A.D. Fuxman and R.J. Miller. First-order query rewriting for inconsistent databases. In ICDT, pages 337--351, 2005. Google Scholar
Digital Library
- ]]R. Gilleron, F. Jousse, I. Tellier, and M. Tommasi. XML document transformation with conditional random fields. In INEX, 2006.Google Scholar
- ]]M.N. Gubanov and P.A. Bernstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.Google Scholar
- ]]A. Gupta and I.S. Mumick. Materialized Views: Techniques, Implementations, and Applications. MIT Press, 1999. Google Scholar
Digital Library
- ]]R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965--976, 2006. Google Scholar
Digital Library
- ]]A.Y. Halevy, M.J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006. Google Scholar
Digital Library
- ]]R. Hall, C. Sutton, and A. McCallum. Unsupervised deduplication using cross-field dependencies. In KDD, pages 310--317, 2008. Google Scholar
Digital Library
- ]]W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google Scholar
Digital Library
- ]]C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998. Google Scholar
Digital Library
- ]]A. Jain, D. Kifer, A. Kirpal, S. Merugu, S. Keerthi, P. Bohannon, and R. Ramakrishnan. Concept-centric extraction: using domain knowledge and local learning. In Manuscript, 2008.Google Scholar
- ]]T.S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.Google Scholar
- ]]D.V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.Google Scholar
Cross Ref
- ]]N. Kushmerick, D.S. Weld, and R.B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google Scholar
- ]]J. Madhavan, L. Afanasiev, L. Antova, and A.Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.Google Scholar
- ]]A. McCallum. Information extraction: Distilling structured data from unstructured text. Queue, 3(9):48--57, 2005. Google Scholar
Digital Library
- ]]A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.Google Scholar
- ]]R. McCann, A. Kramnik, W. Shen, V. Varadarajan, O. Sobulo, and A. Doan. Integrating data from disparate sources: A mass collaboration approach. In ICDE, pages 487--488, 2005. Google Scholar
Digital Library
- ]]I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured. In AAAI: Workshop on AI and Information Integration, 1998.Google Scholar
- ]]J. Myllymaki and J. Jackson. Robust web data extraction with XML path expressions. Technical Report RJ 10245, IBM, 2002.Google Scholar
- ]]G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google Scholar
Digital Library
- ]]H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. and James. Automatic linkage of vital records. Science, 130:954--959, 1959.Google Scholar
Cross Ref
- ]]H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.Google Scholar
- ]]E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google Scholar
Digital Library
- ]]E. Rahm, A. Thor, D. Aumueller, H.H. Do, N. Golovin, and T. Kirsten. iFuice: Information fusion utilizing instance correspondences and peer mappings. In WebDB, pages 7--12, 2005.Google Scholar
- ]]R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In ICML, pages 759--766, 2007. Google Scholar
Digital Library
- ]]R. Ramakrishnan and A. Tomkins. Toward a peopleweb. IEEE Computer, 40(8):63--72, 2007. Google Scholar
Digital Library
- ]]A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using W4F. In VLDB, pages 738--741, 1999. Google Scholar
Digital Library
- ]]P. Singla and P. Domingos. Entity resolution with Markov logic. In ICDM, pages 572--582, 2006. Google Scholar
Digital Library
- ]]S. Sundararajan and S. Keerthi. Graph based classification methods using inaccurate external classifier information. In Manuscript, 2008.Google Scholar
- ]]J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
Index Terms
A web of concepts
Recommendations
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Basic level of concepts in formal concept analysis
ICFCA'12: Proceedings of the 10th international conference on Formal Concept AnalysisThe paper presents a preliminary study on basic level of concepts in the framework of formal concept analysis (FCA). The basic level of concepts is an important phenomenon studied in the psychology of concepts. We argue that this phenomenon may be ...
Ranking web sites using domain ontology concepts
Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...






Comments