Abstract
Tables on the Web contain a vast amount of knowledge in a structured form. To tap into this valuable resource, we address the problem of table retrieval: answering an information need with a ranked list of tables. We investigate this problem in two different variants, based on how the information need is expressed: as a keyword query or as an existing table (“query-by-table”). The main novel contribution of this work is a semantic table retrieval framework for matching information needs (keyword or table queries) against tables. Specifically, we (i) represent queries and tables in multiple semantic spaces (both discrete sparse and continuous dense vector representations) and (ii) introduce various similarity measures for matching those semantic representations. We consider all possible combinations of semantic representations and similarity measures and use these as features in a supervised learning model. Using two purpose-built test collections based on Wikipedia tables, we demonstrate significant and substantial improvements over state-of-the-art baselines.
- Ahmad Ahmadov, Maik Thiele, Julian Eberius, Wolfgang Lehner, and Robert Wrembel. 2015. Towards a hybrid imputation approach using web tables. In Proceedings of the IEEE 2nd International Symposium on Big Data Computing (BDC’15). 21–30.Google Scholar
Cross Ref
- Marie Anan and Gal Avigdor. 2007. On the stable marriage of maximum weight royal couples. In Proceedings of the IIweb’07. 1–6.Google Scholar
- Ebrahim Bagheri and Feras Al-Obeidat. 2020. A latent model for ad hoc table retrieval. In Advances in Information Retrieval. 86–93.Google Scholar
- Sreeram Balakrishnan, Alon Y. Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu. 2015. Applying WebTables in practice. In Proceedings of the CIDR’15.Google Scholar
- Krisztian Balog. 2018. Entity-Oriented Search. The Information Retrieval Series, Vol. 39. Springer. Google Scholar
Digital Library
- Somnath Banerjee, Soumen Chakrabarti, and Ganesh Ramakrishnan. 2009. Learning to rank for quantity consensus queries. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 243–250. Google Scholar
Digital Library
- Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2013. Methods for exploring and mining tables on Wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA’13). 18–26. Google Scholar
Digital Library
- Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL: Entity linking in web tables. In Proceedings of the 14th International Conference on The Semantic Web (ISWC’15). 425–441. Google Scholar
Digital Library
- Michael J. Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1 (Aug. 2009), 1090–1101. Google Scholar
Digital Library
- Michael J. Cafarella, Alon Halevy, and Jayant Madhavan. 2011. Structured data on the web. Commun. ACM 54 (2011), 72–79. Google Scholar
Digital Library
- Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: Exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (Aug. 2008), 538–549. Google Scholar
Digital Library
- Michael J. Cafarella, Alon Y. Halevy, Yang Zhang, Daisy Zhe Wang, and Eugene Wu 0002. 2008. Uncovering the relational web. In Proceedings of the 11th International Workshop on the Web and Databases (WebDB’08).Google Scholar
- Jing Chen, Chenyan Xiong, and Jamie Callan. 2016. An empirical study of learning to rank for entity search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 737–740. Google Scholar
Digital Library
- Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Yinan Xu, and Brian D. Davison. 2020. Table search using a deep contextualized language model. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 589–598. Google Scholar
Digital Library
- Fernando Chirigati, Jialu Liu, Flip Korn, You (Will) Wu, Cong Yu, and Hao Zhang. 2016. Knowledge exploration using tables on the web. Proc. VLDB Endow. 10, 3 (Nov. 2016), 193–204. Google Scholar
Digital Library
- Eric Crestan and Patrick Pantel. 2011. Web-scale table census and classification. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). 545–554. Google Scholar
Digital Library
- Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding related tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 817–828. Google Scholar
Digital Library
- Li Deng, Shuo Zhang, and Krisztian Balog. 2019. Table2Vec: Neural word and entity embeddings for table population and retrieval. In Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). 1029–1032. Google Scholar
Digital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.Google Scholar
- Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). 601–610. Google Scholar
Digital Library
- D. W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. 2006. Table-processing paradigms: A research survey. Int. J. Doc. Anal. Recogn. 8, 2–3 (Jun. 2006), 66–86.Google Scholar
Cross Ref
- J. L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76 (1971), 378--382.Google Scholar
Cross Ref
- Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth J. F. Jones. 2015. Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). 795–798. Google Scholar
Digital Library
- Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, Fabrizio Silvestri, and Narayan Bhamidipati. 2015. Context- and content-aware embeddings for query rewriting in sponsored search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). 383–392. Google Scholar
Digital Library
- Faegheh Hasibi, Krisztian Balog, Darío Garigliotti, and Shuo Zhang. 2017. Nordlys: A toolkit for entity-oriented and semantic search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 1289–1292. Google Scholar
Digital Library
- Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-entity v2: A test collection for entity search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 1265–1268. Google Scholar
Digital Library
- Tom Kenter and Maarten de Rijke. 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). 1411–1420. Google Scholar
Digital Library
- Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and Christian Bizer. 2015. The mannheim search join engine. Web Semant. 35, P3 (Dec. 2015), 159–166. Google Scholar
Digital Library
- Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3, 1–2 (Sept. 2010), 1338–1347. Google Scholar
Digital Library
- Tie-Yan Liu. 2011. Learning to Rank for Information Retrieval. Springer, Berlin.Google Scholar
- Ying Liu, Kun Bai, Prasenjit Mitra, and C. Lee Giles. 2007. TableSeer: Automatic table metadata extraction and searching in digital libraries. In Proceedings of the Joint Conference on Digital Libraries (JCDL’07). 91–100. Google Scholar
Digital Library
- Craig Macdonald, Rodrygo L. T. Santos, and Iadh Ounis. 2012. On the usefulness of query features for learning to rank. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12). 2559–2562. Google Scholar
Digital Library
- Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Y. Halevy. 2009. Harnessing the deep web: Present and future. CoRR abs/0909.1785 (2009). Google Scholar
- Jarana Manotumruksa, Craig MacDonald, and Iadh Ounis. 2016. Modelling user preferences using word embeddings for context-aware venue recommendation. CoRR abs/1606.07828 (2016).Google Scholar
- Bjoern H. Menze, B. Michael Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich, and Fred A. Hamprecht. 2009. A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data.BMC Bioinform. 10 (2009).Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Volume 2 (NIPS’13). Curran Associates Inc., 3111–3119. Google Scholar
Digital Library
- David Milne and Ian H. Witten. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of the 1st AAAI Workshop on Wikipedia and Artifical Intellegence (WIKIAI’08).Google Scholar
- Bhaskar Mitra, Eric T. Nalisnick, Nick Craswell, and Rich Caruana. 2016. A dual embedding space model for document ranking. CoRR abs/1602.01137 (2016).Google Scholar
- Emir Muñoz, Aidan Hogan, and Alessandra Mileo. 2014. Using linked data to mine RDF from Wikipedia’s tables. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM’14). 533–542. Google Scholar
Digital Library
- Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. 2015. Neural programmer: Inducing latent programs with gradient descent. CoRR abs/1511.04834 (2015).Google Scholar
- Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Weidlich Matthias, and Aberer Karl. 2015. Result selection and summarization for web table search. In Proceedings of the 31st International Conference on Data Engineering (ISDE’15). 231–242.Google Scholar
Cross Ref
- Paul Ogilvie and Jamie Callan. 2003. Combining document representations for known-item search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR’03). 143–150. Google Scholar
Digital Library
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.Google Scholar
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). 701–710. Google Scholar
Digital Library
- Mohammad Taher Pilehvar and Jose Camacho-Collados. 2020. Embeddings in Natural Language Processing. Morgan & Claypool Publishers.Google Scholar
- Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering table queries on the web using column keywords. Proc. VLDB Endow. 5 (2012), 908–919. Google Scholar
Digital Library
- David Pinto, Michael Branstein, Ryan Coleman, W. Bruce Croft, Matthew King, Wei Li, and Xing Wei. 2002. QuASM: A system for question answering using semi-structured data. In Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’02). 46–55. Google Scholar
Digital Library
- P. Pyreddy and W. B. Croft. 1997. TINTI: A System for Retrieval in Text Tables TITLE2. Technical Report. USA. Google Scholar
Digital Library
- Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010. LETOR: A benchmark collection for research on learning to rank for information retrieval. Inf. Retr. 13, 4 (2010), 346–374. Google Scholar
Digital Library
- Hadas Raviv, Oren Kurland, and David Carmel. 2016. Document retrieval using entity-based language models. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 65–74. Google Scholar
Digital Library
- Petar Ristoski and Heiko Paulheim. 2016. RDF2Vec: RDF graph embeddings for data mining. In Proceedings of the 15th International Semantic Web Conference (ISWC’16), Lecture Notes in Computer Science, Paul T. Groth, Elena Simperl, Alasdair J. G. Gray, Marta Sabou, Markus Krötzsch, Freddy Lécué, Fabian Flöck, and Yolanda Gil (Eds.), Vol. 9981. 498–514.Google Scholar
Cross Ref
- Gaetano Rossiello, Pierpaolo Basile, and Giovanni Semeraro. 2017. Centroid-based text summarization through compositionality of word embeddings. In Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation across Source Types and Genres. Association for Computational Linguistics, 12–21.Google Scholar
Cross Ref
- Sunita Sarawagi and Soumen Chakrabarti. 2014. Open-domain quantity queries on web tables: Annotation, response, and consensus models. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). 711–720. Google Scholar
Digital Library
- Yoones A. Sekhavat, Francesco Di Paolo, Denilson Barbosa, and Paolo Merialdo. 2014. Knowledge base augmentation using tabular data. In Proceedings of the Conference on Linked Data on the Web (LDOW’14).Google Scholar
- Roee Shraga, Haggai Roitman, Guy Feigenblat, and Mustafa Canim. 2020. Ad hoc table retrieval using intrinsic and extrinsic similarities. In Proceedings of the World Wide Web Conference 2020 (WWW’20). 2479–2485. Google Scholar
Digital Library
- Roee Shraga, Haggai Roitman, Guy Feigenblat, and Mustafa Cannim. 2020. Web table retrieval using multimodal deep learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1399–1408. Google Scholar
Digital Library
- Roee Shraga, Haggai Roitman, Guy Feigenblat, and Bar Weiner. 2020. Projection-based relevance model for table retrieval. In Companion Proceedings of the Web Conference 2020 (WWW’20). 28–29. Google Scholar
Digital Library
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). International World Wide Web Conferences Steering Committee, 1067–1077. Google Scholar
Digital Library
- M. Trabelsi, B. D. Davison, and J. Heflin. 2019. Improved table retrieval using multiple context embeddings for attributes. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data’19). 1238–1244.Google Scholar
- Stephen Tyree, Kilian Q. Weinberger, Kunal Agrawal, and Jennifer Paykin. 2011. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). 387–396. Google Scholar
Digital Library
- Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering semantics of tables on the web. Proc. VLDB Endow. 4, 9 (June 2011), 528–538. Google Scholar
Digital Library
- Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). 363–372. Google Scholar
Digital Library
- Jiannan Wang, Guoliang Li, and Jianhua Fe. 2011. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE’11). 458–469. Google Scholar
Digital Library
- Xing Wei, Bruce Croft, and Andrew Mccallum. 2006. Table extraction for answer retrieval. Inf. Retr. 9, 5 (nov 2006), 589--611. Google Scholar
Digital Library
- Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. 2017. Word-entity duet representations for document ranking. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 763–772. Google Scholar
Digital Library
- Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. InfoGather: Entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). 97–108. Google Scholar
Digital Library
- Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao. 2016. Neural enquirer: Learning to query tables in natural language. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2308–2314. Google Scholar
Digital Library
- Meihui Zhang and Kaushik Chakrabarti. 2013. InfoGather+: Semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). 145–156. Google Scholar
Digital Library
- Shuo Zhang and Krisztian Balog. 2017. Design patterns for fusion-based object retrieval. In Proceedings of the 39th European Conference on Advances in Information Retrieval (ECIR’17). 684–690.Google Scholar
Cross Ref
- Shuo Zhang and Krisztian Balog. 2017. EntiTables: Smart assistance for entity-focused tables. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 255–264. Google Scholar
Digital Library
- Shuo Zhang and Krisztian Balog. 2018. Ad hoc table retrieval using semantic similarity. In Proceedings of the World Wide Web Conference 2018 (WWW’18). 1553–1562. Google Scholar
Digital Library
- Shuo Zhang and Krisztian Balog. 2019. Auto-completion for data cells in relational tables. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). 761–770. Google Scholar
Digital Library
- Shuo Zhang and Krisztian Balog. 2019. Recommending related tables. arxiv:1907.03595. Retrieved from http://arxiv.org/abs/1907.03595.Google Scholar
- Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Trans. Intell. Syst. Technol. 11, 2, Article Article 13 (Jan. 2020), 35 pages. DOI:https://doi.org/10.1145/3372117 Google Scholar
Digital Library
- Shuo Zhang, Krisztian Balog, and Jamie Callan. 2020. Generating categories for sets of entities. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM’20). Association for Computing Machinery, New York, NY, 1833–1842. Google Scholar
Digital Library
- Shuo Zhang, Edgar Meij, Krisztian Balog, and Ridho Reinanda. 2020. Novel entity discovery from web tables. In Proceedings of the World Wide Web Conference 2020 (WWW’20). Association for Computing Machinery, New York, NY, 1298–1308. DOI:https://doi.org/10.1145/3366423.3380205 Google Scholar
Digital Library
- Guangyou Zhou, Tingting He, Jun Zhao, and Po Hu. 2015. Learning continuous word embedding with metadata for question retrieval in community question answering. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL’15). 250–259.Google Scholar
Cross Ref
- Stefan Zwicklbauer, Christoph Einsiedler, Michael Granitzer, and Christin Seifert. 2013. Towards disambiguating web tables. In Proceedings of the 12th International Semantic Web Conference (ISWC-PD’13). 205–208. Google Scholar
Digital Library
Index Terms
Semantic Table Retrieval Using Keyword and Table Queries
Recommendations
Web Table Retrieval using Multimodal Deep Learning
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information RetrievalWe address the web table retrieval task, aiming to retrieve and rank web tables as whole answers to a given information need. To this end, we formally define web tables as multimodal objects. We then suggest a neural ranking model, termed MTR, which ...
Ad Hoc Table Retrieval using Semantic Similarity
WWW '18: Proceedings of the 2018 World Wide Web ConferenceWe introduce and address the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables. This task is not only interesting on its own account, but is also being used as a core component in many other table-based ...
Retrieval Augmented via Execution Guidance in Open-domain Table QA
ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial IntelligenceThe goal of the open-domain table QA task is to answer a question based on retrieving and extracting information from a large corpus of structured tables. Currently, the accuracy of the most popular framework in open-domain QA: the two-stage retrieval, ...






Comments