Abstract
Domain terminologies are a basic resource for various natural language processing tasks. To automatically discover terminologies for a domain of interest, most traditional approaches mostly rely on a domain-specific corpus given in advance; thus, the performance of traditional approaches can only be guaranteed when collecting a high-quality domain-specific corpus, which requires extensive human involvement and domain expertise. In this article, we propose a novel approach that is capable of automatically mining domain terminologies using search engine's query log—a type of domain-independent corpus of higher availability, coverage, and timeliness than a manually collected domain-specific corpus. In particular, we represent query log as a heterogeneous network and formulate the task of mining domain terminology as transductive learning on the heterogeneous network. In the proposed approach, the manifold structure of domain-specificity inherent in query log is captured by using a novel network embedding algorithm and further exploited to reduce the need for the manual annotation efforts for domain terminology classification. We select Agriculture and Healthcare as the target domains and experiment using a real query log from a commercial search engine. Experimental results show that the proposed approach outperforms several state-of-the-art approaches.
- Ahmet Aker, Monica Paramita, and Rob Gaizauskas. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Meeting of the Association for Computational Linguistics, Vol. 1. 402–411.Google Scholar
- Ahmet Aker, Monica Lestari Paramita, Emma Barker, and Robert J. Gaizauskas. 2014. Bootstrapping term extractors for multiple languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation. 483–489.Google Scholar
- Mihael Arcan, Claudio Giuliano, Marco Turchi, and Paul Buitelaar. 2014. Identification of bilingual terms from monolingual documents for statistical machine translation. In Proceedings of the 4th International Workshop on Computational Terminology. 22–31. https://www.aclweb.org/anthology/W14-4803.pdf.Google Scholar
Cross Ref
- Nikita Astrakhantsev. 2018. ATR4S: toolkit with state-of-the-art automatic terms recognition methods in Scala. Lang. Resour. Eval. 52, 3 (2018), 853–872. Google Scholar
Digital Library
- Francesca Bonin, Felice Dell'Orletta, Giulia Venturi, and Simonetta Montemagni. 2010. A contrastive approach to multi-word term extraction from domain corpora. In Proceedings of the 7th International Conference on Language Resources and Evaluation. 19–21.Google Scholar
- Francesca Bonin, Felice Dell'Orletta, Giulia Venturi, and Simonetta Montemagni. 2010. Contrastive filtering of domain-specific multi-word terms from different types of corpora. In Proceedings of the 23rd International Conference on Computational Linguistics. 77.Google Scholar
- Hongyun Cai, Vincent W. Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30, 9 (2018), 1616–1637.Google Scholar
Digital Library
- Merley Conrado, Thiago Pardo, and Solange Rezende. 2013. A machine learning approach to automatic term extraction using a rich feature set. In Proceedings of the NAACL HLT Student Research Workshop. 16–23.Google Scholar
- Béatrice Daille. 1994. Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering. Doctoral dissertation, University Paris 7. http://ucrel.lancs.ac.uk/papers/techpaper/vol5.pdf.Google Scholar
- Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 135–144. Google Scholar
Digital Library
- Tshering Cigay Dorji, El-sayed Atlam, Susumu Yata, Masao Fuketa, Kazuhiro Morita, and Jun-ichi Aoe. 2011. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl. Inf. Syst. 27, 1 (2011), 141–161. Google Scholar
Digital Library
- Maike Erdmann, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2009. Improving the extraction of bilingual terminology from Wikipedia. ACM Trans. Multimedia Comput., Commun., Applic. 5, 4 (2009), 31–47. Google Scholar
Digital Library
- Kateria Frantzi. 2000. Automatic recognition of multi-word terms: the C-value/NC-value method. Int. J. Dig. Libraries 3, 2 (2000), 117–132.Google Scholar
Cross Ref
- Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, and Rakesh Agrawal. 2008. Using the wisdom of the crowds for keyword generation. In Proceedings of the 17th International Conference on World Wide Web. ACM, 61–70. Google Scholar
Digital Library
- Aaron Gerow. 2014. Extracting clusters of specialist terms from unstructured text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1426–1434.Google Scholar
Cross Ref
- Fidelia Ibekwe-SanJuan, Eric SanJuan, and Michael S. E. Vogeley. 2008. Decomposition of terminology graphs for domain knowledge acquisition. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 1463–1464. Google Scholar
Digital Library
- Di Jiang, Kenneth Wai-Ting Leung, and Wilfred Ng. 2016. Query intent mining with multiple dimensions of web search data. World Wide Web 19, 3 (2016), 475–497. Google Scholar
Digital Library
- Rogers Jeffrey Leo John, Thomas S. McTavish, and Rebecca J. Passonneau. 2015. Semantic graphs for mathematics word problems based on mathematics terminology. In Proceedings of the 8th International Conference on Educational Data Mining.Google Scholar
- J.-D. Kim, Tomoko Ohta, Yuka Tateisi, and Jun'ichi Tsujii. 2003. GENIA corpus-a semantically annotated corpus for bio-textmining. Bioinformatics 19, suppl_1 (2003), i180–i182.Google Scholar
Cross Ref
- Su Nam Kim, Lawrence Cavedon, and Timothy Baldwin. 2011. Harvesting domain-specific terms using Wikipedia. In Proceedings of the 16th Australasian Document Computing Symposium. RMIT University, 82–86.Google Scholar
- Lev Kozakov, Youngja Park, T. Fin, Youssef Drissi, Yurdaer Doganata, and Thomas Cofino. 2004. Glossary extraction and utilization in the information search and delivery system for IBM Technical Support. IBM Syst. J. 43, 3 (2004), 546–563. Google Scholar
Digital Library
- Taku Kuribayashi, Yasuhito Asano, and Masatoshi Yoshikawa. 2013. Ranking method specialized for content descriptions of classical music. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 141–142. Google Scholar
Digital Library
- Xiao Li, Ye-Yi Wang, and Alex Acero. 2008. Learning query intent from regularized click graphs. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 339–346. Google Scholar
Digital Library
- Shikun Liu, Edward Johns, and Andrew J. Davison. 2019. End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1871–1880.Google Scholar
- Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2013. Combining c-value and keyword extraction methods for biomedical terms extraction. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan. https://hal-lirmm.ccsd.cnrs.fr/lirmm-01019991/document.Google Scholar
- Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2014. Yet another ranking function for automatic multiword term extraction. In Proceedings of the International Conference on Natural Language Processing. Springer, 52–64.Google Scholar
Cross Ref
- Natalia V. Loukachevitch. 2012. Automatic term recognition needs multiple evidence. In Proceedings of the International Conference on Language Resources and Evaluation. 2401–2407.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 3111–3119. Google Scholar
Digital Library
- Alaa Mohasseb, Mohamed Bader-El-Den, and Mihaela Cocea. 2019. A customised grammar framework for query classification. Expert Syst. Applic. 135 (2019), 164–180.Google Scholar
Cross Ref
- Thibault Mondary, Adeline Nazarenko, Haıfa Zargayouna, and Sabine Barreaux. 2012. The Quæro evaluation initiative on term extraction. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 663–669.Google Scholar
- Weijian Ni, Tong Liu, Haohao Sun, and Zhensheng Wei. 2017. An active learning approach to recognizing domain-specific queries from query log. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. Springer, 18–32.Google Scholar
Cross Ref
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 701–710. Google Scholar
Digital Library
- Scott Piao, Jamie Forth, Ricardo Gacitua, Jon Whittle, and Geraint Wiggins. 2010. Evaluating tools for automatic concept extraction: A case study from the musicology domain. In Proceedings of the Digital Futures Conference.Google Scholar
- Behrang QasemiZadeh, Paul Buitelaar, T. Q. Chen, and Georgeta Bordea. 2012. Semi-supervised technical term tagging with minimal user feedback. In Proceedings of the International Conference on Language Resources and Evaluation. 617–621.Google Scholar
- Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A language resource for evaluating term extraction and entity recognition methods. In Proceedings of the 10th International Conference on Language Resources and Evaluation. 1862–1868.Google Scholar
- Chuan Shi, Binbin Hu, Wayne Xin Zhao, and S. Yu Philip. 2018. Heterogeneous information network embedding for recommendation. IEEE Trans. Knowl. Data Eng. 31, 2 (2018), 357–370. Google Scholar
Digital Library
- Kim Su, Nam, Baldwin Timothy, and Kan Min-Yen. 2009. An unsupervised approach to domain-specific term extraction. In Proceedings of the Australasian Language Technology Association Workshop. 94.Google Scholar
- Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4, 11, 992–1003. Google Scholar
Digital Library
- Mona Taghavi, Ahmed Patel, Nikita Schmidt, Christopher Wills, and Yiqi Tew. 2012. An analysis of web proxy logs with query distribution pattern approach for search engines. Comput. Stand. Interf. 34, 1 (2012), 162–170. Google Scholar
Digital Library
- Jorge Vivaldi and Horacio Rodríguez. 2010. Finding domain terms using Wikipedia. In Proceedings of the 7th International Conference on Language Resources and Evaluation. 386–393.Google Scholar
- Thuy Vu, Ai Ti Aw, and Min Zhang. 2008. Term extraction through unithood and termhood unification. In Proceedings of the International Joint Conference on Natural Language Processing. 631–636.Google Scholar
- Bo Xu, Yunlong Ma, and Hongfei Lin. 2019. A hybrid deep neural network model for query intent classification. J. Intell. Fuzzy Syst. 36, 6 (2019), 6413–6423.Google Scholar
Cross Ref
- X. Yan, Y. Liu, Q. Fang, M. Zhang, S. Ma, and L. Ru. 2013. Domain-specific terms extraction based on Web resource and user behavior. J. Softw. 24, 9 (2013), 2089–2100.Google Scholar
Cross Ref
- Behrang Qasemi Zadeh and Siegfried Handschuh. 2014. Evaluation of technology term recognition with random indexing. In Proceedings of the International Conference on Language Resources and Evaluation.Google Scholar
- Ziqi Zhang, Jie Gao, and Fabio Ciravegna. 2018. SemRe-Rank: Improving automatic term extraction by incorporating semantic relatedness with personalised PageRank. ACM Trans. Knowl. Discov. Data 12, 5 (2018), 57. Google Scholar
Digital Library
- Ziqi Zhang, José Iria, Christopher Brewster, and Fabio Ciravegna. 2008. A comparative evaluation of term recognition algorithms. In Proceedings of the International Conference on Language Resources and Evaluation. 2108–2111.Google Scholar
- Denny Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, and Bernhard Schölkopf. 2004. Learning with local and global consistency. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 321–328. Google Scholar
Digital Library
Index Terms
Mining Domain Terminologies Using Search Engine's Query Log
Recommendations
Cross-Lingual Topic Discovery From Multilingual Search Engine Query Log
Today, major commercial search engines are operating in a multinational fashion to provide web search services for millions of users who compose search queries by different languages. Hence, the search engine query log, which serves as the backbone of ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Mining query subtopics from search log data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalMost queries in web search are ambiguous and multifaceted. Identifying the major senses and facets of queries from search log data, referred to as query subtopic mining in this paper, is a very important issue in web search. Through search log analysis, ...






Comments