skip to main content
research-article

Mining Domain Terminologies Using Search Engine's Query Log

Authors Info & Claims
Published:12 August 2021Publication History
Skip Abstract Section

Abstract

Domain terminologies are a basic resource for various natural language processing tasks. To automatically discover terminologies for a domain of interest, most traditional approaches mostly rely on a domain-specific corpus given in advance; thus, the performance of traditional approaches can only be guaranteed when collecting a high-quality domain-specific corpus, which requires extensive human involvement and domain expertise. In this article, we propose a novel approach that is capable of automatically mining domain terminologies using search engine's query log—a type of domain-independent corpus of higher availability, coverage, and timeliness than a manually collected domain-specific corpus. In particular, we represent query log as a heterogeneous network and formulate the task of mining domain terminology as transductive learning on the heterogeneous network. In the proposed approach, the manifold structure of domain-specificity inherent in query log is captured by using a novel network embedding algorithm and further exploited to reduce the need for the manual annotation efforts for domain terminology classification. We select Agriculture and Healthcare as the target domains and experiment using a real query log from a commercial search engine. Experimental results show that the proposed approach outperforms several state-of-the-art approaches.

References

  1. Ahmet Aker, Monica Paramita, and Rob Gaizauskas. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Meeting of the Association for Computational Linguistics, Vol. 1. 402–411.Google ScholarGoogle Scholar
  2. Ahmet Aker, Monica Lestari Paramita, Emma Barker, and Robert J. Gaizauskas. 2014. Bootstrapping term extractors for multiple languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation. 483–489.Google ScholarGoogle Scholar
  3. Mihael Arcan, Claudio Giuliano, Marco Turchi, and Paul Buitelaar. 2014. Identification of bilingual terms from monolingual documents for statistical machine translation. In Proceedings of the 4th International Workshop on Computational Terminology. 22–31. https://www.aclweb.org/anthology/W14-4803.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  4. Nikita Astrakhantsev. 2018. ATR4S: toolkit with state-of-the-art automatic terms recognition methods in Scala. Lang. Resour. Eval. 52, 3 (2018), 853–872. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Francesca Bonin, Felice Dell'Orletta, Giulia Venturi, and Simonetta Montemagni. 2010. A contrastive approach to multi-word term extraction from domain corpora. In Proceedings of the 7th International Conference on Language Resources and Evaluation. 19–21.Google ScholarGoogle Scholar
  6. Francesca Bonin, Felice Dell'Orletta, Giulia Venturi, and Simonetta Montemagni. 2010. Contrastive filtering of domain-specific multi-word terms from different types of corpora. In Proceedings of the 23rd International Conference on Computational Linguistics. 77.Google ScholarGoogle Scholar
  7. Hongyun Cai, Vincent W. Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30, 9 (2018), 1616–1637.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Merley Conrado, Thiago Pardo, and Solange Rezende. 2013. A machine learning approach to automatic term extraction using a rich feature set. In Proceedings of the NAACL HLT Student Research Workshop. 16–23.Google ScholarGoogle Scholar
  9. Béatrice Daille. 1994. Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering. Doctoral dissertation, University Paris 7. http://ucrel.lancs.ac.uk/papers/techpaper/vol5.pdf.Google ScholarGoogle Scholar
  10. Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 135–144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tshering Cigay Dorji, El-sayed Atlam, Susumu Yata, Masao Fuketa, Kazuhiro Morita, and Jun-ichi Aoe. 2011. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowl. Inf. Syst. 27, 1 (2011), 141–161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Maike Erdmann, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2009. Improving the extraction of bilingual terminology from Wikipedia. ACM Trans. Multimedia Comput., Commun., Applic. 5, 4 (2009), 31–47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kateria Frantzi. 2000. Automatic recognition of multi-word terms: the C-value/NC-value method. Int. J. Dig. Libraries 3, 2 (2000), 117–132.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, and Rakesh Agrawal. 2008. Using the wisdom of the crowds for keyword generation. In Proceedings of the 17th International Conference on World Wide Web. ACM, 61–70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Aaron Gerow. 2014. Extracting clusters of specialist terms from unstructured text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1426–1434.Google ScholarGoogle ScholarCross RefCross Ref
  16. Fidelia Ibekwe-SanJuan, Eric SanJuan, and Michael S. E. Vogeley. 2008. Decomposition of terminology graphs for domain knowledge acquisition. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 1463–1464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Di Jiang, Kenneth Wai-Ting Leung, and Wilfred Ng. 2016. Query intent mining with multiple dimensions of web search data. World Wide Web 19, 3 (2016), 475–497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rogers Jeffrey Leo John, Thomas S. McTavish, and Rebecca J. Passonneau. 2015. Semantic graphs for mathematics word problems based on mathematics terminology. In Proceedings of the 8th International Conference on Educational Data Mining.Google ScholarGoogle Scholar
  19. J.-D. Kim, Tomoko Ohta, Yuka Tateisi, and Jun'ichi Tsujii. 2003. GENIA corpus-a semantically annotated corpus for bio-textmining. Bioinformatics 19, suppl_1 (2003), i180–i182.Google ScholarGoogle ScholarCross RefCross Ref
  20. Su Nam Kim, Lawrence Cavedon, and Timothy Baldwin. 2011. Harvesting domain-specific terms using Wikipedia. In Proceedings of the 16th Australasian Document Computing Symposium. RMIT University, 82–86.Google ScholarGoogle Scholar
  21. Lev Kozakov, Youngja Park, T. Fin, Youssef Drissi, Yurdaer Doganata, and Thomas Cofino. 2004. Glossary extraction and utilization in the information search and delivery system for IBM Technical Support. IBM Syst. J. 43, 3 (2004), 546–563. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Taku Kuribayashi, Yasuhito Asano, and Masatoshi Yoshikawa. 2013. Ranking method specialized for content descriptions of classical music. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 141–142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xiao Li, Ye-Yi Wang, and Alex Acero. 2008. Learning query intent from regularized click graphs. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 339–346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shikun Liu, Edward Johns, and Andrew J. Davison. 2019. End-to-end multi-task learning with attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1871–1880.Google ScholarGoogle Scholar
  25. Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2013. Combining c-value and keyword extraction methods for biomedical terms extraction. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan. https://hal-lirmm.ccsd.cnrs.fr/lirmm-01019991/document.Google ScholarGoogle Scholar
  26. Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2014. Yet another ranking function for automatic multiword term extraction. In Proceedings of the International Conference on Natural Language Processing. Springer, 52–64.Google ScholarGoogle ScholarCross RefCross Ref
  27. Natalia V. Loukachevitch. 2012. Automatic term recognition needs multiple evidence. In Proceedings of the International Conference on Language Resources and Evaluation. 2401–2407.Google ScholarGoogle Scholar
  28. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 3111–3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Alaa Mohasseb, Mohamed Bader-El-Den, and Mihaela Cocea. 2019. A customised grammar framework for query classification. Expert Syst. Applic. 135 (2019), 164–180.Google ScholarGoogle ScholarCross RefCross Ref
  30. Thibault Mondary, Adeline Nazarenko, Haıfa Zargayouna, and Sabine Barreaux. 2012. The Quæro evaluation initiative on term extraction. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 663–669.Google ScholarGoogle Scholar
  31. Weijian Ni, Tong Liu, Haohao Sun, and Zhensheng Wei. 2017. An active learning approach to recognizing domain-specific queries from query log. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. Springer, 18–32.Google ScholarGoogle ScholarCross RefCross Ref
  32. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 701–710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Scott Piao, Jamie Forth, Ricardo Gacitua, Jon Whittle, and Geraint Wiggins. 2010. Evaluating tools for automatic concept extraction: A case study from the musicology domain. In Proceedings of the Digital Futures Conference.Google ScholarGoogle Scholar
  34. Behrang QasemiZadeh, Paul Buitelaar, T. Q. Chen, and Georgeta Bordea. 2012. Semi-supervised technical term tagging with minimal user feedback. In Proceedings of the International Conference on Language Resources and Evaluation. 617–621.Google ScholarGoogle Scholar
  35. Behrang QasemiZadeh and Anne-Kathrin Schumann. 2016. The ACL RD-TEC 2.0: A language resource for evaluating term extraction and entity recognition methods. In Proceedings of the 10th International Conference on Language Resources and Evaluation. 1862–1868.Google ScholarGoogle Scholar
  36. Chuan Shi, Binbin Hu, Wayne Xin Zhao, and S. Yu Philip. 2018. Heterogeneous information network embedding for recommendation. IEEE Trans. Knowl. Data Eng. 31, 2 (2018), 357–370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kim Su, Nam, Baldwin Timothy, and Kan Min-Yen. 2009. An unsupervised approach to domain-specific term extraction. In Proceedings of the Australasian Language Technology Association Workshop. 94.Google ScholarGoogle Scholar
  38. Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4, 11, 992–1003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mona Taghavi, Ahmed Patel, Nikita Schmidt, Christopher Wills, and Yiqi Tew. 2012. An analysis of web proxy logs with query distribution pattern approach for search engines. Comput. Stand. Interf. 34, 1 (2012), 162–170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jorge Vivaldi and Horacio Rodríguez. 2010. Finding domain terms using Wikipedia. In Proceedings of the 7th International Conference on Language Resources and Evaluation. 386–393.Google ScholarGoogle Scholar
  41. Thuy Vu, Ai Ti Aw, and Min Zhang. 2008. Term extraction through unithood and termhood unification. In Proceedings of the International Joint Conference on Natural Language Processing. 631–636.Google ScholarGoogle Scholar
  42. Bo Xu, Yunlong Ma, and Hongfei Lin. 2019. A hybrid deep neural network model for query intent classification. J. Intell. Fuzzy Syst. 36, 6 (2019), 6413–6423.Google ScholarGoogle ScholarCross RefCross Ref
  43. X. Yan, Y. Liu, Q. Fang, M. Zhang, S. Ma, and L. Ru. 2013. Domain-specific terms extraction based on Web resource and user behavior. J. Softw. 24, 9 (2013), 2089–2100.Google ScholarGoogle ScholarCross RefCross Ref
  44. Behrang Qasemi Zadeh and Siegfried Handschuh. 2014. Evaluation of technology term recognition with random indexing. In Proceedings of the International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  45. Ziqi Zhang, Jie Gao, and Fabio Ciravegna. 2018. SemRe-Rank: Improving automatic term extraction by incorporating semantic relatedness with personalised PageRank. ACM Trans. Knowl. Discov. Data 12, 5 (2018), 57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ziqi Zhang, José Iria, Christopher Brewster, and Fabio Ciravegna. 2008. A comparative evaluation of term recognition algorithms. In Proceedings of the International Conference on Language Resources and Evaluation. 2108–2111.Google ScholarGoogle Scholar
  47. Denny Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, and Bernhard Schölkopf. 2004. Learning with local and global consistency. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 321–328. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining Domain Terminologies Using Search Engine's Query Log

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 6
          November 2021
          439 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3476127
          Issue’s Table of Contents

          Copyright © 2021 Association for Computing Machinery.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 August 2021
          • Accepted: 1 April 2021
          • Revised: 1 December 2020
          • Received: 1 February 2020
          Published in tallip Volume 20, Issue 6

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)32
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!