skip to main content
research-article

Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text

Authors Info & Claims
Published:21 June 2020Publication History
Skip Abstract Section

Abstract

Keywords are considered to be important words in the text and can provide a concise representation of the text. With the surge of unlabeled short text on the Internet, automatic keyword extraction task has proven useful in other information processing applications. Graph-based approaches are prevalent unsupervised models for this task. However, most of these methods emphasize the importance of the relation between words without considering other importance factors. Furthermore, when measuring the importance of a word in a text, the damping factor is set to 0.85 following PageRank. To the best of our knowledge, there is no existing work investigating the impact of the damping factor on the keyword extraction task. In addition, there are few publicly available labeled Chinese short text datasets for this task. In this article, we investigate the importance parts of words in a given document and propose an improved graph-based method for keyword extraction from short documents. Moreover, we analyze the impact of importance factors on performance. We also provide annotated long and short Chinese datasets for this task. The model is performed on Chinese and English datasets, and results show that our model obtains improvements in performance over the previous unsupervised models on short documents. Comparative experiments show that the damping factor is related to the text length, which is neglected in traditional methods.

References

  1. Ken Barker and Nadia Cornacchia. 2000. Using noun phrase heads to extract document keyphrases. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. 40--52.Google ScholarGoogle ScholarCross RefCross Ref
  2. Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2012. Ne-Rank: A novel graph-based keyphrase extraction in Twitter. In Proceedings of the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT’12), Vol. 1. IEEE, Los Alamitos, CA, 372--379.Google ScholarGoogle ScholarCross RefCross Ref
  3. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (Jan. 2003), 993--1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Florian Boudin. 2016. pke: An open source Python-based keyphrase extraction toolkit. In Proceedings of the 26th International Conference on Computational Linguistics: System Demonstrations (COLING’16). 69--73.Google ScholarGoogle Scholar
  5. Adrien Bougouin, Florian Boudin, and Béatrice Daille. 2013. TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’13). 543--551.Google ScholarGoogle Scholar
  6. Corina Florescu and Cornelia Caragea. 2017. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Volume 1, Long Papers. 1631--1640.Google ScholarGoogle Scholar
  8. Kazi Saidul Hasan and Vincent Ng. 2010. Conundrums in unsupervised keyphrase extraction: Making sense of the state-of-the-art. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 365--373.Google ScholarGoogle Scholar
  9. Taher H. Haveliwala. 2002. Topic-sensitive PageRank. In Proceedings of the 11th International Conference on World Wide Web. ACM, New York, NY, 517--526.Google ScholarGoogle Scholar
  10. Taher H. Haveliwala and Ar D. Kamvar. 2003. The Second Eigenvalue of the Google Matrix. Technical Report. Stanford.Google ScholarGoogle Scholar
  11. Anette Hulth. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. 216--223.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Anette Hulth and Beáta B. Megyesi. 2006. A study on automatically extracted keywords in text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 537--544.Google ScholarGoogle Scholar
  13. Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In Proceedings of the 12th International Conference on World Wide Web. ACM, New York, NY, 271--279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers 427--431.Google ScholarGoogle ScholarCross RefCross Ref
  15. Alexandros Komninos and Suresh Manandhar. 2016. Dependency based embeddings for sentence classification tasks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1490--1500.Google ScholarGoogle ScholarCross RefCross Ref
  16. Mikalai Krapivin, Aliaksandr Autayeu, Maurizio Marchese, Enrico Blanzieri, and Nicola Segata. 2010. n4: Improving machine learning approaches with natural language processing. In Proceedings of the International Conference on Asian Digital Libraries. 102--111.Google ScholarGoogle Scholar
  17. Amy N. Langville and Carl D. Meyer. 2004. Deeper inside PageRank. Internet Mathematics 1, 3 (2004), 335--380.Google ScholarGoogle ScholarCross RefCross Ref
  18. Guangyi Li and Houfeng Wang. 2014. Improved automatic keyword extraction based on TextRank using domain knowledge. In Natural Language Processing and Chinese Computing. Communications in Computer and Information Science, Vol. 496. Springer, 403--413.Google ScholarGoogle Scholar
  19. Xiaoqing Li, Chengqing Zong, and Keh Yih Su. 2015. A unified model for solving the OOV problem of Chinese word segmentation. ACM Transactions on Asian and Low-Resource Language Information Processing 14, 3 (2015), 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zhenhui Li, Ding Zhou, Yun-Fang Juan, and Jiawei Han. 2010. Keyword extraction for social snippets. In Proceedings of the 19th International Conference on World Wide Web. ACM, New York, NY, 1143--1144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. 2010. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 366--376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Luís Marujo, Miguel Bugalho, João Paulo da Silva Neto, Anatole Gershman, and Jaime Carbonell. 2013. Hourly traffic prediction of news stories. arXiv:1306.4608.Google ScholarGoogle Scholar
  23. Luis Marujo, Wang Ling, Isabel Trancoso, Chris Dyer, Alan W. Black, Anatole Gershman, David Martins de Matos, Joao Neto, and Jaime Carbonell. 2015. Automatic keyword extraction on Twitter. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Volume 2, Short Papers. 637--643.Google ScholarGoogle ScholarCross RefCross Ref
  24. Olena Medelyan, Eibe Frank, and Ian H. Witten. 2009. Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3. 1318--1327.Google ScholarGoogle Scholar
  25. Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. 2017. Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Volume 1, Long Papers.Google ScholarGoogle ScholarCross RefCross Ref
  26. Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order into Texts. Association for Computational Linguistics, Stroudsburg, PA.Google ScholarGoogle Scholar
  27. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google ScholarGoogle Scholar
  28. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.Google ScholarGoogle Scholar
  29. David Milne and Ian H. Witten. 2013. An open-source toolkit for mining Wikipedia. Artificial Intelligence 194 (2013), 222--239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Thuy Dung Nguyen and Min-Yen Kan. 2007. Keyphrase extraction in scientific publications. In Proceedings of the International Conference on Asian Digital Libraries. 317--326.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yukio Ohsawa, Nels E. Benson, and Masahiko Yachida. 1998. KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL’98). IEEE, Los Alamitos, CA, 12--18.Google ScholarGoogle ScholarCross RefCross Ref
  32. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google ScholarGoogle Scholar
  33. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  34. Luca Pretto. 2002. A theoretical analysis of Google’s PageRank. In Proceedings of the International Symposium on String Processing and Information Retrieval. 131--144.Google ScholarGoogle ScholarCross RefCross Ref
  35. Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo, and Carlo Tasso. 2010. A new domain independent keyphrase extraction system. In Proceedings of the Italian Research Conference on Digital Libraries. 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  36. M. Atif Qureshi, Colm O’Riordan, and Gabriella Pasi. 2012. Short-text domain specific key terms/phrases extraction using an n-gram model with Wikipedia. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 2515--2518.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Muhammad Atif Qureshi, Colm O’Riordan, and Gabriella Pasi. 2014. Exploiting Wikipedia to identify domain-specific key terms/phrases from a short-text collection. In Proceedings of the 5th Italian Information Retrieval Workshop (IIR’14). 63--74.Google ScholarGoogle Scholar
  38. Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Volume 13. 63--70.Google ScholarGoogle Scholar
  39. Peter D. Turney. 2000. Learning algorithms for keyphrase extraction. Information Retrieval 2, 4 (2000), 303--336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Peter D. Turney. 2003. Coherent keyphrase extraction via web mining. arXiv:cs/0308033.Google ScholarGoogle Scholar
  41. Xiaojun Wan and Jianguo Xiao. 2008a. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics, Volume 1. 969--976.Google ScholarGoogle ScholarCross RefCross Ref
  42. Xiaojun Wan and Jianguo Xiao. 2008b. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08), Vol. 8. 855--860.Google ScholarGoogle Scholar
  43. Fang Wang, Zhongyuan Wang, Senzhang Wang, and Zhoujun Li. 2014b. Exploiting description knowledge for keyphrase extraction. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. 130--142.Google ScholarGoogle ScholarCross RefCross Ref
  44. Rui Wang, Wei Liu, and Chris McDonald. 2014a. Corpus-independent generic keyphrase extraction using word embedding vectors. In Proceedings of the Software Engineering Research Conference. 39.Google ScholarGoogle Scholar
  45. Xun Wang, Lei Wang, Jiwei Li, and Sujian Li. 2012. Exploring simultaneous keyword and key sentence extraction: Improve graph-based ranking using Wikipedia. In Proceedings of the ACM International Conference on Information and Knowledge Management. 2619--2622.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. 1999. KEA: Practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Libraries. ACM, New York, NY, 254--255.Google ScholarGoogle Scholar
  47. Fan Yang, Yue-Sheng Zhu, and Yu-Jia Ma. 2016. WS-Rank: Bringing sentences into graph for keyword extraction. In Proceedings of the Asia-Pacific Web Conference. 474--477.Google ScholarGoogle ScholarCross RefCross Ref
  48. Zi Yang and Eric Nyberg. 2015. Leveraging procedural knowledge for task-oriented search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 513--522.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Qi Zhang, Yang Wang, Yeyun Gong, and Xuanjing Huang. 2016. Keyphrase extraction using deep recurrent neural networks on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 836--845.Google ScholarGoogle ScholarCross RefCross Ref
  50. Wei Zhang, Wei Feng, and Jianyong Wang. 2013. Integrating semantic relatedness and words’ intrinsic features for keyword extraction. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). 2225--2231.Google ScholarGoogle Scholar
  51. Yingyi Zhang, Jing Li, Yan Song, and Chengzhi Zhang. 2018. Encoding conversation context for neural keyphrase extraction from microblog posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Volume 1, Long Papers. 1676--1686.Google ScholarGoogle ScholarCross RefCross Ref
  52. Hao Zheng, Zhoujun Li, Senzhang Wang, Zhao Yan, and Jianshe Zhou. 2016. Aggregating inter-sentence information to enhance relation extraction. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 3108--3115.Google ScholarGoogle Scholar

Index Terms

  1. Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 5
      September 2020
      278 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3403646
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 June 2020
      • Online AM: 7 May 2020
      • Revised: 1 March 2020
      • Accepted: 1 March 2020
      • Received: 1 October 2017
      Published in tallip Volume 19, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!