Abstract
Keywords are considered to be important words in the text and can provide a concise representation of the text. With the surge of unlabeled short text on the Internet, automatic keyword extraction task has proven useful in other information processing applications. Graph-based approaches are prevalent unsupervised models for this task. However, most of these methods emphasize the importance of the relation between words without considering other importance factors. Furthermore, when measuring the importance of a word in a text, the damping factor is set to 0.85 following PageRank. To the best of our knowledge, there is no existing work investigating the impact of the damping factor on the keyword extraction task. In addition, there are few publicly available labeled Chinese short text datasets for this task. In this article, we investigate the importance parts of words in a given document and propose an improved graph-based method for keyword extraction from short documents. Moreover, we analyze the impact of importance factors on performance. We also provide annotated long and short Chinese datasets for this task. The model is performed on Chinese and English datasets, and results show that our model obtains improvements in performance over the previous unsupervised models on short documents. Comparative experiments show that the damping factor is related to the text length, which is neglected in traditional methods.
- Ken Barker and Nadia Cornacchia. 2000. Using noun phrase heads to extract document keyphrases. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. 40--52.Google Scholar
Cross Ref
- Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2012. Ne-Rank: A novel graph-based keyphrase extraction in Twitter. In Proceedings of the IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT’12), Vol. 1. IEEE, Los Alamitos, CA, 372--379.Google Scholar
Cross Ref
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (Jan. 2003), 993--1022.Google Scholar
Digital Library
- Florian Boudin. 2016. pke: An open source Python-based keyphrase extraction toolkit. In Proceedings of the 26th International Conference on Computational Linguistics: System Demonstrations (COLING’16). 69--73.Google Scholar
- Adrien Bougouin, Florian Boudin, and Béatrice Daille. 2013. TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’13). 543--551.Google Scholar
- Corina Florescu and Cornelia Caragea. 2017. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Google Scholar
Cross Ref
- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Volume 1, Long Papers. 1631--1640.Google Scholar
- Kazi Saidul Hasan and Vincent Ng. 2010. Conundrums in unsupervised keyphrase extraction: Making sense of the state-of-the-art. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 365--373.Google Scholar
- Taher H. Haveliwala. 2002. Topic-sensitive PageRank. In Proceedings of the 11th International Conference on World Wide Web. ACM, New York, NY, 517--526.Google Scholar
- Taher H. Haveliwala and Ar D. Kamvar. 2003. The Second Eigenvalue of the Google Matrix. Technical Report. Stanford.Google Scholar
- Anette Hulth. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. 216--223.Google Scholar
Digital Library
- Anette Hulth and Beáta B. Megyesi. 2006. A study on automatically extracted keywords in text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 537--544.Google Scholar
- Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In Proceedings of the 12th International Conference on World Wide Web. ACM, New York, NY, 271--279.Google Scholar
Digital Library
- Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers 427--431.Google Scholar
Cross Ref
- Alexandros Komninos and Suresh Manandhar. 2016. Dependency based embeddings for sentence classification tasks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1490--1500.Google Scholar
Cross Ref
- Mikalai Krapivin, Aliaksandr Autayeu, Maurizio Marchese, Enrico Blanzieri, and Nicola Segata. 2010. n4: Improving machine learning approaches with natural language processing. In Proceedings of the International Conference on Asian Digital Libraries. 102--111.Google Scholar
- Amy N. Langville and Carl D. Meyer. 2004. Deeper inside PageRank. Internet Mathematics 1, 3 (2004), 335--380.Google Scholar
Cross Ref
- Guangyi Li and Houfeng Wang. 2014. Improved automatic keyword extraction based on TextRank using domain knowledge. In Natural Language Processing and Chinese Computing. Communications in Computer and Information Science, Vol. 496. Springer, 403--413.Google Scholar
- Xiaoqing Li, Chengqing Zong, and Keh Yih Su. 2015. A unified model for solving the OOV problem of Chinese word segmentation. ACM Transactions on Asian and Low-Resource Language Information Processing 14, 3 (2015), 12.Google Scholar
Digital Library
- Zhenhui Li, Ding Zhou, Yun-Fang Juan, and Jiawei Han. 2010. Keyword extraction for social snippets. In Proceedings of the 19th International Conference on World Wide Web. ACM, New York, NY, 1143--1144.Google Scholar
Digital Library
- Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. 2010. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 366--376.Google Scholar
Digital Library
- Luís Marujo, Miguel Bugalho, João Paulo da Silva Neto, Anatole Gershman, and Jaime Carbonell. 2013. Hourly traffic prediction of news stories. arXiv:1306.4608.Google Scholar
- Luis Marujo, Wang Ling, Isabel Trancoso, Chris Dyer, Alan W. Black, Anatole Gershman, David Martins de Matos, Joao Neto, and Jaime Carbonell. 2015. Automatic keyword extraction on Twitter. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Volume 2, Short Papers. 637--643.Google Scholar
Cross Ref
- Olena Medelyan, Eibe Frank, and Ian H. Witten. 2009. Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3. 1318--1327.Google Scholar
- Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. 2017. Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Volume 1, Long Papers.Google Scholar
Cross Ref
- Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order into Texts. Association for Computational Linguistics, Stroudsburg, PA.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.Google Scholar
- David Milne and Ian H. Witten. 2013. An open-source toolkit for mining Wikipedia. Artificial Intelligence 194 (2013), 222--239.Google Scholar
Digital Library
- Thuy Dung Nguyen and Min-Yen Kan. 2007. Keyphrase extraction in scientific publications. In Proceedings of the International Conference on Asian Digital Libraries. 317--326.Google Scholar
Digital Library
- Yukio Ohsawa, Nels E. Benson, and Masahiko Yachida. 1998. KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries (ADL’98). IEEE, Los Alamitos, CA, 12--18.Google Scholar
Cross Ref
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.Google Scholar
Cross Ref
- Luca Pretto. 2002. A theoretical analysis of Google’s PageRank. In Proceedings of the International Symposium on String Processing and Information Retrieval. 131--144.Google Scholar
Cross Ref
- Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo, and Carlo Tasso. 2010. A new domain independent keyphrase extraction system. In Proceedings of the Italian Research Conference on Digital Libraries. 67--78.Google Scholar
Cross Ref
- M. Atif Qureshi, Colm O’Riordan, and Gabriella Pasi. 2012. Short-text domain specific key terms/phrases extraction using an n-gram model with Wikipedia. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 2515--2518.Google Scholar
Digital Library
- Muhammad Atif Qureshi, Colm O’Riordan, and Gabriella Pasi. 2014. Exploiting Wikipedia to identify domain-specific key terms/phrases from a short-text collection. In Proceedings of the 5th Italian Information Retrieval Workshop (IIR’14). 63--74.Google Scholar
- Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Volume 13. 63--70.Google Scholar
- Peter D. Turney. 2000. Learning algorithms for keyphrase extraction. Information Retrieval 2, 4 (2000), 303--336.Google Scholar
Digital Library
- Peter D. Turney. 2003. Coherent keyphrase extraction via web mining. arXiv:cs/0308033.Google Scholar
- Xiaojun Wan and Jianguo Xiao. 2008a. CollabRank: Towards a collaborative approach to single-document keyphrase extraction. In Proceedings of the 22nd International Conference on Computational Linguistics, Volume 1. 969--976.Google Scholar
Cross Ref
- Xiaojun Wan and Jianguo Xiao. 2008b. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08), Vol. 8. 855--860.Google Scholar
- Fang Wang, Zhongyuan Wang, Senzhang Wang, and Zhoujun Li. 2014b. Exploiting description knowledge for keyphrase extraction. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. 130--142.Google Scholar
Cross Ref
- Rui Wang, Wei Liu, and Chris McDonald. 2014a. Corpus-independent generic keyphrase extraction using word embedding vectors. In Proceedings of the Software Engineering Research Conference. 39.Google Scholar
- Xun Wang, Lei Wang, Jiwei Li, and Sujian Li. 2012. Exploring simultaneous keyword and key sentence extraction: Improve graph-based ranking using Wikipedia. In Proceedings of the ACM International Conference on Information and Knowledge Management. 2619--2622.Google Scholar
Digital Library
- Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. 1999. KEA: Practical automatic keyphrase extraction. In Proceedings of the 4th ACM Conference on Digital Libraries. ACM, New York, NY, 254--255.Google Scholar
- Fan Yang, Yue-Sheng Zhu, and Yu-Jia Ma. 2016. WS-Rank: Bringing sentences into graph for keyword extraction. In Proceedings of the Asia-Pacific Web Conference. 474--477.Google Scholar
Cross Ref
- Zi Yang and Eric Nyberg. 2015. Leveraging procedural knowledge for task-oriented search. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 513--522.Google Scholar
Digital Library
- Qi Zhang, Yang Wang, Yeyun Gong, and Xuanjing Huang. 2016. Keyphrase extraction using deep recurrent neural networks on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 836--845.Google Scholar
Cross Ref
- Wei Zhang, Wei Feng, and Jianyong Wang. 2013. Integrating semantic relatedness and words’ intrinsic features for keyword extraction. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). 2225--2231.Google Scholar
- Yingyi Zhang, Jing Li, Yan Song, and Chengzhi Zhang. 2018. Encoding conversation context for neural keyphrase extraction from microblog posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Volume 1, Long Papers. 1676--1686.Google Scholar
Cross Ref
- Hao Zheng, Zhoujun Li, Senzhang Wang, Zhao Yan, and Jianshe Zhou. 2016. Aggregating inter-sentence information to enhance relation extraction. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). 3108--3115.Google Scholar
Index Terms
Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text
Recommendations
A Keyword Extraction Method for Chinese Scientific Abstracts
WCNA 2017: Proceedings of the 2017 International Conference on Wireless Communications, Networking and ApplicationsKeyword extraction plays an essential role for text mining and further semantic analysis. It is a big challenge to extract keywords from short text, especially from short Chinese text. This paper presents a keyword extraction method for Chinese ...
Chinese Short Text Entity Linking Model Based on PET
WSSE '22: Proceedings of the 4th World Symposium on Software EngineeringExisting Chinese short text entity link models are less, and the short text is limited and handled by the context missing and the processing noise. There is still a lot of space to improve the accuracy. This paper proposes a Chinese short text entity ...
Automatic Chinese Keywords Extraction Based on Synonyms
BCGIN '13: Proceedings of the 2013 International Conference on Business Computing and Global InformatizationKeyword plays a significant role in various fields of natural language processing. For Chinese natural language processing, the polysemy and synonym have always been a major problem that troubles researchers. Traditional statistics-based approach, such ...






Comments