Abstract
Automatically summarizing a group of short texts that mainly share one topic is a fundamental task in many applications, e.g., summarizing the main symptoms for a disease based on a group of medical texts that are usually short, i.e., tens of words. Conventional unsupervised short text summarization techniques tend to find the most representative short text document. However, they may cause privacy issues, e.g., personal information in the medical texts may be exposed. Moreover, compared with the complete short text where some unimportant words may exist, a summary consisting of only a few keywords is more preferable by the user due to its clear and concise form. Due to the above reasons, in this article, we aim to solve the problem of unsupervised derivation of keyword summary for short texts. Existing keyword extraction methods such as Latent Dirichlet Allocation cannot be applied to solve this problem, since (1) the ordering relations among the extracted keywords are ignored, which causes troubles for people to capture the main idea of the event, and (2) short texts contain limited context, which makes it hard to find the optimal words for semantic coverage. Hence, we propose a simple but yet effective method named Frequent Closed Wordsets Ranking (FCWRank) to derive the keyword summary from a short text cluster. FCWRank is an unsupervised method that builds on the idea of frequent closed itemset mining in transaction database. FCWRank first mines all frequent closed wordsets from a cluster of short texts and then selects the most important wordset based on an importance model where the similarity between closed wordsets and the relation between the closed wordset and the short text document are considered simultaneously. To make the keywords within the wordset more understandable, FCWRank further unfolds the semantics behind them by sorting them. Experiments on real-world short text collections show that FCWRank outperforms the state-of-the-art baselines in terms of Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence F1, precision and recall scores.
- Laith Mohammad Abualigah and Essam Said Hanandeh. 2015. Applying genetic algorithms to information retrieval using vector space model. Int. J. Comput. Sci. Eng. Appl. 5, 1 (2015), 19–28.Google Scholar
- Laith Mohammad Abualigah and Ahamad Tajudin Khader. 2017. Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J. Supercomput. 73, 11 (2017), 4773–4795. Google Scholar
Digital Library
- Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J. Comput. Sci. 25 (2018), 456–466.Google Scholar
Cross Ref
- Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. A combination of objective functions and hybrid Krill herd algorithm for text document clustering analysis. Eng. Appl. Artif. Intell. 73 (2018), 111–125.Google Scholar
Cross Ref
- Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. Hybrid clustering analysis using improved krill herd algorithm. Appl. Intell. 48, 11 (2018), 4047–4071. Google Scholar
Digital Library
- Laith Mohammad Abualigah, Ahamad Tajudin Khader, Essam Said Hanandeh, and Amir H Gandomi. 2017. A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl. Soft Comput. 60 (2017), 423–435. Google Scholar
Digital Library
- Laith Mohammad Qasim Abualigah. 2019. Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering. Springer.Google Scholar
- Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE’95). IEEE Computer Society, USA, 3–14. Google Scholar
Digital Library
- Mozhgan Nasr Azadani, Nasser Ghadiri, and Ensieh Davoodijam. 2018. Graph-based biomedical text summarization: An itemset mining and sentence clustering approach. J. Biomed. Inf. 84 (2018), 42–58.Google Scholar
Cross Ref
- Elena Baralis, Luca Cagliero, Naeem Mahoto, and Alessandro Fiori. 2013. GRAPHSUM: Discovering correlations among multiple terms for graph-based summarization. Inf. Sci. 249 (2013), 96–109.Google Scholar
Cross Ref
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan.2003), 993–1022. Google Scholar
Digital Library
- Olivier Bodenreider. 2004. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 32, suppl 1 (2004), D267–D270.Google Scholar
Cross Ref
- Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. Btm: Topic modeling over short texts. Trans. Knowl. Data Eng. 26, 12 (2014), 2928–2941.Google Scholar
Cross Ref
- Hal Daumé III and Daniel Marcu. 2006. Bayesian query-focused summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’06). 305–312. Google Scholar
Digital Library
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Assoc. Inf. Sci. Technol. 41, 6 (1990), 391–407.Google Scholar
- Christian Gulden, Melanie Kirchner, Christina Schüttler, Marc Hinderer, Marvin Kampf, Hans-Ulrich Prokosch, and Dennis Toddenroth. 2019. Extractive summarization of clinical trial descriptions. Int. J. Med. Inf. 129 (2019), 114–121.Google Scholar
Cross Ref
- Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In ACM Sigmod Rec. 29 (2000). 1–12. Google Scholar
Digital Library
- Charles A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10–16.Google Scholar
Cross Ref
- Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. 289–296. Google Scholar
Digital Library
- Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. In Proceedings of the 1st Workshop on Social Media Analytics. 80–88. Google Scholar
Digital Library
- Pei Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Meichun Hsu. 2001. PrefixSpan: Mining sequential patterns by prefix-projected growth. In Proceedings of the International Conference on Data Engineering. Google Scholar
Digital Library
- Ou Jin, Nathan N Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the Conference on Information and Knowledge Management (CIKM’11). ACM, 775–784. Google Scholar
Digital Library
- Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. 1999. The web as a graph: Measurements, models, and methods. In International Computing and Combinatorics Conference. Springer, 1–17. Google Scholar
Digital Library
- Tamara G. Kolda, Brett W. Bader, and Joseph P. Kenny. 2005. Higher-order web link analysis using multilinear algebra. In Proceedings of the IEEE International Conference on Data Mining (ICDM’05). Google Scholar
Digital Library
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval Online (SIGIR’16). 165–174. Google Scholar
Digital Library
- Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: Efficient evaluation of sliding-window aggregates over data streams. ACM Sigmod Rec. 34, 1 (2005), 39–44. Google Scholar
Digital Library
- Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. 2015. Reader-aware multi-document summarization via sparse coding. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’15). 1270–1276. Google Scholar
Digital Library
- Ximing Li, Jiaojiao Zhang, and Jihong Ouyang. 2019. Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7884–7891.Google Scholar
Cross Ref
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches OutAssociation for Computational Linguistics, Barcelona, Spain, 74–81.Google Scholar
- Marina Litvak and Mark Last. 2008. Graph-based keyword extraction for single-document summarization. In Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization. 17–24. Google Scholar
Digital Library
- Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’14). 55–60.Google Scholar
Cross Ref
- Héctor D. Menéndez, Laura Plaza, and David Camacho. 2014. Combining graph connectivity and genetic clustering to improve biomedical summarization. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation (CEC’14). IEEE, 2740–2747.Google Scholar
Cross Ref
- Rada Mihalcea, Courtney Corley, Carlo Strapparava, et al. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’06), Vol. 6. 775–780. Google Scholar
Digital Library
- Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).Google Scholar
- Milad Moradi and Nasser Ghadiri. 2017. Quantifying the informativeness for biomedical literature summarization: An itemset mining method. Comput. Methods Progr. Biomed. 146 (2017), 77–89.Google Scholar
Cross Ref
- Milad Moradi and Nasser Ghadiri. 2018. Different approaches for identifying important concepts in probabilistic biomedical text summarization. Artif. Intell. Med. 84 (2018), 101–116.Google Scholar
Cross Ref
- Jeffrey Nichols, Jalal Mahmud, and Clemens Drews. 2012. Summarizing sporting events using Twitter. In Proceedings of the International Conference on Intelligent User Interfaces (IUI’12). 189–198. Google Scholar
Digital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google Scholar
- Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the International Conference on Database Theory (ICDT’99). Springer, 398–416. Google Scholar
Digital Library
- Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the Annual Conference on the World Wide Web (WWW’08). 91–100. Google Scholar
Digital Library
- Laura Plaza, Alberto Díaz, and Pablo Gervás. 2011. A semantic graph-based approach to biomedical summarisation. Artif. Intell. Med. 53, 1 (2011), 1–14. Google Scholar
Digital Library
- Lawrence Reeve, Hyoil Han, and Ari D. Brooks. 2006. BioChain: Lexical chaining methods for biomedical text summarization. In Proceedings of the 2006 ACM Symposium on Applied Computing. 180–184. Google Scholar
Digital Library
- Oussama Rouane, Hacene Belhadef, and Mustapha Bouakkaz. 2019. Combine clustering and frequent itemsets mining to enhance biomedical text summarization. Expert Syst. Appl. 135 (2019), 362–373.Google Scholar
Cross Ref
- Hassan Sayyadi and Lise Getoor. 2009. Futurerank: Ranking scientific articles by predicting their future pagerank. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 533–544.Google Scholar
Cross Ref
- Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of the International Conference on Machine Learning (ICML’14). 190–198. Google Scholar
Digital Library
- Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the International Conference on Machine Learning (ICML’06). 977–984. Google Scholar
Digital Library
- Zhongyuan Wang and Haixun Wang. 2016. Understanding short texts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’16) (Tutorial).Google Scholar
- Zhongqing Wang and Yue Zhang. 2017. A neural model for joint event detection and summarization. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). Google Scholar
Digital Library
- Ho Chung Wu, Robert Wing Pong Luk, Kam Fai Wong, and Kui Lam Kwok. 2008. Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. 26, 3 (2008), 13. Google Scholar
Digital Library
- Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achananuparp, Ee-Peng Lim, and Xiaoming Li. 2011. Topical keyphrase extraction from Twitter. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’11). 379–388. Google Scholar
Digital Library
- Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval (ECIR’11). 338–349. Google Scholar
Digital Library
- George Kingsley Zipf. 2016. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books.Google Scholar
- Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: A simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48, 2 (2016), 379–398. Google Scholar
Digital Library
Index Terms
Unsupervised Derivation of Keyword Summary for Short Texts
Recommendations
Sparse Biterm Topic Model for Short Texts
Web and Big DataAbstractExtracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus ...
A topic model for co-occurring normal documents and short texts
User comments, as a large group of online short texts, are becoming increasingly prevalent with the development of online communications. These short texts are characterized by their co-occurrences with usually lengthier normal documents. For example, ...
Topic Modeling of Short Texts: A Pseudo-Document View
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningRecent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-...






Comments