skip to main content
research-article

Unsupervised Derivation of Keyword Summary for Short Texts

Authors Info & Claims
Published:02 June 2021Publication History
Skip Abstract Section

Abstract

Automatically summarizing a group of short texts that mainly share one topic is a fundamental task in many applications, e.g., summarizing the main symptoms for a disease based on a group of medical texts that are usually short, i.e., tens of words. Conventional unsupervised short text summarization techniques tend to find the most representative short text document. However, they may cause privacy issues, e.g., personal information in the medical texts may be exposed. Moreover, compared with the complete short text where some unimportant words may exist, a summary consisting of only a few keywords is more preferable by the user due to its clear and concise form. Due to the above reasons, in this article, we aim to solve the problem of unsupervised derivation of keyword summary for short texts. Existing keyword extraction methods such as Latent Dirichlet Allocation cannot be applied to solve this problem, since (1) the ordering relations among the extracted keywords are ignored, which causes troubles for people to capture the main idea of the event, and (2) short texts contain limited context, which makes it hard to find the optimal words for semantic coverage. Hence, we propose a simple but yet effective method named Frequent Closed Wordsets Ranking (FCWRank) to derive the keyword summary from a short text cluster. FCWRank is an unsupervised method that builds on the idea of frequent closed itemset mining in transaction database. FCWRank first mines all frequent closed wordsets from a cluster of short texts and then selects the most important wordset based on an importance model where the similarity between closed wordsets and the relation between the closed wordset and the short text document are considered simultaneously. To make the keywords within the wordset more understandable, FCWRank further unfolds the semantics behind them by sorting them. Experiments on real-world short text collections show that FCWRank outperforms the state-of-the-art baselines in terms of Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence F1, precision and recall scores.

References

  1. Laith Mohammad Abualigah and Essam Said Hanandeh. 2015. Applying genetic algorithms to information retrieval using vector space model. Int. J. Comput. Sci. Eng. Appl. 5, 1 (2015), 19–28.Google ScholarGoogle Scholar
  2. Laith Mohammad Abualigah and Ahamad Tajudin Khader. 2017. Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J. Supercomput. 73, 11 (2017), 4773–4795. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J. Comput. Sci. 25 (2018), 456–466.Google ScholarGoogle ScholarCross RefCross Ref
  4. Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. A combination of objective functions and hybrid Krill herd algorithm for text document clustering analysis. Eng. Appl. Artif. Intell. 73 (2018), 111–125.Google ScholarGoogle ScholarCross RefCross Ref
  5. Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. 2018. Hybrid clustering analysis using improved krill herd algorithm. Appl. Intell. 48, 11 (2018), 4047–4071. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Laith Mohammad Abualigah, Ahamad Tajudin Khader, Essam Said Hanandeh, and Amir H Gandomi. 2017. A novel hybridization strategy for krill herd algorithm applied to clustering techniques. Appl. Soft Comput. 60 (2017), 423–435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Laith Mohammad Qasim Abualigah. 2019. Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering. Springer.Google ScholarGoogle Scholar
  8. Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE’95). IEEE Computer Society, USA, 3–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mozhgan Nasr Azadani, Nasser Ghadiri, and Ensieh Davoodijam. 2018. Graph-based biomedical text summarization: An itemset mining and sentence clustering approach. J. Biomed. Inf. 84 (2018), 42–58.Google ScholarGoogle ScholarCross RefCross Ref
  10. Elena Baralis, Luca Cagliero, Naeem Mahoto, and Alessandro Fiori. 2013. GRAPHSUM: Discovering correlations among multiple terms for graph-based summarization. Inf. Sci. 249 (2013), 96–109.Google ScholarGoogle ScholarCross RefCross Ref
  11. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan.2003), 993–1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Olivier Bodenreider. 2004. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 32, suppl 1 (2004), D267–D270.Google ScholarGoogle ScholarCross RefCross Ref
  13. Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. Btm: Topic modeling over short texts. Trans. Knowl. Data Eng. 26, 12 (2014), 2928–2941.Google ScholarGoogle ScholarCross RefCross Ref
  14. Hal Daumé III and Daniel Marcu. 2006. Bayesian query-focused summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’06). 305–312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Assoc. Inf. Sci. Technol. 41, 6 (1990), 391–407.Google ScholarGoogle Scholar
  16. Christian Gulden, Melanie Kirchner, Christina Schüttler, Marc Hinderer, Marvin Kampf, Hans-Ulrich Prokosch, and Dennis Toddenroth. 2019. Extractive summarization of clinical trial descriptions. Int. J. Med. Inf. 129 (2019), 114–121.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In ACM Sigmod Rec. 29 (2000). 1–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Charles A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10–16.Google ScholarGoogle ScholarCross RefCross Ref
  19. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. 289–296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. In Proceedings of the 1st Workshop on Social Media Analytics. 80–88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pei Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Meichun Hsu. 2001. PrefixSpan: Mining sequential patterns by prefix-projected growth. In Proceedings of the International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ou Jin, Nathan N Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the Conference on Information and Knowledge Management (CIKM’11). ACM, 775–784. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. 1999. The web as a graph: Measurements, models, and methods. In International Computing and Combinatorics Conference. Springer, 1–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tamara G. Kolda, Brett W. Bader, and Joseph P. Kenny. 2005. Higher-order web link analysis using multilinear algebra. In Proceedings of the IEEE International Conference on Data Mining (ICDM’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval Online (SIGIR’16). 165–174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: Efficient evaluation of sliding-window aggregates over data streams. ACM Sigmod Rec. 34, 1 (2005), 39–44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Piji Li, Lidong Bing, Wai Lam, Hang Li, and Yi Liao. 2015. Reader-aware multi-document summarization via sparse coding. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’15). 1270–1276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ximing Li, Jiaojiao Zhang, and Jihong Ouyang. 2019. Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7884–7891.Google ScholarGoogle ScholarCross RefCross Ref
  29. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches OutAssociation for Computational Linguistics, Barcelona, Spain, 74–81.Google ScholarGoogle Scholar
  30. Marina Litvak and Mark Last. 2008. Graph-based keyword extraction for single-document summarization. In Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization. 17–24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’14). 55–60.Google ScholarGoogle ScholarCross RefCross Ref
  32. Héctor D. Menéndez, Laura Plaza, and David Camacho. 2014. Combining graph connectivity and genetic clustering to improve biomedical summarization. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation (CEC’14). IEEE, 2740–2747.Google ScholarGoogle ScholarCross RefCross Ref
  33. Rada Mihalcea, Courtney Corley, Carlo Strapparava, et al. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’06), Vol. 6. 775–780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04).Google ScholarGoogle Scholar
  35. Milad Moradi and Nasser Ghadiri. 2017. Quantifying the informativeness for biomedical literature summarization: An itemset mining method. Comput. Methods Progr. Biomed. 146 (2017), 77–89.Google ScholarGoogle ScholarCross RefCross Ref
  36. Milad Moradi and Nasser Ghadiri. 2018. Different approaches for identifying important concepts in probabilistic biomedical text summarization. Artif. Intell. Med. 84 (2018), 101–116.Google ScholarGoogle ScholarCross RefCross Ref
  37. Jeffrey Nichols, Jalal Mahmud, and Clemens Drews. 2012. Summarizing sporting events using Twitter. In Proceedings of the International Conference on Intelligent User Interfaces (IUI’12). 189–198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google ScholarGoogle Scholar
  39. Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the International Conference on Database Theory (ICDT’99). Springer, 398–416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the Annual Conference on the World Wide Web (WWW’08). 91–100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Laura Plaza, Alberto Díaz, and Pablo Gervás. 2011. A semantic graph-based approach to biomedical summarisation. Artif. Intell. Med. 53, 1 (2011), 1–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Lawrence Reeve, Hyoil Han, and Ari D. Brooks. 2006. BioChain: Lexical chaining methods for biomedical text summarization. In Proceedings of the 2006 ACM Symposium on Applied Computing. 180–184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Oussama Rouane, Hacene Belhadef, and Mustapha Bouakkaz. 2019. Combine clustering and frequent itemsets mining to enhance biomedical text summarization. Expert Syst. Appl. 135 (2019), 362–373.Google ScholarGoogle ScholarCross RefCross Ref
  44. Hassan Sayyadi and Lise Getoor. 2009. Futurerank: Ranking scientific articles by predicting their future pagerank. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 533–544.Google ScholarGoogle ScholarCross RefCross Ref
  45. Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of the International Conference on Machine Learning (ICML’14). 190–198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the International Conference on Machine Learning (ICML’06). 977–984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhongyuan Wang and Haixun Wang. 2016. Understanding short texts. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’16) (Tutorial).Google ScholarGoogle Scholar
  48. Zhongqing Wang and Yue Zhang. 2017. A neural model for joint event detection and summarization. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ho Chung Wu, Robert Wing Pong Luk, Kam Fai Wong, and Kui Lam Kwok. 2008. Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. 26, 3 (2008), 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achananuparp, Ee-Peng Lim, and Xiaoming Li. 2011. Topical keyphrase extraction from Twitter. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’11). 379–388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and traditional media using topic models. In Proceedings of the European Conference on Information Retrieval (ECIR’11). 338–349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. George Kingsley Zipf. 2016. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books.Google ScholarGoogle Scholar
  53. Yuan Zuo, Jichang Zhao, and Ke Xu. 2016. Word network topic model: A simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48, 2 (2016), 379–398. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unsupervised Derivation of Keyword Summary for Short Texts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Internet Technology
      ACM Transactions on Internet Technology  Volume 21, Issue 2
      June 2021
      599 pages
      ISSN:1533-5399
      EISSN:1557-6051
      DOI:10.1145/3453144
      • Editor:
      • Ling Liu
      Issue’s Table of Contents

      Copyright © 2021 Association for Computing Machinery.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 June 2021
      • Online AM: 7 May 2020
      • Revised: 1 April 2020
      • Accepted: 1 April 2020
      • Received: 1 February 2020
      Published in toit Volume 21, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!