skip to main content
short-paper

Impact of Similarity Measures in Graph-based Automatic Text Summarization of Konkani Texts

Published:21 February 2023Publication History
Skip Abstract Section

Abstract

Automatic text summarization is a popular area in Natural Language Processing and Machine Learning. In this work, we adopt a graph-based text summarization approach, using PageRank algorithm, for automatically summarizing Konkani text documents. Konkani is an Indo-Aryan language spoken primarily in the state of Goa, which is on the west coast of India. It is a low-resource language with limited language processing tools. Such tools are readily available in other popular languages of choice for automatic text summarization, like English. The Konkani language dataset used for this purpose is based on Konkani folktales. We examine the impact of various language-independent and language-dependent similarity measures on the construction of the graph. The language-dependent similarity measures use pre-trained fastText word embeddings. A fully connected undirected graph is constructed for each document with the sentences represented as the graph's vertices. The vertices are connected to each other based on how strongly they are related to one another. Thereafter, PageRank algorithm is used for ranking the scores of the vertices. The top-ranking sentences are used to generate the summary. ROUGE toolkit was used for evaluating the quality of these system-generated summaries, and the performance was evaluated against human generated “gold-standard” abstracts and also compared with baselines and benchmark systems. The experimental results show that language-independent similarity measures performed well compared to language-dependent similarity measures despite not using language-specific tools, such as stop-words list, stemming, and word embeddings.

REFERENCES

  1. [1] Andhale Narendra and Bewoor Laxmi A.. 2016. An overview of text summarization techniques. In International Conference on Computing Communication Control and Automation (ICCUBEA). Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Moratanch N. and Chitrakala S.. 2017. A survey on extractive text summarization. In International Conference on Computer, Communication and Signal Processing (ICCCSP). Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] D'Silva Jovi and Sharma Uzzal. 2019. Development of a Konkani language dataset for automatic text summarization and its challenges. Int. J. Eng. Res. Technol. 12, 10 (2019). 18131817.Google ScholarGoogle Scholar
  4. [4] Gupta Vishal. 2013. A survey of text summarizers for Indian languages and comparison of their performance. J. Emerg. Technol. Web Intell. 5, 4 (2013), 361366.Google ScholarGoogle Scholar
  5. [5] Nimavat Ketakee and Joshiara Hetal A.. 2017. Query-based summarization methods for conversational agents: An overview. Int. J. Adv. Res. Comput. Sci. 8 8 (2017).Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Silva Jovi D’ and Sharma Uzzal. 2019. Automatic text summarization of Indian languages: A multilingual problem. J. Theoret. Appl. Inf. Technol. 97, 1 (2019).Google ScholarGoogle Scholar
  7. [7] Firas Hmida. 2014. Language Independent Summarization Approaches. Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding. LINA Nantes-University, France. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Saleh Ahmed A. and Weigang Li. 2017. Language independent text summarization of western European languages using shape coding of text elements. In 13th International Conference on Natural Computation. Fuzzy Systems and Knowledge Discovery (ICNC-FSKD’17). Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Al-Taani Ahmad T.. 2017. Automatic text summarization approaches. In International Conference on Infocom Technologies and Unmanned Systems, Trends and Future Directions (ICTUS). Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Rahimi S., Mozhdehi, Ali T. and Abdolahi Mohamad. 2017. An overview on extractive text summarization. In IEEE 4th International Conference on Knowledge-based Engineering and Innovation (KBEI). Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Document Understanding Conferences (DUC). National Institute of Standards and Technology (NIST). Retrieved March 3, 2021 from https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html.Google ScholarGoogle Scholar
  12. [12] Brin Sergey and Page Lawrence. 1998. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and ISDN Systems 30, 17. DOI: Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Government of India. 2011. Statement–4: Distribution of Population by Schedule and Other Languages India, States and Union Territories – 2011, Office of the Registrar General & Census Commissioner, India, Ministry of Home Affairs, Government of India. Retrieved from https://censusindia.gov.in/2011Census/Language-2011/Statement-4.pdf.Google ScholarGoogle Scholar
  14. [14] Mihalcea Rada and Tarau Paul. 2004. TextRank: Bringing order into texts. In Conference on Empirical Methods in Natural Language Processing. Retrieved from http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf.Google ScholarGoogle Scholar
  15. [15] Erkan Gunes and Radev Dragomir R.. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22 (2004), 457479. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Ferreira Rafael, Freitas Frederico, Cabral Luciano De Souza, Lins Rafael Dueire, Lima Rinaldo, França Gabriel, Simskez, Steven J. and Favaro Luciano. 2013. A four dimension graph model for automatic text summarization. In IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT). 389396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Sornil Ohm and Gree-ut Kornnika. 2006. An automatic text summarization approach using content-based and graph-based characteristics. In IEEE Conference on Cybernetics and Intelligent Systems. 16. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Mihalcea Rada. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In ACL on Interactive Poster and Demonstration Sessions. DOI: Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] D'Silva Jovi and Sharma Uzzal. 2022. Explorations in graph-based ranking algorithms for automatic text summarization on Konkani texts. In International Conference on Sustainable Advanced Computing. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Lin Hui, Bilmes Jeff, and Xie Sasha. 2009. Graph-based submodular selection for extractive summarization. In IEEE Workshop on Automatic Speech Recognition & Understanding. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Chirantana Mallick, Ajit K. Das, Madhurima Dutta, Asit K. Das, and Apurba Sarkar. 2018. Graph-based Text Summarization Using Modified TextRank. Advances in Intelligent Systems and Computing (August 2018), 137--146. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Banu Munisamy, Karthika C., Sudarmani, P. and Geetha T. V.. 2007. Tamil document summarization using semantic graph method. In International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’07). DOI: Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Sankar K., Sundar Ram R. Vijay, and Devi Sobha Lalitha. 2011. Text Extraction for an Agglutinative Language. Language in India, Problems of Parsing in Indian Languages, Vol. 11, 5659.Google ScholarGoogle Scholar
  24. [24] Vimal Kumar K., Yadav Divakar, and Sharma Arun. 2015. Graph Based Technique for Hindi Text Summarization. Inf. Syst. Des. Intell. Applic. 301310. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Dalal Vipul and Malik Latesh. 2018. Semantic graph based automatic text summarization for Hindi documents using particle swarm optimization. In Information and Communication Technology for Intelligent Systems (ICTIS’17), 284289.Google ScholarGoogle Scholar
  26. [26] Rathod Yogeshwari V.. 2018. Extractive text summarization of Marathi news articles. Int. Res. J. Eng. Technol. 5, 7 (2018).Google ScholarGoogle Scholar
  27. [27] Sarwadnya Vaishali and Sonawane Sheetal. 2018. Marathi extractive text summarizer using graph based model. In 4th International Conference on Computing Communication Control and Automation (ICCUBEA). IEEE. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Kanitha D. K., Muhammad Noorul M. D., and Shanavas S. A.. 2018. Malayalam text summarization using graph based method. Int. J. Comput. Sci. Inf. Technol. 9, 2 (2018), 4044.Google ScholarGoogle Scholar
  29. [29] Ghosh Partha, Shahariar, Rezvi and Khan Muhammad A.. 2018. A rule based extractive text summarization technique for Bangla news documents. Int. J. Mod. Educ. Comput. Sci. 10 (2018), 4453. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Ling. 5 (2017), 135146. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Grave Edouard, Bojanowski Piotr, Gupta Prakhar, Joulin Armand, and Mikolov Tomas. 2018. Learning word vectors for 157 languages. In International Conference on Language Resources and Evaluation (LREC’18). DOI: Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Phung Viet and Vine Lance De. 2015. A study on the use of word embeddings and PageRank for Vietnamese text summarization. In 20th Australasian Document Computing Symposium. Retrieved from Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Zuo Xiaolei, Zhang Silan, and Xia Jingbo. 2017. The enhancement of TextRank algorithm by using Word2Vec and its application on topic extraction. In2nd Annual International Conference on Information System and Artificial Intelligence (ISAI’17).Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Al-Sabahi Kamal and Zuping Zhang. 2019. Document summarization using sentence-level semantic based on word embeddings. Int. J. Softw. Eng. Knowl. Eng. 29, 2 (2019), 177196. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Zhuolin Jiang, Manaj Srivastava, Sanjay Krishna, David Akodes, and Richard Schwartz. 2020. Combining Word Embeddings and N-grams for Unsupervised Document Summarization. arXiv - CS - Machine Learning (IF), (April 25, 2020). https://arxiv.org/abs/2004.14119Google ScholarGoogle Scholar
  36. [36] Moradi Milad, Dashti, Maedeh and Samwald Matthias. 2020. Summarization of biomedical articles using domain-specific word embeddings and graph ranking. J. Biomed. Inform. 107 (2020). DOI: Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Manju K., Peter, David and Idicula Sumam Mary. 2021. A framework for generating extractive summary from multiple Malayalam documents. Information 12 (2021). DOI: Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Prasad Chandrika C. and Kallimani Jagdish S.. 2022. Extractive text summarization of Kannada text documents using page ranking technique. In Intelligent Data Communication Technologies and Internet of Things. (Lecture Notes on Data Engineering and Communications Technologies, Vol. 101. Springer. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Barrios Federico, López Federico, Argerich Luis, and Wachenchauzer Rosa. 2015. Variations of the similarity function of TextRank for automated summarization. In Argentine Symposium on Artificial Intelligence (ASAI). 6572. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Mihalcea Rada and Radev Dragomir R.. 2011. Graph-based Natural Language Processing and Information Retrieval. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Jackson Donald A., Somers, Keith M. and Harvey Harold H.. 1989. Similarity coefficients: Measures of co-occurrence and association or simply measures of occurrence? Amer. Natur. 133 (1989), 436453. Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Levenshtein V. I.. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 8 (1966).Google ScholarGoogle Scholar
  43. [43] Sankar K. and Lalithadevi Sobha. 2009. An approach to text summarization. In 3rd International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies. Association for Computational Linguistics, 5360.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] William E. Winkler. 1999. The state of record linkage and current research problems. In Statistics of Income Division, Internal Revenue Service Publication, Vol. 14.Google ScholarGoogle Scholar
  45. [45] Jaro Matthew A.. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84, 406 (1989), 414420. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Jaro Matthew A.. 1995. Probabilistic linkage of large public health data files. Statist. Med. 14 (1995), 491498. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] D'Silva Jovi and Sharma Uzzal. 2022. Automatic text summarization of Konkani texts using pre-trained word embeddings and deep learning. Int. J. Electric. Comput. Eng. 12, 2 (2022), 19902000. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Pele Ofir and Werman Michael. 2008. A linear time histogram metric for improved SIFT matching. In European Conference on Computer Vision (Lecture Notes in Computer Science, Vol. 5304). Springer, Berlin. DOI: Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Pele Ofir and Werman Michael. 2009. Fast and robust earth mover's distances. In IEEE 12th International Conference on Computer Vision, 460467. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Kusner Matt, Sun Yu, Kolkin, Nicholas I. and Weinberger Kilian Q.. 2015. From embeddings to document distances. In 32nd International Conference on International Conference on Machine Learning (ICML’15). JMLR.org, 957966.Google ScholarGoogle Scholar
  51. [51] Lin Chim-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Workshop on Text Summarization Branches Out (WAS’04), 7481.Google ScholarGoogle Scholar
  52. [52] D'Silva Jovi and Sharma Uzzal. 2021. Automatic text summarization of Konkani folk tales using supervised machine learning algorithms and language independent features. IETE J. Res. DOI: Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Luhn H. P.. 1958. The automatic creation of literature abstracts. IBM J. Res. Devel. 2, 2 (1958), 159165. Retrieved from http://www.research.ibm.com/journal/rd/022/luhn.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Campos Ricardo, Mangaravite Vitor, Pasquali Arian, Jorge Alipio, Nunes Celia and Jatowt Adam. 2020. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. J. 509 (2020), 257289. DOI: Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] D'Silva Jovi and Sharma Uzzal. 2020. Unsupervised automatic text summarization of Konkani texts using K-means with elbow method. Int. J. Eng. Res. Technol. 13, 9 (2020), 23802384. Retrieved from Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Impact of Similarity Measures in Graph-based Automatic Text Summarization of Konkani Texts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
      February 2023
      624 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3572719
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 February 2023
      • Online AM: 13 September 2022
      • Accepted: 21 July 2022
      • Received: 22 March 2022
      Published in tallip Volume 22, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
    • Article Metrics

      • Downloads (Last 12 months)169
      • Downloads (Last 6 weeks)8

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!