Abstract
Automatic text summarization is a popular area in Natural Language Processing and Machine Learning. In this work, we adopt a graph-based text summarization approach, using PageRank algorithm, for automatically summarizing Konkani text documents. Konkani is an Indo-Aryan language spoken primarily in the state of Goa, which is on the west coast of India. It is a low-resource language with limited language processing tools. Such tools are readily available in other popular languages of choice for automatic text summarization, like English. The Konkani language dataset used for this purpose is based on Konkani folktales. We examine the impact of various language-independent and language-dependent similarity measures on the construction of the graph. The language-dependent similarity measures use pre-trained fastText word embeddings. A fully connected undirected graph is constructed for each document with the sentences represented as the graph's vertices. The vertices are connected to each other based on how strongly they are related to one another. Thereafter, PageRank algorithm is used for ranking the scores of the vertices. The top-ranking sentences are used to generate the summary. ROUGE toolkit was used for evaluating the quality of these system-generated summaries, and the performance was evaluated against human generated “gold-standard” abstracts and also compared with baselines and benchmark systems. The experimental results show that language-independent similarity measures performed well compared to language-dependent similarity measures despite not using language-specific tools, such as stop-words list, stemming, and word embeddings.
- [1] . 2016. An overview of text summarization techniques. In International Conference on Computing Communication Control and Automation (ICCUBEA). Retrieved from Google Scholar
Cross Ref
- [2] . 2017. A survey on extractive text summarization. In International Conference on Computer, Communication and Signal Processing (ICCCSP). Retrieved from Google Scholar
Cross Ref
- [3] . 2019. Development of a Konkani language dataset for automatic text summarization and its challenges. Int. J. Eng. Res. Technol. 12, 10 (2019). 1813–1817.Google Scholar
- [4] . 2013. A survey of text summarizers for Indian languages and comparison of their performance. J. Emerg. Technol. Web Intell. 5, 4 (2013), 361–366.Google Scholar
- [5] . 2017. Query-based summarization methods for conversational agents: An overview. Int. J. Adv. Res. Comput. Sci. 8 8 (2017).Google Scholar
Cross Ref
- [6] . 2019. Automatic text summarization of Indian languages: A multilingual problem. J. Theoret. Appl. Inf. Technol. 97, 1 (2019).Google Scholar
- [7] . 2014. Language Independent Summarization Approaches. Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding. LINA Nantes-University, France. Retrieved from Google Scholar
Cross Ref
- [8] . 2017. Language independent text summarization of western European languages using shape coding of text elements. In 13th International Conference on Natural Computation. Fuzzy Systems and Knowledge Discovery (ICNC-FSKD’17). Retrieved from Google Scholar
Cross Ref
- [9] . 2017. Automatic text summarization approaches. In International Conference on Infocom Technologies and Unmanned Systems, Trends and Future Directions (ICTUS). Retrieved from Google Scholar
Cross Ref
- [10] . 2017. An overview on extractive text summarization. In IEEE 4th International Conference on Knowledge-based Engineering and Innovation (KBEI). Retrieved from Google Scholar
Cross Ref
- [11] Document Understanding Conferences (DUC). National Institute of Standards and Technology (NIST). Retrieved March 3, 2021 from https://www-nlpir.nist.gov/projects/duc/guidelines/2002.html.Google Scholar
- [12] . 1998. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and ISDN Systems 30, 1–7.
DOI: Google ScholarDigital Library
- [13] Government of India. 2011. Statement–4: Distribution of Population by Schedule and Other Languages India, States and Union Territories – 2011, Office of the Registrar General & Census Commissioner, India, Ministry of Home Affairs, Government of India. Retrieved from https://censusindia.gov.in/2011Census/Language-2011/Statement-4.pdf.Google Scholar
- [14] . 2004. TextRank: Bringing order into texts. In Conference on Empirical Methods in Natural Language Processing. Retrieved from http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf.Google Scholar
- [15] . 2004. LexRank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22 (2004), 457–479.
DOI: Google ScholarCross Ref
- [16] . 2013. A four dimension graph model for automatic text summarization. In IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT). 389–396.Google Scholar
Digital Library
- [17] . 2006. An automatic text summarization approach using content-based and graph-based characteristics. In IEEE Conference on Cybernetics and Intelligent Systems. 1–6. Retrieved from Google Scholar
Cross Ref
- [18] . 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In ACL on Interactive Poster and Demonstration Sessions.
DOI: Google ScholarDigital Library
- [19] . 2022. Explorations in graph-based ranking algorithms for automatic text summarization on Konkani texts. In International Conference on Sustainable Advanced Computing. Retrieved from Google Scholar
Cross Ref
- [20] . 2009. Graph-based submodular selection for extractive summarization. In IEEE Workshop on Automatic Speech Recognition & Understanding. Retrieved from Google Scholar
Cross Ref
- [21] Chirantana Mallick, Ajit K. Das, Madhurima Dutta, Asit K. Das, and Apurba Sarkar. 2018. Graph-based Text Summarization Using Modified TextRank. Advances in Intelligent Systems and Computing (August 2018), 137--146. Google Scholar
Cross Ref
- [22] . 2007. Tamil document summarization using semantic graph method. In International Conference on Computational Intelligence and Multimedia Applications (ICCIMA’07).
DOI: Google ScholarDigital Library
- [23] . 2011. Text Extraction for an Agglutinative Language. Language in India, Problems of Parsing in Indian Languages, Vol. 11, 56–59.Google Scholar
- [24] . 2015. Graph Based Technique for Hindi Text Summarization. Inf. Syst. Des. Intell. Applic. 301–310. Retrieved from Google Scholar
Cross Ref
- [25] . 2018. Semantic graph based automatic text summarization for Hindi documents using particle swarm optimization. In Information and Communication Technology for Intelligent Systems (ICTIS’17), 284–289.Google Scholar
- [26] . 2018. Extractive text summarization of Marathi news articles. Int. Res. J. Eng. Technol. 5, 7 (2018).Google Scholar
- [27] . 2018. Marathi extractive text summarizer using graph based model. In 4th International Conference on Computing Communication Control and Automation (ICCUBEA). IEEE. Retrieved from Google Scholar
Cross Ref
- [28] . 2018. Malayalam text summarization using graph based method. Int. J. Comput. Sci. Inf. Technol. 9, 2 (2018), 40–44.Google Scholar
- [29] . 2018. A rule based extractive text summarization technique for Bangla news documents. Int. J. Mod. Educ. Comput. Sci. 10 (2018), 44–53. Retrieved from Google Scholar
Cross Ref
- [30] . 2017. Enriching word vectors with subword information. Trans. Assoc. Computat. Ling. 5 (2017), 135–146. Retrieved from Google Scholar
Cross Ref
- [31] . 2018. Learning word vectors for 157 languages. In International Conference on Language Resources and Evaluation (LREC’18).
DOI: Google ScholarCross Ref
- [32] . 2015. A study on the use of word embeddings and PageRank for Vietnamese text summarization. In 20th Australasian Document Computing Symposium. Retrieved from Google Scholar
Digital Library
- [33] . 2017. The enhancement of TextRank algorithm by using Word2Vec and its application on topic extraction. In2nd Annual International Conference on Information System and Artificial Intelligence (ISAI’17).Google Scholar
Cross Ref
- [34] . 2019. Document summarization using sentence-level semantic based on word embeddings. Int. J. Softw. Eng. Knowl. Eng. 29, 2 (2019), 177–196.
DOI: Google ScholarCross Ref
- [35] Zhuolin Jiang, Manaj Srivastava, Sanjay Krishna, David Akodes, and Richard Schwartz. 2020. Combining Word Embeddings and N-grams for Unsupervised Document Summarization. arXiv - CS - Machine Learning (IF), (April 25, 2020). https://arxiv.org/abs/2004.14119Google Scholar
- [36] . 2020. Summarization of biomedical articles using domain-specific word embeddings and graph ranking. J. Biomed. Inform. 107 (2020).
DOI: Google ScholarCross Ref
- [37] . 2021. A framework for generating extractive summary from multiple Malayalam documents. Information 12 (2021).
DOI: Google ScholarCross Ref
- [38] . 2022. Extractive text summarization of Kannada text documents using page ranking technique. In Intelligent Data Communication Technologies and Internet of Things. (Lecture Notes on Data Engineering and Communications Technologies, Vol. 101. Springer.
DOI: Google ScholarCross Ref
- [39] . 2015. Variations of the similarity function of TextRank for automated summarization. In Argentine Symposium on Artificial Intelligence (ASAI). 65–72.
DOI: Google ScholarCross Ref
- [40] . 2011. Graph-based Natural Language Processing and Information Retrieval. Cambridge University Press.Google Scholar
Cross Ref
- [41] . 1989. Similarity coefficients: Measures of co-occurrence and association or simply measures of occurrence? Amer. Natur. 133 (1989), 436–453. Google Scholar
Cross Ref
- [42] . 1966. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 8 (1966).Google Scholar
- [43] . 2009. An approach to text summarization. In 3rd International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies. Association for Computational Linguistics, 53–60.Google Scholar
Cross Ref
- [44] William E. Winkler. 1999. The state of record linkage and current research problems. In Statistics of Income Division, Internal Revenue Service Publication, Vol. 14.Google Scholar
- [45] . 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84, 406 (1989), 414–420.
DOI: Google ScholarCross Ref
- [46] . 1995. Probabilistic linkage of large public health data files. Statist. Med. 14 (1995), 491–498.
DOI: Google ScholarCross Ref
- [47] . 2022. Automatic text summarization of Konkani texts using pre-trained word embeddings and deep learning. Int. J. Electric. Comput. Eng. 12, 2 (2022), 1990–2000.
DOI: Google ScholarCross Ref
- [48] . 2008. A linear time histogram metric for improved SIFT matching. In European Conference on Computer Vision (Lecture Notes in Computer Science, Vol. 5304). Springer, Berlin.
DOI: Google ScholarDigital Library
- [49] . 2009. Fast and robust earth mover's distances. In IEEE 12th International Conference on Computer Vision, 460–467. Retrieved from Google Scholar
Cross Ref
- [50] . 2015. From embeddings to document distances. In 32nd International Conference on International Conference on Machine Learning (ICML’15). JMLR.org, 957–966.Google Scholar
- [51] . 2004. ROUGE: A package for automatic evaluation of summaries. In Workshop on Text Summarization Branches Out (WAS’04), 74–81.Google Scholar
- [52] . 2021. Automatic text summarization of Konkani folk tales using supervised machine learning algorithms and language independent features. IETE J. Res.
DOI: Google ScholarCross Ref
- [53] . 1958. The automatic creation of literature abstracts. IBM J. Res. Devel. 2, 2 (1958), 159–165. Retrieved from http://www.research.ibm.com/journal/rd/022/luhn.pdf.Google Scholar
Digital Library
- [54] . 2020. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. J. 509 (2020), 257–289.
DOI: Google ScholarDigital Library
- [55] . 2020. Unsupervised automatic text summarization of Konkani texts using K-means with elbow method. Int. J. Eng. Res. Technol. 13, 9 (2020), 2380–2384. Retrieved from Google Scholar
Cross Ref
Index Terms
Impact of Similarity Measures in Graph-based Automatic Text Summarization of Konkani Texts
Recommendations
Annotated suffix tree similarity measure for text summarization
DMNLP'15: Proceedings of the 2nd International Conference on Interactions between Data Mining and Natural Language Processing - Volume 1410The paper describes an attempt to improve the TextRank algorithm. TextRank is an algorithm for unsupervised text summarisation. It has two main stages: first stage is representing a text as a weighted directed graph, where nodes stand for single ...
Learning bilingual word embedding for automatic text summarization in low resource language
AbstractStudies in low-resource languages have become more challenging with the increasing volume of texts in today's digital era. Also, the lack of labeled data and text processing libraries in a language further widens the research gap ...
Unsupervised Keyword Extraction Methods Based on a Word Graph Network
Supervised keyword extraction methods usually require a large human-annotated corpus to train the model. Expensive manual labeling has made unsupervised technology using word graph networks attractive. Traditional word graph networks simply consider ...






Comments