Abstract
The primary objective of this work is to classify Hindi and Telugu stories into three genres: fable, folk-tale, and legend. In this work, we are proposing a framework for story classification (SC) using keyword and part-of-speech (POS) features. For improving the performance of SC system, feature reduction techniques and combinations of various POS tags are explored. Further, we investigated the performance of SC by dividing the story into parts depending on its semantic structure. In this work, stories are (i) manually divided into parts based on their semantics as introduction, main, and climax; and (ii) automatically divided into equal parts based on number of sentences in a story as initial, middle, and end. We have also examined sentence increment model, which aims at determining an optimum number of sentences required to identify story genre by incremental selection of sentences in a story. Experiments are conducted on Hindi and Telugu story corpora consisting of 300 and 150 short stories, respectively. The performance of SC system is evaluated using different combinations of keyword and POS-based features, with three well-established machine learning classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN), and (iii) Support Vector Machine (SVM). Performance of the classifier is evaluated using 10-fold cross-validation and effectiveness of classifier is measured using precision, recall, and F-measure. From the classification results, it is observed that adding linguistic information boosts the performance of story classification. In view of the structure of the story, main, and initial parts of the story have shown comparatively better performance. The results from the sentence incremental model have indicated that the first nine and seven sentences in Hindi and Telugu stories, respectively, are sufficient for better classification of stories. In most of the studies, SVM models outperformed the other models in classification accuracy.
- Li Baoli, Lu Qin, and Yu Shiwen. 2004. An adaptive k-nearest neighbor text categorization strategy. ACM Trans. Asian Lang. Inform. Proc. 3, 4 (2004), 215--226.Google Scholar
Digital Library
- Akshar Bharati and Prashanth R. Mannem. 2007. Introduction to shallow parsing contest on South Asian languages. In Proceedings of the IJCAI and the Workshop on Shallow Parsing for South Asian Languages (SPSAL’07). Citeseer, 1--8.Google Scholar
- Betul Ceran, Ravi Karad, Ajay Mandvekar, Steven R. Corman, and Hasan Davulcu. 2012. A semantic triplet based story classifier. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM’12).Google Scholar
Digital Library
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2 (2011), 27:1--27:27. Software available at: http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
Digital Library
- N. Deepamala and P. Ramakanth Kumar. 2014. Text classification of Kannada webpages using various pre-processing agents. In Recent Advances in Intelligent Informatics. Springer.Google Scholar
- Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6 (1990), 391--407.Google Scholar
Cross Ref
- Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 7 (1998), 1895--1923.Google Scholar
Digital Library
- Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management. ACM, 148--155.Google Scholar
Digital Library
- Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 5 (1971), 378.Google Scholar
Cross Ref
- George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, (Mar. 2003), 1289--1305.Google Scholar
- Keinosuke Fukunaga. 1990. Introduction to Statistical Pattern Recognition. Elsevier.Google Scholar
Digital Library
- Marco Guerini and Carlo Strapparava. 2014. Credible or incredible? Dissecting urban legends. In Computational Linguistics and Intelligent Text Processing. Springer.Google Scholar
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explor. Newsl. 11, 1 (Nov. 2009), 10--18. DOI:https://doi.org/10.1145/1656274.1656278Google Scholar
Digital Library
- D. M. Harikrishna and K. S. Rao. 2015a. Children story classification based on structure of the story. In Proceedings of the International Conference on Advances in Computing Communications and Informatics.Google Scholar
- D. M. Harikrishna and K. S. Rao. 2015b. Classification of children stories in Hindi using keywords and POS density. In Proceedings of the International Conference on Computer Communication and Control.Google Scholar
- D. M. Harikrishna, Gurunath Reddy, and K.S Rao. 2015. Multi-stage children story speech synthesis for Hindi. In Proceedings of the International Conference on Contemporary Computing. 220--224.Google Scholar
Digital Library
- Donna Harman, Noriko Kando, Prasenjit Majumder, Mandar Mitra, and Carol Peters. 2010. Introduction to the special Issue on Indian language information retrieval. ACM Trans. Asian Lang. Inform. Proc. 9, 3 (2010), 9.Google Scholar
- Michal Hrala and Pavel Král. 2013. Evaluation of the document classification approaches. In Proceedings of the 8th International Conference on Computer Recognition Systems (CORES’13). Springer, 877--885.Google Scholar
Cross Ref
- Elias Iosif and Taniya Mishra. 2014. From speaker identification to affective analysis: A multi-step system for analyzing children stories. In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL’14), 40--49.Google Scholar
Cross Ref
- R. Jayashree and Murthy K. Srikanta. 2011. An analysis of sentence level text classification for the Kannada language. In Proceedings of the International Conference of Soft Computing and Pattern Recognition (SoCPaR’11).Google Scholar
- Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Springer.Google Scholar
- Hyunsoo Kim, Peg Howland, and Haesun Park. 2005. Dimension reduction in text classification with support vector machines. J. Mach. Learning Res. 6 (2005), 37--53. http://dl.acm.org/citation.cfm?id=1046920.1046922.Google Scholar
Digital Library
- Youngjoong Ko, Jinwoo Park, and Jungyun Seo. 2002. Automatic text categorization using the importance of sentences. In Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1. ACL, 1--7.Google Scholar
Digital Library
- Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura E. Barnes, and Donald E. Brown. 2019. Text classification algorithms: A survey. (2019). DOI:https://doi.org/10.3390/info10040150Google Scholar
- Pavel Král. 2014. Named entities as new features for Czech document classification. In Computational Linguistics and Intelligent Text Processing. Springer, 417--427.Google Scholar
- I. Kuralenok and I. Nekrest’yanov. 2000. Automatic document classification based on latent semantic analysis. Programm. Comput. Softw. 26, 4 (2000), 199--206.Google Scholar
Digital Library
- Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, and Gongyi Wu. 2004. Improving text classification using local latent semantic indexing. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE, 162--169.Google Scholar
- Paula Vaz Lobo and David Martins De Matos. 2010. Fairy tale corpus organization using latent semantic mapping and an item-to-item top-n recommendation algorithm. In Proceedings of the Language Resources and Evaluation Conference (LREC’10).Google Scholar
- Andreea Moldovan, Radu Ioan Bot, and Gert Wanka. 2005. Latent semantic indexing for patent documents. Int. J. Appl. Math. Comput. Sci. 15 (2005), 551.Google Scholar
- Kavi Narayana Murthy. 2003. Automatic categorization of Telugu news articles. In International Symposium on Linguistics, Quantification and Computation. 119--127. http://languagetechnologies.uohyd.ac.in/knm-publications/il_text_cat.pdf.Google Scholar
- Nidhi and Vishal Gupta. 2012. Domain based classification of Punjabi text documents using ontology and hybrid based approach. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (COLING’12).Google Scholar
- Hemant A. Patil, Tanvina B. Patel, Nirmesh J. Shah, Hardik B. Sailor, Raghava Krishnan, G. R. Kasthuri, T. Nagarajan, Lilly Christina, Naresh Kumar, Veera Raghavendra et al. 2013. A syllable-based framework for unit selection synthesis in 13 Indian languages. In Proceedings of the Oriental COCOSDA Held Jointly with the Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE’13). IEEE, 1--8.Google Scholar
- Meera Patil and Pravin Game. 2014. Comparison of Marathi text classifiers. Assoc. Comput. Electron. Elect. Eng. 4 (2014).Google Scholar
- Pratiksha Y. Pawar and S. H. Gawande. 2012. A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2, 4 (2012), 423--426.Google Scholar
Cross Ref
- John F. Pitrelli, Raimo Bakis, Ellen M. Eide, Raul Fernandez, Wael Hamza, Michael Picheny et al. 2006. The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio, Speech, Lang. Proc. 14, 4 (2006), 1099--1108.Google Scholar
Digital Library
- R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from: http://www.R-project.org ISBN 3-900051-07-0.Google Scholar
- K. Raghuveer and Kavi Narayana Murthy. 2007. Text categorization in Indian languages using machine learning approaches. In Proceedings of the Indian International Conference on Artificial Intelligence.Google Scholar
- K. Rajan, Vennila Ramalingam, M. Ganesan, S. Palanivel, and B. Palaniappan. 2009. Automatic classification of Tamil documents using vector space model and artificial neural network. Exp. Syst. Appl. 36 (2009), 10914--10918.Google Scholar
Digital Library
- Tara N. Sainath, Sameer Maskey, Dimitri Kanevsky, Bhuvana Ramabhadran, David Nahamoo, and Julia Hirschberg. 2010. Sparse representations for text categorization. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’10). 2266--2269.Google Scholar
Cross Ref
- Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya. 2006. Morphological richness offsets resource demand-experiences in constructing a POS tagger for Hindi. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 779--786.Google Scholar
Digital Library
- Manish Sinha, Mahesh Reddy, and Pushpak Bhattacharyya. 2006. An approach towards construction and application of multilingual indo-wordnet. In Proceedings of the 3rd Global Wordnet Conference (GWC’06).Google Scholar
- Songbo Tan, Xueqi Cheng, Moustafa M. Ghanem, Bin Wang, and Hongbo Xu. 2005. A novel refinement approach for text categorization. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management. ACM, 469--476.Google Scholar
Digital Library
- Kari Torkkola. 2001. Linear discriminant analysis in document classification. In Proceedings of the IEEE ICDM Workshop on Text Mining. Citeseer, 800--806.Google Scholar
- Madhuri Tummalapalli, Manoj Chinnakotla, and Radhika Mamidi. 2018. Towards better sentence classification for morphologically rich languages. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing.Google Scholar
- Oytun Türk and Marc Schröder. 2008. A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’08). 2282--2285.Google Scholar
Cross Ref
- Peter D. Turney. 2002. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Meeting on Association for Computational Linguistics. ACL, 417--424.Google Scholar
- Ziqiang Wang and Xu Qian. 2008. Text categorization based on LDA and SVM. In Proceedings of the International Conference on Computer Science and Software Engineering, Vol. 1. IEEE, 674--677.Google Scholar
Digital Library
- Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 42--49.Google Scholar
Digital Library
Index Terms
Children’s Story Classification in Indian Languages Using Linguistic and Keyword-based Features
Recommendations
New under-sampling methods to address the problem of unbalanced sentiment classification: application on Arabic datasets
This paper presents the study we have carried out to address the problem of unbalanced datasets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to ...
Automatic Punjabi poetry classification using machine learning algorithms with reduced feature set
With the appearance of Unicode encoding, content in Indian dialects is continually expanding on the internet. Gathering of artistic writings in Punjabi language, particularly poetry, is expanding day by day on the web. In this way, grouping of poems, as ...
A Combined Classification Algorithm Based on C4.5 and NB
ISICA '08: Proceedings of the 3rd International Symposium on Advances in Computation and IntelligenceWhen our learning task is to build a model with accurate classification, C4.5 and NB are two very important algorithms for achieving this task because of their simplicity and high performance. In this paper, we present a combined classification ...






Comments