skip to main content
research-article

Children’s Story Classification in Indian Languages Using Linguistic and Keyword-based Features

Published:26 November 2019Publication History
Skip Abstract Section

Abstract

The primary objective of this work is to classify Hindi and Telugu stories into three genres: fable, folk-tale, and legend. In this work, we are proposing a framework for story classification (SC) using keyword and part-of-speech (POS) features. For improving the performance of SC system, feature reduction techniques and combinations of various POS tags are explored. Further, we investigated the performance of SC by dividing the story into parts depending on its semantic structure. In this work, stories are (i) manually divided into parts based on their semantics as introduction, main, and climax; and (ii) automatically divided into equal parts based on number of sentences in a story as initial, middle, and end. We have also examined sentence increment model, which aims at determining an optimum number of sentences required to identify story genre by incremental selection of sentences in a story. Experiments are conducted on Hindi and Telugu story corpora consisting of 300 and 150 short stories, respectively. The performance of SC system is evaluated using different combinations of keyword and POS-based features, with three well-established machine learning classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN), and (iii) Support Vector Machine (SVM). Performance of the classifier is evaluated using 10-fold cross-validation and effectiveness of classifier is measured using precision, recall, and F-measure. From the classification results, it is observed that adding linguistic information boosts the performance of story classification. In view of the structure of the story, main, and initial parts of the story have shown comparatively better performance. The results from the sentence incremental model have indicated that the first nine and seven sentences in Hindi and Telugu stories, respectively, are sufficient for better classification of stories. In most of the studies, SVM models outperformed the other models in classification accuracy.

References

  1. Li Baoli, Lu Qin, and Yu Shiwen. 2004. An adaptive k-nearest neighbor text categorization strategy. ACM Trans. Asian Lang. Inform. Proc. 3, 4 (2004), 215--226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Akshar Bharati and Prashanth R. Mannem. 2007. Introduction to shallow parsing contest on South Asian languages. In Proceedings of the IJCAI and the Workshop on Shallow Parsing for South Asian Languages (SPSAL’07). Citeseer, 1--8.Google ScholarGoogle Scholar
  3. Betul Ceran, Ravi Karad, Ajay Mandvekar, Steven R. Corman, and Hasan Davulcu. 2012. A semantic triplet based story classifier. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2 (2011), 27:1--27:27. Software available at: http://www.csie.ntu.edu.tw/cjlin/libsvm.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Deepamala and P. Ramakanth Kumar. 2014. Text classification of Kannada webpages using various pre-processing agents. In Recent Advances in Intelligent Informatics. Springer.Google ScholarGoogle Scholar
  6. Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6 (1990), 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  7. Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 7 (1998), 1895--1923.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management. ACM, 148--155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 5 (1971), 378.Google ScholarGoogle ScholarCross RefCross Ref
  10. George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, (Mar. 2003), 1289--1305.Google ScholarGoogle Scholar
  11. Keinosuke Fukunaga. 1990. Introduction to Statistical Pattern Recognition. Elsevier.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Marco Guerini and Carlo Strapparava. 2014. Credible or incredible? Dissecting urban legends. In Computational Linguistics and Intelligent Text Processing. Springer.Google ScholarGoogle Scholar
  13. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explor. Newsl. 11, 1 (Nov. 2009), 10--18. DOI:https://doi.org/10.1145/1656274.1656278Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. M. Harikrishna and K. S. Rao. 2015a. Children story classification based on structure of the story. In Proceedings of the International Conference on Advances in Computing Communications and Informatics.Google ScholarGoogle Scholar
  15. D. M. Harikrishna and K. S. Rao. 2015b. Classification of children stories in Hindi using keywords and POS density. In Proceedings of the International Conference on Computer Communication and Control.Google ScholarGoogle Scholar
  16. D. M. Harikrishna, Gurunath Reddy, and K.S Rao. 2015. Multi-stage children story speech synthesis for Hindi. In Proceedings of the International Conference on Contemporary Computing. 220--224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Donna Harman, Noriko Kando, Prasenjit Majumder, Mandar Mitra, and Carol Peters. 2010. Introduction to the special Issue on Indian language information retrieval. ACM Trans. Asian Lang. Inform. Proc. 9, 3 (2010), 9.Google ScholarGoogle Scholar
  18. Michal Hrala and Pavel Král. 2013. Evaluation of the document classification approaches. In Proceedings of the 8th International Conference on Computer Recognition Systems (CORES’13). Springer, 877--885.Google ScholarGoogle ScholarCross RefCross Ref
  19. Elias Iosif and Taniya Mishra. 2014. From speaker identification to affective analysis: A multi-step system for analyzing children stories. In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL’14), 40--49.Google ScholarGoogle ScholarCross RefCross Ref
  20. R. Jayashree and Murthy K. Srikanta. 2011. An analysis of sentence level text classification for the Kannada language. In Proceedings of the International Conference of Soft Computing and Pattern Recognition (SoCPaR’11).Google ScholarGoogle Scholar
  21. Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Springer.Google ScholarGoogle Scholar
  22. Hyunsoo Kim, Peg Howland, and Haesun Park. 2005. Dimension reduction in text classification with support vector machines. J. Mach. Learning Res. 6 (2005), 37--53. http://dl.acm.org/citation.cfm?id=1046920.1046922.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Youngjoong Ko, Jinwoo Park, and Jungyun Seo. 2002. Automatic text categorization using the importance of sentences. In Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1. ACL, 1--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura E. Barnes, and Donald E. Brown. 2019. Text classification algorithms: A survey. (2019). DOI:https://doi.org/10.3390/info10040150Google ScholarGoogle Scholar
  25. Pavel Král. 2014. Named entities as new features for Czech document classification. In Computational Linguistics and Intelligent Text Processing. Springer, 417--427.Google ScholarGoogle Scholar
  26. I. Kuralenok and I. Nekrest’yanov. 2000. Automatic document classification based on latent semantic analysis. Programm. Comput. Softw. 26, 4 (2000), 199--206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, and Gongyi Wu. 2004. Improving text classification using local latent semantic indexing. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE, 162--169.Google ScholarGoogle Scholar
  28. Paula Vaz Lobo and David Martins De Matos. 2010. Fairy tale corpus organization using latent semantic mapping and an item-to-item top-n recommendation algorithm. In Proceedings of the Language Resources and Evaluation Conference (LREC’10).Google ScholarGoogle Scholar
  29. Andreea Moldovan, Radu Ioan Bot, and Gert Wanka. 2005. Latent semantic indexing for patent documents. Int. J. Appl. Math. Comput. Sci. 15 (2005), 551.Google ScholarGoogle Scholar
  30. Kavi Narayana Murthy. 2003. Automatic categorization of Telugu news articles. In International Symposium on Linguistics, Quantification and Computation. 119--127. http://languagetechnologies.uohyd.ac.in/knm-publications/il_text_cat.pdf.Google ScholarGoogle Scholar
  31. Nidhi and Vishal Gupta. 2012. Domain based classification of Punjabi text documents using ontology and hybrid based approach. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (COLING’12).Google ScholarGoogle Scholar
  32. Hemant A. Patil, Tanvina B. Patel, Nirmesh J. Shah, Hardik B. Sailor, Raghava Krishnan, G. R. Kasthuri, T. Nagarajan, Lilly Christina, Naresh Kumar, Veera Raghavendra et al. 2013. A syllable-based framework for unit selection synthesis in 13 Indian languages. In Proceedings of the Oriental COCOSDA Held Jointly with the Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE’13). IEEE, 1--8.Google ScholarGoogle Scholar
  33. Meera Patil and Pravin Game. 2014. Comparison of Marathi text classifiers. Assoc. Comput. Electron. Elect. Eng. 4 (2014).Google ScholarGoogle Scholar
  34. Pratiksha Y. Pawar and S. H. Gawande. 2012. A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2, 4 (2012), 423--426.Google ScholarGoogle ScholarCross RefCross Ref
  35. John F. Pitrelli, Raimo Bakis, Ellen M. Eide, Raul Fernandez, Wael Hamza, Michael Picheny et al. 2006. The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio, Speech, Lang. Proc. 14, 4 (2006), 1099--1108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R Development Core Team. 2008. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from: http://www.R-project.org ISBN 3-900051-07-0.Google ScholarGoogle Scholar
  37. K. Raghuveer and Kavi Narayana Murthy. 2007. Text categorization in Indian languages using machine learning approaches. In Proceedings of the Indian International Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  38. K. Rajan, Vennila Ramalingam, M. Ganesan, S. Palanivel, and B. Palaniappan. 2009. Automatic classification of Tamil documents using vector space model and artificial neural network. Exp. Syst. Appl. 36 (2009), 10914--10918.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tara N. Sainath, Sameer Maskey, Dimitri Kanevsky, Bhuvana Ramabhadran, David Nahamoo, and Julia Hirschberg. 2010. Sparse representations for text categorization. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’10). 2266--2269.Google ScholarGoogle ScholarCross RefCross Ref
  40. Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya. 2006. Morphological richness offsets resource demand-experiences in constructing a POS tagger for Hindi. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. Association for Computational Linguistics, 779--786.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Manish Sinha, Mahesh Reddy, and Pushpak Bhattacharyya. 2006. An approach towards construction and application of multilingual indo-wordnet. In Proceedings of the 3rd Global Wordnet Conference (GWC’06).Google ScholarGoogle Scholar
  42. Songbo Tan, Xueqi Cheng, Moustafa M. Ghanem, Bin Wang, and Hongbo Xu. 2005. A novel refinement approach for text categorization. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management. ACM, 469--476.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kari Torkkola. 2001. Linear discriminant analysis in document classification. In Proceedings of the IEEE ICDM Workshop on Text Mining. Citeseer, 800--806.Google ScholarGoogle Scholar
  44. Madhuri Tummalapalli, Manoj Chinnakotla, and Radhika Mamidi. 2018. Towards better sentence classification for morphologically rich languages. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing.Google ScholarGoogle Scholar
  45. Oytun Türk and Marc Schröder. 2008. A comparison of voice conversion methods for transforming voice quality in emotional speech synthesis. In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH’08). 2282--2285.Google ScholarGoogle ScholarCross RefCross Ref
  46. Peter D. Turney. 2002. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Meeting on Association for Computational Linguistics. ACL, 417--424.Google ScholarGoogle Scholar
  47. Ziqiang Wang and Xu Qian. 2008. Text categorization based on LDA and SVM. In Proceedings of the International Conference on Computer Science and Software Engineering, Vol. 1. IEEE, 674--677.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 42--49.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Children’s Story Classification in Indian Languages Using Linguistic and Keyword-based Features

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!