skip to main content
note

Understanding Document Semantics from Summaries: A Case Study on Hindi Texts

Published:18 November 2016Publication History
Skip Abstract Section

Abstract

Summary of a document contains words that actually contribute to the semantics of the document. Latent Semantic Analysis (LSA) is a mathematical model that is used to understand document semantics by deriving a semantic structure based on patterns of word correlations in the document. When using LSA to capture semantics from summaries, it is observed that LSA performs quite well despite being completely independent of any external sources of semantics. However, LSA can be remodeled to enhance its capability to analyze correlations within texts. By taking advantage of the model being language independent, this article presents two stages of LSA remodeling to understand document semantics in the Indian context, specifically from Hindi text summaries. One stage of remodeling is done by providing supplementary information, such as document category and domain information. The second stage of remodeling is done by using a supervised term weighting measure in the process. The remodeled LSA’s performance is empirically evaluated in a document classification application by comparing the accuracies of classification to plain LSA. An improvement in the performance of LSA in the range of 4.7% to 6.2% is achieved from the remodel when compared to the plain model. The results suggest that summaries of documents efficiently capture the semantic structure of documents and is an alternative to full-length documents for understanding document semantics.

References

  1. K. Baker 2005. Singular Value Decomposition Tutorial. Retrieved September 22, 2016, from http://www.ling.ohio-state.edu/∼kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf.Google ScholarGoogle Scholar
  2. M. Berry and S. Dumais. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37, 4, 573--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Chisholm and T. Kolda. 1999. New Term Weighting Formulas for the Vector Space Method in Information Retrieval. Report ORNL/TM-13756. Computer Science and Mathematics Division, Oak Ridge National Laboratory.Google ScholarGoogle Scholar
  4. K. Cho and J. Kim 1997. Automatic text categorization on hierarchical category structure by using ICF (inverted category frequency) weighting. In Proceedings of the Korea Information Science Society Conference (KISS’97). 507--510.Google ScholarGoogle Scholar
  5. F. Debole and F. Sebastianit. 2003. Supervised term weighting for automated text categorization. In Proceedings of the ACM Symposium on Applied Computing. 784--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Deerwester, S. Dumais, G. Furnas, and T. K. Landauer. 1990. Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology 4, 6, 391--407.Google ScholarGoogle Scholar
  7. S. Dumais. 1990. Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval. Technical Report TM-ARH-017527, Bellcore, Morristown, NJ.Google ScholarGoogle Scholar
  8. B. D. Eugenio and R. Serafin. 2010. Dialogue act classification, instance-based learning, and higher order dialogue structure. Dialogue and Discourse 1, 2, 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  9. H. Guan, J. Zhou, and M. Guo. 2009. A class-feature-centroid classifier for text categorization. In Proceedings of the International Conference on World Wide Web. 201--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Hulth. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 216--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Hulth and B. B. Megyesi. 2006. A study on automatically extracted keywords in text categorization. In Proceedings of the International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics. 537--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Kanejiya, A. Kumar, and S. Prasad. 2003. Automatic evaluation of students’ answers using syntactically enhanced LSA. In Proceedings of the Workshop on Building Educational Applications Using NLP. 53--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Kanimozhi. 2012. Web based classification of Tamil documents using ABPA. International Journal of Scientific and Engineering Research 3, 5, 540--545.Google ScholarGoogle Scholar
  14. S. J. Ker and J. Chen. 2000. A text categorization based on summarization technique. In Proceedings of theWorkshop on Recent Advances in NLP and Information Retrieval and the Annual Meeting of the Association for Computational Linguistics. 79--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Ko and J. Seo. 2000. Automatic text categorization by unsupervised learning. In Proceedings of the 18th International Conference on Computational Linguistics. 453--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Ko, J. Park, and J. Seo. 2004. Improving text categorization using the importance of sentences. Information Processing and Management 40, 1, 65--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Kolda. 1997. Limited-Memory Matrix Methods with Applications. Ph.D. Dissertation, Applied Mathematics Program, University of Maryland at College Park.Google ScholarGoogle Scholar
  18. K. Krishnamurthi, V. R. Panuganti, and V. V. Bulusu. 2013. An empirical evaluation of dimensionality reduction using LSA on Hindi text. In Proceedings of theInternational Conference on Asian Language Processing. 21--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Kumar, B. C. Singh, S. Pani, and V. Maral. 2013. Seamless integration of common framework Indian language TTSes in various applications. In Proceedings of theInternational Oriental Conference on Asian Spoken Language Research and Evaluation. 1--6.Google ScholarGoogle Scholar
  20. M. Lan, C. H. Tan, and J. Su. 2009. Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 4, 721--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. K. Landauer. 1998. An introduction to latent semantic analysis. Discourse Processes 25, 2--3, 259--284.Google ScholarGoogle ScholarCross RefCross Ref
  22. T. K. Landauer and S. Dumais. 2008. Latent semantic analysis. Scholarpedia 3, 11, 4356.Google ScholarGoogle ScholarCross RefCross Ref
  23. R. Larson and M. Davis. 2002. Information Organization and Retrieval. SIMS, Lecture 18: Vector Representation, University of California at Berkeley.Google ScholarGoogle Scholar
  24. E. Leopold and J. Kindermann. 2002. Text categorization with support vector machines: How to represent texts in input space? Machine Learning 46, 1--3, 423--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Lertnattee and C. Leuviphan. 2012. Using class frequency for improving centroid-based text classification. ACEEE International Journal on Information Technology 2, 2, 62--66.Google ScholarGoogle Scholar
  26. Y. Liu, H. T. Loh, and A. X. Sun. 2009. Imbalanced text classification: A term weighting approach. Expert Systems with Applications 36, 690--701. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. E. Lochbaum. 1989. Comparing and combining the effectiveness of LSI and the ordinary vector space model for information retrieval. Information Processing and Management 25, 6, 665--676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. G. A. Malik, C. Boitet, and P. Bhattacharyya. 2008. Hindi Urdu machine transliteration using finite-state transducers. In Proceedings of the International Conference on Computational Linguistics. 537--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Mansur, N. Uzzaman, and M. Khan. 2006. Analysis of n-gram based text categorization for Bangla in a newspaper corpus. In Proceedings of the 9th International Conference on Computer and Information Technology. 1--8.Google ScholarGoogle Scholar
  30. R. Mihalcea and S. Hassan. 2005. Using the essence of texts to improve document classification. In Proceedings of the Conference on Recent Advances in Natural Language Processing. 150--160.Google ScholarGoogle Scholar
  31. K. N. Murthy. 2005. Automatic categorization of Telugu news articles. In Proceedings of the International Symposium on Linguistics, Quantification, and Computation (CALTS’05). 119--127.Google ScholarGoogle Scholar
  32. V. G. Nidhi. 2012. Domain based classification of Punjabi text documents using ontology and hybrid based approach. In Proceedings of the Workshop on South and Southeast Asian Natural Language Processing. 109--122.Google ScholarGoogle Scholar
  33. Y. Ogura and I. Kobayashi. 2013. Text classification based on the latent topics of important sentences extracted by the PageRank algorithm. In Proceedings of the Association for Computational Linguistics Student Research Workshop. 46--51.Google ScholarGoogle Scholar
  34. D. Pandey. 2013. Development and suitability of Indian languages speech database for building Watson based ASR system. In Proceedings of the International Conference on Asian Spoken Language Research and Evaluation. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  35. X. J. Quan, W. Y. Liu, and B. Qiu. 2011. Term weighting schemes for question categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 5, 1009--1021. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Raju, B. V. Vardhan, G. Reddy, and A. Babu. 2009. Analysis of text complexity in a Crypto system—a case study on Telugu. Communications in Computer and Information Science 58, 281--288.Google ScholarGoogle ScholarCross RefCross Ref
  37. A. Ramanathan. 2003. A lightweight stemmer for Hindi. In Proceedings of the Workshop of Computational Linguistics for South Asian Languages Expanding Synergies with Europe. 42--48.Google ScholarGoogle Scholar
  38. P. V. Reddy, B. V. Vardhan, and A. Govardhan. 2011. Corpus based extractive document summarization for Indic script. In Proceedings of the International Conference on Asian Language Processing. 154--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Reed, Y. Jiao, and T. Potok. 2006. TF-ICF: A new term weighting scheme for clustering dynamic data streams. In Proceedings of the International Conference on Machine Learning and Applications. 258--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. Rishel, A. L. Perkins, and S. Yenduri. 2006. Augmentation of a term-document matrix with part-of-speech tags to improve accuracy of LSA. In Proceedings of the International Conference on Applied Computer Science. 573--578. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. E. Robertson and S. K. Jones. 1976. Relevance weighting of search terms. Journal of the Association for Information Science and Technology 27, 3, 129--146.Google ScholarGoogle Scholar
  42. G. Salton and C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. G. Salton and C. Yang. 1975. A vector space model for automatic indexing. Communications of the ACM 18, 11, 613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Sarmah, N. Saharia, and S. K. Sarma. 2012. A novel approach for document classification using Assamese Wordnet. In Proceedings of the 6th International Global Wordnet Conference. 324--329.Google ScholarGoogle Scholar
  45. R. Serafin, B. D. Eugenio, and M. Glass. 2003. Latent semantic analysis for dialogue act classification. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 94--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Singal and G. Salton. 1995. Pivoted Document Length Normalization, Technical Report TR95-1560. Cornell University, Ithaca, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. Singh, V. K. Singh, and T. J. Siddiqui. 2013. Hindi word sense disambiguation using semantic relatedness measure. In Multidisciplinary Trends in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 8271. Springer, 247--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. P. Soucy and G. W. Mineau. 2005. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the International Conference on Artificial Intelligence. 1130--1135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. B. V. Vardhan, L. Pratap Reddy, and A. Vinay Babu. 2007. A model for overlapping trigram technique for Telugu script. Journal of Theoretical and Applied Information Technology 3, 3, 9--14.Google ScholarGoogle Scholar
  50. S. R. Vispute and M. A. Potey. 2013. Automatic text categorization of Marathi documents using clustering technique. In Proceedings of the International Conference on Advanced Computing Technologies. 1--5.Google ScholarGoogle Scholar
  51. D. Wang and H. Zhang. 2013. Inverse-category-frequency based supervised term weighting schemes for text categorization. Journal of Information Science and Engineering 29, 209--225.Google ScholarGoogle Scholar
  52. P. Wiemer-Hastings and I. Zipitria. 2001. Rules for syntax, vectors for semantics. In Proceedings of the Annual Conference of the Cognitive Science Society. 1112--1117.Google ScholarGoogle Scholar
  53. N. P. Xuan and H. L. Quang. 2014. A new improved term weighting schemes for text categorization. In Knowledge and Systems Engineering. Advances in Intelligence Systems and Computing, Vol. 244. Springer, 261--270.Google ScholarGoogle Scholar
  54. S. Zelikovitz. 2001. Using LSI for text classification in the presence of background text. In Proceedings of the ACM International Conference on Information and Knowledge Management. 113--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. A. Zukas and R. J. Price. 2003. Document categorization using latent semantic indexing. In Proceedings of the Symposium on Document Image Understanding Technology. 87--91.Google ScholarGoogle Scholar

Index Terms

  1. Understanding Document Semantics from Summaries: A Case Study on Hindi Texts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)8
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!