Abstract
Summary of a document contains words that actually contribute to the semantics of the document. Latent Semantic Analysis (LSA) is a mathematical model that is used to understand document semantics by deriving a semantic structure based on patterns of word correlations in the document. When using LSA to capture semantics from summaries, it is observed that LSA performs quite well despite being completely independent of any external sources of semantics. However, LSA can be remodeled to enhance its capability to analyze correlations within texts. By taking advantage of the model being language independent, this article presents two stages of LSA remodeling to understand document semantics in the Indian context, specifically from Hindi text summaries. One stage of remodeling is done by providing supplementary information, such as document category and domain information. The second stage of remodeling is done by using a supervised term weighting measure in the process. The remodeled LSA’s performance is empirically evaluated in a document classification application by comparing the accuracies of classification to plain LSA. An improvement in the performance of LSA in the range of 4.7% to 6.2% is achieved from the remodel when compared to the plain model. The results suggest that summaries of documents efficiently capture the semantic structure of documents and is an alternative to full-length documents for understanding document semantics.
- K. Baker 2005. Singular Value Decomposition Tutorial. Retrieved September 22, 2016, from http://www.ling.ohio-state.edu/∼kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf.Google Scholar
- M. Berry and S. Dumais. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37, 4, 573--595. Google Scholar
Digital Library
- E. Chisholm and T. Kolda. 1999. New Term Weighting Formulas for the Vector Space Method in Information Retrieval. Report ORNL/TM-13756. Computer Science and Mathematics Division, Oak Ridge National Laboratory.Google Scholar
- K. Cho and J. Kim 1997. Automatic text categorization on hierarchical category structure by using ICF (inverted category frequency) weighting. In Proceedings of the Korea Information Science Society Conference (KISS’97). 507--510.Google Scholar
- F. Debole and F. Sebastianit. 2003. Supervised term weighting for automated text categorization. In Proceedings of the ACM Symposium on Applied Computing. 784--788. Google Scholar
Digital Library
- S. Deerwester, S. Dumais, G. Furnas, and T. K. Landauer. 1990. Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology 4, 6, 391--407.Google Scholar
- S. Dumais. 1990. Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval. Technical Report TM-ARH-017527, Bellcore, Morristown, NJ.Google Scholar
- B. D. Eugenio and R. Serafin. 2010. Dialogue act classification, instance-based learning, and higher order dialogue structure. Dialogue and Discourse 1, 2, 1--24.Google Scholar
Cross Ref
- H. Guan, J. Zhou, and M. Guo. 2009. A class-feature-centroid classifier for text categorization. In Proceedings of the International Conference on World Wide Web. 201--210. Google Scholar
Digital Library
- A. Hulth. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 216--223. Google Scholar
Digital Library
- A. Hulth and B. B. Megyesi. 2006. A study on automatically extracted keywords in text categorization. In Proceedings of the International Conference on Computational Linguistics and the Annual Meeting of the Association for Computational Linguistics. 537--544. Google Scholar
Digital Library
- D. Kanejiya, A. Kumar, and S. Prasad. 2003. Automatic evaluation of students’ answers using syntactically enhanced LSA. In Proceedings of the Workshop on Building Educational Applications Using NLP. 53--60. Google Scholar
Digital Library
- S. Kanimozhi. 2012. Web based classification of Tamil documents using ABPA. International Journal of Scientific and Engineering Research 3, 5, 540--545.Google Scholar
- S. J. Ker and J. Chen. 2000. A text categorization based on summarization technique. In Proceedings of theWorkshop on Recent Advances in NLP and Information Retrieval and the Annual Meeting of the Association for Computational Linguistics. 79--83. Google Scholar
Digital Library
- Y. Ko and J. Seo. 2000. Automatic text categorization by unsupervised learning. In Proceedings of the 18th International Conference on Computational Linguistics. 453--459. Google Scholar
Digital Library
- Y. Ko, J. Park, and J. Seo. 2004. Improving text categorization using the importance of sentences. Information Processing and Management 40, 1, 65--79. Google Scholar
Digital Library
- T. Kolda. 1997. Limited-Memory Matrix Methods with Applications. Ph.D. Dissertation, Applied Mathematics Program, University of Maryland at College Park.Google Scholar
- K. Krishnamurthi, V. R. Panuganti, and V. V. Bulusu. 2013. An empirical evaluation of dimensionality reduction using LSA on Hindi text. In Proceedings of theInternational Conference on Asian Language Processing. 21--24. Google Scholar
Digital Library
- P. Kumar, B. C. Singh, S. Pani, and V. Maral. 2013. Seamless integration of common framework Indian language TTSes in various applications. In Proceedings of theInternational Oriental Conference on Asian Spoken Language Research and Evaluation. 1--6.Google Scholar
- M. Lan, C. H. Tan, and J. Su. 2009. Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 4, 721--735. Google Scholar
Digital Library
- T. K. Landauer. 1998. An introduction to latent semantic analysis. Discourse Processes 25, 2--3, 259--284.Google Scholar
Cross Ref
- T. K. Landauer and S. Dumais. 2008. Latent semantic analysis. Scholarpedia 3, 11, 4356.Google Scholar
Cross Ref
- R. Larson and M. Davis. 2002. Information Organization and Retrieval. SIMS, Lecture 18: Vector Representation, University of California at Berkeley.Google Scholar
- E. Leopold and J. Kindermann. 2002. Text categorization with support vector machines: How to represent texts in input space? Machine Learning 46, 1--3, 423--444. Google Scholar
Digital Library
- V. Lertnattee and C. Leuviphan. 2012. Using class frequency for improving centroid-based text classification. ACEEE International Journal on Information Technology 2, 2, 62--66.Google Scholar
- Y. Liu, H. T. Loh, and A. X. Sun. 2009. Imbalanced text classification: A term weighting approach. Expert Systems with Applications 36, 690--701. Google Scholar
Digital Library
- K. E. Lochbaum. 1989. Comparing and combining the effectiveness of LSI and the ordinary vector space model for information retrieval. Information Processing and Management 25, 6, 665--676. Google Scholar
Digital Library
- M. G. A. Malik, C. Boitet, and P. Bhattacharyya. 2008. Hindi Urdu machine transliteration using finite-state transducers. In Proceedings of the International Conference on Computational Linguistics. 537--544. Google Scholar
Digital Library
- M. Mansur, N. Uzzaman, and M. Khan. 2006. Analysis of n-gram based text categorization for Bangla in a newspaper corpus. In Proceedings of the 9th International Conference on Computer and Information Technology. 1--8.Google Scholar
- R. Mihalcea and S. Hassan. 2005. Using the essence of texts to improve document classification. In Proceedings of the Conference on Recent Advances in Natural Language Processing. 150--160.Google Scholar
- K. N. Murthy. 2005. Automatic categorization of Telugu news articles. In Proceedings of the International Symposium on Linguistics, Quantification, and Computation (CALTS’05). 119--127.Google Scholar
- V. G. Nidhi. 2012. Domain based classification of Punjabi text documents using ontology and hybrid based approach. In Proceedings of the Workshop on South and Southeast Asian Natural Language Processing. 109--122.Google Scholar
- Y. Ogura and I. Kobayashi. 2013. Text classification based on the latent topics of important sentences extracted by the PageRank algorithm. In Proceedings of the Association for Computational Linguistics Student Research Workshop. 46--51.Google Scholar
- D. Pandey. 2013. Development and suitability of Indian languages speech database for building Watson based ASR system. In Proceedings of the International Conference on Asian Spoken Language Research and Evaluation. 1--6.Google Scholar
Cross Ref
- X. J. Quan, W. Y. Liu, and B. Qiu. 2011. Term weighting schemes for question categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 5, 1009--1021. Google Scholar
Digital Library
- M. Raju, B. V. Vardhan, G. Reddy, and A. Babu. 2009. Analysis of text complexity in a Crypto system—a case study on Telugu. Communications in Computer and Information Science 58, 281--288.Google Scholar
Cross Ref
- A. Ramanathan. 2003. A lightweight stemmer for Hindi. In Proceedings of the Workshop of Computational Linguistics for South Asian Languages Expanding Synergies with Europe. 42--48.Google Scholar
- P. V. Reddy, B. V. Vardhan, and A. Govardhan. 2011. Corpus based extractive document summarization for Indic script. In Proceedings of the International Conference on Asian Language Processing. 154--157. Google Scholar
Digital Library
- J. Reed, Y. Jiao, and T. Potok. 2006. TF-ICF: A new term weighting scheme for clustering dynamic data streams. In Proceedings of the International Conference on Machine Learning and Applications. 258--263. Google Scholar
Digital Library
- T. Rishel, A. L. Perkins, and S. Yenduri. 2006. Augmentation of a term-document matrix with part-of-speech tags to improve accuracy of LSA. In Proceedings of the International Conference on Applied Computer Science. 573--578. Google Scholar
Digital Library
- S. E. Robertson and S. K. Jones. 1976. Relevance weighting of search terms. Journal of the Association for Information Science and Technology 27, 3, 129--146.Google Scholar
- G. Salton and C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523. Google Scholar
Digital Library
- G. Salton and C. Yang. 1975. A vector space model for automatic indexing. Communications of the ACM 18, 11, 613--620. Google Scholar
Digital Library
- J. Sarmah, N. Saharia, and S. K. Sarma. 2012. A novel approach for document classification using Assamese Wordnet. In Proceedings of the 6th International Global Wordnet Conference. 324--329.Google Scholar
- R. Serafin, B. D. Eugenio, and M. Glass. 2003. Latent semantic analysis for dialogue act classification. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 94--96. Google Scholar
Digital Library
- A. Singal and G. Salton. 1995. Pivoted Document Length Normalization, Technical Report TR95-1560. Cornell University, Ithaca, NY. Google Scholar
Digital Library
- S. Singh, V. K. Singh, and T. J. Siddiqui. 2013. Hindi word sense disambiguation using semantic relatedness measure. In Multidisciplinary Trends in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 8271. Springer, 247--256. Google Scholar
Digital Library
- P. Soucy and G. W. Mineau. 2005. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the International Conference on Artificial Intelligence. 1130--1135. Google Scholar
Digital Library
- B. V. Vardhan, L. Pratap Reddy, and A. Vinay Babu. 2007. A model for overlapping trigram technique for Telugu script. Journal of Theoretical and Applied Information Technology 3, 3, 9--14.Google Scholar
- S. R. Vispute and M. A. Potey. 2013. Automatic text categorization of Marathi documents using clustering technique. In Proceedings of the International Conference on Advanced Computing Technologies. 1--5.Google Scholar
- D. Wang and H. Zhang. 2013. Inverse-category-frequency based supervised term weighting schemes for text categorization. Journal of Information Science and Engineering 29, 209--225.Google Scholar
- P. Wiemer-Hastings and I. Zipitria. 2001. Rules for syntax, vectors for semantics. In Proceedings of the Annual Conference of the Cognitive Science Society. 1112--1117.Google Scholar
- N. P. Xuan and H. L. Quang. 2014. A new improved term weighting schemes for text categorization. In Knowledge and Systems Engineering. Advances in Intelligence Systems and Computing, Vol. 244. Springer, 261--270.Google Scholar
- S. Zelikovitz. 2001. Using LSI for text classification in the presence of background text. In Proceedings of the ACM International Conference on Information and Knowledge Management. 113--118. Google Scholar
Digital Library
- A. Zukas and R. J. Price. 2003. Document categorization using latent semantic indexing. In Proceedings of the Symposium on Document Image Understanding Technology. 87--91.Google Scholar
Index Terms
Understanding Document Semantics from Summaries: A Case Study on Hindi Texts
Recommendations
A semantic weighting method for document classification based on Markov logic networks
RACS '14: Proceedings of the 2014 Conference on Research in Adaptive and Convergent SystemsThis paper proposes a semantic weighting method to classify textural documents. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem ...
Document classification with supervised latent feature selection
WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and SemanticsThe classification of text documents to categories generally deals with large dimensionality of a structured representation of the documents. To favor generality over accuracy of the classifier some dimensionality reduction technique has to be applied. ...
Problem-adaptable document analysis and understanding for high-volume applications
Although the Internet is increasingly emerging as “the” widespread platform for information interchange, day-to-day work in companies still necessitates the laborious, manual processing of huge amounts of printed documents. This article presents the ...






Comments