Abstract
In this article, we propose a system called “UTTAM,” for correcting spelling errors in Hindi language text using supervised learning. Unlike other languages, Hindi contains a large set of characters, words with inflections and complex characters, phonetically similar sets of characters, and so on. The complexity increases the possibility of confusion and occasionally leads to entering a wrong character in a word. The existence of spelling errors in text significantly decreases the accuracy of the available resources, like search engine, text editor, and so on. The proposed work is the first approach to correct non-word (Out of Vocabulary) errors as well as real-word errors simultaneously in a sentence of Hindi language. The proposed method investigates the human behavior, i.e., the type and frequency of spelling errors done by humans in Hindi text. Based on the type and frequency of spelling errors, the heterogeneous data is collected in matrices. This data in matrices is used to generate the suitable candidate words for an input word. After generating candidate words, the Viterbi algorithm is applied to perform the word correction. The Viterbi algorithm finds the best sequence of candidate words to correct the input sentence. For Hindi, this work is the first attempt for real-word error correction. For non-word errors, the experiments show that “UTTAM” performs better than the existing systems SpellGuru and Saksham.
- K. Kukich. 1992. Techniques for automatically correcting words in text. ACM Comput. Surveys 24, 4 (1992), 377--439. Google Scholar
Digital Library
- R. Golding and D. Roth. 1999. A winnow-based approach to context-sensitive spelling correction. Mach. Learn. 34 (1--3), 107--130. Google Scholar
Digital Library
- J. Pedler. 2005. Using semantic associations for the detection of real-word spelling errors. Corpus Linguist. Conf. Ser. 1, 1.Google Scholar
- R. C. Amorim and M. Zampieri. 2013. Effective spell checking methods using clustering algorithms. In Proceedings of the International Conference on Recent Adv. Natural Lang. Process. 172--178.Google Scholar
- R. Aggrawal. 2007. Hindi Editor with Spell Checker. Vinayaka Mission University, Salem. Retrieved from http://sanskrit.jnu.ac.in/rstudents/mphil/ritu-agrawal.pdf.Google Scholar
- R. M. K. Sinha. 2009. A journey from Indian scripts processing to Indian languages processing. IEEE Ann. Hist. Comput. 31, 1 8--31. Google Scholar
Digital Library
- Y. M. Khan. 2014. The BJP's track to triumph: A critical analysis. Inst. Region. Studies Focus 28, 2.Google Scholar
- V. Bansal and R. M. K. Sinha. 2002. Partitioning and searching dictionary for correction of optically read Devanagri characters strings. Int. J. Doc. Anal. Recogn. 4, 4 (2002), 269--280.Google Scholar
Cross Ref
- B. William. 1996. Devanagari script. In The World's Writing Systems. Oxford University Press.Google Scholar
- F. J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM, 7, 3 (1964), 171--176. Google Scholar
Digital Library
- P. James. 1980. Computer programs for detecting and correcting spelling errors. Comput. Pract. Commun. ACM 23, 12 (1980), 676--687. Google Scholar
Digital Library
- A. R. Golding. 1995. A Bayesian hybrid method for context-sensitive spelling correction, Proceedings of the 3rd Workshop on Very Large Corpora. 39--53.Google Scholar
- V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 707--10.Google Scholar
- R. Golding and Y. Schabes. 1996. Combining trigram-based and feature-based methods for context sensitive spelling correction. In Proceedings of the 34th Annual Meeting Assoc. Comput. Linguist. 71--78. Google Scholar
Digital Library
- M. P. Jones and J. H. Martin. 1997. Contextual spelling correction using latent semantic analysis. In Proceedings of the 5th Conference on Appl. Natur. Process. 166--173. Google Scholar
Digital Library
- A. J. Carlson and I. Fette. 2007. Memory-based context sensitive spelling correction at web scale. In Proceedings of the 6th International Conference on Mach. Learn. Appl. 166--171. Google Scholar
Digital Library
- M. S. Ryan and G. R. Nudd. 1993. The Viterbi Algorithm. Department of Computer Science, Coventry, UK.Google Scholar
- A. J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Info. Theory 13, 260--269. Google Scholar
Digital Library
- S. Cucerzan and E. Brill. 2004. Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 293--300.Google Scholar
- K. W. Church and W. A. Gale. 1991. Probabilty scoring for spelling correction. Stat. Comput. 1, 2 (1991), 99--103.Google Scholar
Cross Ref
- M. D. Kernighan, K. W. Church, and W. A. Gale. 1990. A spelling correction program based on a noisy channel model. COLING Comput. Linguist. 2, 205--210. Google Scholar
Digital Library
- E. Mays, F. J. Damerau, and R. L. Mercer. 1991. Context based spelling correction. Info. Process. Manage. 27, 5 (1991), 517--522. Google Scholar
Digital Library
- V. Cherkassky, N. Vassilas, G. L. Brodt, and H. Wechsler. 1992. Conventional and associative memory approaches to automatic spelling correction. Eng. Appl. Artific. Intell. 5, 3 (1992), 223--227.Google Scholar
- D. Yarowsky. 1994. Decision list for lexical ambiguity resolution: Application to recent restoration in spanish and french. In Proceedings of 32nd Annual Meeting on Assoc. Comput. Linguist. 88--95. Google Scholar
Digital Library
- G. Hirst and D. St-Onge. 1998. WordNet: An Electronic Lexical Database, Chapter Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. The MIT Press, Cambridge, MA, 305--332.Google Scholar
- E. Brill and R. C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of 38th Annual Meeting on Comput. Linguist. 286--293. Google Scholar
Digital Library
- S. Verberne. 2002. Context-sensitive spell checking based on word trigram probabilities. Master's Thesis, University of Nijmegen.Google Scholar
- G. Hirst and A. Budanitsky. 2005. Correcting real-word spelling errors by restoring lexical cohesion. Nat. Lang. Eng. 11, 1 (2005), 87--111. Google Scholar
Digital Library
- L. A. Wilcox-O'Hearn, G. Hirst, and A. Budanitsky. 2008. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer models. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’08), LNCS 4919, Springer-Verlag, 605--616. Google Scholar
Digital Library
- D. Fossati and B. D. Eugenio. 2008. I saw tree trees in the park: How to correct real-word spelling mistakes. In Proceedings of 11th International Conference on Lang. Res. and Eval. 896--901.Google Scholar
- Y. Zhou, S. Jing, G. Huang, S. Liu, and Y. Zhang. 2010. A correcting model based on tribayes for real-word errors in English essays. In Proceedings of the 5th IEEE International Symposium on Computational Intelligence and Design (ISCID’10). 407--410. Google Scholar
Digital Library
- C. Varol and C. Bayrak. 2011. Estimation of quality of service in spelling correction using Kullback-Leibler divergence. Expert Syst. Appl. 38, 5 (2011), 6307--6312. Google Scholar
Digital Library
- P. Samanta and B. B. Chaudhuri. 2013. A simple real-word error detection and correction using local word bigram and trigram. In Proceedings of 25th Conference on Comput. Linguist. Speech Process (ROCING'13). 211--220.Google Scholar
- V. N. M. Aradhya, G. H. Kumar, and S. Noushath. 2008. Multilingual OCR system for south indian scripts and English documents: An approach based on fourier transform and principal component analysis. Eng. Appl. Artific. Intell. 21, 658--668. Google Scholar
Digital Library
- S. Kompalli, S. Setlur, and V. Govindaraju. 2009. Devanagari OCR using a recognition driven segmentation framework and stochastic language models. Int. J. Doc. Anal. Recogn. 12, 123--138. Google Scholar
Digital Library
- M. W. C. Reynaert. 2008. All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In Proceedings of the 6th International Language Resources and Evaluation. 1867--1872.Google Scholar
- J. P. Gupta, D. K. Tayal, and A. Gupta. 2011. A tengram method based part-of-speech tagging of multi-category words in Hindi language. Expert Syst. Appl. 38, 12, 15084--15093. Google Scholar
Digital Library
- A. Jain and M. Jain. 2014. Detection and correction of non-word spelling errors in Hindi language. In Proceedings of the International Conference on Data Mining and Intelligent Computing (ICDMIC’14). 1--5.Google Scholar
- S. Kabra and R. Agarwal. 2014. Auto spell suggestion for high-quality speech synthesis in Hindi. Int. J. Comput. Appl. 87, 17.Google Scholar
- B. Kaur and H. Singh. 2015. Design and implementation of HINSPELL—Hindi spell checker using hybrid approach. Int. J. Sci. Res. Manage. 3, 2 (2015), 2058--2061.Google Scholar
- N. R. Tyson and I. Nagar. 2009. Prosodic rules for schwa-deletion in Hindi text-to-speech synthesis. Int. J. Speech. Technol. 12, 15--25.Google Scholar
Cross Ref
- R. Snell and S. Weightman. 1989. Teach yourself Hindi. Hodder 8 Stoughton.Google Scholar
- IME. 1958. A Basic Grammar of Modern Hindi, 2nd ed. Ministry of Education 8 Scientific Research.Google Scholar
- R. S. McGregor. 1999. Outline of Hindi Grammar with Exercise, 3rd ed. Oxford University Press, Oxford/New York.Google Scholar
- S. H. Kellogg. 1990. A Grammar of the Hindi Ganguage: In Which Are Treated the High Hindi, Braj, and the Eastern Hindi of the Ramayan of Tulsi Das, 1st ed. Munshiram Manoharla, New Delhi.Google Scholar
- I. S. MacKenzie and K. Tanaka-Ishii. 2007. Text Entry Systems: Mobility, Accessibility, Universality. Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- U. Z. Ahmed, K. Bali, M. Choudhury, and S. VB. 2011. Challenges in designing input method editors for Indian languages: The role of word-origin and context. In Proceedings of the Workshop on Advances in Text Input Methods (WTIM’11). 1--9.Google Scholar
- R. Joshi, K. Shoff, and S. Mudur. 2003. A phonemic code based scheme for effective processing of Indian languages. In Proceedings of the Internationalization and Unicode Conference. 1--17.Google Scholar
- S. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust., Speech, Signal Process. 35, 3 (1987) 400--401.Google Scholar
Cross Ref
- K. Seymore and R. Rosenfeld. 1996. Scalable backoff language models. In Proceedings of the 4th International Conference on Spoken Language. 232--235.Google Scholar
- J. Pedler. 2007. Computer Correction of Real-word Spelling Errors in Dyslexic Text. PhD. Thesis, Birkbeck, London University.Google Scholar
- M. Reynaert. 2010. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. Int. J. Doc. Anal. Recogn. 1--15. Google Scholar
Digital Library
- M. K. Sharma and D. Samanta. 2014. Word prediction system for text entry in Hindi. ACM Trans. Asian Low-Res. Lang. Info. Process. 13, 2 (2014) 1--29. Google Scholar
Digital Library
- R. N. Srivastava. 1998. Hindi Linguistics. Kaninga Publications, Delhi.Google Scholar
- A. Jain and D. K. Lobiyal. 2016. Fuzzy Hindi wordnet and word sense disambiguation using fuzzy graph connectivity measures. ACM Trans. Asian Low-Res. Lang. Info. Process. 15, 2. Google Scholar
Digital Library
Index Terms
“UTTAM”: An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning
Recommendations
Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record
In this article, we show how a set of natural language processing (NLP) tools can be combined to improve the processing of clinical records. The study concentrates on improving spelling correction, which is of major importance for quality control in the ...
BEDSpell: Spelling Error Correction Using BERT-Based Masked Language Model and Edit Distance
Service-Oriented Computing – ICSOC 2022 WorkshopsAbstractThe spelling correction problem, the task of automatically correcting misspellings in a text, is critical in natural language processing (NLP). Although it can be considered a standalone task, in most cases, it is an integral component of various ...
Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The ...






Comments