skip to main content
research-article

“UTTAM”: An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning

Authors Info & Claims
Published:19 November 2018Publication History
Skip Abstract Section

Abstract

In this article, we propose a system called “UTTAM,” for correcting spelling errors in Hindi language text using supervised learning. Unlike other languages, Hindi contains a large set of characters, words with inflections and complex characters, phonetically similar sets of characters, and so on. The complexity increases the possibility of confusion and occasionally leads to entering a wrong character in a word. The existence of spelling errors in text significantly decreases the accuracy of the available resources, like search engine, text editor, and so on. The proposed work is the first approach to correct non-word (Out of Vocabulary) errors as well as real-word errors simultaneously in a sentence of Hindi language. The proposed method investigates the human behavior, i.e., the type and frequency of spelling errors done by humans in Hindi text. Based on the type and frequency of spelling errors, the heterogeneous data is collected in matrices. This data in matrices is used to generate the suitable candidate words for an input word. After generating candidate words, the Viterbi algorithm is applied to perform the word correction. The Viterbi algorithm finds the best sequence of candidate words to correct the input sentence. For Hindi, this work is the first attempt for real-word error correction. For non-word errors, the experiments show that “UTTAM” performs better than the existing systems SpellGuru and Saksham.

References

  1. K. Kukich. 1992. Techniques for automatically correcting words in text. ACM Comput. Surveys 24, 4 (1992), 377--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Golding and D. Roth. 1999. A winnow-based approach to context-sensitive spelling correction. Mach. Learn. 34 (1--3), 107--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Pedler. 2005. Using semantic associations for the detection of real-word spelling errors. Corpus Linguist. Conf. Ser. 1, 1.Google ScholarGoogle Scholar
  4. R. C. Amorim and M. Zampieri. 2013. Effective spell checking methods using clustering algorithms. In Proceedings of the International Conference on Recent Adv. Natural Lang. Process. 172--178.Google ScholarGoogle Scholar
  5. R. Aggrawal. 2007. Hindi Editor with Spell Checker. Vinayaka Mission University, Salem. Retrieved from http://sanskrit.jnu.ac.in/rstudents/mphil/ritu-agrawal.pdf.Google ScholarGoogle Scholar
  6. R. M. K. Sinha. 2009. A journey from Indian scripts processing to Indian languages processing. IEEE Ann. Hist. Comput. 31, 1 8--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. M. Khan. 2014. The BJP's track to triumph: A critical analysis. Inst. Region. Studies Focus 28, 2.Google ScholarGoogle Scholar
  8. V. Bansal and R. M. K. Sinha. 2002. Partitioning and searching dictionary for correction of optically read Devanagri characters strings. Int. J. Doc. Anal. Recogn. 4, 4 (2002), 269--280.Google ScholarGoogle ScholarCross RefCross Ref
  9. B. William. 1996. Devanagari script. In The World's Writing Systems. Oxford University Press.Google ScholarGoogle Scholar
  10. F. J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM, 7, 3 (1964), 171--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. James. 1980. Computer programs for detecting and correcting spelling errors. Comput. Pract. Commun. ACM 23, 12 (1980), 676--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. R. Golding. 1995. A Bayesian hybrid method for context-sensitive spelling correction, Proceedings of the 3rd Workshop on Very Large Corpora. 39--53.Google ScholarGoogle Scholar
  13. V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 707--10.Google ScholarGoogle Scholar
  14. R. Golding and Y. Schabes. 1996. Combining trigram-based and feature-based methods for context sensitive spelling correction. In Proceedings of the 34th Annual Meeting Assoc. Comput. Linguist. 71--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. P. Jones and J. H. Martin. 1997. Contextual spelling correction using latent semantic analysis. In Proceedings of the 5th Conference on Appl. Natur. Process. 166--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. J. Carlson and I. Fette. 2007. Memory-based context sensitive spelling correction at web scale. In Proceedings of the 6th International Conference on Mach. Learn. Appl. 166--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. S. Ryan and G. R. Nudd. 1993. The Viterbi Algorithm. Department of Computer Science, Coventry, UK.Google ScholarGoogle Scholar
  18. A. J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Info. Theory 13, 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Cucerzan and E. Brill. 2004. Spelling correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’04). 293--300.Google ScholarGoogle Scholar
  20. K. W. Church and W. A. Gale. 1991. Probabilty scoring for spelling correction. Stat. Comput. 1, 2 (1991), 99--103.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. D. Kernighan, K. W. Church, and W. A. Gale. 1990. A spelling correction program based on a noisy channel model. COLING Comput. Linguist. 2, 205--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. Mays, F. J. Damerau, and R. L. Mercer. 1991. Context based spelling correction. Info. Process. Manage. 27, 5 (1991), 517--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. V. Cherkassky, N. Vassilas, G. L. Brodt, and H. Wechsler. 1992. Conventional and associative memory approaches to automatic spelling correction. Eng. Appl. Artific. Intell. 5, 3 (1992), 223--227.Google ScholarGoogle Scholar
  24. D. Yarowsky. 1994. Decision list for lexical ambiguity resolution: Application to recent restoration in spanish and french. In Proceedings of 32nd Annual Meeting on Assoc. Comput. Linguist. 88--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Hirst and D. St-Onge. 1998. WordNet: An Electronic Lexical Database, Chapter Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms. The MIT Press, Cambridge, MA, 305--332.Google ScholarGoogle Scholar
  26. E. Brill and R. C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of 38th Annual Meeting on Comput. Linguist. 286--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Verberne. 2002. Context-sensitive spell checking based on word trigram probabilities. Master's Thesis, University of Nijmegen.Google ScholarGoogle Scholar
  28. G. Hirst and A. Budanitsky. 2005. Correcting real-word spelling errors by restoring lexical cohesion. Nat. Lang. Eng. 11, 1 (2005), 87--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. A. Wilcox-O'Hearn, G. Hirst, and A. Budanitsky. 2008. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer models. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’08), LNCS 4919, Springer-Verlag, 605--616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Fossati and B. D. Eugenio. 2008. I saw tree trees in the park: How to correct real-word spelling mistakes. In Proceedings of 11th International Conference on Lang. Res. and Eval. 896--901.Google ScholarGoogle Scholar
  31. Y. Zhou, S. Jing, G. Huang, S. Liu, and Y. Zhang. 2010. A correcting model based on tribayes for real-word errors in English essays. In Proceedings of the 5th IEEE International Symposium on Computational Intelligence and Design (ISCID’10). 407--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Varol and C. Bayrak. 2011. Estimation of quality of service in spelling correction using Kullback-Leibler divergence. Expert Syst. Appl. 38, 5 (2011), 6307--6312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Samanta and B. B. Chaudhuri. 2013. A simple real-word error detection and correction using local word bigram and trigram. In Proceedings of 25th Conference on Comput. Linguist. Speech Process (ROCING'13). 211--220.Google ScholarGoogle Scholar
  34. V. N. M. Aradhya, G. H. Kumar, and S. Noushath. 2008. Multilingual OCR system for south indian scripts and English documents: An approach based on fourier transform and principal component analysis. Eng. Appl. Artific. Intell. 21, 658--668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Kompalli, S. Setlur, and V. Govindaraju. 2009. Devanagari OCR using a recognition driven segmentation framework and stochastic language models. Int. J. Doc. Anal. Recogn. 12, 123--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. W. C. Reynaert. 2008. All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In Proceedings of the 6th International Language Resources and Evaluation. 1867--1872.Google ScholarGoogle Scholar
  37. J. P. Gupta, D. K. Tayal, and A. Gupta. 2011. A tengram method based part-of-speech tagging of multi-category words in Hindi language. Expert Syst. Appl. 38, 12, 15084--15093. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Jain and M. Jain. 2014. Detection and correction of non-word spelling errors in Hindi language. In Proceedings of the International Conference on Data Mining and Intelligent Computing (ICDMIC’14). 1--5.Google ScholarGoogle Scholar
  39. S. Kabra and R. Agarwal. 2014. Auto spell suggestion for high-quality speech synthesis in Hindi. Int. J. Comput. Appl. 87, 17.Google ScholarGoogle Scholar
  40. B. Kaur and H. Singh. 2015. Design and implementation of HINSPELL—Hindi spell checker using hybrid approach. Int. J. Sci. Res. Manage. 3, 2 (2015), 2058--2061.Google ScholarGoogle Scholar
  41. N. R. Tyson and I. Nagar. 2009. Prosodic rules for schwa-deletion in Hindi text-to-speech synthesis. Int. J. Speech. Technol. 12, 15--25.Google ScholarGoogle ScholarCross RefCross Ref
  42. R. Snell and S. Weightman. 1989. Teach yourself Hindi. Hodder 8 Stoughton.Google ScholarGoogle Scholar
  43. IME. 1958. A Basic Grammar of Modern Hindi, 2nd ed. Ministry of Education 8 Scientific Research.Google ScholarGoogle Scholar
  44. R. S. McGregor. 1999. Outline of Hindi Grammar with Exercise, 3rd ed. Oxford University Press, Oxford/New York.Google ScholarGoogle Scholar
  45. S. H. Kellogg. 1990. A Grammar of the Hindi Ganguage: In Which Are Treated the High Hindi, Braj, and the Eastern Hindi of the Ramayan of Tulsi Das, 1st ed. Munshiram Manoharla, New Delhi.Google ScholarGoogle Scholar
  46. I. S. MacKenzie and K. Tanaka-Ishii. 2007. Text Entry Systems: Mobility, Accessibility, Universality. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. U. Z. Ahmed, K. Bali, M. Choudhury, and S. VB. 2011. Challenges in designing input method editors for Indian languages: The role of word-origin and context. In Proceedings of the Workshop on Advances in Text Input Methods (WTIM’11). 1--9.Google ScholarGoogle Scholar
  48. R. Joshi, K. Shoff, and S. Mudur. 2003. A phonemic code based scheme for effective processing of Indian languages. In Proceedings of the Internationalization and Unicode Conference. 1--17.Google ScholarGoogle Scholar
  49. S. Katz. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust., Speech, Signal Process. 35, 3 (1987) 400--401.Google ScholarGoogle ScholarCross RefCross Ref
  50. K. Seymore and R. Rosenfeld. 1996. Scalable backoff language models. In Proceedings of the 4th International Conference on Spoken Language. 232--235.Google ScholarGoogle Scholar
  51. J. Pedler. 2007. Computer Correction of Real-word Spelling Errors in Dyslexic Text. PhD. Thesis, Birkbeck, London University.Google ScholarGoogle Scholar
  52. M. Reynaert. 2010. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. Int. J. Doc. Anal. Recogn. 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. M. K. Sharma and D. Samanta. 2014. Word prediction system for text entry in Hindi. ACM Trans. Asian Low-Res. Lang. Info. Process. 13, 2 (2014) 1--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. R. N. Srivastava. 1998. Hindi Linguistics. Kaninga Publications, Delhi.Google ScholarGoogle Scholar
  55. A. Jain and D. K. Lobiyal. 2016. Fuzzy Hindi wordnet and word sense disambiguation using fuzzy graph connectivity measures. ACM Trans. Asian Low-Res. Lang. Info. Process. 15, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. “UTTAM”: An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Asian and Low-Resource Language Information Processing
                ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 1
                March 2019
                196 pages
                ISSN:2375-4699
                EISSN:2375-4702
                DOI:10.1145/3292011
                Issue’s Table of Contents

                Copyright © 2018 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 19 November 2018
                • Accepted: 1 July 2018
                • Revised: 1 May 2018
                • Received: 1 May 2017
                Published in tallip Volume 18, Issue 1

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              HTML Format

              View this article in HTML Format .

              View HTML Format
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!