skip to main content
research-article

Lipi Gnani: A Versatile OCR for Documents in any Language Printed in Kannada Script

Published:18 May 2020Publication History
Skip Abstract Section

Abstract

A Kannada OCR, called Lipi Gnani, has been designed and developed from scratch, with the motivation of it being able to convert printed text or poetry in Kannada script, without any restriction on vocabulary. The training and test sets have been collected from more than 35 books published from 1970 to 2002, and this includes books written in Halegannada and pages containing Sanskrit slokas written in Kannada script. The coverage of the OCR is nearly complete in the sense that it recognizes all punctuation marks, special symbols, and Indo-Arabic and Kannada numerals. Several minor and major original contributions have been done in developing this OCR at different processing stages, such as binarization, character segmentation, recognition, and Unicode mapping. This has created a Kannada OCR that performs as good as, and in some cases better than, Google’s Tesseract OCR, as shown by the results. To the best of our knowledge, this is the maiden report of a complete Kannada OCR, handling all issues involved. Currently, there is no dictionary-based postprocessing, and the obtained results are due solely to the recognition process. Four benchmark test databases containing scanned pages from books in Kannada, Sanskrit, Konkani, and Tulu languages, but all of them printed in Kannada script, have been created. The word-level recognition accuracy of Lipi Gnani is 5.3% higher on the Kannada dataset than that of Google’s Tesseract OCR, 8.5% higher on the Sanskrit dataset, and 23.4% higher on the datasets of Konkani and Tulu.

References

  1. K. G. Aparna and A. G. Ramakrishnan. 2002. A complete Tamil optical character recognition system. In Proceedings of the 5th IAPR International Workshop on Document Analysis Systems (DAS’02). 53--57.Google ScholarGoogle Scholar
  2. T. V. Ashwin and P. S. Sastry. 2002. A font and size-independent OCR system for printed Kannada documents using support vector machines. Sadhana 27, 1 (Feb. 2002), 35--58.Google ScholarGoogle ScholarCross RefCross Ref
  3. Veena Bansal and R. M. K. Sinha. 2000. Integrating knowledge sources in Devanagari text recognition system. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 30, 4 (2000), 500--505.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Veena Bansal and R. M. K. Sinha. 2002. Segmentation of touching and fused Devanagari characters. Pattern Recognition 35, 4 (2002), 875--893.Google ScholarGoogle ScholarCross RefCross Ref
  5. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), Article 27. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. B. Chaudhuri, O. A. Kumar, and K. V. Ramana. 1991. Automatic generation and recognition of Telugu script characters. IETE Journal of Research 37 (1991), 499--511.Google ScholarGoogle ScholarCross RefCross Ref
  7. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297.Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Dhanya, A. G. Ramakrishnan, and Peeta Basa Pati. 2002. Script identification in printed bilingual documents. Sadhana 27, 1 (2002), 73--82.Google ScholarGoogle ScholarCross RefCross Ref
  9. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. David Forney. 1973. The Viterbi algorithm. Proceedings of the IEEE 61, 3 (1973), 268--278.Google ScholarGoogle Scholar
  11. Venu Govindaraju and Srirangaraj Ranga Setlur. 2009. Guide to OCR for Indic Scripts: Document Recognition and Retrieval. Springer Science 8 Business Media.Google ScholarGoogle Scholar
  12. IISc MILE Lab. 2018a. Konkani Documents in Kannada Script (dataset of Konkani documents printed using Kannada script). Retrieved December 30, 2018 from https://github.com/MILE-IISc/KonkaniDocumentsInKannadaScript.Google ScholarGoogle Scholar
  13. IISc MILE Lab. 2018b. Sanskrit Pages Using Kannada Script (dataset of scanned images of Sanskrit text printed using Kannada script). Retrieved December 30, 2018 from https://github.com/MILE-IISc/SanskritPagesUsingKannadaScript.Google ScholarGoogle Scholar
  14. IISc MILE Lab. 2018c. Tulu Documents (dataset of scanned pages of Tulu books). Retrieved December 30, 2018 from https://github.com/MILE-IISc/TuluDocuments.Google ScholarGoogle Scholar
  15. IISc MILE Lab. 2019. Kannada OCR Test Images with Ground Truth (dataset of 250 Kannada documents carefully chosen to have various kinds of recognition challenges). Retrieved May 28, 2019 from https://github.com/MILE-IISc/Kannada-OCR-test-images-with-ground-truth.Google ScholarGoogle Scholar
  16. Rangachar Kasturi, Lawrence O’Gorman, and Venu Govindaraju. 2002. Document image analysis: A primer. Sadhana 27, 1 (2002), 3--22.Google ScholarGoogle ScholarCross RefCross Ref
  17. Aparna Kokku and Srinivasa Chakravarthy. 2009. A complete OCR system for Tamil magazine documents. In Guide to OCR for Indic Scripts. Springer, 147--162.Google ScholarGoogle Scholar
  18. Praveen Krishnan, Naveen Sankaran, Ajeet Kumar Singh, and C. V. Jawahar. 2014. Towards a robust OCR system for Indic scripts. In Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems (DAS’14). IEEE, Tours, France, 141--145.Google ScholarGoogle Scholar
  19. Library of Congress Standards. 2015. METS: Metadata Encoding and Transmission Standard. Retrieved June 19, 2019 from http://www.loc.gov/standards/mets/.Google ScholarGoogle Scholar
  20. Library of Congress Standards. 2019. ALTO: Technical Metadata for Layout and Text Objects. Retrieved June 19, 2019 from http://www.loc.gov/standards/alto/.Google ScholarGoogle Scholar
  21. Gurpreet Singh Lehal and Chandan Singh. 2000. A Gurmukhi script recognition system. In Proceedings of the 15th International Conference on Pattern Recognition (ICPR’00), Vol. 2. IEEE, Barcelona, Spain, 557--560.Google ScholarGoogle ScholarCross RefCross Ref
  22. Gurpreet Singh Lehal and Chandan Singh. 2002. A post-processor for Gurmukhi OCR. Sadhana 27, 1 (2002), 99--111.Google ScholarGoogle ScholarCross RefCross Ref
  23. A. Madhavaraj, A. G. Ramakrishnan, H. R. Shiva Kumar, and Nagaraj Bhat. 2014. Improved recognition of aged Kannada documents by effective segmentation of merged characters. In Proceedings of the International Conference on Signal Processing and Communications (SPCOM’14). IEEE, Bangalore, India, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  24. Minesh Mathew, Ajeet Kumar Singh, and C. V. Jawahar. 2016. Multilingual OCR for Indic scripts. In Proceedings of the 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, Santorini, Greece, 186--191.Google ScholarGoogle Scholar
  25. P. Nagabhushan and Radhika M. Pai. 1999. Modified region decomposition method and optimal depth decision tree in the recognition of non-uniform sized characters—An experimentation with Kannada characters. Pattern Recognition Letters 20, 14 (1999), 1467--1475.Google ScholarGoogle ScholarCross RefCross Ref
  26. Premkumar S. Natarajan, Ehry MacRostie, and Michael Decerbo. 2005. The BBN Byblos Hindi OCR system. In Proceedings of SPIE 5676: Document Recognition and Retrieval XII (DRR’05). 10--17.Google ScholarGoogle Scholar
  27. N. V. Neeba, Anoop Namboodiri, C. V. Jawahar, and P. J. Narayanan. 2009. Recognition of Malayalam documents. In Guide to OCR for Indic Scripts. Springer, 125--146.Google ScholarGoogle Scholar
  28. Atul Negi, Chakravarthy Bhagvati, and B. Krishna. 2001. An OCR system for Telugu. In Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR’01). IEEE, Seattle, WA, USA, 1110--1114.Google ScholarGoogle Scholar
  29. Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (1979), 62--66.Google ScholarGoogle ScholarCross RefCross Ref
  30. Umapada Pal and B. B. Chaudhuri. 1994. OCR in Bangla: An Indo-Bangladeshi language. In Proceedings of the 12th IAPR International Conference on Pattern Recognition.Google ScholarGoogle Scholar
  31. Peeta Basa Pati and A. G. Ramakrishnan. 2000. Machine Recognition of Printed Odiya Text. Master’s Thesis, Department of EE, Indian Institute of Science, Bangalore.Google ScholarGoogle Scholar
  32. Peeta Basa Pati, A. G. Ramakrishnan, and U. K. A. Rao. 2000. Machine recognition of printed Odiya characters. In Proceedings of the 3rd International Conference on Information Technology (ICIT’00).Google ScholarGoogle Scholar
  33. S. N. S. Rajasekaran and B. L. Deekshatulu. 1977. Recognition of printed Telugu characters. Computer Graphics and Image Processing 6, 4 (1977), 335--360.Google ScholarGoogle ScholarCross RefCross Ref
  34. A. G. Ramakrishnan and H. R. Shiva Kumar. 2014. E-Inclusion 8 Accessibility—Winner (Manthan award 2014). Retrieved April 20, 2020 from http://manthanaward.org/e-inclusion-accessibilty-winner-2014/.Google ScholarGoogle Scholar
  35. Tony M. Rath and Rudrapatna Manmatha. 2007. Word spotting for historical documents. International Journal of Document Analysis and Recognition 9, 2--4 (2007), 139--152.Google ScholarGoogle ScholarCross RefCross Ref
  36. H. R. Shiva Kumar, A. Madhavaraj, and A. G. Ramakrishnan. 2019. Splitting merged characters of Kannada benchmark dataset using simplified paired-valleys and L-cut. In Proceedings of the 25th National Conference on Communication.Google ScholarGoogle Scholar
  37. H. R. Shiva Kumar and A. G. Ramakrishnan. 2019. Gamma enhanced binarization—An adaptive nonlinear enhancement of degraded word images for improved recognition of split characters. In Proceedings of the 25th National Conference on Communication.Google ScholarGoogle Scholar
  38. R. M. K. Sinha and H. N. Mahabala. 1979. Machine recognition of Devanagari script. IEEE Transactions on Systems, Man, and Cybernetics 9, 8 (1979), 535--441.Google ScholarGoogle Scholar
  39. Ray Smith. 2016. Building a Multilingual OCR Engine. Retrieved June 20, 2016 from https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf.Google ScholarGoogle Scholar
  40. Sargur N. Srihari and Venugopal Govindaraju. 1989. Analysis of textual images using the Hough transform. Machine Vision and Applications 2, 3 (1989), 141--153.Google ScholarGoogle Scholar
  41. Changming Sun and Deyi Si. 1997. Skew and slant correction for document images using gradient direction. In Proceedings of the 4th International Conference on Document Analysis and Recognition, Vol. 1. IEEE, Ulm, Germany, CA, 142--146.Google ScholarGoogle Scholar
  42. Tesseract. 2018. Tesseract Manual. Retrieved December 15, 2018 from https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc.Google ScholarGoogle Scholar
  43. Unicode Consortium. 2018. The Unicode Standard v11.0 U0C80. Retrieved June 5, 2018 from https://unicode.org/charts/PDF/U0C80.pdf.Google ScholarGoogle Scholar
  44. B. Vijay Kumar and A. G. Ramakrishnan. 2002. Machine recognition of printed Kannada text. In Proceedings of the 5th IAPR International Workshop on Document Analysis Systems (DAS’02), Vol. 5. 37--48.Google ScholarGoogle Scholar
  45. B. Vijay Kumar and A. G. Ramakrishnan. 2004. Radial basis function and subspace approach for printed Kannada text recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), Vol. 5. IEEE, Montreal, Quebec, Canada, 321--324.Google ScholarGoogle Scholar
  46. Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM 21, 1 (1974), 168--173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Nick White. 2014. Uzn Format. Retrieved August 21, 2014 from https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format.Google ScholarGoogle Scholar
  48. Wikipedia. 2018a. Konkani Language. Retrieved October 3, 2018 from https://en.wikipedia.org/wiki/Konkani_language.Google ScholarGoogle Scholar
  49. Wikipedia. 2018b. Tulu Language. Retrieved October 3, 2018 from https://en.wikipedia.org/wiki/Tulu_language.Google ScholarGoogle Scholar

Index Terms

  1. Lipi Gnani: A Versatile OCR for Documents in any Language Printed in Kannada Script

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 4
        July 2020
        291 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3391538
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 May 2020
        • Online AM: 7 May 2020
        • Accepted: 1 March 2020
        • Revised: 1 February 2020
        • Received: 1 December 2018
        Published in tallip Volume 19, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!