Abstract
A Kannada OCR, called Lipi Gnani, has been designed and developed from scratch, with the motivation of it being able to convert printed text or poetry in Kannada script, without any restriction on vocabulary. The training and test sets have been collected from more than 35 books published from 1970 to 2002, and this includes books written in Halegannada and pages containing Sanskrit slokas written in Kannada script. The coverage of the OCR is nearly complete in the sense that it recognizes all punctuation marks, special symbols, and Indo-Arabic and Kannada numerals. Several minor and major original contributions have been done in developing this OCR at different processing stages, such as binarization, character segmentation, recognition, and Unicode mapping. This has created a Kannada OCR that performs as good as, and in some cases better than, Google’s Tesseract OCR, as shown by the results. To the best of our knowledge, this is the maiden report of a complete Kannada OCR, handling all issues involved. Currently, there is no dictionary-based postprocessing, and the obtained results are due solely to the recognition process. Four benchmark test databases containing scanned pages from books in Kannada, Sanskrit, Konkani, and Tulu languages, but all of them printed in Kannada script, have been created. The word-level recognition accuracy of Lipi Gnani is 5.3% higher on the Kannada dataset than that of Google’s Tesseract OCR, 8.5% higher on the Sanskrit dataset, and 23.4% higher on the datasets of Konkani and Tulu.
- K. G. Aparna and A. G. Ramakrishnan. 2002. A complete Tamil optical character recognition system. In Proceedings of the 5th IAPR International Workshop on Document Analysis Systems (DAS’02). 53--57.Google Scholar
- T. V. Ashwin and P. S. Sastry. 2002. A font and size-independent OCR system for printed Kannada documents using support vector machines. Sadhana 27, 1 (Feb. 2002), 35--58.Google Scholar
Cross Ref
- Veena Bansal and R. M. K. Sinha. 2000. Integrating knowledge sources in Devanagari text recognition system. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans 30, 4 (2000), 500--505.Google Scholar
Digital Library
- Veena Bansal and R. M. K. Sinha. 2002. Segmentation of touching and fused Devanagari characters. Pattern Recognition 35, 4 (2002), 875--893.Google Scholar
Cross Ref
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), Article 27. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
Digital Library
- B. B. Chaudhuri, O. A. Kumar, and K. V. Ramana. 1991. Automatic generation and recognition of Telugu script characters. IETE Journal of Research 37 (1991), 499--511.Google Scholar
Cross Ref
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297.Google Scholar
Cross Ref
- D. Dhanya, A. G. Ramakrishnan, and Peeta Basa Pati. 2002. Script identification in printed bilingual documents. Sadhana 27, 1 (2002), 73--82.Google Scholar
Cross Ref
- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874.Google Scholar
Digital Library
- G. David Forney. 1973. The Viterbi algorithm. Proceedings of the IEEE 61, 3 (1973), 268--278.Google Scholar
- Venu Govindaraju and Srirangaraj Ranga Setlur. 2009. Guide to OCR for Indic Scripts: Document Recognition and Retrieval. Springer Science 8 Business Media.Google Scholar
- IISc MILE Lab. 2018a. Konkani Documents in Kannada Script (dataset of Konkani documents printed using Kannada script). Retrieved December 30, 2018 from https://github.com/MILE-IISc/KonkaniDocumentsInKannadaScript.Google Scholar
- IISc MILE Lab. 2018b. Sanskrit Pages Using Kannada Script (dataset of scanned images of Sanskrit text printed using Kannada script). Retrieved December 30, 2018 from https://github.com/MILE-IISc/SanskritPagesUsingKannadaScript.Google Scholar
- IISc MILE Lab. 2018c. Tulu Documents (dataset of scanned pages of Tulu books). Retrieved December 30, 2018 from https://github.com/MILE-IISc/TuluDocuments.Google Scholar
- IISc MILE Lab. 2019. Kannada OCR Test Images with Ground Truth (dataset of 250 Kannada documents carefully chosen to have various kinds of recognition challenges). Retrieved May 28, 2019 from https://github.com/MILE-IISc/Kannada-OCR-test-images-with-ground-truth.Google Scholar
- Rangachar Kasturi, Lawrence O’Gorman, and Venu Govindaraju. 2002. Document image analysis: A primer. Sadhana 27, 1 (2002), 3--22.Google Scholar
Cross Ref
- Aparna Kokku and Srinivasa Chakravarthy. 2009. A complete OCR system for Tamil magazine documents. In Guide to OCR for Indic Scripts. Springer, 147--162.Google Scholar
- Praveen Krishnan, Naveen Sankaran, Ajeet Kumar Singh, and C. V. Jawahar. 2014. Towards a robust OCR system for Indic scripts. In Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems (DAS’14). IEEE, Tours, France, 141--145.Google Scholar
- Library of Congress Standards. 2015. METS: Metadata Encoding and Transmission Standard. Retrieved June 19, 2019 from http://www.loc.gov/standards/mets/.Google Scholar
- Library of Congress Standards. 2019. ALTO: Technical Metadata for Layout and Text Objects. Retrieved June 19, 2019 from http://www.loc.gov/standards/alto/.Google Scholar
- Gurpreet Singh Lehal and Chandan Singh. 2000. A Gurmukhi script recognition system. In Proceedings of the 15th International Conference on Pattern Recognition (ICPR’00), Vol. 2. IEEE, Barcelona, Spain, 557--560.Google Scholar
Cross Ref
- Gurpreet Singh Lehal and Chandan Singh. 2002. A post-processor for Gurmukhi OCR. Sadhana 27, 1 (2002), 99--111.Google Scholar
Cross Ref
- A. Madhavaraj, A. G. Ramakrishnan, H. R. Shiva Kumar, and Nagaraj Bhat. 2014. Improved recognition of aged Kannada documents by effective segmentation of merged characters. In Proceedings of the International Conference on Signal Processing and Communications (SPCOM’14). IEEE, Bangalore, India, 1--6.Google Scholar
Cross Ref
- Minesh Mathew, Ajeet Kumar Singh, and C. V. Jawahar. 2016. Multilingual OCR for Indic scripts. In Proceedings of the 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, Santorini, Greece, 186--191.Google Scholar
- P. Nagabhushan and Radhika M. Pai. 1999. Modified region decomposition method and optimal depth decision tree in the recognition of non-uniform sized characters—An experimentation with Kannada characters. Pattern Recognition Letters 20, 14 (1999), 1467--1475.Google Scholar
Cross Ref
- Premkumar S. Natarajan, Ehry MacRostie, and Michael Decerbo. 2005. The BBN Byblos Hindi OCR system. In Proceedings of SPIE 5676: Document Recognition and Retrieval XII (DRR’05). 10--17.Google Scholar
- N. V. Neeba, Anoop Namboodiri, C. V. Jawahar, and P. J. Narayanan. 2009. Recognition of Malayalam documents. In Guide to OCR for Indic Scripts. Springer, 125--146.Google Scholar
- Atul Negi, Chakravarthy Bhagvati, and B. Krishna. 2001. An OCR system for Telugu. In Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR’01). IEEE, Seattle, WA, USA, 1110--1114.Google Scholar
- Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9, 1 (1979), 62--66.Google Scholar
Cross Ref
- Umapada Pal and B. B. Chaudhuri. 1994. OCR in Bangla: An Indo-Bangladeshi language. In Proceedings of the 12th IAPR International Conference on Pattern Recognition.Google Scholar
- Peeta Basa Pati and A. G. Ramakrishnan. 2000. Machine Recognition of Printed Odiya Text. Master’s Thesis, Department of EE, Indian Institute of Science, Bangalore.Google Scholar
- Peeta Basa Pati, A. G. Ramakrishnan, and U. K. A. Rao. 2000. Machine recognition of printed Odiya characters. In Proceedings of the 3rd International Conference on Information Technology (ICIT’00).Google Scholar
- S. N. S. Rajasekaran and B. L. Deekshatulu. 1977. Recognition of printed Telugu characters. Computer Graphics and Image Processing 6, 4 (1977), 335--360.Google Scholar
Cross Ref
- A. G. Ramakrishnan and H. R. Shiva Kumar. 2014. E-Inclusion 8 Accessibility—Winner (Manthan award 2014). Retrieved April 20, 2020 from http://manthanaward.org/e-inclusion-accessibilty-winner-2014/.Google Scholar
- Tony M. Rath and Rudrapatna Manmatha. 2007. Word spotting for historical documents. International Journal of Document Analysis and Recognition 9, 2--4 (2007), 139--152.Google Scholar
Cross Ref
- H. R. Shiva Kumar, A. Madhavaraj, and A. G. Ramakrishnan. 2019. Splitting merged characters of Kannada benchmark dataset using simplified paired-valleys and L-cut. In Proceedings of the 25th National Conference on Communication.Google Scholar
- H. R. Shiva Kumar and A. G. Ramakrishnan. 2019. Gamma enhanced binarization—An adaptive nonlinear enhancement of degraded word images for improved recognition of split characters. In Proceedings of the 25th National Conference on Communication.Google Scholar
- R. M. K. Sinha and H. N. Mahabala. 1979. Machine recognition of Devanagari script. IEEE Transactions on Systems, Man, and Cybernetics 9, 8 (1979), 535--441.Google Scholar
- Ray Smith. 2016. Building a Multilingual OCR Engine. Retrieved June 20, 2016 from https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf.Google Scholar
- Sargur N. Srihari and Venugopal Govindaraju. 1989. Analysis of textual images using the Hough transform. Machine Vision and Applications 2, 3 (1989), 141--153.Google Scholar
- Changming Sun and Deyi Si. 1997. Skew and slant correction for document images using gradient direction. In Proceedings of the 4th International Conference on Document Analysis and Recognition, Vol. 1. IEEE, Ulm, Germany, CA, 142--146.Google Scholar
- Tesseract. 2018. Tesseract Manual. Retrieved December 15, 2018 from https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc.Google Scholar
- Unicode Consortium. 2018. The Unicode Standard v11.0 U0C80. Retrieved June 5, 2018 from https://unicode.org/charts/PDF/U0C80.pdf.Google Scholar
- B. Vijay Kumar and A. G. Ramakrishnan. 2002. Machine recognition of printed Kannada text. In Proceedings of the 5th IAPR International Workshop on Document Analysis Systems (DAS’02), Vol. 5. 37--48.Google Scholar
- B. Vijay Kumar and A. G. Ramakrishnan. 2004. Radial basis function and subspace approach for printed Kannada text recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), Vol. 5. IEEE, Montreal, Quebec, Canada, 321--324.Google Scholar
- Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM 21, 1 (1974), 168--173.Google Scholar
Digital Library
- Nick White. 2014. Uzn Format. Retrieved August 21, 2014 from https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format.Google Scholar
- Wikipedia. 2018a. Konkani Language. Retrieved October 3, 2018 from https://en.wikipedia.org/wiki/Konkani_language.Google Scholar
- Wikipedia. 2018b. Tulu Language. Retrieved October 3, 2018 from https://en.wikipedia.org/wiki/Tulu_language.Google Scholar
Index Terms
Lipi Gnani: A Versatile OCR for Documents in any Language Printed in Kannada Script
Recommendations
Touching character segmentation of Devanagari script
ICCCNT '16: Proceedings of the 7th International Conference on Computing Communication and Networking TechnologiesSegmentation of characters is one of the major step in OCR system. Devanagari script is a two dimensional form of symbol. It is very inconvenient to treat each form of character as a separate symbol because such combinations are very large in number. ...
Ligature Segmentation for Urdu OCR
ICDAR '13: Proceedings of the 2013 12th International Conference on Document Analysis and RecognitionUrdu script uses superset of Arabic alphabet, but uses Nastaliq writing style. Nastaliq script is highly cursive, context sensitive and is written diagonally from top right to bottom left with stacking of characters, which makes it very hard to process ...
Choice of recognizable units for URDU OCR
DAR '12: Proceeding of the workshop on Document Analysis and RecognitionThere has been considerable work on Arabic OCR. However, all that work is based on Naskh style. Urdu script is based on Arabic alphabet, but uses Nastalique style. The Nastalique style makes OCR in general and character segmentation in particular, a ...






Comments