Abstract
The work reported in this article deals with the ground truth generation scheme for online handwritten Bangla documents at text-line, word, and stroke levels. The aim of the proposed scheme is twofold: firstly, to build a document level database so that future researchers can use the database to do research in this field. Secondly, the ground truth information will help other researchers to evaluate the performance of their algorithms developed for text-line extraction, word extraction, word segmentation, stroke recognition, and word recognition. The reported ground truth generation scheme starts with text-line extraction from the online handwritten Bangla documents, then words extraction from the text-lines, and finally segmentation of those words into basic strokes. After word segmentation, the basic strokes are assigned appropriate class labels by using modified distance-based feature extraction procedure and the MLP (Multi-layer Perceptron) classifier. The Unicode for the words are then generated from the sequence of stroke labels. XML files are used to store the stroke, word, and text-line levels ground truth information for the corresponding documents. The proposed system is semi-automatic and each step such as text-line extraction, word extraction, word segmentation, and stroke recognition has been implemented by using different algorithms. Thus, the proposed ground truth generation procedure minimizes huge manual intervention by reducing the number of mouse clicks required to extract text-lines, words from the document, and segment the words into basic strokes. The integrated stroke recognition module also helps to minimize the manual labor needed to assign appropriate stroke labels. The freely available and can be accessed at https://byanjon.herokuapp.com/.
- S. M. Obaidullah, C. Halder, and K. C. Santosh. 2018. PHDIndic_11: Page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimedia Tools and Applications 77 (2018), 1643–1678. Google Scholar
Digital Library
- P. K. Singh, R. Sarkar, and N. Das. 2018. Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimedia Tools and Applications 77 (2018), 8441–8473. Google Scholar
Digital Library
- R. Ghosh, S. Shanu, S. Ranjan, and K. Kumari. 2019. An approach based on classifier combination for online handwritten text and non-text classification in Devanagari script. Sadhana 44, 8 (2019), 1–8.Google Scholar
Cross Ref
- Z. Xie, Z. Sun, L. Jin, H. Ni, and T. Lyons. 2017. Learning spatial-semantic context with fully convolutional recurrent network for online handwritten chinese text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 8 (2017), 1903–1917.Google Scholar
Cross Ref
- Y. C. Wu, F. Yin, and C. L. Liu. 2017. Improving handwritten Chinese text recognition using neural network language models and convolutional neural network shape models. Pattern Recognition 65 (2017), 251–264. Google Scholar
Digital Library
- I. Ahmad, R. Leonard, G. A. Fink, and A. S. Mahmoud. 2013. Novel sub-character HMM models for Arabic text recognition. In International Conference on Document Analysis and Recognition, 658–662. Google Scholar
Digital Library
- A. Irfan, G. A. Fink, and S. A. Mahmoud. 2014. Improvements in sub-character HMM model based Arabic text recognition. In International Conference on Frontiers in Handwriting Recognition, 537–542.Google Scholar
- S. Sen, M. Mitra, S. Chowdhury, R. Sarkar, and K. Roy. 2016. Quad-tree based Image segmentation and feature extraction to recognize online handwritten Bangla characters. In 7th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition, 246–256.Google Scholar
- S. Sen, D. Shaoo, M. Mitra, R. Sarkar, and K. Roy. 2018. DFA based online Bangla character recognition. In International Conference on Information Technology & Applied Mathematics, 175–183.Google Scholar
- U. Bhattacharya, B. K. Gupta, and S. K. Parui. 2007. Direction code based features for recognition of online handwritten characters of Bangla. In International Conference on Document Analysis and Recognition, 58–62. Google Scholar
Digital Library
- K. Roy. 2012. Stroke-database design for online handwriting recognition in Bangla. In International Journal of Modern Engineering Research, 2534–2540.Google Scholar
- S. Sen, A. Bhattacharyya, R. Sarkar, K. Roy, and D. Doermann. 2018. Application of structural and topological features to recognize online handwritten Bangla characters. Transaction of Asian Low Resource Language Information Processing 17 (2018). Google Scholar
Digital Library
- U. Bhattacharya, R. Plamondon, S. Dutta Chowdhury, P. Goyal, and S. K. Parui. 2017. A sigma-lognormal model-based approach to generating large synthetic online handwriting sample databases. In International Journal on Document Analysis and Recognition, 1–17. Google Scholar
Digital Library
- R. Ghosh, C. Vamsi, and P. Kumar. 2018. RNN based online handwritten word recognition in Devanagari and Bengali scripts using horizontal zoning. Pattern Recognition 92 (2018), 203–218.Google Scholar
Cross Ref
- S. Sen, S. Chowdhury, M. Mitra, F. Schwenker, R. Sarkar, and K. Roy. 2018. A novel segmentation technique for online handwritten Bangla words. Pattern Recognition Letters 139 (2018), 26–33.Google Scholar
Cross Ref
- G. A. Fink, S. Vajda, U. Bhattacharya, S. K. Parui, and B. B. Chaudhuri. 2010. Online Bangla word recognition using sub-stroke level features and hidden Markov models. In International Conference on Frontiers in Handwriting Recognition, 393–398. Google Scholar
Digital Library
- U. Bhattacharya, A. Nigam, Y. S. Rawat, and S. K. Parui. 2008. An analytic scheme for online handwritten Bangla cursive word recognition. In International Conference on Frontiers in Handwriting Recognition, 320–325.Google Scholar
- S. Mohiuddin, U. Bhattacharya, and S. K. Parui. 2011. Unconstrained Bangla online handwriting recognition based on MLP and SVM. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, 16 pages. Google Scholar
Digital Library
- S. Chowdhury, U. Garai, and T. Chattopadhyay. 2011. A weighted finite-state transducer (WFST)-based language model for online Indic script handwriting recognition. In International Conference on Document Analysis and Recognition, 599–602. Google Scholar
Digital Library
- N. Bhattacharya and U. Pal. 2012. Stroke segmentation and recognition from Bangla online handwritten text. In International Conference on Frontiers in Handwriting Recognition, 740–745. Google Scholar
Digital Library
- N. Bhattacharya, U. Pal, and K. Roy. 2011. Individual character segmentation from single stroke of Bangla online handwritten text. International Journal of Machine Intelligence 3 (2011), 980–984.Google Scholar
- E. Indermühle, M. Liwicki, and H. Bunke. 2010. IAMonDo-database: An online handwritten document database with non-uniform contents. In IAPR International Workshop on Document Analysis Systems, 97–104. Google Scholar
Digital Library
- J. Schenk, J. Lenz, and G. Rigoll. 2009. Novel script line identification method for script normalization and feature extraction in online handwritten whiteboard note recognition. Pattern Recognition 42, 12 (2009), 3383–3393. Google Scholar
Digital Library
- A. M. Namboodiri and A. K. Jain. 2004. Online handwritten script recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1 (2004), 124–130. Google Scholar
Digital Library
- I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. 1994. Unipen project of on-line data exchange and benchmarks. In Proceedings of IAPR International Conference on Pattern Recognition, 29–33.Google Scholar
- H. Singh, R. K. Sharma, R. Kumar, K. Verma, R. Kumar, and M. Kumar. 2019. A benchmark dataset of online handwritten Gurmukhi script words and numerals. In International Conference on Computer Vision and Image Processing, 457–466.Google Scholar
- B. Nethravathi, C. P. Archana, K. Shashikiran, A. G. Ramakrishnan, and V. Vijay Kumar. 2010. Creation of a huge annotated database for Tamil and Kannada OHR. In Proceedings of International Conference on Frontiers in Handwriting Recognition, 415–420. Google Scholar
Digital Library
- U. Marti and H. Bunke. 1999. A full English sentence database for off-line handwriting recognition. In Proceedings of International Conference on Document Analysis and Recognition, 705–708. Google Scholar
Digital Library
- Hindi and Bengali among top 10 most common languages in the world. (2013). Retrieved on 15 November, 2019 from https://timesofindia.indiatimes.com/world/uk/Hindi-and-Bengali-among-top-10-most-common-languages-in-the-world/articleshow/26104249.cms.Google Scholar
Index Terms
BYANJON: A Ground Truth Preparation System for Online Handwritten Bangla Documents
Recommendations
Stroke Segmentation and Recognition from Bangla Online Handwritten Text
ICFHR '12: Proceedings of the 2012 International Conference on Frontiers in Handwriting RecognitionThis paper deals with recognition of online handwritten Bangla (Bengali) text. Here, at first, we segment cursive words into strokes. A stroke may represent a character or a part of a character. We selected a set of Bangla words written by different ...
A System for Bangla Online Handwritten Text
ICDAR '13: Proceedings of the 2013 12th International Conference on Document Analysis and RecognitionRecognition of Bangla compound characters has rarely got attention from researchers. This paper deals with segmentation and recognition of online handwritten Bangla cursive text containing basic and compound characters and all types of modifiers. Here, ...
CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image
In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words. This is the first handwritten ...






Comments