Abstract
Segmentation of text lines and words in an unconstrained handwritten or a machine-printed degraded document is a challenging document analysis problem due to the heterogeneity in the document structure. Often there is un-even skew between the lines and also broken words in a document. In this article, the contribution lies in segmentation of a document page image into lines and words. We have proposed an unsupervised, robust, and simple statistical method to segment a document image that is either handwritten or machine-printed (degraded or otherwise). In our proposed method, the segmentation is treated as a two-class classification problem. The classification is done by considering the distribution of gap size (between lines and between words) in a binary page image. Our method is very simple and easy to implement. Other than the binarization of the input image, no pre-processing is necessary. There is no need of high computational resources. The proposed method is unsupervised in the sense that no annotated document page images are necessary. Thus, the issue of a training database does not arise. In fact, given a document page image, the parameters that are needed for segmentation of text lines and words are learned in an unsupervised manner. We have applied our proposed method on several popular publicly available handwritten and machine-printed datasets (ISIDDI, IAM-Hist, IAM, PBOK) of different Indian and other languages containing different fonts. Several experimental results are presented to show the effectiveness and robustness of our method. We have experimented on ICDAR-2013 handwriting segmentation contest dataset and our method outperforms the winning method. In addition to this, we have suggested a quantitative measure to compute the level of degradation of a document page image.
- [1] National Center for Scientific Research (NCSR) IMPACT group. 2015. NCSR LineSegmentationTool, v0.3.0 Retrieved from
DOI:
http://transkribus.eu.Google Scholar
- [2] . 2011. Piece-wise painting technique for line segmentation of unconstrained handwritten text: A specific study with Persian text documents. Pattern Anal. Applic. 14, 4 (2011), 381–394. Google Scholar
Digital Library
- [3] . 2011. A new scheme for unconstrained handwritten text-line segmentation. Pattern Recog. 44, 4 (2011), 917–928. Google Scholar
Digital Library
- [4] . 2012. Dataset and ground truth for handwritten text in four different scripts. Int. J. Pattern Recog. Artif. Intell. 26, 04 (2012), 1253001.Google Scholar
Cross Ref
- [5] . 2002. A font and size-independent OCR system for printed Kannada documents using support vector machines. Sadhana 27, 1 (2002), 35–58.Google Scholar
Cross Ref
- [6] . 2018. Toward a dataset-agnostic word segmentation method. In 25th IEEE International Conference on Image Processing (ICIP’18). IEEE, 2635–2639.Google Scholar
- [7] . 2007. Text line extraction from multi-skewed handwritten documents. Pattern Recog. 40, 6 (2007), 1825–1839. Google Scholar
Digital Library
- [8] . 2013. Handwritten and printed text separation in real document. arXiv preprint arXiv:1303.4614 (2013).Google Scholar
- [9] . 2016. Bangla handwritten character segmentation using structural features: A supervised and bootstrapping approach. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4 (2016), 29.Google Scholar
Digital Library
- [10] . 2018. A hybrid deep architecture for robust recognition of text lines of degraded printed documents. In 24th International Conference on Pattern Recognition. IEEE, 3174–3179.Google Scholar
- [11] . 2015. Text line segmentation with water flow algorithm based on power function. J. Electric. Eng. 66, 3 (2015), 132–141.Google Scholar
Cross Ref
- [12] . 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In International Conference on Document Analysis and Recognition (ICDAR’97). 1011–1015.
DOI: DOI: DOI: https://doi.org/10.1109/ICDAR.1997.620662 Google ScholarCross Ref
- [13] . 1998. A complete printed Bangla OCR system. Pattern Recog. 31, 5 (1998), 531–549.Google Scholar
Cross Ref
- [14] . 2018. A hybrid text line segmentation approach for the ancient handwritten unconstrained freestyle Modi script documents. Imag. Sci. J. 66, 7 (2018), 433–442.
DOI: DOI: DOI: https://doi.org/10.1080/13682199.2018.1499226Google ScholarCross Ref
- [15] . 2011. Transcription alignment of Latin manuscripts using hidden Markov models. In Workshop on Historical Document Imaging and Processing. ACM, 29–36. Google Scholar
Digital Library
- [16] . 2012. Lexicon-free handwritten word spotting using character HMMs. Pattern Recog. Lett. 33, 7 (2012), 934–942. Google Scholar
Digital Library
- [17] . 2013. A binarization-free clustering approach to segment curved text lines in historical manuscripts. In 12th International Conference on Document Analysis and Recognition. IEEE, 1290–1294. Google Scholar
Digital Library
- [18] . 2020. DevNet: An efficient CNN architecture for handwritten Devanagari character recognition. Int. J. Pattern Recog. Artif. Intell. 34, 12 (2020), 2052009.Google Scholar
Cross Ref
- [19] . 2018. Content independent writer identification on Bangla Script: A document level approach. Int. J. Pattern Recog. Artif. Intell. 32, 09 (2018), 1856011.Google Scholar
Cross Ref
- [20] . 1999. An architecture for handwritten text recognition systems. Int. J. Docum. Anal. Recog. 2, 1 (1999), 37–44.Google Scholar
Cross Ref
- [21] . 2007. Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Docum. Anal. Recog. 9, 2–4 (2007), 167–177. Google Scholar
Digital Library
- [22] . 2012. A synthesised word approach to word retrieval in handwritten documents. Pattern Recog. 45, 12 (2012), 4225–4236. Google Scholar
Digital Library
- [23] . 2008. Text line detection in handwritten documents. Pattern Recog. 41, 12 (2008), 3758–3772. Google Scholar
Digital Library
- [24] . 2009. Text line and word segmentation of handwritten documents. Pattern Recog. 42, 12 (2009), 3169–3183. Google Scholar
Digital Library
- [25] . 1996. Word spotting: A new approach to indexing handwriting. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 631–637. Google Scholar
Digital Library
- [26] . 2005. A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Trans. Pattern Anal. Mach. Intell. 27, 8 (
Aug. 2005), 1212–1225. Google ScholarDigital Library
- [27] . 2001. Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In 6th International Conference on Document Analysis and Recognition. IEEE, 159–163. Google Scholar
Digital Library
- [28] . 2002. The IAM-database: An English sentence database for offline handwriting recognition. Int. J. Docum. Anal. Recog. 5, 1 (2002), 39–46.Google Scholar
Cross Ref
- [29] . 2013. On the evaluation of handwritten text line detection algorithms. In 12th International Conference on Document Analysis and Recognition. IEEE, 185–189. Google Scholar
Digital Library
- [30] . 2020. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā 45, 1 (2020), 1–22.Google Scholar
Cross Ref
- [31] . 2018. Extreme learning machine for handwritten Indic script identification in multiscript documents. J. Electron. Imag. 27, 5 (2018), 051214.Google Scholar
Cross Ref
- [32] . 2018. PHDIndic_11: Page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimedia Tools Applic. 77, 2 (2018), 1643–1678. Google Scholar
Digital Library
- [33] . 2018. Handwritten Indic script identification in multi-script document images: A survey. Int. J. Pattern Recog. Artif. Intell. 32, 10 (2018), 1856012.Google Scholar
Cross Ref
- [34] . 2019. Automatic Indic script identification from handwritten documents: Page, block, line and word-level approach. Int. J. Mach. Learn. Cyber. 10, 1 (2019), 87–106.Google Scholar
Cross Ref
- [35] . 1979. A threshold selection method from gray-level histograms. IEEE Trans. Syst., Man, Cyber. 9, 1 (1979), 62–66.Google Scholar
Cross Ref
- [36] . 2010. Handwritten document image segmentation into text lines and words. Pattern Recog. 43, 1 (2010), 369–377. Google Scholar
Digital Library
- [37] . 2016. Complete system for text line extraction using convolutional neural networks and watershed transform. In 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, 30–35.Google Scholar
- [38] . 2015. Combining learned script points and combinatorial optimization for text line extraction. In 3rd International Workshop on Historical Document Imaging and Processing. ACM, 71–78. Google Scholar
Digital Library
- [39] . 1999. Empirical performance evaluation of graphics recognition systems. IEEE Trans. Pattern Anal. Mach. Intell. 21, 9 (1999), 849–870. Google Scholar
Digital Library
- [40] . 2015. Efficient segmentation-free keyword spotting in historical document collections. Pattern Recog. 48, 2 (
Feb. 2015), 545–555.DOI: DOI: DOI: https://doi.org/10.1016/j.patcog.2014.08.021Google ScholarCross Ref
- [41] . 2014. Language-independent text-line extraction algorithm for handwritten documents. IEEE Sig. Process. Lett. 21, 9 (
Sep. 2014), 1115–1119.DOI: DOI: DOI: https://doi.org/10.1109/LSP.2014.2325940Google ScholarCross Ref
- [42] . 2015. Word segmentation method for handwritten documents based on structured learning. IEEE Sig. Process. Lett. 22, 8 (2015), 1161–1165.Google Scholar
Cross Ref
- [43] . 2009. Line extraction from unconstraint handwritten document pages using piece-wise water-flow technique. In Indian International Conference on Artificial Intelligence. 1861–1872.Google Scholar
- [44] . 2014. Extraction of text lines from handwritten documents using piecewise water flow technique. J. Intell. Syst. 23, 3 (2014), 245–260.Google Scholar
Cross Ref
- [45] . 2017. Using neural cells to improve textual line segmentation. FHTW, Provo, Utah. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Using+neural+cells+to+improve+textual+line+segmentation&btnG=.Google Scholar
- [46] . 2018. Neural text line segmentation of multilingual print and handwriting with recognition-based evaluation. In 16th International Conference on Frontiers in Handwriting Recognition (ICFHR’18). IEEE, 265–272.Google Scholar
- [47] . 2016. Segmentation of English offline handwritten cursive scripts using a feedforward neural network. Neural Comput. Applic. 27, 5 (2016), 1369–1379. Google Scholar
Digital Library
- [48] . 2005. Handwritten Arabic word spotting using the cedarabic document analysis system. In Symposium on Document Image Understanding Technology (SDIUT’05). 123–132.Google Scholar
- [49] . 2008. Robust text-line and word segmentation for handwritten documents images. In IEEE International Conference on Acoustics, Speech and Signal Processing. 3393–3396.
DOI: DOI: DOI: https://doi.org/10.1109/ICASSP.2008.4518379Google Scholar - [50] . 2013. ICDAR 2013 handwriting segmentation contest. In 12th International Conference on Document Analysis and Recognition (ICDAR’13). IEEE, 1402–1406. Google Scholar
Digital Library
- [51] . 2015. A novel word segmentation method based on object detection and deep learning. In International Symposium on Visual Computing. Springer, 231–240.Google Scholar
- [52] . 2017. Neural Ctrl-F: Segmentation-free query-by-string word spotting in handwritten manuscript collections. In IEEE International Conference on Computer Vision (ICCV’17). 4443–4452.
DOI: DOI: DOI: https://doi.org/10.1109/ICCV.2017.475Google Scholar - [53] . 1982. Document analysis system. IBM J. Res. Devel. 26, 6 (1982), 647–656. Google Scholar
Digital Library
- [54] . 2002. Automatic segmentation of the IAM off-line database for handwritten English text. In Object Recognition Supported by User Interaction for Service Robots, Vol. 4. IEEE, 35–39.Google Scholar
Index Terms
An Unsupervised and Robust Line and Word Segmentation Method for Handwritten and Degraded Printed Document
Recommendations
Handwritten ZIP code recognition using lexicon free word recognition algorithm
ICDAR '95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2The paper describes a new approach to ZIP code recognition using a word recognition algorithm, where a numeral string is recognized as a word. The paper also describes an end to end ZIP code recognition system consisting of tilt/slant correction, line ...
Segmentation of lines and words in handwritten Gurmukhi script documents
IITM '10: Proceedings of the First International Conference on Intelligent Interactive Technologies and MultimediaOptical Character Recognition (OCR) is an essential part of Document Analysis System. Among few phases of an OCR system, segmentation is an important phase. After preprocessing phase, it is necessary to segment the text into lines, words and characters ...
Line, Word, and Character Segmentation of Manipuri Machine Printed Text
CICN '14: Proceedings of the 2014 International Conference on Computational Intelligence and Communication NetworksSegmentation of line, word and character are one of the critical phases of optical character recognition (OCR). Due to the imperfection in segmentation, most of the recognition system produce poor recognition rate. In this paper we are discussing some ...






Comments