skip to main content
research-article

An Unsupervised and Robust Line and Word Segmentation Method for Handwritten and Degraded Printed Document

Published:31 October 2021Publication History
Skip Abstract Section

Abstract

Segmentation of text lines and words in an unconstrained handwritten or a machine-printed degraded document is a challenging document analysis problem due to the heterogeneity in the document structure. Often there is un-even skew between the lines and also broken words in a document. In this article, the contribution lies in segmentation of a document page image into lines and words. We have proposed an unsupervised, robust, and simple statistical method to segment a document image that is either handwritten or machine-printed (degraded or otherwise). In our proposed method, the segmentation is treated as a two-class classification problem. The classification is done by considering the distribution of gap size (between lines and between words) in a binary page image. Our method is very simple and easy to implement. Other than the binarization of the input image, no pre-processing is necessary. There is no need of high computational resources. The proposed method is unsupervised in the sense that no annotated document page images are necessary. Thus, the issue of a training database does not arise. In fact, given a document page image, the parameters that are needed for segmentation of text lines and words are learned in an unsupervised manner. We have applied our proposed method on several popular publicly available handwritten and machine-printed datasets (ISIDDI, IAM-Hist, IAM, PBOK) of different Indian and other languages containing different fonts. Several experimental results are presented to show the effectiveness and robustness of our method. We have experimented on ICDAR-2013 handwriting segmentation contest dataset and our method outperforms the winning method. In addition to this, we have suggested a quantitative measure to compute the level of degradation of a document page image.

REFERENCES

  1. [1] National Center for Scientific Research (NCSR) IMPACT group. 2015. NCSR LineSegmentationTool, v0.3.0 Retrieved from DOI: http://transkribus.eu.Google ScholarGoogle Scholar
  2. [2] Alaei Alireza, Nagabhushan P., and Pal Umapada. 2011. Piece-wise painting technique for line segmentation of unconstrained handwritten text: A specific study with Persian text documents. Pattern Anal. Applic. 14, 4 (2011), 381394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Alaei Alireza, Pal Umapada, and Nagabhushan P.. 2011. A new scheme for unconstrained handwritten text-line segmentation. Pattern Recog. 44, 4 (2011), 917928. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Alaei Alireza, Pal Umapada, and Nagabhushan P.. 2012. Dataset and ground truth for handwritten text in four different scripts. Int. J. Pattern Recog. Artif. Intell. 26, 04 (2012), 1253001.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Ashwin T. V. and Sastry P. S.. 2002. A font and size-independent OCR system for printed Kannada documents using support vector machines. Sadhana 27, 1 (2002), 3558.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Axler Gregory and Wolf Lior. 2018. Toward a dataset-agnostic word segmentation method. In 25th IEEE International Conference on Image Processing (ICIP’18). IEEE, 26352639.Google ScholarGoogle Scholar
  7. [7] Basu Subhadip, Chaudhuri Chitrita, Kundu Mahantapas, Nasipuri Mita, and Basu Dipak Kumar. 2007. Text line extraction from multi-skewed handwritten documents. Pattern Recog. 40, 6 (2007), 18251839. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Belaïd Abdel, Santosh K. C., and d’Andecy Vincent Poulain. 2013. Handwritten and printed text separation in real document. arXiv preprint arXiv:1303.4614 (2013).Google ScholarGoogle Scholar
  9. [9] Bhowmik Tapan Kumar, Parui Swapan Kumar, Roy Utpal, and Schomaker Lambert. 2016. Bangla handwritten character segmentation using structural features: A supervised and bootstrapping approach. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4 (2016), 29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Biswas Chandan, Mukherjee Partha Sarathi, Ghosh Koyel, Bhattacharya Ujjwal, and Parui Swapan K.. 2018. A hybrid deep architecture for robust recognition of text lines of degraded printed documents. In 24th International Conference on Pattern Recognition. IEEE, 31743179.Google ScholarGoogle Scholar
  11. [11] Brodić Darko. 2015. Text line segmentation with water flow algorithm based on power function. J. Electric. Eng. 66, 3 (2015), 132141.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chaudhuri Bidyut and Pal Umapada. 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In International Conference on Document Analysis and Recognition (ICDAR’97). 10111015. DOI:DOI: DOI: https://doi.org/10.1109/ICDAR.1997.620662 Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chaudhuri B. B. and Pal U.. 1998. A complete printed Bangla OCR system. Pattern Recog. 31, 5 (1998), 531549.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Deshmukh Manisha S., Patil Manoj P., and Kolhe Satish R.. 2018. A hybrid text line segmentation approach for the ancient handwritten unconstrained freestyle Modi script documents. Imag. Sci. J. 66, 7 (2018), 433442. DOI:DOI: DOI: https://doi.org/10.1080/13682199.2018.1499226Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Fischer Andreas, Frinken Volkmar, Fornés Alicia, and Bunke Horst. 2011. Transcription alignment of Latin manuscripts using hidden Markov models. In Workshop on Historical Document Imaging and Processing. ACM, 2936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Fischer Andreas, Keller Andreas, Frinken Volkmar, and Bunke Horst. 2012. Lexicon-free handwritten word spotting using character HMMs. Pattern Recog. Lett. 33, 7 (2012), 934942. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Garz Angelika, Fischer Andreas, Bunke Horst, and Ingold Rolf. 2013. A binarization-free clustering approach to segment curved text lines in historical manuscripts. In 12th International Conference on Document Analysis and Recognition. IEEE, 12901294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Guha Riya, Das Nibaran, Kundu Mahantapas, Nasipuri Mita, and Santosh K. C.. 2020. DevNet: An efficient CNN architecture for handwritten Devanagari character recognition. Int. J. Pattern Recog. Artif. Intell. 34, 12 (2020), 2052009.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Halder Chayan, Obaidullah Sk Md, Santosh K. C., and Roy Kaushik. 2018. Content independent writer identification on Bangla Script: A document level approach. Int. J. Pattern Recog. Artif. Intell. 32, 09 (2018), 1856011.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Kim Gyeonghwan, Govindaraju Venu, and Srihari Sargur N.. 1999. An architecture for handwritten text recognition systems. Int. J. Docum. Anal. Recog. 2, 1 (1999), 3744.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Konidaris Thomas, Gatos Basilios, Ntzios Kostas, Pratikakis Ioannis, Theodoridis Sergios, and Perantonis Stavros J.. 2007. Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int. J. Docum. Anal. Recog. 9, 2–4 (2007), 167177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Liang Yiqing, Fairhurst Michael C., and Guest Richard M.. 2012. A synthesised word approach to word retrieval in handwritten documents. Pattern Recog. 45, 12 (2012), 42254236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Louloudis Georgios, Gatos Basilios, Pratikakis Ioannis, and Halatsis Constantin. 2008. Text line detection in handwritten documents. Pattern Recog. 41, 12 (2008), 37583772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Louloudis Georgios, Gatos Basilios, Pratikakis Ioannis, and Halatsis Constantin. 2009. Text line and word segmentation of handwritten documents. Pattern Recog. 42, 12 (2009), 31693183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Manmatha R., Han Chengfeng, and Riseman Edward M.. 1996. Word spotting: A new approach to indexing handwriting. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 631637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Manmatha R. and Rothfeder J. L.. 2005. A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Trans. Pattern Anal. Mach. Intell. 27, 8 (Aug. 2005), 12121225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Marti U.-V. and Bunke Horst. 2001. Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In 6th International Conference on Document Analysis and Recognition. IEEE, 159163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Marti U.-V. and Bunke Horst. 2002. The IAM-database: An English sentence database for offline handwriting recognition. Int. J. Docum. Anal. Recog. 5, 1 (2002), 3946.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Moysset Bastien and Kermorvant Christopher. 2013. On the evaluation of handwritten text line detection algorithms. In 12th International Conference on Document Analysis and Recognition. IEEE, 185189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Mukherjee Jayati, Parui Swapan K., and Roy Utpal. 2020. NN-based analytic approach to symbol level recognition for degraded Bengali printed documents. Sādhanā 45, 1 (2020), 122.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Obaidullah Sk Md, Bose Amitava, Mukherjee Himadri, Santosh K. C., Das Nibaran, and Roy Kaushik. 2018. Extreme learning machine for handwritten Indic script identification in multiscript documents. J. Electron. Imag. 27, 5 (2018), 051214.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Obaidullah Sk Md, Halder Chayan, Santosh K. C., Das Nibaran, and Roy Kaushik. 2018. PHDIndic_11: Page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimedia Tools Applic. 77, 2 (2018), 16431678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Obaidullah Sk Md, Santosh K. C., Das Nibaran, Halder Chayan, and Roy Kaushik. 2018. Handwritten Indic script identification in multi-script document images: A survey. Int. J. Pattern Recog. Artif. Intell. 32, 10 (2018), 1856012.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Obaidullah Sk Md, Santosh K. C., Halder Chayan, Das Nibaran, and Roy Kaushik. 2019. Automatic Indic script identification from handwritten documents: Page, block, line and word-level approach. Int. J. Mach. Learn. Cyber. 10, 1 (2019), 87106.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Otsu Nobuyuki. 1979. A threshold selection method from gray-level histograms. IEEE Trans. Syst., Man, Cyber. 9, 1 (1979), 6266.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Papavassiliou Vassilis, Stafylakis Themos, Katsouros Vassilis, and Carayannis George. 2010. Handwritten document image segmentation into text lines and words. Pattern Recog. 43, 1 (2010), 369377. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Pastor-Pellicer Joan, Afzal Muhammad Zeshan, Liwicki Marcus, and Castro-Bleda María José. 2016. Complete system for text line extraction using convolutional neural networks and watershed transform. In 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, 3035.Google ScholarGoogle Scholar
  38. [38] Pastor-Pellicer Joan, Garz Angelika, Ingold Rolf, and Castro-Bleda María-José. 2015. Combining learned script points and combinatorial optimization for text line extraction. In 3rd International Workshop on Historical Document Imaging and Processing. ACM, 7178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Phillips Ihsin T. and Chhabra Atul K.. 1999. Empirical performance evaluation of graphics recognition systems. IEEE Trans. Pattern Anal. Mach. Intell. 21, 9 (1999), 849870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Rusiñol Marçal, Aldavert David, Toledo Ricardo, and Lladós Josep. 2015. Efficient segmentation-free keyword spotting in historical document collections. Pattern Recog. 48, 2 (Feb. 2015), 545555. DOI:DOI: DOI: https://doi.org/10.1016/j.patcog.2014.08.021Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Ryu J., Koo H. I., and Cho N. I.. 2014. Language-independent text-line extraction algorithm for handwritten documents. IEEE Sig. Process. Lett. 21, 9 (Sep. 2014), 11151119. DOI:DOI: DOI: https://doi.org/10.1109/LSP.2014.2325940Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Ryu Jewoong, Koo Hyung Il, and Cho Nam Ik. 2015. Word segmentation method for handwritten documents based on structured learning. IEEE Sig. Process. Lett. 22, 8 (2015), 11611165.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Sarkar Ram, Basu Subhadip, Das Nibaran, Mollah Ayatullah Faruk, Kundu Mahantapas, and Nasipuri Mita. 2009. Line extraction from unconstraint handwritten document pages using piece-wise water-flow technique. In Indian International Conference on Artificial Intelligence. 18611872.Google ScholarGoogle Scholar
  44. [44] Sarkar Ram, Das Nibaran, Basu Subhadip, Kundu Mahantapas, and Nasipuri Mita. 2014. Extraction of text lines from handwritten documents using piecewise water flow technique. J. Intell. Syst. 23, 3 (2014), 245260.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Schone P.. 2017. Using neural cells to improve textual line segmentation. FHTW, Provo, Utah. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Using+neural+cells+to+improve+textual+line+segmentation&btnG=.Google ScholarGoogle Scholar
  46. [46] Schone Patrick, Hargraves Christian, Morrey Jon, Day Rachael, and Jacox Mindy. 2018. Neural text line segmentation of multilingual print and handwriting with recognition-based evaluation. In 16th International Conference on Frontiers in Handwriting Recognition (ICFHR’18). IEEE, 265272.Google ScholarGoogle Scholar
  47. [47] Sharma Manoj Kumar and Dhaka Vijay Pal. 2016. Segmentation of English offline handwritten cursive scripts using a feedforward neural network. Neural Comput. Applic. 27, 5 (2016), 13691379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Srihari Sargur, Srinivasan Harish, Babu Pavithra, and Bhole Chetan. 2005. Handwritten Arabic word spotting using the cedarabic document analysis system. In Symposium on Document Image Understanding Technology (SDIUT’05). 123132.Google ScholarGoogle Scholar
  49. [49] Stafylakis T., Papavassiliou V., Katsouros V., and Carayannis G.. 2008. Robust text-line and word segmentation for handwritten documents images. In IEEE International Conference on Acoustics, Speech and Signal Processing. 33933396. DOI:DOI: DOI: https://doi.org/10.1109/ICASSP.2008.4518379Google ScholarGoogle Scholar
  50. [50] Stamatopoulos Nikolaos, Gatos Basilis, Louloudis Georgios, Pal Umapada, and Alaei Alireza. 2013. ICDAR 2013 handwriting segmentation contest. In 12th International Conference on Document Analysis and Recognition (ICDAR’13). IEEE, 14021406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wilkinson Tomas and Brun Anders. 2015. A novel word segmentation method based on object detection and deep learning. In International Symposium on Visual Computing. Springer, 231240.Google ScholarGoogle Scholar
  52. [52] Wilkinson T., Lindström J., and Brun A.. 2017. Neural Ctrl-F: Segmentation-free query-by-string word spotting in handwritten manuscript collections. In IEEE International Conference on Computer Vision (ICCV’17). 44434452. DOI:DOI: DOI: https://doi.org/10.1109/ICCV.2017.475Google ScholarGoogle Scholar
  53. [53] Wong Kwan Y., Casey Richard G., and Wahl Friedrich M.. 1982. Document analysis system. IBM J. Res. Devel. 26, 6 (1982), 647656. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zimmermann Matthias and Bunke Horst. 2002. Automatic segmentation of the IAM off-line database for handwritten English text. In Object Recognition Supported by User Interaction for Service Robots, Vol. 4. IEEE, 3539.Google ScholarGoogle Scholar

Index Terms

  1. An Unsupervised and Robust Line and Word Segmentation Method for Handwritten and Degraded Printed Document

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
          March 2022
          413 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3494070
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 October 2021
          • Accepted: 1 July 2021
          • Revised: 1 April 2021
          • Received: 1 February 2020
          Published in tallip Volume 21, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)129
          • Downloads (Last 6 weeks)10

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!