skip to main content
research-article

AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection

Authors Info & Claims
Published:16 December 2019Publication History
Skip Abstract Section

Abstract

Detection of scene text in arbitrary shapes is a challenging task in the field of computer vision. Most existing scene text detection methods exploit the rectangle/quadrangular bounding box to denote the detected text, which fails to accurately fit text with arbitrary shapes, such as curved text. In addition, recent progress on scene text detection has benefited from Fully Convolutional Network. Text cues contained in multi-level convolutional features are complementary for detecting scene text objects. How to explore these multi-level features is still an open problem. To tackle the above issues, we propose an Attention-based Bidirectional Long Short-Term Memory (AB-LSTM) model for scene text detection. First, word stroke regions (WSRs) and text center blocks (TCBs) are extracted by two AB-LSTM models, respectively. Then, the union of WSRs and TCBs are used to represent text objects. To verify the effectiveness of the proposed method, we perform experiments on four public benchmarks: CTW1500, Total-text, ICDAR2013, and MSRA-TD500, and compare it with existing state-of-the-art methods. Experiment results demonstrate that the proposed method can achieve competitive results, and well handle scene text objects with arbitrary shapes (i.e., curved, oriented, and horizontal forms).

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from Arxiv Preprint Arxiv:1409.0473 (2014).Google ScholarGoogle Scholar
  2. Michal Busta, Lukas Neumann, and Jiri Matas. 2015. Fastext: Efficient unconstrained scene text detector. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1206--1214.Google ScholarGoogle Scholar
  3. Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 935--942.Google ScholarGoogle ScholarCross RefCross Ref
  4. Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2 (2018), 48.Google ScholarGoogle Scholar
  5. Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 6773--6780.Google ScholarGoogle ScholarCross RefCross Ref
  6. Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2963--2970.Google ScholarGoogle ScholarCross RefCross Ref
  7. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2315--2324.Google ScholarGoogle ScholarCross RefCross Ref
  9. Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, G. Alexander, I. I. Ororbia, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 474--483.Google ScholarGoogle Scholar
  10. Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single shot text detector with regional attention. In Proceedings of the International Conference on Computer Vision (ICCV’17). 3047--3055.Google ScholarGoogle ScholarCross RefCross Ref
  11. Tong He, Weilin Huang, Yu Qiao, and Jian Yao. 2016. Accurate text localization in natural image with cascaded convolutional text network. Retrieved from: Arxiv Preprint Arxiv:1603.09423 (2016).Google ScholarGoogle Scholar
  12. Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5020--5029.Google ScholarGoogle ScholarCross RefCross Ref
  13. Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 745--753.Google ScholarGoogle ScholarCross RefCross Ref
  14. Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. 2017. Wordsup: Exploiting word annotations for character-based text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 4940--4949.Google ScholarGoogle ScholarCross RefCross Ref
  15. Shao Huang, Weiqiang Wang, Shengfeng He, and Rynson W. H. Lau. 2017. Egocentric hand detection via dynamic region growing. ACM Trans. Multimedia Comput., Commun., Applic. 14, 1 (2017), 10.Google ScholarGoogle Scholar
  16. Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the International Conference on Computer Vision (ICCV’13). 1241--1248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (ACMMM’14). 675--678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. 2017. R2CNN: Rotational region CNN for orientation robust scene text detection. Retrieved from Arxiv Preprint Arxiv:1706.09579 (2017).Google ScholarGoogle Scholar
  19. Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’15). 1156--1160.Google ScholarGoogle Scholar
  20. Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’13). 1484--1493.Google ScholarGoogle Scholar
  21. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 384--393.Google ScholarGoogle ScholarCross RefCross Ref
  23. Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, and Jian Yang. 2018. Shape robust text detection with progressive scale expansion network. Retrieved from Arxiv Preprint Arxiv:1806.02559 (2018).Google ScholarGoogle Scholar
  24. Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’17). 4161--4167.Google ScholarGoogle ScholarCross RefCross Ref
  25. Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5909--5918.Google ScholarGoogle ScholarCross RefCross Ref
  26. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). 21--37.Google ScholarGoogle Scholar
  27. Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. 2017. Detecting curve text in the wild: New dataset and new solution. Retrieved from Arxiv Preprint Arxiv:1712.02170 (2017).Google ScholarGoogle Scholar
  28. Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. Scene text detection with fully convolutional neural networks. Multimedia Tools Applic. 78, 13 (2019), 18205--18227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7202--7209.Google ScholarGoogle ScholarCross RefCross Ref
  31. Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 67--83.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7553--7563.Google ScholarGoogle ScholarCross RefCross Ref
  33. Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20, 11 (2018), 3111--3122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Andrew Mehnert and Paul Jackway. 1997. An improved seeded region growing algorithm. Pattern Recog. Lett. 18, 10 (1997), 1065--1071.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon et al. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 1454--1459.Google ScholarGoogle Scholar
  36. Lukas Neumann and Jiri Matas. 2010. A method for text localization and recognition in real-world images. In Proceedings of the Asian Conference on Computer Vision (ACCV’10). 770--783.Google ScholarGoogle Scholar
  37. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). 91--99.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Asif Shahab, Faisal Shafait, and Andreas Dengel. 2011. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11). 1491--1496.Google ScholarGoogle Scholar
  39. Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2550--2558.Google ScholarGoogle ScholarCross RefCross Ref
  40. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from Arxiv Preprint Arxiv:1409.1556 (2014).Google ScholarGoogle Scholar
  41. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  42. Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision (ECCV’16). 56--72.Google ScholarGoogle ScholarCross RefCross Ref
  43. Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2s (2018), 40.Google ScholarGoogle Scholar
  44. Christian Wolf and Jean-Michel Jolion. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recog. 8, 4 (2006), 280--296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Saining Xie and Zhuowen Tu. 2015. Holistically nested edge detection. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1395--1403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3073--3082.Google ScholarGoogle ScholarCross RefCross Ref
  47. Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1083--1090.Google ScholarGoogle Scholar
  48. Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene text detection via holistic, multi-channel prediction. Retrieved from Arxiv Preprint Arxiv:1606.09002 (2016).Google ScholarGoogle Scholar
  49. Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. 2014. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36, 5 (2014), 970--983.Google ScholarGoogle ScholarCross RefCross Ref
  50. Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. 2016. Text detection, tracking and recognition in video: a comprehensive survey. IEEE Transactions on Image Processing (TIP) 25, 6 (2016), 2752--2773.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. 2018. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1741--1750.Google ScholarGoogle ScholarCross RefCross Ref
  52. Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4159--4167.Google ScholarGoogle ScholarCross RefCross Ref
  53. Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2642--2651.Google ScholarGoogle ScholarCross RefCross Ref
  54. Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. (FCS) 10, 1 (2016), 19--36.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!