Abstract
Detection of scene text in arbitrary shapes is a challenging task in the field of computer vision. Most existing scene text detection methods exploit the rectangle/quadrangular bounding box to denote the detected text, which fails to accurately fit text with arbitrary shapes, such as curved text. In addition, recent progress on scene text detection has benefited from Fully Convolutional Network. Text cues contained in multi-level convolutional features are complementary for detecting scene text objects. How to explore these multi-level features is still an open problem. To tackle the above issues, we propose an Attention-based Bidirectional Long Short-Term Memory (AB-LSTM) model for scene text detection. First, word stroke regions (WSRs) and text center blocks (TCBs) are extracted by two AB-LSTM models, respectively. Then, the union of WSRs and TCBs are used to represent text objects. To verify the effectiveness of the proposed method, we perform experiments on four public benchmarks: CTW1500, Total-text, ICDAR2013, and MSRA-TD500, and compare it with existing state-of-the-art methods. Experiment results demonstrate that the proposed method can achieve competitive results, and well handle scene text objects with arbitrary shapes (i.e., curved, oriented, and horizontal forms).
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from Arxiv Preprint Arxiv:1409.0473 (2014).Google Scholar
- Michal Busta, Lukas Neumann, and Jiri Matas. 2015. Fastext: Efficient unconstrained scene text detector. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1206--1214.Google Scholar
- Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 935--942.Google Scholar
Cross Ref
- Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2 (2018), 48.Google Scholar
- Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 6773--6780.Google Scholar
Cross Ref
- Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 2963--2970.Google Scholar
Cross Ref
- Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.Google Scholar
Digital Library
- Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2315--2324.Google Scholar
Cross Ref
- Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, G. Alexander, I. I. Ororbia, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 474--483.Google Scholar
- Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single shot text detector with regional attention. In Proceedings of the International Conference on Computer Vision (ICCV’17). 3047--3055.Google Scholar
Cross Ref
- Tong He, Weilin Huang, Yu Qiao, and Jian Yao. 2016. Accurate text localization in natural image with cascaded convolutional text network. Retrieved from: Arxiv Preprint Arxiv:1603.09423 (2016).Google Scholar
- Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5020--5029.Google Scholar
Cross Ref
- Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 745--753.Google Scholar
Cross Ref
- Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. 2017. Wordsup: Exploiting word annotations for character-based text detection. In Proceedings of the International Conference on Computer Vision (ICCV’17). 4940--4949.Google Scholar
Cross Ref
- Shao Huang, Weiqiang Wang, Shengfeng He, and Rynson W. H. Lau. 2017. Egocentric hand detection via dynamic region growing. ACM Trans. Multimedia Comput., Commun., Applic. 14, 1 (2017), 10.Google Scholar
- Weilin Huang, Zhe Lin, Jianchao Yang, and Jue Wang. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the International Conference on Computer Vision (ICCV’13). 1241--1248.Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (ACMMM’14). 675--678.Google Scholar
Digital Library
- Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. 2017. R2CNN: Rotational region CNN for orientation robust scene text detection. Retrieved from Arxiv Preprint Arxiv:1706.09579 (2017).Google Scholar
- Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’15). 1156--1160.Google Scholar
- Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’13). 1484--1493.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). 1097--1105.Google Scholar
Digital Library
- Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. 2017. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 384--393.Google Scholar
Cross Ref
- Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, and Jian Yang. 2018. Shape robust text detection with progressive scale expansion network. Retrieved from Arxiv Preprint Arxiv:1806.02559 (2018).Google Scholar
- Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. TextBoxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’17). 4161--4167.Google Scholar
Cross Ref
- Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5909--5918.Google Scholar
Cross Ref
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). 21--37.Google Scholar
- Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. 2017. Detecting curve text in the wild: New dataset and new solution. Retrieved from Arxiv Preprint Arxiv:1712.02170 (2017).Google Scholar
- Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. Scene text detection with fully convolutional neural networks. Multimedia Tools Applic. 78, 13 (2019), 18205--18227.Google Scholar
Digital Library
- Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19--35.Google Scholar
Digital Library
- Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7202--7209.Google Scholar
Cross Ref
- Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 67--83.Google Scholar
Digital Library
- Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7553--7563.Google Scholar
Cross Ref
- Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20, 11 (2018), 3111--3122.Google Scholar
Digital Library
- Andrew Mehnert and Paul Jackway. 1997. An improved seeded region growing algorithm. Pattern Recog. Lett. 18, 10 (1997), 1065--1071.Google Scholar
Digital Library
- Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon et al. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’17). 1454--1459.Google Scholar
- Lukas Neumann and Jiri Matas. 2010. A method for text localization and recognition in real-world images. In Proceedings of the Asian Conference on Computer Vision (ACCV’10). 770--783.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). 91--99.Google Scholar
Digital Library
- Asif Shahab, Faisal Shafait, and Andreas Dengel. 2011. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11). 1491--1496.Google Scholar
- Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2550--2558.Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from Arxiv Preprint Arxiv:1409.1556 (2014).Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google Scholar
Cross Ref
- Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision (ECCV’16). 56--72.Google Scholar
Cross Ref
- Cheng Wang, Haojin Yang, and Christoph Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput., Commun., Applic. 14, 2s (2018), 40.Google Scholar
- Christian Wolf and Jean-Michel Jolion. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recog. 8, 4 (2006), 280--296.Google Scholar
Digital Library
- Saining Xie and Zhuowen Tu. 2015. Holistically nested edge detection. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1395--1403.Google Scholar
Digital Library
- Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3073--3082.Google Scholar
Cross Ref
- Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1083--1090.Google Scholar
- Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. 2016. Scene text detection via holistic, multi-channel prediction. Retrieved from Arxiv Preprint Arxiv:1606.09002 (2016).Google Scholar
- Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao. 2014. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 36, 5 (2014), 970--983.Google Scholar
Cross Ref
- Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. 2016. Text detection, tracking and recognition in video: a comprehensive survey. IEEE Transactions on Image Processing (TIP) 25, 6 (2016), 2752--2773.Google Scholar
Digital Library
- Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. 2018. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1741--1750.Google Scholar
Cross Ref
- Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4159--4167.Google Scholar
Cross Ref
- Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2642--2651.Google Scholar
Cross Ref
- Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. (FCS) 10, 1 (2016), 19--36.Google Scholar
Digital Library
Index Terms
AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection
Recommendations
Detection and rectification of arbitrary shaped scene texts by using text keypoints and links
Highlights- We propose a robust scene text detection and rectification technique that is capable of detecting and rectifying scene texts of arbitrary shapes almost ...
AbstractDetection and recognition of scene texts of arbitrary shapes remain a grand challenge due to the super-rich text shape variation in text line orientations, lengths, curvatures, etc. This paper presents a mask-guided multi-task network ...
A Multi-Oriented Scene Text Detector with Position-Sensitive Segmentation
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia RetrievalScene text detection has been studied for a long time and lots of approaches have achieved promising performances. Most approaches regard text as a specific object and utilize the popular frameworks of object detection to detect scene text. However, ...
Scene text detection with fully convolutional neural networks
Text detection in scene image has become a hot topic in computer vision and artificial intelligence research, due to its wide range of applications and challenges. Most state-of-the-art methods for text detection based on deep learning rely on text ...






Comments