Abstract
Recently, many scene text detection algorithms have achieved impressive performance by using convolutional neural networks. However, most of them do not make full use of the context among the hierarchical multi-level features to improve the performance of scene text detection. In this article, we present an efficient multi-level features enhanced cumulative framework based on instance segmentation for scene text detection. At first, we adopt a Multi-Level Features Enhanced Cumulative (MFEC) module to capture features of cumulative enhancement of representational ability. Then, a Multi-Level Features Fusion (MFF) module is designed to fully integrate both high-level and low-level MFEC features, which can adaptively encode scene text information. To verify the effectiveness of the proposed method, we perform experiments on six public datasets (namely, CTW1500, Total-text, MSRA-TD500, ICDAR2013, ICDAR2015, and MLT2017), and make comparisons with other state-of-the-art methods. Experimental results demonstrate that the proposed Multi-Level Features Enhanced Cumulative Network (MFECN) detector can well handle scene text instances with irregular shapes (i.e., curved, oriented, and horizontal) and achieves better or comparable results.
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 4 (2018), 834–848.Google Scholar
Cross Ref
- Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google Scholar
- Zhineng Chen, Shanshan Ai, and Caiyan Jia. 2019. Structure-aware deep learning for product image classification. ACM Trans. Multim. Comput. Commun. Applic. 15, 1s (2019), 4. Google Scholar
Digital Library
- Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. InProceedings of theInternational Conference on Document Analysis and Recognition (ICDAR). 935–942.Google Scholar
Cross Ref
- Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multim. Comput. Commun. Applic. 14, 2 (2018), 48. Google Scholar
Digital Library
- Yuchen Dai, Zheng Huang, Yuting Gao, Youxuan Xu, Kai Chen, Jie Guo, and Weidong Qiu. 2018. Fused text segmentation networks for multi-oriented scene text detection. In Proceedings of the International Conference on Pattern Recognition (ICPR). 3604–3609.Google Scholar
Cross Ref
- Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting scene text via instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 6773–6780.Google Scholar
- Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338. Google Scholar
Digital Library
- Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. 2018. Detect-and-track: Efficient pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 350–359.Google Scholar
Cross Ref
- Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2315–2324.Google Scholar
Cross Ref
- Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, Alexander G. Ororbi, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3519–3528.Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2961–2969.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.Google Scholar
Cross Ref
- Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 3047–3055.Google Scholar
Cross Ref
- Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. 2018. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5020–5029.Google Scholar
Cross Ref
- Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 745–753.Google Scholar
Cross Ref
- Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip H. S. Torr. 2017. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3203–3212.Google Scholar
- Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui Ding. 2017. Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4940–4949.Google Scholar
Cross Ref
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7132–7141.Google Scholar
Cross Ref
- Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, and Ross Girshick. 2018. Learning to segment every thing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4233–4241.Google Scholar
Cross Ref
- Shao Huang, Weiqiang Wang, Shengfeng He, and Rynson W. H. Lau. 2018. Egocentric hand detection via dynamic region growing. ACM Trans. Multim. Comput. Commun. Applic. 14, 1 (2018), 10. Google Scholar
Digital Library
- Zhida Huang, Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2019. Mask R-CNN with pyramid attention network for scene text detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). 764–772.Google Scholar
Cross Ref
- Jisoo Jeong, Hyojin Park, and Nojun Kwak. 2017. Enhancement of SSD by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587 (2017).Google Scholar
- Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). 1156–1160. Google Scholar
Digital Library
- Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). 1484–1493. Google Scholar
Digital Library
- Wei Ke, Jie Chen, Jianbin Jiao, Guoying Zhao, and Qixiang Ye. 2017. SRN: Side-output residual network for object symmetry detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1068–1076.Google Scholar
Cross Ref
- Hyungtae Lee and Heesung Kwon. 2017. Going deeper with contextual CNN for hyperspectral image classification. IEEE Trans. Image Proc. 26, 10 (2017), 4843–4855.Google Scholar
Digital Library
- Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2359–2367.Google Scholar
Cross Ref
- Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 4161–4167. Google Scholar
Digital Library
- Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5909–5918.Google Scholar
Cross Ref
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2117–2125.Google Scholar
Cross Ref
- Jingchao Liu, Xuebo Liu, Jie Sheng, Ding Liang, Xin Li, and Qingjie Liu. 2019. Pyramid mask text detector. arXiv preprint arXiv:1903.11800 (2019).Google Scholar
- Jiaming Liu, Chengquan Zhang, Yipeng Sun, Junyu Han, and Errui Ding. 2019. Detecting text in the wild with deep character embedding network. arXiv preprint arXiv:1901.00363 (2019).Google Scholar
- Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 8759–8768.Google Scholar
Cross Ref
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV). 21–37.Google Scholar
- Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Fast oriented text spotting with a unified network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5676–5685.Google Scholar
Cross Ref
- Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. 2017. Richer convolutional features for edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)). 3000–3009.Google Scholar
Cross Ref
- Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. 2017. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017).Google Scholar
- Zichuan Liu, Guosheng Lin, Sheng Yang, Fayao Liu, Weisi Lin, and Wang Ling Goh. 2019. Towards robust curve text detection with conditional spatial expansion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7269–7278.Google Scholar
Cross Ref
- Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Trans. Multim. Comput. Commun. Applic. 15, 4 (2019), 1–23. Google Scholar
Digital Library
- Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. Scene text detection with fully convolutional neural networks. Multim. Tools Applic. 78, 13 (2019), 18205–18227. Google Scholar
Digital Library
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3431–3440.Google Scholar
Cross Ref
- Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong Yao. 2018. TextSnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 20–36.Google Scholar
Digital Library
- Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 67–83.Google Scholar
Cross Ref
- Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7553–7563.Google Scholar
Cross Ref
- Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multim. 20, 11 (2018), 3111–3122.Google Scholar
Digital Library
- Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon et al. 2017. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR). 1454–1459.Google Scholar
- S. Ren, K. He, R. Girshick, and J. Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137–1149. Google Scholar
Digital Library
- Abhijit Guha Roy, Nassir Navab, and Christian Wachinger. 2019. Recalibrating fully convolutional networks with spatial and channel “Squeeze and Excitation” blocks. IEEE Trans. Med. Imag. 38, 2 (2019), 540–549.Google Scholar
Cross Ref
- Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2550–2558.Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Jingkuan Song, Zhilong Zhou, Lianli Gao, Xing Xu, and Heng Tao Shen. 2018. Cumulative nets for edge detection. In Proceedings of the ACM International Conference on Multimedia (MM). 1847–1855. Google Scholar
Digital Library
- Mingxing Tan, Ruoming Pang, and Quoc V. Le. 2020. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10781–10790.Google Scholar
- Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In Proceedings of the European Conference on Computer Vision (ECCV). 56–72.Google Scholar
Cross Ref
- Fangfang Wang, Liming Zhao, Xi Li, Xinchao Wang, and Dacheng Tao. 2018. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1381–1389.Google Scholar
Cross Ref
- Pengfei Wang, Chengquan Zhang, Fei Qi, Zuming Huang, Mengyi En, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2019. A single-shot arbitrarily-shaped text detector based on context attended multi-task learning. In Proceedings of the ACM International Conference on Multimedia (MM). 1277–1285. Google Scholar
Digital Library
- Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao. 2019. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 9336–9345.Google Scholar
Cross Ref
- Christian Wolf and Jean-Michel Jolion. 2006. Object count/area graphs for the evaluation of object detection and segmentation algorithms. Int. J. Doc. Anal. Recog. 8, 4 (2006), 280–296. Google Scholar
Digital Library
- Enze Xie, Yuhang Zang, Shuai Shao, Gang Yu, Cong Yao, and Guangyao Li. 2019. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 9038–9045.Google Scholar
Cross Ref
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1492–1500.Google Scholar
Cross Ref
- Saining Xie and Zhuowen Tu. 2015. Holistically-nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 1395–1403. Google Scholar
Digital Library
- Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang, Zhibo Yang, and Xiang Bai. 2019. TextField: Learning a deep direction field for irregular scene text detection. IEEE Trans. Image Proc. 28, 11 (2019), 5566--5579.Google Scholar
Digital Library
- Chuhui Xue, Shijian Lu, and Fangneng Zhan. 2018. Accurate scene text detection through border semantics awareness and bootstrapping. In Proceedings of the European Conference on Computer Vision (ECCV). 355–372.Google Scholar
Cross Ref
- Chuhui Xue, Shijian Lu, and Wei Zhang. 2019. MSR: Multi-scale shape regression for scene text detection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 989–995. Google Scholar
Digital Library
- Qiangpeng Yang, Mengli Cheng, Wenmeng Zhou, Yan Chen, Minghui Qiu, and Wei Lin. 2018. IncepText: A new inception-text module with deformable PSROI pooling for multi-oriented scene text detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 1071–1077. Google Scholar
Digital Library
- Cong Yao, Xiang Bai, and Wenyu Liu. 2014. A unified framework for multioriented text detection and recognition. IEEE Trans. Image Proc. 23, 11 (2014), 4737–4749.Google Scholar
Cross Ref
- Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts of arbitrary orientations in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1083–1090. Google Scholar
Digital Library
- Xu-Cheng Yin, Ze-Yu Zuo, Shu Tian, and Cheng-Lin Liu. 2016. Text detection, tracking and recognition in video: A comprehensive survey. IEEE Trans. Image Proc. 25, 6 (2016), 2752–2773. Google Scholar
Digital Library
- Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas S. Huang. 2016. UnitBox: An advanced object detection network. In Proceedings of the ACM International Conference on Multimedia (MM). 516–520. Google Scholar
Digital Library
- Xingyu Zeng, Wanli Ouyang, Bin Yang, Junjie Yan, and Xiaogang Wang. 2016. Gated bi-directional CNN for object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 354–369.Google Scholar
Cross Ref
- Lu Zhang, Ju Dai, Huchuan Lu, You He, and Gang Wang. 2018. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1741–1750.Google Scholar
Cross Ref
- Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. 2017. Amulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 202–211.Google Scholar
Cross Ref
- Sheng Zhang, Yuliang Liu, Lianwen Jin, and Canjie Luo. 2018. Feature enhancement network: A refined scene text detector. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2612–2619.Google Scholar
- Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. 2016. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4159–4167.Google Scholar
Cross Ref
- Kai Zhao, Wei Shen, Shanghua Gao, Dandan Li, and Ming-Ming Cheng. 2018. Hi-Fi: Hierarchical feature integration for skeleton detection. In Proceedings of the International Joint Conference on Artificial Intelligenc (IJCAI). 1191–1197. Google Scholar
Digital Library
- Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5551–5560.Google Scholar
Cross Ref
- Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. 10, 1 (2016), 19–36. Google Scholar
Digital Library
Index Terms
MFECN: Multi-level Feature Enhanced Cumulative Network for Scene Text Detection
Recommendations
Selective feature fusion network for salient object detection
AbstractFully convolutional neural networks have achieved great success in salient object detection, in which the effective use of multi‐layer features plays a critical role. Based on this advantage, many saliency detectors have emerged in recent years, ...
In this paper, we propose a selective feature fusion network which consists of a selective feature fusion module (SFM) and an attention‐guide hierarchical feature emphasis module (AEM). Selective feature fusion modules adaptively selects the important ...
Scene text detection via decoupled feature pyramid networks
AbstractDetecting arbitrary shape scene texts is challenging mainly due to the varied aspect ratios, curves, and scales. In this paper, we propose a novel arbitrary shape scene text detection method via Decoupled Feature Pyramid Networks (DFPN) and ...
Refine-FPN: Instance Segmentation Based on a Non-local Multi-feature Aggregation Mechanism
AbstractRational use of multilevel structures of deep networks to extract multiscale features is crucial for instance segmentation. The Feature Pyramid Network (FPN) is a classical architecture that enriches the semantic information of multiscale objects. ...






Comments