Abstract
Recognizing irregular text from natural scene images is challenging due to the unconstrained appearance of text, such as curvature, orientation, and distortion. Recent recognition networks regard this task as a text sequence labeling problem and most networks capture the sequence only from a single-granularity visual representation, which to some extent limits the performance of recognition. In this article, we propose a hierarchical attention network to capture multi-granularity deep local representations for recognizing irregular scene text. It consists of several hierarchical attention blocks, and each block contains a Local Visual Representation Module (LVRM) and a Decoder Module (DM). Based on the hierarchical attention network, we propose a scene text recognition network. The extensive experiments show that our proposed network achieves the state-of-the-art performance on several benchmark datasets including IIIT-5K, SVT, CUTE, SVT-Perspective, and ICDAR datasets under shorter training time.
- Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. What is wrong with scene text recognition model comparisons? Dataset and model analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 4714–4722. DOI:https://doi.org/10.1109/ICCV.2019.00481Google Scholar
- Fan Bai, Zhanzhan Cheng, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. Edit probability for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
- Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Yike Guo and Faisal Farooq (Eds.). ACM, 71–79. DOI:https://doi.org/10.1145/3219819.3219861 Google Scholar
Digital Library
- Antoni Buades, Bartomeu Coll, and J.-M. Morel. 2005. A non-local algorithm for image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05). 60–65. Google Scholar
Digital Library
- Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 5086–5094.Google Scholar
- Zhanzhan Cheng, Yangliu Xu, Fan Bai, Yi Niu, Shiliang Pu, and Shuigeng Zhou. 2018. AON: Towards arbitrarily oriented text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5571–5579.Google Scholar
- Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 886–893. Google Scholar
Digital Library
- Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning (ICML’17). Google Scholar
Digital Library
- Shancheng Fang, Hongtao Xie, Zheng-Jun Zha, Nannan Sun, Jianlong Tan, and Yongdong Zhang. 2018. Attention and language ensemble for scene text recognition with convolutional sequence modeling. In Proceedings of the International Conference on Multimedia (ACM MM’18). Google Scholar
Digital Library
- Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4438–4446.Google Scholar
- Hongchao Gao, Xi Wang, Yujia Li, Jizhong Han, Songlin Hu, and Ruixuan Li. 2019. Self-representation convolutional neural networks. In Proceedings of the IEEE International Conference on Multimedia and Expo. IEEE, 1672–1677.Google Scholar
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 1243–1252. Retrieved from http://proceedings.mlr.press/v70/gehring17a.html. Google Scholar
Digital Library
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML’06). Google Scholar
Digital Library
- Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2315–2324.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google Scholar
Cross Ref
- M. Jaderberg, K. Simonyan, et al 2015. Spatial transformer networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’15). 2017–2025. Google Scholar
Digital Library
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep structured output learning for unconstrained text recognition. In Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.5903.Google Scholar
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116, 1 (2016), 1–20. DOI:https://doi.org/10.1007/s11263-015-0823-z Google Scholar
Digital Library
- Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep features for text spotting. In Proceedings of the European Conference on Computer Vision (ECCV’14). 512–528.Google Scholar
- Tianming Jiang, Jiangfeng Zeng, Ke Zhou, Ping Huang, and Tianming Yang. 2019. Lifelong disk failure prediction via GAN-based anomaly detection. In Proceedings of the 37th IEEE International Conference on Computer Design. IEEE, 199–207. DOI:https://doi.org/10.1109/ICCD46524.2019.00033Google Scholar
- Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on robust reading. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’15). 1156–1160. Google Scholar
Digital Library
- Dimosthenis Karatzas et al. 2013. ICDAR 2013 robust reading competition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’13). Google Scholar
Digital Library
- Hua Wang, Ping Huang, Xubin He, Ran Lai, Wenyan Li, Wenjie Liu, Tianming Yang, Ke Zhou, and Si Sun. 2019. Improving cache performance for large-scale photo stores via heuristic prefetching scheme. IEEE Trans. Parallel Distrib. Syst. 30, 9 (2019), 2033–2045.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http://arxiv.org/abs/1412.6980.Google Scholar
- Alex Krizhevsky and G. Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.Google Scholar
- C. Lee, S. Osinderoet al. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2231–2239.Google Scholar
- Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the 33th AAAI Conference on Artificial Intelligence (AAAI’19).Google Scholar
- Minghui Liao, Jian Zhang, Zhaoyi Wan, et al. 2019. Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19).Google Scholar
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2117–2125.Google Scholar
- Zhouhan Lin, Minwei Feng, Cícero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In Proceedings of the 5th International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=BJC_jUqxe.Google Scholar
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 21–37.Google Scholar
- Wei Liu, Chaofeng Chen, Kwan-Yee K. Wong, Zhizhong Su, and Junyu Han. 2016. STAR-Net: A SpaTial Attention Residue Network for scene text recognition. In Proceedings of the British Machine Vision Conference, Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith (Eds.). BMVA Press. Retrieved from http://www.bmva.org/bmvc/2016/papers/paper043/index.html.Google Scholar
- Wei Liu, Chaofeng Chen, and Kwan-Yee K Wong. 2018. Char-Net: A character-aware neural network for distorted scene text recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18).Google Scholar
- Yu Liu, Jingkuan Song, Ke Zhou, Lingyu Yan, Li Liu, Fuhao Zou, and Ling Shao. 2019. Deep self-taught hashing for image retrieval. IEEE Trans. Cybern. 49, 6 (2019), 2229–2241. DOI:https://doi.org/10.1109/TCYB.2018.2822781Google Scholar
Cross Ref
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the International Conference on Advances In Neural Information Processing Systems (NIPS’16). 289–297. Google Scholar
Digital Library
- Simon M. Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, Robert Young, Kazuki Ashida, Hiroki Nagai, Masayuki Okamoto, Hiroaki Yamamoto, Hidetoshi Miyao, JunMin Zhu, WuWen Ou, Christian Wolf, Jean-Michel Jolion, Leon Todoran, Marcel Worring, and Xiaofan Lin. 2005. ICDAR 2003 robust reading competitions: Entries, results, and future directions. Int. J. Doc. Anal. Recog. 7, 2–3 (2005), 105–122. DOI:https://doi.org/10.1007/s10032-004-0134-3 Google Scholar
Digital Library
- Canjie Luo, Lianwen Jin, and Zenghui Sun. 2019. MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recog. 90 (2019), 109–118. DOI:https://doi.org/10.1016/j.patcog.2019.01.020Google Scholar
Cross Ref
- Xiao Ma, Jiangfeng Zeng, Limei Peng, Giancarlo Fortino, and Yin Zhang. 2019. Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Fut. Gen. Comput. Syst. 93 (2019), 304–311. DOI:https://doi.org/10.1016/j.future.2018.10.041Google Scholar
Cross Ref
- Krystian Mikolajczyk and Cordelia Schmid. 2004. Scale & affine invariant interest point detectors. Int. J. Comput. Vis. 60, 1 (2004), 63–86. DOI:https://doi.org/10.1023/B:VISI.0000027790.02288.f2 Google Scholar
Digital Library
- Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (BMVC’12).Google Scholar
- Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Top-down and bottom-up cues for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12), 2687–2694. Google Scholar
Digital Library
- Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’14). 2204–2212. Google Scholar
Digital Library
- Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24, 7 (2002), 971–987. DOI:https://doi.org/10.1109/TPAMI.2002.1017623 Google Scholar
Digital Library
- Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and Chew Lim Tan. 2013. Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). 569–576. Google Scholar
Digital Library
- Anhar Risnumawan, Palaiahnakote Shivakumara, Chee Seng Chan, and Chew Lim Tan. 2014. A robust arbitrary text detection system for natural scene images. Exp. Syst. Appl. 41, 18 (2014), 8027–8048. DOI:https://doi.org/10.1016/j.eswa.2014.07.008Google Scholar
Cross Ref
- Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 11 (2017), 2298–2304. DOI:https://doi.org/10.1109/TPAMI.2016.2646371Google Scholar
Digital Library
- Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), 4168–4176.Google Scholar
Cross Ref
- Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google Scholar
Cross Ref
- Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2019. ASTER: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2035–2048. DOI:https://doi.org/10.1109/TPAMI.2018.2848939Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.). Retrieved from http://arxiv.org/abs/1409.1556.Google Scholar
- Zhuotao Tian, Michelle Shu, Pengyuan Lyu, Ruiyu Li, Chao Zhou, Xiaoyong Shen, and Jiaya Jia. 2019. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 4234–4243.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 6000–6010. Google Scholar
Digital Library
- Jianfeng Wang and Xiaolin Hu. 2017. Gated recurrent convolution neural network for OCR. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 334–343. Google Scholar
Digital Library
- Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11). Google Scholar
Digital Library
- Peng Wang, Lu Yang, et al. 2019. A simple and robust convolutional-attention network for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19).Google Scholar
- Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7794–7803.Google Scholar
- X. Yang, D. He, et al. 2017. Learning to read irregular text with attention mechanisms. In Proceedings of the International Joint Conferences on Artificial Intelligence Organization (IJCAI’17). 3280–3286. Google Scholar
Digital Library
- Cong Yao, Xiang Bai, Baoguang Shi, and Wenyu Liu. 2014. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 4042–4049. Google Scholar
Digital Library
- Qixiang Ye and David Doermann. 2015. Text detection and recognition in imagery: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 37, 7 (2015), 1480–1500.Google Scholar
Digital Library
- Jiangfeng Zeng, Xiao Ma, and Ke Zhou. 2019. Enhancing attention-based LSTM with position context for aspect-level sentiment classification. IEEE Access 7 (2019), 20462–20471. DOI:https://doi.org/10.1109/ACCESS.2019.2893806Google Scholar
Cross Ref
- Jiangfeng Zeng, Xiao Ma, and Ke Zhou. 2019. Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing 366 (2019), 295–304. DOI:https://doi.org/10.1016/j.neucom.2019.07.085Google Scholar
Digital Library
- Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, Minwei Ran, and Zekang Li. 2019. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proceedings of the International Conference on Management of Data, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 415–432. DOI:https://doi.org/10.1145/3299869.3300085 Google Scholar
Digital Library
- Ke Zhou, Jiangfeng Zeng, Yu Liu, and Fuhao Zou. 2018. Deep sentiment hashing for text retrieval in social CIoT. Fut. Gen. Comput. Syst. 86 (2018), 362–371. DOI:https://doi.org/10.1016/j.future.2018.03.047Google Scholar
Cross Ref
Index Terms
Multi-granularity Deep Local Representations for Irregular Scene Text Recognition
Recommendations
Scene Text Analysis using Deep Belief Networks
ICVGIP '14: Proceedings of the 2014 Indian Conference on Computer Vision Graphics and Image ProcessingThis paper focuses on the recognition and analysis of text embedded in scene images using Deep learning. The proposed approach uses deep learning architectures for automated higher order feature extraction, thereby improving classification accuracies in ...
Thai Scene Text Recognition with Character Combination
Pattern Recognition and Computer VisionAbstractIn recent years, scene text recognition(STR) that recognizing character sequences in natural images is in great demand beyond various fields. However, most STR studies only focus on popular scripts like Chinese or English, too little attention has ...
Scene text recognition using residual convolutional recurrent neural network
Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The ...






Comments