Abstract
In this article, we present Convoluitional Attention Networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods; our CAN is completely built on CNN and includes an attention mechanism. The distinctive characteristics of our method include (i) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN; (ii) the attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling; and (iii) position embeddings are equipped in both a spatial encoder and a sequence decoder to give our networks a sense of location. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K, and ICDAR datasets. The experimental results validate the effectiveness of different components and show that our convolutional-based method achieves state-of-the-art or competitive performance over prior works, even without the use of RNN.
- Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. 2014. Word spotting and recognition with embedded attributes. IEEE Transactions in Pattern Analysis and Machine Intelligence 36, 12 (2014), 2552--2566.Google Scholar
Cross Ref
- Ouais Alsharif and Joelle Pineau. 2013. End-to-end text recognition with hybrid HMM maxout models. CoRR abs/1310.1811 (2013). http://arxiv.org/abs/1310.1811Google Scholar
- Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR abs/1607.06450 (2016). http://arxiv.org/abs/1607.06450Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). http://arxiv.org/abs/1409.0473Google Scholar
- Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut Neven. 2013. PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). Sydney, Australia, December 1-8, 2013. IEEE, 785--792. Google Scholar
Digital Library
- Zhineng Chen, Chong-Wah Ngo, Wei Zhang, Juan Cao, and Yu-Gang Jiang. 2014. Name-face association in web videos: A large-scale dataset, baselines, and open issues. Journal of Computer Science Technology 29, 5 (2014), 785--798.Google Scholar
Cross Ref
- Zhineng Chen, Wei Zhang, Bin Deng, Hongtao Xie, and Xiaoyan Gu. 2017. Name-face association with web facial image supervision. Multimedia Systems 4 (2017), 1--20.Google Scholar
- Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, October 25, 2014, Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi (Eds.). Association for Computational Linguistics, 103--111. http://aclweb.org/anthology/W/W14/W14-4012.pdf.Google Scholar
Cross Ref
- Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning (ICML’17) (Proceedings of Machine Learning Research), Sydney, NSW, Australia, August 6-11, 2017, Doina Precup and Yee Whye Teh (Eds.), Vol. 70. ACM, 933--941. http://proceedings.mlr.press/v70/dauphin17a.html. Google Scholar
Digital Library
- Shancheng Fang, Hongtao Xie, Zhineng Chen, Yizhi Liu, and Yan Li. 2018. Uyghur text matching in graphic images for biomedical semantic analysis. Neuroinformatics (19 Jan 2018).Google Scholar
- Shancheng Fang, Hongtao Xie, Zhineng Chen, Shiai Zhu, Xiaoyan Gu, and Xingyu Gao. 2017. Detecting Uyghur text in complex background images with convolutional neural network. Multimedia Tools and Applications 76, 13 (2017), 15083--15103. Google Scholar
Digital Library
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning (ICML’17) (Proceedings of Machine Learning Research), Sydney, NSW, Australia, August 6-11, 2017, Doina Precup and Yee Whye Teh (Eds.), Vol. 70. ACM, 1243--1252. http://proceedings.mlr.press/v70/gehring17a.html. Google Scholar
Digital Library
- Suman K. Ghosh, Ernest Valveny, and Andrew D. Bagdanov. 2017. Visual attention models for scene text recognition. CoRR abs/1706.01487 (2017). http://arxiv.org/abs/1706.01487Google Scholar
- Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. 2013. Maxout networks. In Proceedings of the International Conference on Machine Learning (ICML'13). ACM, 1319--1327. https://arxiv.org/pdf/1302.4389. Google Scholar
Digital Library
- Albert Gordo. 2015. Supervised mid-level features for word image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7-12, 2015. IEEE, 2956--2964.Google Scholar
Cross Ref
- Alex Graves. 2013. Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013). http://arxiv.org/abs/1308.0850Google Scholar
- Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06) (ACM International Conference Proceeding Series), Pittsburg, PA, June 25-29, 2006, William W. Cohen and Andrew Moore (Eds.), Vol. 148. ACM, 369--376. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7-13, 2015. IEEE, 1026--1034. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016c. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 770--778.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16) Part IV (Lecture Notes in Computer Science), Amsterdam, Netherlands, October 11-14, 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.), Vol. 9908. IEEE, 630--645.Google Scholar
Cross Ref
- Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. 2016a. Reading scene text in deep convolutional sequences. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, February 12-17, 2016, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 3501--3508. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12256. Google Scholar
Digital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) (JMLR Workshop and Conference Proceedings), Lille, France, July 6-11, 2015, Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 448--456. http://jmlr.org/proceedings/papers/v37/ioffe15.html. Google Scholar
Digital Library
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014a. Deep structured output learning for unconstrained text recognition. CoRR abs/1412.5903 (2014). http://arxiv.org/abs/1412.5903Google Scholar
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014b. Reading text in the wild with convolutional neural networks. CoRR abs/1412.1842 (2014). http://arxiv.org/abs/1412.1842Google Scholar
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014c. Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014). http://arxiv.org/abs/1406.2227Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, Canada, December 7-12, 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). MIT Press, 2017--2025. http://papers.nips.cc/paper/5854-spatial-transformer-networks. Google Scholar
Digital Library
- Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014d. Deep features for text spotting. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part IV (Lecture Notes in Computer Science), Zurich, Switzerland, September 6-12, 2014, David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.), Vol. 8692. Springer, 512--528.Google Scholar
Cross Ref
- Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, DC, August 25-28, 2013. IEEE, 1484--1493. Google Scholar
Digital Library
- Chen-Yu Lee and Simon Osindero. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 2231--2239.Google Scholar
Cross Ref
- Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. TOMCCAP 2, 1 (2006), 1--19. Google Scholar
Digital Library
- Simon M. Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, and Robert Young. 2003. ICDAR 2003 robust reading competitions. In Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR’03), 2-Volume Set, Edinburg, Scotland, August 3-6, 2003. IEEE, 682--687. Google Scholar
Digital Library
- Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15), Lisbon, Portugal, September 17-21, 2015, Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (Eds.). The Association for Computational Linguistics, 1412--1421. http://aclweb.org/anthology/D/D15/D15-1166.pdf.Google Scholar
Cross Ref
- Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (BMVC’12), Surrey, UK, September 3-7, 2012, Richard Bowden, John P. Collomosse, and Krystian Mikolajczyk (Eds.). British Machine Vision Association Press, 1--11.Google Scholar
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10), Haifa, Israel, June 21-24, 2010, Johannes Fürnkranz and Thorsten Joachims (Eds.). ACM, 807--814. http://www.icml2010.org/papers/432.pdf. Google Scholar
Digital Library
- Lukas Neumann and Jiri Matas. 2012. Real-time scene text localization and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12), Providence, RI, June 16-21, 2012. IEEE, 3538--3545. Google Scholar
Digital Library
- Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. 2010. Large-scale image retrieval with compressed Fisher vectors. In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10), San Francisco, CA, June 13-18, 2010. IEEE, 3384--3391.Google Scholar
Cross Ref
- José A. Rodríguez and Florent Perronnin. 2013. Label embedding for text recognition. In Proceddings of the British Machine Vision Conference (BMVC’13), Bristol, UK, September 9-13, 2013, Tilo Burghardt, Dima Damen, Walterio W. Mayol-Cuevas, and Majid Mirmehdi (Eds.). British Machine Vision Association Press.Google Scholar
- José A. Rodríguez-Serrano, Albert Gordo, and Florent Perronnin. 2015. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision 113, 3 (2015), 193--207. Google Scholar
Digital Library
- Tim Salimans and Diederik P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Proceedings of the 29th Annual Conference on Advances in Neural Information Processing Systems, Barcelona, Spain, December 5-10, 2016, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 901. http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks. Google Scholar
Digital Library
- Baoguang Shi, Xiang Bai, and Cong Yao. 2015. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015). http://arxiv.org/abs/1507.05717Google Scholar
- Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 4168--4176.Google Scholar
Cross Ref
- Bolan Su and Shijian Lu. 2014. Accurate scene text recognition based on recurrent neural network. In Proceedings of the 12th Asian Conference on Computer Vision (ACCV’14) Revised Selected Papers, Part I (Lecture Notes in Computer Science), Singapore, November 1-5, 2014, Daniel Cremers, Ian D. Reid, Hideo Saito, and Ming-Hsuan Yang (Eds.), Vol. 9003. Springer, 35--48.Google Scholar
- Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML’13) (JMLR Workshop and Conference Proceedings), Atlanta, GA, June 16-21, 2013, Vol. 28. ACM, 1139--1147. http://jmlr.org/proceedings/papers/v28/sutskever13.html. Google Scholar
Digital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762Google Scholar
- Kai Wang, Boris Babenko, and Serge J. Belongie. 2011. End-to-end scene text recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11), Barcelona, Spain, November 6-13, 2011, Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc J. Van Gool (Eds.). IEEE, 1457--1464. Google Scholar
Digital Library
- Kai Wang and Serge J. Belongie. 2010. Word spotting in the wild. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10), Part I (Lecture Notes in Computer Science), Heraklion, Crete, September 5-11, 2010, Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.), Vol. 6311. IEEE, 591--604. Google Scholar
Digital Library
- Tao Wang, David J. Wu, Adam Coates, and Andrew Y. Ng. 2012. End-to-end text recognition with convolutional neural networks. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12), Tsukuba, Japan, November 11-15, 2012. IEEE, 3304--3308. http://ieeexplore.ieee.org/document/6460871/.Google Scholar
- Zbigniew Wojna, Alexander N. Gorban, Dar-Shyang Lee, Kevin Murphy, Qian Yu, Yeqing Li, and Julian Ibarz. 2017. Attention-based extraction of structured information from street view imagery. CoRR abs/1704.03549 (2017). http://arxiv.org/abs/1704.03549Google Scholar
- Chenggang Yan, Hongtao Xie, Shun Liu, Jian Yin, Yongdong Zhang, and Qionghai Dai. 2018a. Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 220--229.Google Scholar
Cross Ref
- Chenggang Yan, Hongtao Xie, Dongbao Yang, Jian Yin, Yongdong Zhang, and Qionghai Dai. 2018b. Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 284--295.Google Scholar
Cross Ref
- Hongtao Xie, Dongbao Yang, Nannan Sun, Zhineng Chen, and Yongdong Zhang. 2014. Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognition.Google Scholar
- Cong Yao, Xiang Bai, Baoguang Shi, and Wenyu Liu. 2014. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14), Columbus, OH, June 23-28, 2014. IEEE, 4042--4049. Google Scholar
Digital Library
- Hantao Yao, Shiliang Zhang, Yongdong Zhang, Jintao Li, and Qi Tian. 2016. Coarse-to-fine description for fine-grained visual categorization. IEEE Transactions on Image Processing 25, 10 (2016), 4858--4872.Google Scholar
Digital Library
- Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huan-Bo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep fusion of multiple semantic cues for complex event recognition. IEEE Transactions on Image Processing 25, 3 (2016), 1033--1046.Google Scholar
Digital Library
- Biao Zhu, Hongxin Zhang, Wei Chen, Feng Xia, and Ross Maciejewski. 2015. ShotVis: Smartphone-based visualization of OCR information from images. TOMCCAP 12, 1s (2015), 12:1--12:17. Google Scholar
Digital Library
Index Terms
Convolutional Attention Networks for Scene Text Recognition
Recommendations
Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling
MM '18: Proceedings of the 26th ACM international conference on MultimediaRecent dominant approaches for scene text recognition are mainly based on convolutional neural network (CNN) and recurrent neural network (RNN), where the CNN processes images and the RNN generates character sequences. Different from these methods, we ...
Scene text recognition using residual convolutional recurrent neural network
Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The ...
Offline Handwritten English Character Recognition Based on Convolutional Neural Network
DAS '12: Proceedings of the 2012 10th IAPR International Workshop on Document Analysis SystemsThis paper applies Convolutional Neural Networks (CNNs) for offline handwritten English character recognition. We use a modified LeNet-5 CNN model, with special settings of the number of neurons in each layer and the connecting way between some layers. ...






Comments