skip to main content
research-article

Convolutional Attention Networks for Scene Text Recognition

Authors Info & Claims
Published:24 January 2019Publication History
Skip Abstract Section

Abstract

In this article, we present Convoluitional Attention Networks (CAN) for unconstrained scene text recognition. Recent dominant approaches for scene text recognition are mainly based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), where the CNN encodes images and the RNN generates character sequences. Our CAN is different from these methods; our CAN is completely built on CNN and includes an attention mechanism. The distinctive characteristics of our method include (i) CAN follows encoder-decoder architecture, in which the encoder is a deep two-dimensional CNN and the decoder is a one-dimensional CNN; (ii) the attention mechanism is applied in every convolutional layer of the decoder, and we propose a novel spatial attention method using average pooling; and (iii) position embeddings are equipped in both a spatial encoder and a sequence decoder to give our networks a sense of location. We conduct experiments on standard datasets for scene text recognition, including Street View Text, IIIT5K, and ICDAR datasets. The experimental results validate the effectiveness of different components and show that our convolutional-based method achieves state-of-the-art or competitive performance over prior works, even without the use of RNN.

References

  1. Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Valveny. 2014. Word spotting and recognition with embedded attributes. IEEE Transactions in Pattern Analysis and Machine Intelligence 36, 12 (2014), 2552--2566.Google ScholarGoogle ScholarCross RefCross Ref
  2. Ouais Alsharif and Joelle Pineau. 2013. End-to-end text recognition with hybrid HMM maxout models. CoRR abs/1310.1811 (2013). http://arxiv.org/abs/1310.1811Google ScholarGoogle Scholar
  3. Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR abs/1607.06450 (2016). http://arxiv.org/abs/1607.06450Google ScholarGoogle Scholar
  4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). http://arxiv.org/abs/1409.0473Google ScholarGoogle Scholar
  5. Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut Neven. 2013. PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). Sydney, Australia, December 1-8, 2013. IEEE, 785--792. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Zhineng Chen, Chong-Wah Ngo, Wei Zhang, Juan Cao, and Yu-Gang Jiang. 2014. Name-face association in web videos: A large-scale dataset, baselines, and open issues. Journal of Computer Science Technology 29, 5 (2014), 785--798.Google ScholarGoogle ScholarCross RefCross Ref
  7. Zhineng Chen, Wei Zhang, Bin Deng, Hongtao Xie, and Xiaoyan Gu. 2017. Name-face association with web facial image supervision. Multimedia Systems 4 (2017), 1--20.Google ScholarGoogle Scholar
  8. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, October 25, 2014, Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi (Eds.). Association for Computational Linguistics, 103--111. http://aclweb.org/anthology/W/W14/W14-4012.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  9. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning (ICML’17) (Proceedings of Machine Learning Research), Sydney, NSW, Australia, August 6-11, 2017, Doina Precup and Yee Whye Teh (Eds.), Vol. 70. ACM, 933--941. http://proceedings.mlr.press/v70/dauphin17a.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shancheng Fang, Hongtao Xie, Zhineng Chen, Yizhi Liu, and Yan Li. 2018. Uyghur text matching in graphic images for biomedical semantic analysis. Neuroinformatics (19 Jan 2018).Google ScholarGoogle Scholar
  11. Shancheng Fang, Hongtao Xie, Zhineng Chen, Shiai Zhu, Xiaoyan Gu, and Xingyu Gao. 2017. Detecting Uyghur text in complex background images with convolutional neural network. Multimedia Tools and Applications 76, 13 (2017), 15083--15103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning (ICML’17) (Proceedings of Machine Learning Research), Sydney, NSW, Australia, August 6-11, 2017, Doina Precup and Yee Whye Teh (Eds.), Vol. 70. ACM, 1243--1252. http://proceedings.mlr.press/v70/gehring17a.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Suman K. Ghosh, Ernest Valveny, and Andrew D. Bagdanov. 2017. Visual attention models for scene text recognition. CoRR abs/1706.01487 (2017). http://arxiv.org/abs/1706.01487Google ScholarGoogle Scholar
  14. Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. 2013. Maxout networks. In Proceedings of the International Conference on Machine Learning (ICML'13). ACM, 1319--1327. https://arxiv.org/pdf/1302.4389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Albert Gordo. 2015. Supervised mid-level features for word image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7-12, 2015. IEEE, 2956--2964.Google ScholarGoogle ScholarCross RefCross Ref
  16. Alex Graves. 2013. Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013). http://arxiv.org/abs/1308.0850Google ScholarGoogle Scholar
  17. Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML’06) (ACM International Conference Proceeding Series), Pittsburg, PA, June 25-29, 2006, William W. Cohen and Andrew Moore (Eds.), Vol. 148. ACM, 369--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7-13, 2015. IEEE, 1026--1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016c. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  20. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16) Part IV (Lecture Notes in Computer Science), Amsterdam, Netherlands, October 11-14, 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.), Vol. 9908. IEEE, 630--645.Google ScholarGoogle ScholarCross RefCross Ref
  21. Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. 2016a. Reading scene text in deep convolutional sequences. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, February 12-17, 2016, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 3501--3508. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) (JMLR Workshop and Conference Proceedings), Lille, France, July 6-11, 2015, Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 448--456. http://jmlr.org/proceedings/papers/v37/ioffe15.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014a. Deep structured output learning for unconstrained text recognition. CoRR abs/1412.5903 (2014). http://arxiv.org/abs/1412.5903Google ScholarGoogle Scholar
  25. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014b. Reading text in the wild with convolutional neural networks. CoRR abs/1412.1842 (2014). http://arxiv.org/abs/1412.1842Google ScholarGoogle Scholar
  26. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014c. Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014). http://arxiv.org/abs/1406.2227Google ScholarGoogle Scholar
  27. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, Canada, December 7-12, 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). MIT Press, 2017--2025. http://papers.nips.cc/paper/5854-spatial-transformer-networks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014d. Deep features for text spotting. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part IV (Lecture Notes in Computer Science), Zurich, Switzerland, September 6-12, 2014, David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.), Vol. 8692. Springer, 512--528.Google ScholarGoogle ScholarCross RefCross Ref
  29. Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernández Mota, Jon Almazán, and Lluís-Pere de las Heras. 2013. ICDAR 2013 robust reading competition. In Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, DC, August 25-28, 2013. IEEE, 1484--1493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chen-Yu Lee and Simon Osindero. 2016. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 2231--2239.Google ScholarGoogle ScholarCross RefCross Ref
  31. Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. 2006. Content-based multimedia information retrieval: State of the art and challenges. TOMCCAP 2, 1 (2006), 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Simon M. Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, and Robert Young. 2003. ICDAR 2003 robust reading competitions. In Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR’03), 2-Volume Set, Edinburg, Scotland, August 3-6, 2003. IEEE, 682--687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15), Lisbon, Portugal, September 17-21, 2015, Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (Eds.). The Association for Computational Linguistics, 1412--1421. http://aclweb.org/anthology/D/D15/D15-1166.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  34. Anand Mishra, Karteek Alahari, and C. V. Jawahar. 2012. Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference (BMVC’12), Surrey, UK, September 3-7, 2012, Richard Bowden, John P. Collomosse, and Krystian Mikolajczyk (Eds.). British Machine Vision Association Press, 1--11.Google ScholarGoogle Scholar
  35. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10), Haifa, Israel, June 21-24, 2010, Johannes Fürnkranz and Thorsten Joachims (Eds.). ACM, 807--814. http://www.icml2010.org/papers/432.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lukas Neumann and Jiri Matas. 2012. Real-time scene text localization and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12), Providence, RI, June 16-21, 2012. IEEE, 3538--3545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. 2010. Large-scale image retrieval with compressed Fisher vectors. In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10), San Francisco, CA, June 13-18, 2010. IEEE, 3384--3391.Google ScholarGoogle ScholarCross RefCross Ref
  38. José A. Rodríguez and Florent Perronnin. 2013. Label embedding for text recognition. In Proceddings of the British Machine Vision Conference (BMVC’13), Bristol, UK, September 9-13, 2013, Tilo Burghardt, Dima Damen, Walterio W. Mayol-Cuevas, and Majid Mirmehdi (Eds.). British Machine Vision Association Press.Google ScholarGoogle Scholar
  39. José A. Rodríguez-Serrano, Albert Gordo, and Florent Perronnin. 2015. Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision 113, 3 (2015), 193--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tim Salimans and Diederik P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Proceedings of the 29th Annual Conference on Advances in Neural Information Processing Systems, Barcelona, Spain, December 5-10, 2016, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 901. http://papers.nips.cc/paper/6114-weight-normalization-a-simple-reparameterization-to-accelerate-training-of-deep-neural-networks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Baoguang Shi, Xiang Bai, and Cong Yao. 2015. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR abs/1507.05717 (2015). http://arxiv.org/abs/1507.05717Google ScholarGoogle Scholar
  42. Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, June 27-30, 2016. IEEE, 4168--4176.Google ScholarGoogle ScholarCross RefCross Ref
  43. Bolan Su and Shijian Lu. 2014. Accurate scene text recognition based on recurrent neural network. In Proceedings of the 12th Asian Conference on Computer Vision (ACCV’14) Revised Selected Papers, Part I (Lecture Notes in Computer Science), Singapore, November 1-5, 2014, Daniel Cremers, Ian D. Reid, Hideo Saito, and Ming-Hsuan Yang (Eds.), Vol. 9003. Springer, 35--48.Google ScholarGoogle Scholar
  44. Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML’13) (JMLR Workshop and Conference Proceedings), Atlanta, GA, June 16-21, 2013, Vol. 28. ACM, 1139--1147. http://jmlr.org/proceedings/papers/v28/sutskever13.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762Google ScholarGoogle Scholar
  46. Kai Wang, Boris Babenko, and Serge J. Belongie. 2011. End-to-end scene text recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11), Barcelona, Spain, November 6-13, 2011, Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc J. Van Gool (Eds.). IEEE, 1457--1464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Kai Wang and Serge J. Belongie. 2010. Word spotting in the wild. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10), Part I (Lecture Notes in Computer Science), Heraklion, Crete, September 5-11, 2010, Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.), Vol. 6311. IEEE, 591--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tao Wang, David J. Wu, Adam Coates, and Andrew Y. Ng. 2012. End-to-end text recognition with convolutional neural networks. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12), Tsukuba, Japan, November 11-15, 2012. IEEE, 3304--3308. http://ieeexplore.ieee.org/document/6460871/.Google ScholarGoogle Scholar
  49. Zbigniew Wojna, Alexander N. Gorban, Dar-Shyang Lee, Kevin Murphy, Qian Yu, Yeqing Li, and Julian Ibarz. 2017. Attention-based extraction of structured information from street view imagery. CoRR abs/1704.03549 (2017). http://arxiv.org/abs/1704.03549Google ScholarGoogle Scholar
  50. Chenggang Yan, Hongtao Xie, Shun Liu, Jian Yin, Yongdong Zhang, and Qionghai Dai. 2018a. Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 220--229.Google ScholarGoogle ScholarCross RefCross Ref
  51. Chenggang Yan, Hongtao Xie, Dongbao Yang, Jian Yin, Yongdong Zhang, and Qionghai Dai. 2018b. Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 284--295.Google ScholarGoogle ScholarCross RefCross Ref
  52. Hongtao Xie, Dongbao Yang, Nannan Sun, Zhineng Chen, and Yongdong Zhang. 2014. Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognition.Google ScholarGoogle Scholar
  53. Cong Yao, Xiang Bai, Baoguang Shi, and Wenyu Liu. 2014. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14), Columbus, OH, June 23-28, 2014. IEEE, 4042--4049. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Hantao Yao, Shiliang Zhang, Yongdong Zhang, Jintao Li, and Qi Tian. 2016. Coarse-to-fine description for fine-grained visual categorization. IEEE Transactions on Image Processing 25, 10 (2016), 4858--4872.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huan-Bo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep fusion of multiple semantic cues for complex event recognition. IEEE Transactions on Image Processing 25, 3 (2016), 1033--1046.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Biao Zhu, Hongxin Zhang, Wei Chen, Feng Xia, and Ross Maciejewski. 2015. ShotVis: Smartphone-based visualization of OCR information from images. TOMCCAP 12, 1s (2015), 12:1--12:17. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Convolutional Attention Networks for Scene Text Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1s
      Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
      January 2019
      265 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3309769
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 January 2019
      • Accepted: 1 June 2018
      • Revised: 1 April 2018
      • Received: 1 October 2017
      Published in tomm Volume 15, Issue 1s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!