skip to main content
research-article

Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text Detection

Authors Info & Claims
Published:03 February 2023Publication History
Skip Abstract Section

Abstract

Arbitrary-shaped text detection in natural images is a challenging task due to the complexity of the background and the diversity of text properties. The difficulty lies in two aspects: accurate separation of adjacent texts and sufficient text feature representation. To handle these problems, we consider text detection as instance segmentation and propose a novel text detection framework, which jointly learns semantic segmentation and a pixel affinity pyramid in a unified fully convolutional network. Specifically, the pixel affinity pyramid is proposed to encode multi-scale instance affiliation relationships of pixels, which is not only robust to varying shapes of text but also provides an accurate boundary description for separating closely located texts. In the inference phase, a simple but effective post-processing is presented to reconstruct text instances from the semantic segmentation results under the guidance of the learned pixel affinity pyramid, achieving good accuracy and efficiency. Furthermore, to enhance the representation of text features in the neural network, two modules — the Region Enhancement Module (REM) and Attentional Fusion Module (AFM) — are proposed. The REM models the semantic correlations of regional features to enhance the features from the text area, which effectively suppresses false-positive detection. The AFM adaptively fuses multi-scale textual information through an attention mechanism to obtain abundant text semantic features, which benefits multi-sized text detection. Extensive ablation experiments are conducted demonstrating the effectiveness of the REM and AFM. Evaluation results on standard benchmarks, including Total-Text, ICDAR2015, SCUT-CTW1500, and MSRA-TD500, show that our method surpasses most existing text detectors and achieves state-of-the-art performance, denoting its superior capability in detecting arbitrary-shaped texts.

REFERENCES

  1. [1] Chen Yunpeng, Rohrbach Marcus, Yan Zhicheng, Shuicheng Yan, Feng Jiashi, and Kalantidis Yannis. 2019. Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 433442.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ch’ng Chee Kheng and Chan Chee Seng. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1. IEEE, 935942.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Deng Dan, Liu Haifeng, Li Xuelong, and Cai Deng. 2018. Pixellink: Detecting scene text via instance segmentation. In 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Epshtein Boris, Ofek Eyal, and Wexler Yonatan. 2010. Detecting text in natural scenes with stroke width transform. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 29632970.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Fang Shancheng, Xie Hongtao, Wang Yuxin, Mao Zhendong, and Zhang Yongdong. 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 70987107.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Feng Wei, He Wenhao, Yin Fei, Zhang Xu-Yao, and Liu Cheng-Lin. 2019. TextDragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 90769085.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Fu Jun, Liu Jing, Tian Haijie, Li Yong, Bao Yongjun, Fang Zhiwei, and Lu Hanqing. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Gao Naiyu, Shan Yanhu, Wang Yupei, Zhao Xin, Yu Yinan, Yang Ming, and Huang Kaiqi. 2019. SSAP: Single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE International Conference on Computer Vision. 642651.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Gupta Ankush, Vedaldi Andrea, and Zisserman Andrew. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 23152324.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] He Kaiming, Gkioxari Georgia, Dollár Piotr, and Girshick Ross. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 29612969.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] He K., Zhang X., Ren S., and Sun J.. 2016. Very deep convolutional networks for large-scale image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle Scholar
  13. [13] He Pan, Huang Weilin, He Tong, Zhu Qile, Qiao Yu, and Li Xiaolin. 2017. Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision. 30473055.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] He Wenhao, Zhang Xu-Yao, Yin Fei, and Liu Cheng-Lin. 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Computer Vision. 745753.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71327141.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Huang Weilin, Lin Zhe, Yang Jianchao, and Wang Jue. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE International Conference on Computer Vision. 12411248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Huang Zilong, Wang Xinggang, Huang Lichao, Huang Chang, Wei Yunchao, and Liu Wenyu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 603612.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huang Zhida, Zhong Zhuoyao, Sun Lei, and Huo Qiang. 2019. Mask R-CNN with pyramid attention network for scene text detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 764772.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Jaderberg Max, Simonyan Karen, Zisserman Andrew, et al. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 20172025.Google ScholarGoogle Scholar
  20. [20] Kang Chulmoo, Kim Gunhee, and Yoo Suk I.. 2017. Detection and recognition of text embedded in online images via neural context models. In 31st AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Karatzas Dimosthenis, Gomez-Bigorda Lluis, Nicolaou Anguelos, Ghosh Suman, Bagdanov Andrew, Iwamura Masakazu, Matas Jiri, Neumann Lukas, Chandrasekhar Vijay Ramaseshan, Lu Shijian, et al. 2015. ICDAR 2015 competition on robust reading. In 13th International Conference on Document Analysis and Recognition (ICDAR’15). IEEE, 11561160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kipf Thomas N. and Welling Max. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  23. [23] Li Zhaoju, Zhou Zongwei, Jiang Nan, Han Zhenjun, Xing Junliang, and Jiao Jianbin. 2020. Spatial preserved graph convolution networks for person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1s, Article 26 (April2020), 14 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Liao Minghui, Shi Baoguang, and Bai Xiang. 2018. Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing 27, 8 (2018), 36763690.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Liao Minghui, Shi Baoguang, Bai Xiang, Wang Xinggang, and Liu Wenyu. 2017. Textboxes: A fast text detector with a single deep neural network. In 31st AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liao Minghui, Zhu Zhen, Shi Baoguang, Xia Gui-song, and Bai Xiang. 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 59095918.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Yiding, Yang Siyu, Li Bin, Zhou Wengang, Xu Jizheng, Li Houqiang, and Lu Yan. 2018. Affinity derivation and graph merge for instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686703.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Liu Zichuan, Lin Guosheng, Yang Sheng, Feng Jiashi, Lin Weisi, and Goh Wang Ling. 2018. Learning Markov clustering networks for scene text detection. arXiv preprint arXiv:1805.08365 (2018).Google ScholarGoogle Scholar
  29. [29] Liu Zichuan, Lin Guosheng, Yang Sheng, Liu Fayao, Lin Weisi, and Goh Wang Ling. 2019. Towards robust curve text detection with conditional spatial expansion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72697278.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Liu Zhandong, Zhou Wengang, and Li Houqiang. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2019), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Long Shangbang, Ruan Jiaqiang, Zhang Wenjie, He Xin, Wu Wenhao, and Yao Cong. 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 2036.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Lyu Pengyuan, Liao Minghui, Yao Cong, Wu Wenhao, and Bai Xiang. 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 6783.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ma Jianqi, Shao Weiyuan, Ye Hao, Wang Li, Wang Hong, Zheng Yingbin, and Xue Xiangyang. 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia 20, 11 (2018), 31113122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Matas Jiri, Chum Ondrej, Urban Martin, and Pajdla Tomás. 2004. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing 22, 10 (2004), 761767.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Mei Tao, Li Lusong, Hua Xian-Sheng, and Li Shipeng. 2012. ImageSense: Towards contextual image advertising. ACM Transactions on Multimedia Computing, Communications, and Applications 8, 1, Article 6 (Feb.2012), 18 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Milletari Fausto, Navab Nassir, and Ahmadi Seyed-Ahmad. 2016. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 4th International Conference on 3D Vision (3DV’16). IEEE, 565571.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Neumann Lukáš and Matas Jiří. 2012. Real-time scene text localization and recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 35383545.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Shi Baoguang, Bai Xiang, and Belongie Serge. 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 25502558.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Shrivastava Abhinav, Gupta Abhinav, and Girshick Ross. 2016. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 761769.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Tian Zhuotao, Shu Michelle, Lyu Pengyuan, Li Ruiyu, Zhou Chao, Shen Xiaoyong, and Jia Jiaya. 2019. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42344243.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Wang Fangfang, Zhao Liming, Li Xi, Wang Xinchao, and Tao Dacheng. 2018. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13811389.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wang Kai and Belongie Serge. 2010. Word spotting in the wild. In European Conference on Computer Vision. Springer, 591604.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Wang Pengfei, Zhang Chengquan, Qi Fei, Huang Zuming, En Mengyi, Han Junyu, Liu Jingtuo, Ding Errui, and Shi Guangming. 2019. A single-shot arbitrarily-shaped text detector based on context attended multi-task learning. In Proceedings of the 27th ACM International Conference on Multimedia. 12771285.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Wang Wenhai, Xie Enze, Li Xiang, Hou Wenbo, Lu Tong, Yu Gang, and Shao Shuai. 2019. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 93369345.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Wenhai, Xie Enze, Song Xiaoge, Zang Yuhang, Wang Wenjia, Lu Tong, Yu Gang, and Shen Chunhua. 2019. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE International Conference on Computer Vision. 84408449.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 77947803.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wang Xiaobing, Jiang Yingying, Luo Zhenbo, Liu Cheng-Lin, Choi Hyunsoo, and Kim Sungjin. 2019. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 64496458.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Yuxin, Xie Hongtao, Fang Shancheng, Wang Jing, Zhu Shenggao, and Zhang Yongdong. 2021. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 1419414203.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Y., Xie H., Zha Z., Tian Y., Fu Z., and Zhang Y.. 2020. R-Net: A relationship network for efficient and accurate scene text detection. IEEE Transactions on Multimedia (2020), 11.Google ScholarGoogle Scholar
  51. [51] Xie Enze, Zang Yuhang, Shao Shuai, Yu Gang, Yao Cong, and Li Guangyao. 2019. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 90389045.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Xie Hongtao, Fang Shancheng, Zha Zheng-Jun, Yang Yating, Li Yan, and Zhang Yongdong. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Xu Dan, Wang Wei, Tang Hao, Liu Hong, Sebe Nicu, and Ricci Elisa. 2018. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39173925.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Xu Yongchao, Wang Yukang, Zhou Wei, Wang Yongpan, Yang Zhibo, and Bai Xiang. 2019. TextField: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing 28, 11 (2019), 55665579.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Xue Chuhui, Lu Shijian, and Zhang Wei. 2019. MSR: Multi-scale shape regression for scene text detection. arXiv preprint arXiv:1901.02596 (2019).Google ScholarGoogle Scholar
  56. [56] Yao Cong, Bai Xiang, and Liu Wenyu. 2014. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23, 11 (2014), 47374749.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yao Cong, Bai Xiang, Liu Wenyu, Ma Yi, and Tu Zhuowen. 2012. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 10831090.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Yao Cong, Bai Xiang, Sang Nong, Zhou Xinyu, Zhou Shuchang, and Cao Zhimin. 2016. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016).Google ScholarGoogle Scholar
  59. [59] Ye Jian, Chen Zhe, Liu Juhua, and Du Bo. 2020. TextFuseNet: Scene text detection with richer fused features. IJCAI.Google ScholarGoogle Scholar
  60. [60] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang. 2017. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017).Google ScholarGoogle Scholar
  61. [61] Zhang Chengquan, Liang Borong, Huang Zuming, En Mengyi, Han Junyu, Ding Errui, and Ding Xinghao. 2019. Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1055210561.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Zhang Li, Xu Dan, Arnab Anurag, and Torr Philip H. S.. 2020. Dynamic graph message passing networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 37263735.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Zhang Songyang, Yan Shipeng, and He Xuming. 2019. LatentGNN: Learning efficient non-local relations for visual recognition. arXiv preprint arXiv:1905.11634 (2019).Google ScholarGoogle Scholar
  64. [64] Zhou Peng, Ni Bingbing, Geng Cong, Hu Jianguo, and Xu Yi. 2018. Scale-transferrable object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zhou Xinyu, Yao Cong, Wen He, Wang Yuzhi, Zhou Shuchang, He Weiran, and Liang Jiajun. 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 55515560.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zhou Yanzhao, Ye Qixiang, Qiu Qiang, and Jiao Jianbin. 2017. Oriented response networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 519528.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text Detection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1s
        February 2023
        504 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572859
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 February 2023
        • Online AM: 4 July 2022
        • Accepted: 8 March 2022
        • Revised: 21 February 2022
        • Received: 1 September 2020
        Published in tomm Volume 19, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)264
        • Downloads (Last 6 weeks)12

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!