Abstract
Arbitrary-shaped text detection in natural images is a challenging task due to the complexity of the background and the diversity of text properties. The difficulty lies in two aspects: accurate separation of adjacent texts and sufficient text feature representation. To handle these problems, we consider text detection as instance segmentation and propose a novel text detection framework, which jointly learns semantic segmentation and a pixel affinity pyramid in a unified fully convolutional network. Specifically, the pixel affinity pyramid is proposed to encode multi-scale instance affiliation relationships of pixels, which is not only robust to varying shapes of text but also provides an accurate boundary description for separating closely located texts. In the inference phase, a simple but effective post-processing is presented to reconstruct text instances from the semantic segmentation results under the guidance of the learned pixel affinity pyramid, achieving good accuracy and efficiency. Furthermore, to enhance the representation of text features in the neural network, two modules — the Region Enhancement Module (REM) and Attentional Fusion Module (AFM) — are proposed. The REM models the semantic correlations of regional features to enhance the features from the text area, which effectively suppresses false-positive detection. The AFM adaptively fuses multi-scale textual information through an attention mechanism to obtain abundant text semantic features, which benefits multi-sized text detection. Extensive ablation experiments are conducted demonstrating the effectiveness of the REM and AFM. Evaluation results on standard benchmarks, including Total-Text, ICDAR2015, SCUT-CTW1500, and MSRA-TD500, show that our method surpasses most existing text detectors and achieves state-of-the-art performance, denoting its superior capability in detecting arbitrary-shaped texts.
- [1] . 2019. Graph-based global reasoning networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 433–442.Google Scholar
Cross Ref
- [2] . 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1. IEEE, 935–942.Google Scholar
Cross Ref
- [3] . 2018. Pixellink: Detecting scene text via instance segmentation. In 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [4] . 2010. Detecting text in natural scenes with stroke width transform. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2963–2970.Google Scholar
Cross Ref
- [5] . 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 7098–7107.Google Scholar
Cross Ref
- [6] . 2019. TextDragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE International Conference on Computer Vision. 9076–9085.Google Scholar
Cross Ref
- [7] . 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- [8] . 2019. SSAP: Single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE International Conference on Computer Vision. 642–651.Google Scholar
Cross Ref
- [9] . 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2315–2324.Google Scholar
Cross Ref
- [10] . 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.Google Scholar
Cross Ref
- [11] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [12] . 2016. Very deep convolutional networks for large-scale image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
- [13] . 2017. Single shot text detector with regional attention. In Proceedings of the IEEE International Conference on Computer Vision. 3047–3055.Google Scholar
Cross Ref
- [14] . 2017. Deep direct regression for multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Computer Vision. 745–753.Google Scholar
Cross Ref
- [15] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google Scholar
Cross Ref
- [16] . 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE International Conference on Computer Vision. 1241–1248.Google Scholar
Digital Library
- [17] . 2019. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 603–612.Google Scholar
Cross Ref
- [18] . 2019. Mask R-CNN with pyramid attention network for scene text detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 764–772.Google Scholar
Cross Ref
- [19] . 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017–2025.Google Scholar
- [20] . 2017. Detection and recognition of text embedded in online images via neural context models. In 31st AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [21] . 2015. ICDAR 2015 competition on robust reading. In 13th International Conference on Document Analysis and Recognition (ICDAR’15). IEEE, 1156–1160.Google Scholar
Digital Library
- [22] . 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- [23] . 2020. Spatial preserved graph convolution networks for person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1s, Article
26 (April 2020), 14 pages.DOI: Google ScholarDigital Library
- [24] . 2018. Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing 27, 8 (2018), 3676–3690.Google Scholar
Cross Ref
- [25] . 2017. Textboxes: A fast text detector with a single deep neural network. In 31st AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [26] . 2018. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5909–5918.Google Scholar
Cross Ref
- [27] . 2018. Affinity derivation and graph merge for instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–703.Google Scholar
Digital Library
- [28] . 2018. Learning Markov clustering networks for scene text detection. arXiv preprint arXiv:1805.08365 (2018).Google Scholar
- [29] . 2019. Towards robust curve text detection with conditional spatial expansion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7269–7278.Google Scholar
Cross Ref
- [30] . 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2019), 1–23.Google Scholar
Digital Library
- [31] . 2018. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 20–36.Google Scholar
Digital Library
- [32] . 2018. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 67–83.Google Scholar
Digital Library
- [33] . 2018. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia 20, 11 (2018), 3111–3122.Google Scholar
Digital Library
- [34] . 2004. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing 22, 10 (2004), 761–767.Google Scholar
Cross Ref
- [35] . 2012. ImageSense: Towards contextual image advertising. ACM Transactions on Multimedia Computing, Communications, and Applications 8, 1, Article
6 (Feb. 2012), 18 pages.DOI: Google ScholarDigital Library
- [36] . 2016. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 4th International Conference on 3D Vision (3DV’16). IEEE, 565–571.Google Scholar
Cross Ref
- [37] . 2012. Real-time scene text localization and recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3538–3545.Google Scholar
Digital Library
- [38] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [39] . 2017. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2550–2558.Google Scholar
Cross Ref
- [40] . 2016. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 761–769.Google Scholar
Cross Ref
- [41] . 2019. Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4234–4243.Google Scholar
Cross Ref
- [42] . 2018. Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1381–1389.Google Scholar
Cross Ref
- [43] . 2010. Word spotting in the wild. In European Conference on Computer Vision. Springer, 591–604.Google Scholar
Digital Library
- [44] . 2019. A single-shot arbitrarily-shaped text detector based on context attended multi-task learning. In Proceedings of the 27th ACM International Conference on Multimedia. 1277–1285.Google Scholar
Digital Library
- [45] . 2019. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9336–9345.Google Scholar
Cross Ref
- [46] . 2019. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE International Conference on Computer Vision. 8440–8449.Google Scholar
Cross Ref
- [47] . 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.Google Scholar
Cross Ref
- [48] . 2019. Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6449–6458.Google Scholar
Cross Ref
- [49] . 2021. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 14194–14203.Google Scholar
Cross Ref
- [50] . 2020. R-Net: A relationship network for efficient and accurate scene text detection. IEEE Transactions on Multimedia (2020), 1–1.Google Scholar
- [51] . 2019. Scene text detection with supervised pyramid context network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9038–9045.Google Scholar
Digital Library
- [52] . 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), 1–17.Google Scholar
Digital Library
- [53] . 2018. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3917–3925.Google Scholar
Cross Ref
- [54] . 2019. TextField: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing 28, 11 (2019), 5566–5579.Google Scholar
Digital Library
- [55] . 2019. MSR: Multi-scale shape regression for scene text detection. arXiv preprint arXiv:1901.02596 (2019).Google Scholar
- [56] . 2014. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing 23, 11 (2014), 4737–4749.Google Scholar
Cross Ref
- [57] . 2012. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1083–1090.Google Scholar
Digital Library
- [58] . 2016. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 (2016).Google Scholar
- [59] . 2020. TextFuseNet: Scene text detection with richer fused features. IJCAI.Google Scholar
- [60] . 2017. Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017).Google Scholar
- [61] . 2019. Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10552–10561.Google Scholar
Cross Ref
- [62] . 2020. Dynamic graph message passing networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3726–3735.Google Scholar
Cross Ref
- [63] . 2019. LatentGNN: Learning efficient non-local relations for visual recognition. arXiv preprint arXiv:1905.11634 (2019).Google Scholar
- [64] . 2018. Scale-transferrable object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
Cross Ref
- [65] . 2017. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5551–5560.Google Scholar
Cross Ref
- [66] . 2017. Oriented response networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 519–528.Google Scholar
Cross Ref
Index Terms
Learning Pixel Affinity Pyramid for Arbitrary-Shaped Text Detection
Recommendations
Detection and rectification of arbitrary shaped scene texts by using text keypoints and links
Highlights- We propose a robust scene text detection and rectification technique that is capable of detecting and rectifying scene texts of arbitrary shapes almost ...
AbstractDetection and recognition of scene texts of arbitrary shapes remain a grand challenge due to the super-rich text shape variation in text line orientations, lengths, curvatures, etc. This paper presents a mask-guided multi-task network ...
Arbitrary-shaped scene text detection with keypoint-based shape representation
AbstractRecently scene text detection has become a hot research topic. Arbitrary-shaped text detection is more challenging due to the irregular geometry of the texts such as long curved shapes. Most existing works attempt to solve the problem by using ...
Affinity Derivation and Graph Merge for Instance Segmentation
Computer Vision – ECCV 2018AbstractWe present an instance segmentation scheme based on pixel affinity information, which is the relationship of two pixels belonging to the same instance. In our scheme, we use two neural networks with similar structures. One predicts the pixel level ...






Comments