Abstract
The performance of instance search relies heavily on the ability to locate and describe a wide variety of object instances in a video/image collection. Due to the lack of a proper mechanism for locating instances and deriving feature representation, instance search is generally only effective when the instances are from known object categories. In this article, a simple but effective instance-level feature representation approach is presented. Different from the existing approaches, the issues of class-agnostic instance localization and distinctive feature representation are considered. The former is achieved by detecting salient instance regions from an image by a layer-wise back-propagation process. The back-propagation starts from the last convolution layer of a pre-trained CNNs that is originally used for classification. The back-propagation proceeds layer by layer until it reaches the input layer. This allows the salient instance regions in the input image from both known and unknown categories to be activated. Each activated salient region covers the full or, more usually, a major range of an instance. The distinctive feature representation is produced by average-pooling on the feature map of a certain layer with the detected instance region. Experiments show that this kind of feature representation demonstrates considerably better performance than most of the existing approaches.
- [1] . 2017. Instance search retrospective with focus on TRECVID. International Journal of Multimedia Information Retrieval 6, 1 (2017), 1–29.Google Scholar
Cross Ref
- [2] . 2014. Neural codes for image retrieval. In European Conference on Computer Vision. Springer, 584–599.Google Scholar
Cross Ref
- [3] . 2006. Surf: Speeded up robust features. In European Conference on Computer Vision. Springer, 404–417.Google Scholar
Digital Library
- [4] . 2009. Evaluation of GIST descriptors for web-scale image search. In Proceedings of the ACM International Conference on Image and Video Retrieval. 1–8.Google Scholar
Digital Library
- [5] . 2019. LaSOT: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5374–5383.Google Scholar
Cross Ref
- [6] . 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.Google Scholar
Digital Library
- [7] . 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587.Google Scholar
Digital Library
- [8] . 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.Google Scholar
Cross Ref
- [9] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [10] . 2015. What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (2015), 814–830.Google Scholar
Digital Library
- [11] . 2019. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 99 (2019), 1.Google Scholar
Cross Ref
- [12] . 2017. Efficient diffusion on region manifolds: Recovering small objects with compact CNN representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2077–2086.Google Scholar
Cross Ref
- [13] . 2008. Hamming embedding and weak geometric consistency for large scale image search. In European Conference on Computer Vision. Springer, 304–317.Google Scholar
Digital Library
- [14] . 2011. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 9 (2011), 1704–1716.Google Scholar
Digital Library
- [15] . 2017. Class-weighted convolutional features for visual instance search. In Proceedings of the British Machine Vision Conference.Google Scholar
Cross Ref
- [16] . 2016. Cross-dimensional weighting for aggregated deep convolutional features. In European Conference on Computer Vision. Springer, 685–701.Google Scholar
Cross Ref
- [17] . 2018. Regional attention based deep feature for image retrieval. In Proceedings of the British Machine Vision Conference. 209.Google Scholar
- [18] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.Google Scholar
Digital Library
- [19] . 2017. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2359–2367.Google Scholar
Cross Ref
- [20] . 2019. Instance search based on weakly supervised feature learning. Neurocomputing 424 (2019), 117–124.Google Scholar
- [21] . 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [22] . 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91–110.Google Scholar
Digital Library
- [23] . 1998. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems. 570–576.Google Scholar
Digital Library
- [24] . 2018. Saliency weighted convolutional features for instance search. In Proceedings of the International Conference on Content-Based Multimedia Indexing. IEEE, Los Alamitos, CA, 1–6.Google Scholar
Cross Ref
- [25] . 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision. 3456–3465.Google Scholar
Cross Ref
- [26] . 2017. SalGAN: Visual saliency prediction with generative adversarial networks. arXiv:1701.01081 (2017).Google Scholar
- [27] . 2015. Local convolutional features with unsupervised training for image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 91–99.Google Scholar
Digital Library
- [28] . 2016. Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications 4, 3 (2016), 251–258.Google Scholar
Cross Ref
- [29] . 2017. YouTube-BoundingBoxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5296–5305.Google Scholar
Cross Ref
- [30] . 2018. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google Scholar
- [31] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [32] . 2016. Faster R-CNN features for instance search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 9–16.Google Scholar
Cross Ref
- [33] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [34] . 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the IEEE International Conference on Computer Vision. 1470–1477.Google Scholar
Cross Ref
- [35] . 2018. PCL: Proposal cluster learning for weakly supervised object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 1 (2018), 176–191.Google Scholar
Digital Library
- [36] . 2017. Multiple instance detection network with online instance classifier refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2843–2851.Google Scholar
Cross Ref
- [37] . 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).Google Scholar
- [38] . 2019. C-MIL: Continuation multiple instance learning for weakly supervised object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2199–2208.Google Scholar
Cross Ref
- [39] . 2015. INSTRE: A new benchmark for instance-level object retrieval and recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 11, 3 (2015), 37.Google Scholar
Digital Library
- [40] . 2017. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing 26, 6 (2017), 2868–2881.Google Scholar
Digital Library
- [41] . 2021. Towards accurate localization by instance search. In Proceedings of ACM International Conference on Multimedia. 3807–3815.Google Scholar
- [42] . 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818–833.Google Scholar
Cross Ref
- [43] . 2018. Instance search via instance level segmentation and feature representation. arXiv preprint arXiv:1806.03576 (2018).Google Scholar
- [44] . 2018. Top-down neural attention by excitation backprop. International Journal of Computer Vision 126, 10 (2018), 1084–1102.Google Scholar
Digital Library
- [45] . 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google Scholar
Cross Ref
- [46] . 2018. Weakly supervised instance segmentation using class peak response. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3791–3800.Google Scholar
Cross Ref
Index Terms
Deeply Activated Salient Region for Instance Search
Recommendations
Instance search via instance level segmentation and feature representation
AbstractInstance search is an interesting task as well as a challenging issue due to the lack of effective feature representation. In this paper, an instance level feature representation built upon fully convolutional instance-aware ...
Object instance identification with fully convolutional networks
This paper presents a novel approach for instance search and object detection, applied to museum visits. This approach relies on fully convolutional networks (FCN) to obtain region proposals and object representation. Our proposal consists in four steps:...
Statistical Textural Distinctiveness for Salient Region Detection in Natural Images
CVPR '13: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern RecognitionA novel statistical textural distinctiveness approach for robustly detecting salient regions in natural images is proposed. Rotational-invariant neighborhood-based textural representations are extracted and used to learn a set of representative texture ...






Comments