Abstract
Person re-identification is the task of recognizing an individual across heterogeneous non-overlapping camera views. It has become a crucial capability needed by many applications in public space video surveillance. However, it remains a challenging task due to the subtle inter-class similarity and large intra-class variation found in person images. Current CNN-based approaches have focused and investigated traditional identification or verification frameworks. Such approaches typically use the whole input image including the background and fail to pay attention to specific body parts, deviating the feature representation learning from informative parts. In this article, we introduce a self-attention mechanism coupled with cross-resolution to improve the feature representation learning of person re-identification task. The proposed self-attention module reinforces the most informative parts from a high-resolution image using its internal representation at the low-resolution. In particular, the model is fed with a pair of images on a different scale and consists of two branches. The upper branch processes the high-resolution image and learns high dimensional feature representation while the lower branch processes the low-resolution image and learns a filtering attention heatmap. The feature maps on the lower branch are subsequently weighted to reflect the importance of each patch of the input image using a softmax operation; whereas, on the upper branch, we apply a max pooling operation to downsample the high-resolution feature map before element-wise multiplied with the attention heatmap. Our attention module helps the network learn the most discriminative visual features of multiple regions of the image and is specifically optimized to attend and enforce feature representation at different scales. Extensive experiments on three large-scale datasets show that network architectures augmented with our self-attention module systematically improve their accuracy and outperform various state-of-the-art models by a large margin.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.Google Scholar
- S. Bai, X. Bai, and Q. Tian. 2017. Scalable person re-identification on supervised smoothed manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3356--3365. DOI:https://doi.org/10.1109/CVPR.2017.358Google Scholar
- Igor Barros Barbosa, Marco Cristani, Barbara Caputo, Aleksander Rognhaugen, and Theoharis Theoharis. 2018. Looking beyond appearances: Synthetic training data for deep CNNs in re-identification. Comput. Vis. Image Underst. 167 (2018), 50--62. DOI:https://doi.org/10.1016/j.cviu.2017.12.002Google Scholar
Digital Library
- D. Chen, Z. Yuan, B. Chen, and N. Zheng. 2016. Similarity learning with spatial constraints for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1268--1277. DOI:https://doi.org/10.1109/CVPR.2016.142Google Scholar
- W. Chen, X. Chen, J. Zhang, and K. Huang. 2017. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1320--1329. DOI:https://doi.org/10.1109/CVPR.2017.145Google Scholar
- D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. 2016. Person re-identification by multi-channel parts-based CNN with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1335--1344. DOI:https://doi.org/10.1109/CVPR.2016.149Google Scholar
- Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cross Ref
- Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In Proceedings of the International Conference on Neural Information Processing Systems. The MIT Press, Cambridge, MA, 577--585. Retrieved from http://dl.acm.org/citation.cfm?id=2969239.2969304.Google Scholar
- Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. 2018. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- H. Fan, L. Zheng, and Y. Yang. 2017. Unsupervised person re-identification: Clustering and fine-tuning. ArXiv e-prints (May 2017). arxiv:cs.CV/1705.10444.Google Scholar
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (Sept. 2010), 1627--1645. DOI:https://doi.org/10.1109/TPAMI.2009.167Google Scholar
Digital Library
- R. C. Fong and A. Vedaldi. 2017. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 3449--3457. DOI:https://doi.org/10.1109/ICCV.2017.371Google Scholar
- M. Geng, Y. Wang, T. Xiang, and Y. Tian. 2016. Deep transfer learning for person re-identification. ArXiv e-prints (Nov. 2016). arxiv:cs.CV/1611.05244.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1026--1034. DOI:https://doi.org/10.1109/ICCV.2015.123Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 770--778. DOI:https://doi.org/10.1109/CVPR.2016.90Google Scholar
- Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’15). The MIT Press, Cambridge, MA, 1693--1701. Retrieved from http://dl.acm.org/citation.cfm?id=2969239.2969428.Google Scholar
- A. Hermans, L. Beyer, and B. Leibe. 2017. In defense of the triplet loss for person re-identification. ArXiv e-prints (March 2017). arxiv:cs.CV/1703.07737.Google Scholar
- J. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial network. In Proceedings of the International Conference on Neural Information Processing Systems.Google Scholar
- Minyue Jiang, Yuan Yuan, and Qi Wang. 2018. Self-attention learning for person re-identification. In Proceedings of the British Machine Vision Conference.Google Scholar
- X. Jing, X. Zhu, F. Wu, R. Hu, X. You, Y. Wang, H. Feng, and J. Yang. 2017. Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. IEEE Trans. Image Proc. 26, 3 (Mar. 2017), 1363--1378. DOI:https://doi.org/10.1109/TIP.2017.2651364Google Scholar
Digital Library
- Nikolaos Karianakis, Zicheng Liu, Yinpeng Chen, and Stefano Soatto. 2018. Reinforced temporal attention and split-rate transfer for depth-based person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google Scholar
Cross Ref
- M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. 2012. Large scale metric learning from equivalence constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2288--2295. DOI:https://doi.org/10.1109/CVPR.2012.6247939Google Scholar
- D. Li, X. Chen, Z. Zhang, and K. Huang. 2017. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7398--7407. DOI:https://doi.org/10.1109/CVPR.2017.782Google Scholar
- W. Li, R. Zhao, T. Xiao, and X. Wang. 2014. DeepReID: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 152--159. DOI:https://doi.org/10.1109/CVPR.2014.27Google Scholar
- Wei Li, Xiatian Zhu, and Shaogang Gong. 2017. Person re-identification by deep joint learning of multi-loss classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). AAAI Press, 2194--2200. Retrieved from http://dl.acm.org/citation.cfm?id=3172077.3172193.Google Scholar
Digital Library
- W. Li, X. Zhu, and S. Gong. 2018. Harmonious attention network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2285--2294. DOI:https://doi.org/10.1109/CVPR.2018.00243Google Scholar
- S. Liao, Y. Hu, Xiangyu Zhu, and S. Z. Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2197--2206. DOI:https://doi.org/10.1109/CVPR.2015.7298832Google Scholar
- H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. 2017. End-to-end comparative attention networks for person re-identification. IEEE Trans. Image Proc. 26, 7 (July 2017), 3492--3506. DOI:https://doi.org/10.1109/TIP.2017.2700762Google Scholar
Digital Library
- Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Junjie Yan, and Xiaogang Wang. 2017. HydraPlus-Net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision. 350--359.Google Scholar
Cross Ref
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., 289--297. Retrieved from http://dl.acm.org/citation.cfm?id=3157096.3157129.Google Scholar
Digital Library
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). The MIT Press, Cambridge, MA, 2204--2212. Retrieved from http://dl.acm.org/citation.cfm?id=2969033.2969073.Google Scholar
- S. Paisitkriangkrai, C. Shen, and A. van den Hengel. 2015. Learning to rank in person re-identification with metric ensembles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1846--1855. DOI:https://doi.org/10.1109/CVPR.2015.7298794Google Scholar
- Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2249--2255. DOI:https://doi.org/10.18653/v1/D16-1244Google Scholar
Cross Ref
- N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran. 2018. Image transformer. ArXiv e-prints (Feb. 2018). arxiv:cs.CV/1802.05751.Google Scholar
- A. Rahimpour, L. Liu, A. Taalimi, Y. Song, and H. Qi. 2017. Person re-identification using visual attention. In Proceedings of the IEEE International Conference on Image Processing (ICIP’17). 4242--4246. DOI:https://doi.org/10.1109/ICIP.2017.8297082Google Scholar
- Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the Conference on Computer Vision Workshop on Benchmarking Multi-Target Tracking.Google Scholar
Cross Ref
- C. Shan, J. Zhang, Y. Wang, and L. Xie. 2017. Attention-based end-to-end speech recognition on voice search. ArXiv e-prints (July 2017). arxiv:cs.CL/1707.07167.Google Scholar
- Yantao Shen, Hongsheng Li, Shuai Yi, Dapeng Chen, and Xiaogang Wang. 2018. Person re-identification with deep similarity-guided graph neural network. In Proceedings of the European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. 11219. Springer, 508--526.Google Scholar
Cross Ref
- Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang. 2018. End-to-end deep Kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
- J. Si, H. Zhang, C. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang. 2018. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5363--5372. DOI:https://doi.org/10.1109/CVPR.2018.00562Google Scholar
- C. Song, Y. Huang, W. Ouyang, and L. Wang. 2018. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1179--1188. DOI:https://doi.org/10.1109/CVPR.2018.00129Google Scholar
- Y. Sun, L. Zheng, W. Deng, and S. Wang. 2017. SVDNet for pedestrian retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 3820--3828. DOI:https://doi.org/10.1109/ICCV.2017.410Google Scholar
- Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. 11219. Springer.Google Scholar
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2818--2826. DOI:https://doi.org/10.1109/CVPR.2017.357Google Scholar
- E. Ustinova, Y. Ganin, and V. Lempitsky. 2015. Multiregion bilinear convolutional neural networks for person re-identification. ArXiv e-prints (Dec. 2015). arxiv:cs.CV/1512.05300.Google Scholar
- Rahul Rama Varior, Mrinal Haloi, and Gang Wang. 2016. Gated siamese convolutional neural network architecture for human re-identification. In Proceedings of the European Conference on Computer Vision.Google Scholar
Cross Ref
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Neural Information Processing Systems.Google Scholar
- Roberto Vezzani, Davide Baltieri, and Rita Cucchiara. 2013. People reidentification in surveillance and forensics: A survey. ACM Comput. Surv. 46, 2, Article 29 (Dec. 2013), 37 pages. DOI:https://doi.org/10.1145/2543581.2543596Google Scholar
- Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. 2018. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google Scholar
Cross Ref
- F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6450--6458. DOI:https://doi.org/10.1109/CVPR.2017.683Google Scholar
- F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. 2016. Joint learning of single-image and cross-image representations for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1288--1296. DOI:https://doi.org/10.1109/CVPR.2016.144Google Scholar
- J. Wang, X. Zhu, S. Gong, and W. Li. 2018. Transferable joint attribute-identity deep learning for unsupervised person re-identification. ArXiv e-prints (Mar. 2018). arxiv:cs.CV/1803.09786.Google Scholar
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Zheng Wang, Ruimin Hu, Yi Yu, Junjun Jiang, Chao Liang, and Jinqiao Wang. 2016. Scale-adaptive low-resolution person re-identification via learning a discriminating surface. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). AAAI Press, 2669--2675. Retrieved from http://dl.acm.org/citation.cfm?id=3060832.3060994.Google Scholar
- Lin Wu, Chunhua Shen, and Anton Hengel. 2016. Deep linear discriminant analysis on Fisher networks: A hybrid architecture for person re-identification. Pattern Recog. 65 (06 2016).Google Scholar
- L. Wu, Y. Wang, J. Gao, and D. Tao. 2018. Deep co-attention based comparators for relative representation learning in person re-identification. ArXiv e-prints (Apr. 2018). arxiv:cs.CV/1804.11027.Google Scholar
- T. Xiao, H. Li, W. Ouyang, and X. Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1249--1258. DOI:https://doi.org/10.1109/CVPR.2016.140Google Scholar
- T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. 2017. Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3376--3385. DOI:https://doi.org/10.1109/CVPR.2017.360Google Scholar
- Fei Xiong, Mengran Gou, Octavia Camps, and Mario Sznaier. 2014. Person re-identification using kernel-based metric learning methods. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Cham, 1--16.Google Scholar
Cross Ref
- Huijuan Xu and Kate Saenko. 2016. Ask, attend, and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Cham, 451--466.Google Scholar
Cross Ref
- J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang. 2018. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2119--2128. DOI:https://doi.org/10.1109/CVPR.2018.00226Google Scholar
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend, and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. PMLR, 2048--2057. Retrieved from http://proceedings.mlr.press/v37/xuc15.html.Google Scholar
- Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 21--29. DOI:https://doi.org/10.1109/CVPR.2016.10Google Scholar
- A. Zeyer, K. Irie, R. Schlüter, and H. Ney. 2018. Improved training of end-to-end attention models for speech recognition. ArXiv e-prints (May 2018). arxiv:cs.CL/1805.03294.Google Scholar
- C. Zhang, L. Wu, and Y. Wang. 2018. Crossing generative adversarial networks for cross-view person re-identification. ArXiv e-prints (Jan. 2018). arxiv:cs.CV/1801.01760.Google Scholar
- L. Zhang, T. Xiang, and S. Gong. 2016. Learning a discriminative null space for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1239--1248. DOI:https://doi.org/10.1109/CVPR.2016.139Google Scholar
- Y. Zhang and S. Li. 2011. Gabor-LBP based region covariance descriptor for person re-identification. In Proceedings of the 6th International Conference on Image and Graphics. 368--371. DOI:https://doi.org/10.1109/ICIG.2011.40Google Scholar
- B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multim. 19, 6 (June 2017), 1245--1256. DOI:https://doi.org/10.1109/TMM.2017.2648498Google Scholar
Digital Library
- L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the ICCV Workshops. 1116--1124. DOI:https://doi.org/10.1109/ICCV.2015.133Google Scholar
- Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. A discriminatively learned CNN embedding for person re-identification. ACM Trans. Multim. Comput. Commun. Applic. 14, 1, Article 13 (Dec. 2017), 20 pages. DOI:https://doi.org/10.1145/3159171Google Scholar
- Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. 2017. Random erasing data augmentation. ArXiv e-prints (Aug. 2017). arxiv:cs.CV/1708.04896.Google Scholar
- Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. 2018. Camera style adaptation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
Cross Ref
Index Terms
Enforcing Affinity Feature Learning through Self-attention for Person Re-identification
Recommendations
Deep Residual Network with Self Attention Improves Person Re-Identification Accuracy
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and ComputingIn this paper, we present an attention mechanism scheme to improve the person re-identification task. Inspired by biology, we propose Self Attention Grid (SAG) to discover the most informative parts from a high-resolution image using its internal ...
MAFT: An Image Super-Resolution Method Based on Mixed Attention and Feature Transfer
Web and Big DataAbstractReference-based image super-resolution methods, which enhance the restoration of a low-resolution (LR) images by introducing an additional high-resolution (HR) reference image, have made rapid and remarkable progress in the field of image super-...
An attention-driven convolutional neural network-based multi-level spectral–spatial feature learning for hyperspectral image classification▪
AbstractRecently, convolutional neural networks (CNNs) are successfully applied to extract abstract features of hyperspectral image (HSI), and they obtained competitive performances in HSI classification. However, HSI has inhomogeneous pixels ...
Highlights- The proposed MFCNN aggregates multiple adjacent backbones to extract features.
- ...






Comments