Abstract
Crowd counting is a popular topic with widespread applications. Currently, the biggest challenge to crowd counting is large-scale variation in objects. In this article, we focus on overcoming this challenge by proposing a novel Attentive Encoder-Decoder Network (AEDN), which is supervised on multiple feature scales to conduct crowd counting via density estimation. This work has three main contributions. First, we augment the traditional encoder-decoder architecture with our proposed residual attention blocks, which, beyond skip-connected encoded features, further extend the decoded features with attentive features. AEDN is better at establishing long-range dependencies between the encoder and decoder, therefore promoting more effective fusion of multi-scale features for handling scale-variations. Second, we design a new KL-divergence-based distribution loss to supervise the scale-aware structural differences between two density maps, which complements the pixel-isolated MSE loss and better optimizes AEDN to generate high-quality density maps. Third, we adopt a multi-scale supervision scheme, such that multiple KL divergences and MSE losses are deployed at all decoding stages, providing more thorough supervisions for different feature scales. Extensive experimental results on four public datasets, including ShanghaiTech Part A, ShanghaiTech Part B, UCF-CC-50, and UCF-QNRF, reveal the superiority and efficacy of the proposed method, which outperforms most state-of-the-art competitors.
- Senjian An, Wanquan Liu, and Svetha Venkatesh. 2007. Face recognition using kernel ridge regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07). IEEE, 1--7.Google Scholar
Cross Ref
- Ankan Bansal and K. S. Venkatesh. 2015. People counting in high density crowds from still images. Arxiv Preprint Arxiv:1507.08445.Google Scholar
- Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.Google Scholar
- Lokesh Boominathan, Srinivas S. S. Kruthiventi, and R. Venkatesh Babu. 2016. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the ACM Multimedia Conference. ACM, 640--644.Google Scholar
- Gabriel J. Brostow and Roberto Cipolla. 2006. Unsupervised bayesian detection of independent motion in crowds. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 1. IEEE, 594--601.Google Scholar
- Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. 2018. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV’18). 734--750.Google Scholar
Cross Ref
- Antoni B. Chan, Zhang-Sheng John Liang, and Nuno Vasconcelos. 2008. Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). IEEE, 1--7.Google Scholar
Cross Ref
- Ke Chen, Chen Change Loy, Shaogang Gong, and Tony Xiang. 2012. Feature mining for localised crowd counting. In Proceedings of the British Machine Vision Conference (BMVC’12), Vol. 1. 3.Google Scholar
Cross Ref
- Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L. Yuille. 2016. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3640--3649.Google Scholar
- Xinlei Chen, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Tensormask: A foundation for dense object segmentation. Arxiv Preprint Arxiv:1903.12174.Google Scholar
- Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based Models for Speech Recognition. arxiv:cs.CL/1506.07503.Google Scholar
- Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. 2003. Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 5 (2003), 564--575.Google Scholar
Digital Library
- Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems. MIT Press, 379--387.Google Scholar
- Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 886--893.Google Scholar
Digital Library
- Rohit Girdhar and Deva Ramanan. 2017. Attentional pooling for action recognition. In Advances in Neural Information Processing Systems. MIT Press, 34--45.Google Scholar
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448.Google Scholar
Digital Library
- Robert M. Haralick and Linda G. Shapiro. 1985. Image segmentation techniques. Comput. Vision Graph. Image Process. 29, 1 (1985), 100--132.Google Scholar
Cross Ref
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7132--7141.Google Scholar
Cross Ref
- Siyu Huang, Xi Li, Zhongfei Zhang, Fei Wu, Shenghua Gao, Rongrong Ji, and Junwei Han. 2017. Body structure aware deep crowd counting. IEEE Trans. Image Process. 27, 3 (2017), 1049--1059.Google Scholar
Cross Ref
- Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. 2013. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 2547--2554.Google Scholar
Digital Library
- Haroon Idrees, Khurram Soomro, and Mubarak Shah. 2015. Detecting humans in dense crowds using locally-consistent scale prior and global occlusion reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 37, 10 (2015), 1986--1998.Google Scholar
Digital Library
- Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak Shah. 2018. Composition loss for counting, density map estimation and localization in dense crowds. Arxiv Preprint Arxiv:1808.01050.Google Scholar
- Xiaolong Jiang, Peizhao Li, Xiantong Zhen, and Xianbin Cao. 2019. Model-free tracking with deep appearance and motion features integration. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 101--110.Google Scholar
Cross Ref
- Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xianbin Cao, David Doermann, and Ling Shao. 2019. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6133--6142.Google Scholar
Cross Ref
- Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. 2011. Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34, 7 (2011), 1409--1422.Google Scholar
Digital Library
- Dan Kong, Douglas Gray, and Hai Tao. 2006. A viewpoint invariant approach for crowd counting. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 3. IEEE, 1187--1190.Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. MIT Press, 1097--1105.Google Scholar
- Victor Kulikov and Victor Lempitsky. 2019. Instance segmentation of biological images using harmonic embeddings. Arxiv Preprint Arxiv:1904.05257.Google Scholar
- Victor Lempitsky and Andrew Zisserman. 2010. Learning to count objects in images. In Advances in Neural Information Processing Systems. MIT Press, 1324--1332.Google Scholar
- Min Li, Zhaoxiang Zhang, Kaiqi Huang, and Tieniu Tan. 2008. Estimating the number of people in crowded scenes by MID-based foreground segmentation and head-shoulder detection. In Proceedings of the 19th International Conference on Pattern Recognition (ICPR’08). IEEE, 1--4.Google Scholar
Cross Ref
- Yuhong Li, Xiaofan Zhang, and Deming Chen. 2018. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. arxiv:cs.CV/1802.10062.Google Scholar
- Sheng-Fuu Lin, Jaw-Yeh Chen, and Hung-Xin Chao. 2001. Estimation of number of people in crowded scenes using perspective transformation. IEEE Trans. Syst. Man. Cybernet. Part A: Syst. Hum. 31, 6 (2001), 645--654.Google Scholar
Digital Library
- Heng Liu, Jungong Han, Shudong Hou, Ling Shao, and Yue Ruan. 2018. Single image super-resolution using a deep encoder--decoder symmetrical network with iterative back projection. Neurocomputing 282 (2018), 52--59.Google Scholar
Cross Ref
- Hao Liu, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2017. Learning deep sharable and structural detectors for face alignment. IEEE Trans. Image Process. 26, 4 (2017), 1666--1678.Google Scholar
Digital Library
- Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G. Hauptmann. 2018. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5197--5206.Google Scholar
- Xialei Liu, Joost van de Weijer, and Andrew D. Bagdanov. 2018. Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7661--7669.Google Scholar
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3431--3440.Google Scholar
Cross Ref
- AN Marana, L da F Costa, RA Lotufo, and SA Velastin. 1998. On the efficacy of texture analysis for crowd monitoring. In Proceedings of the International Symposium on Computer Graphics, Image Processing, and Vision (SIBGRAPI’98). IEEE, 354--361.Google Scholar
Digital Library
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 483--499.Google Scholar
Cross Ref
- Daniel Onoro-Rubio and Roberto J. López-Sastre. 2016. Towards perspective-free object counting with deep learning. In Proceedings of the European Conference on Computer Vision. Springer, 615--629.Google Scholar
- Michael Oren, Constantine Papageorgiou, Pawan Sinha, Edgar Osuna, and Tomaso Poggio. 1997. Pedestrian detection using wavelet templates. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’97). IEEE, 193--199.Google Scholar
Cross Ref
- Nikos Paragios and Visvanathan Ramesh. 2001. A MRF-based approach for real-time subway monitoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’01), Vol. 1. IEEE, I--I.Google Scholar
Cross Ref
- Vincent Rabaud and Serge Belongie. 2006. Counting crowded moving objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 1. IEEE, 705--711.Google Scholar
Digital Library
- Viresh Ranjan, Hieu Le, and Minh Hoai. 2018. Iterative crowd counting. Arxiv Preprint Arxiv:1807.09959.Google Scholar
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 779--788.Google Scholar
Cross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. MIT Press, 91--99.Google Scholar
Digital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234--241.Google Scholar
Cross Ref
- Edward Rosten and Tom Drummond. 2006. Machine learning for high-speed corner detection. In Proceedings of the European Conference on Computer Vision. Springer, 430--443.Google Scholar
Digital Library
- David Ryan, Simon Denman, Clinton Fookes, and Sridha Sridharan. 2009. Crowd counting using multiple local features. In Proceedings of the Conference on Digital Image Computing: Techniques and Applications. IEEE, 81--88.Google Scholar
Digital Library
- David Ryan, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2015. An evaluation of crowd counting methods, features and regression models. Comput. Vision Image Understand. 130 (2015), 1--17.Google Scholar
Digital Library
- Sami Abdulla Mohsen Saleh, Shahrel Azmin Suandi, and Haidi Ibrahim. 2015. Recent survey on crowd density estimation and counting for visual surveillance. Eng. Appl. Artific. Intell. 41 (2015), 103--114.Google Scholar
Digital Library
- Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu. 2017. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1. 6.Google Scholar
- Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, and Xiaokang Yang. 2018. Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5245--5254.Google Scholar
Cross Ref
- Vishwanath A. Sindagi and Vishal M. Patel. 2017. CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the 14th IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS’17). IEEE, 1--6.Google Scholar
- Vishwanath A. Sindagi and Vishal M. Patel. 2017. Generating high-quality crowd density maps using contextual pyramid CNNs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 1879--1888.Google Scholar
- Vishwanath A. Sindagi and Vishal M. Patel. 2018. A survey of recent advances in CNN-based single image crowd counting and density estimation. Pattern Recogn. Lett. 107 (2018), 3--16.Google Scholar
Cross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arxiv:cs.CL/1706.03762.Google Scholar
- Paul Viola and Michael J. Jones. 2004. Robust real-time face detection. Int. J. Comput. Vision 57, 2 (2004), 137--154.Google Scholar
Digital Library
- Chuan Wang, Hua Zhang, Liang Yang, Si Liu, and Xiaochun Cao. 2015. Deep people counting in extremely dense crowds. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1299--1302.Google Scholar
Digital Library
- Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). DOI:https://doi.org/10.1109/cvpr.2017.683Google Scholar
Cross Ref
- Meng Wang and Xiaogang Wang. 2011. Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, 3401--3408.Google Scholar
Digital Library
- Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip H. S. Torr. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1328--1338.Google Scholar
- Ze Wang, Zehao Xiao, Kai Xie, Qiang Qiu, Xiantong Zhen, and Xianbin Cao. 2018. In defense of single-column networks for crowd counting. Arxiv Preprint Arxiv:1808.06133.Google Scholar
- Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A. Little, Scott L. Wing, and Thomas Serre. 2016. Computer vision cracks the leaf code. Proc. Natl. Acad. Sci. U.S.A. 113, 12 (2016), 3305--3310.Google Scholar
Cross Ref
- Bo Wu and Ramakant Nevatia. 2005. Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors. In Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV’05), Vol. 1. IEEE, 90--97.Google Scholar
- Xinyu Wu, Guoyuan Liang, Ka Keung Lee, and Yangsheng Xu. 2006. Crowd density estimation using texture analysis and learning. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO’06). IEEE, 214--219.Google Scholar
Cross Ref
- Saining Xie and Zhuowen Tu. 2015. Holistically nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision. 1395--1403.Google Scholar
Digital Library
- Feng Xiong, Xingjian Shi, and Dit-Yan Yeung. 2017. Spatiotemporal modeling for crowd counting in videos. In Proceedings of the IEEE International Conference on Computer Vision. 5151--5159.Google Scholar
Cross Ref
- Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2017. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1.Google Scholar
Cross Ref
- Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci. 2018. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 3917--3925.Google Scholar
Cross Ref
- Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. 2018. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 2119--2128.Google Scholar
Cross Ref
- Alper Yilmaz, Omar Javed, and Mubarak Shah. 2006. Object tracking: A survey. ACM Comput. Surveys 38, 4 (2006), 13.Google Scholar
Digital Library
- Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 2403--2412.Google Scholar
Cross Ref
- Sergey Zagoruyko and Nikos Komodakis. 2016. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. Arxiv Preprint Arxiv:1612.03928.Google Scholar
- Lu Zhang, Miaojing Shi, and Qiaobo Chen. 2018. Crowd counting via scale-adaptive convolutional neural network. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 1113--1121.Google Scholar
Cross Ref
- Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Trans. Circ. Syst. Video Technol. 27, 3 (2017), 421--430.Google Scholar
Digital Library
- S. Zhang, X. Lan, H. Yao, H. Zhou, D. Tao, and X. Li. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Trans. Neural Netw. Learn. Syst. 28, 10 (2017), 2357--2370.Google Scholar
Cross Ref
- Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Trans. Circ. Syst. Video Technol. 25, 11 (2015), 1749--1760.Google Scholar
Digital Library
- Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. 2016. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 589--597.Google Scholar
Cross Ref
- Tao Zhao, Ram Nevatia, and Bo Wu. 2008. Segmentation and tracking of multiple humans in crowded environments. IEEE Trans. Pattern Anal. Mach. Intell. 30, 7 (2008), 1198--1211.Google Scholar
Digital Library
- Zhuoyi Zhao, Hongsheng Li, Rui Zhao, and Xiaogang Wang. 2016. Crossing-line crowd counting with two-phase deep neural networks. In Proceedings of the European Conference on Computer Vision. Springer, 712--726.Google Scholar
Cross Ref
- Wentao Zhu, Yufang Huang, Hui Tang, Zhen Qian, Nan Du, Wei Fan, and Xiaohui Xie. 2018. AnatomyNet: Deep 3D squeeze-and-excitation u-nets for fast and fully automated whole-volume anatomical segmentation. Arxiv Preprint Arxiv:1808.05238.Google Scholar
- Wentao Zhu, Xiang Xiang, Trac D. Tran, Gregory D. Hager, and Xiaohui Xie. 2018. Adversarial deep structured nets for mass segmentation from mammograms. In Proceedings of the IEEE 15th International Symposium on Biomedical Imaging (ISBI’18). IEEE, 847--850.Google Scholar
Cross Ref
Index Terms
Multi-scale Supervised Attentive Encoder-Decoder Network for Crowd Counting
Recommendations
Attentive encoder-decoder networks for crowd counting
AbstractCrowd counting that aims to estimate the crowd density has recently made significant progress but remains an unsolved problem due to several challenges. In this paper, we propose an Attentive Encoder-Decoder Network (AEDNet) to ...
Attentive Recurrent Neural Network for Weak-supervised Multi-label Image Classification
MM '18: Proceedings of the 26th ACM international conference on MultimediaMulti-label image classification is a fundamental and challenging task in computer vision, and recently achieved significant progress by exploiting semantic relations among labels. However, the spatial positions of labels for multi-labels images are ...
Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder
MM '18: Proceedings of the 26th ACM international conference on MultimediaHuman conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation ...






Comments