skip to main content
research-article

Saliency Prediction in the Deep Learning Era: Successes and Limitations

Published:01 February 2021Publication History
Skip Abstract Section

Abstract

Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss the remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models.

References

  1. [1] Koch C. and Ullman S., “Shifts in selective visual attention: Towards the underlying neural circuitry,” in Matters of Intelligence. New York, NY, USA: Springer, 1987, pp. 115141.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Treisman A. M. and Gelade G., “A feature-integration theory of attention,” Cognitive Psychology, vol. 12, no. 1, pp. 97136, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Itti L., Koch C., and Niebur E., “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 12541259, Nov. 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Borji A. and Itti L., “State-of-the-art in visual attention modeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 185207, Jan. 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Hayhoe M. M. and Ballard D. H., “Eye movements in natural behavior,” Trends Cognitive Sci., vol. 9, no. 4, pp. 188194, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Borji A., Sihite D. N., and Itti L., “What/where to look next? modeling top-down visual attention in complex interactive environments,” IEEE Trans. Syst. Man. Cybern. Part A - Syst. Humans, vol. 44, no. 5, pp. 523538, May 2014.Google ScholarGoogle Scholar
  7. [7] Bylinskii Z., Recasens A., Borji A., Oliva A., Torralba A., and Durand F., “Where should saliency models look next?” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 809824.Google ScholarGoogle Scholar
  8. [8] Borji A., Sihite D. N., and Itti L., “Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 5569, Jan. 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Borji A., Sihite D. N., and Itti L., “What stands out in a scene? a study of human explicit saliency judgment,” Vis. Res., vol. 91, pp. 6277, Aug. 2013.Google ScholarGoogle Scholar
  10. [10] Bruce N. D., Wloka C., Frosst N., Rahman S., and Tsotsos J. K., “On computational modeling of visual saliency: Examining whats right, and whats left,” Vis. Res., vol. 116, pp. 95112, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Bylinskii Z., Judd T., Oliva A., Torralba A., and Durand F., “What do different evaluation metrics tell us about saliency models?IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 3, pp. 740757, Mar. 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Tavakoli H. R., Ahmed F., Borji A., and Laaksonen J., “Saliency revisited: Analysis of mouse movements versus fixations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 17741782.Google ScholarGoogle Scholar
  13. [13] Borji A., Tavakoli H. R., Sihite D. N., and Itti L., “Analysis of scores, datasets, and models in visual saliency prediction,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 921928.Google ScholarGoogle Scholar
  14. [14] Koehler K., Guo F., Zhang S., and Eckstein M. P., “What do saliency models predict?J. Vis., vol. 14, no. 3, 2014, Art. no. 14.Google ScholarGoogle Scholar
  15. [15] Bylinskii Z., DeGennaro E. M., Rajalingham R., Ruda H., Zhang J., and Tsotsos J. K., “Towards the quantitative evaluation of visual attention models,” Vis. Res., vol. 116, pp. 258268, 2015.Google ScholarGoogle Scholar
  16. [16] Huang X., Shen C., Boix X., and Zhao Q., “Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 262270.Google ScholarGoogle Scholar
  17. [17] Borji A., Sihite D. N., and Itti L., “Salient object detection: A benchmark,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 414429.Google ScholarGoogle Scholar
  18. [18] Yosinski J., Clune J., Nguyen A., Fuchs T., and Lipson H., “Understanding neural networks through deep visualization,” arXiv preprint arXiv:1506.06579, 2015.Google ScholarGoogle Scholar
  19. [19] Xu P., Ehinger K. A., Zhang Y., Finkelstein A., Kulkarni S. R., and Xiao J., “TurkerGaze: Crowdsourcing saliency with webcam based eye tracking,” vol. abs/1504.06755, 2015.Google ScholarGoogle Scholar
  20. [20] Parkhurst D., Law K., and Niebur E., “Modeling the role of salience in the allocation of overt visual attention,” Vis. Res., vol. 42, no. 1, pp. 107123, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Bruce N. and Tsotsos J., “Saliency based on information maximization,” in Proc. 18th Int. Conf. Neural Inf. Process. Syst., 2005, pp. 155162.Google ScholarGoogle Scholar
  22. [22] Jiang M., Huang S., Duan J., and Zhao Q., “Salicon: Saliency in context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 10721080.Google ScholarGoogle Scholar
  23. [23] Kim N. W., Bylinskii Z., Borkin M. A., Gajos K. Z., Oliva A., Durand F., and Pfister H., “BubbleView: An alternative to eye-tracking for crowdsourcing image importance,” Feb. 2017.Google ScholarGoogle Scholar
  24. [24] Winkler S. and Subramanian R., “Overview of eye tracking datasets,” in Proc. 5th Int. Workshop Quality Multimedia Exp., 2013, pp. 212217.Google ScholarGoogle Scholar
  25. [25] Bylinskii Z., Judd T., Borji A., Itti L., Durand F., Oliva A., and Torralba A., “MIT saliency benchmark.2015. [Online]. Available: http://saliency.mit.edu/Google ScholarGoogle Scholar
  26. [26] Judd T., Ehinger K., Durand F., and Torralba A., “Learning to predict where humans look,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp. 21062113.Google ScholarGoogle Scholar
  27. [27] Bylinskii Z., Alsheikh S., Madan S., Recasens A., Zhong K., Pfister H., Durand F., and Oliva A., “Understanding infographics through textual and visual tag prediction,” arXiv: 1709.09215, 2017.Google ScholarGoogle Scholar
  28. [28] Jiang M., Xu J., and Zhao Q., “Saliency in crowd,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 1732.Google ScholarGoogle Scholar
  29. [29] Shen C. and Zhao Q., “Webpage saliency,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 3346.Google ScholarGoogle Scholar
  30. [30] Zheng Q., Jiao J., Cao Y., and Lau R. W., “Task-driven webpage saliency,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 287302.Google ScholarGoogle Scholar
  31. [31] Ramanathan S., Katti H., Sebe N., Kankanhalli M., and Chua T.-S., “An eye fixation database for saliency detection in images,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 3043.Google ScholarGoogle Scholar
  32. [32] Fan S., Shen Z., Jiang M., Koenig B. L., Xu J., Kankanhalli M. S., and Zhao Q., “Emotional attention: A study of image sentiment and visual attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 75217531.Google ScholarGoogle Scholar
  33. [33] Borji A., Sihite D. N., and Itti L., “Probabilistic learning of task-specific visual attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 470477.Google ScholarGoogle Scholar
  34. [34] Palazzi A., Solera F., Calderara S., Alletto S., and Cucchiara R., “Learning where to attend like a human driver” in Proc. 2017 IEEE Intell. Vehicles Symp. (IV), 2017, pp. 920925.Google ScholarGoogle Scholar
  35. [35] Mital P. K., Smith T. J., Hill R. L., and Henderson J. M., “Clustering of gaze during dynamic scene viewing is predicted by motion,” Cognitive Comput., vol. 3, no. 1, pp. 524, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Mathe S. and Sminchisescu C., “Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7, pp. 14081424, Jul. 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Wang W., Shen J., Guo F., Cheng M.-M., and Borji A., “Revisiting video saliency: A large-scale benchmark and a new model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 48944903.Google ScholarGoogle Scholar
  38. [38] Jiang L., Xu M., Liu T., Qiao M., and Wang Z., “Deepvs: A deep learning based video saliency prediction approach,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 602617.Google ScholarGoogle Scholar
  39. [39] Coutrot A. and Guyader N., “How saliency, faces, and sound influence gaze in dynamic social scenes,” J. Vis., vol. 14, no. 8, pp. 55, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Liu Y., Zhang S., Xu M., and He X., “Predicting salient face in multipleface videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 44204428.Google ScholarGoogle Scholar
  41. [41] Zhang Z., Xu Y., Yu J., and Gao S., “Saliency detection in 360 videos,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 488503.Google ScholarGoogle Scholar
  42. [42] Assens M., Giro-i Nieto X., McGuinness K., and OConnor N. E., “Saltinet: Scan-path prediction on 360 degree images using saliency volumes,” in Proc. Comput. Vis. Workshop, 2017, pp. 23312338.Google ScholarGoogle Scholar
  43. [43] Rubner Y., Tomasi C., and Guibas L. J., “The earth mover's distance as a metric for image retrieval,” Int. J. Comput. Vis., vol. 40, no. 2, pp. 99121, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Peters R. J., Iyer A., Itti L., and Koch C., “Components of bottom-up gaze allocation in natural images,” Vis. Res., vol. 45, no. 8, pp. 23972416, Aug. 2005.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zhang L., Tong M. H., Marks T. K., Shan H., and Cottrell G. W., “Sun: A bayesian framework for saliency using natural statistics,” J. Vis., vol. 8, no. 7, pp. 3232, 2008.Google ScholarGoogle Scholar
  46. [46] Kümmerer M., Wallis T. S., and Bethge M., “Information-theoretic model comparison unifies saliency metrics,” Proc. Nat. Acad. Sci., vol. 112, no. 52, pp. 1605416059, 2015.Google ScholarGoogle Scholar
  47. [47] Xia C., Li J., Su J., and Borji A., “Learning a saliency evaluation metric using crowdsourced perceptual judgments,” arXiv: 1806.10257, 2018.Google ScholarGoogle Scholar
  48. [48] Koch C. and Ullman S., “Shifts in selective visual attention: Towards the underlying neural circuitry,” Hum. Neurobiol., vol. 4, no. 4, pp. 21927, 1985.Google ScholarGoogle Scholar
  49. [49] Cerf M., Frady E. P., and Koch C., “Faces and text attract gaze independent of the task: Experimental data and computer model,” J. Vis., vol. 9, no. 12, pp. 10.115, Nov. 18, 2009.Google ScholarGoogle Scholar
  50. [50] Parks D., Borji A., and Itti L., “Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes,” Vis. Res., vol. 116B, pp. 113126, 2015.Google ScholarGoogle Scholar
  51. [51] Harel J., Koch C., and Perona P., “Graph-based visual saliency,” in Proc. 19th Int. Conf. Neural Inf. Process. Syst., 2006, pp. 545552.Google ScholarGoogle Scholar
  52. [52] Hou X. and Zhang L., “Saliency detection: A spectral residual approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 18.Google ScholarGoogle Scholar
  53. [53] Garcia-Diaz A., Fdez-Vidal X. R., Pardo X. M., and Dosil R., “Saliency from hierarchical adaptation through decorrelation and variance normalization,” Image Vis. Comput., vol. 30, no. 1, pp. 5164, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zhang J. and Sclaroff S., “Saliency detection: A boolean map approach,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 153160.Google ScholarGoogle Scholar
  55. [55] Leboran V., Garcia-Diaz A., Fdez-Vidal X. R., and Pardo X. M., “Dynamic whitening saliency,” in Proc. IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 5, pp. 893907, May 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Hossein Khatoonabadi S., Vasconcelos N., Bajic I. V., and Shan Y., “How many bits does it take for a stimulus to be salient?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 55015510.Google ScholarGoogle Scholar
  57. [57] Xu M., Jiang L., Sun X., Ye Z., and Wang Z., “Learning to detect video saliency with hevc features,” IEEE Trans. Image Process., vol. 26, no. 1, pp. 369385, Jan. 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Guo C., Ma Q., and Zhang L., “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 18.Google ScholarGoogle Scholar
  59. [59] Rudoy D., Goldman D. B., Shechtman E., and Zelnik-Manor L., “Learning video saliency from human gaze using candidate selection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 11471154.Google ScholarGoogle Scholar
  60. [60] LeCun Y., Bottou L., Bengio Y., and Haffner P., “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 22782324, Nov. 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L., “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248255.Google ScholarGoogle Scholar
  62. [62] Vig E., Dorr M., and Cox D., “Large-scale optimization of hierarchical features for saliency prediction in natural images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 27982805.Google ScholarGoogle Scholar
  63. [63] Kümmerer M., Theis L., and Bethge M., “Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,” arXiv:1411.1045, 2014.Google ScholarGoogle Scholar
  64. [64] Krizhevsky A., Sutskever I., and Hinton G. E., “Imagenet classification with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst. - Vol. 1, 2012, pp. 10971105.Google ScholarGoogle Scholar
  65. [65] Kümmerer M., Wallis T. S., Gatys L. A., and Bethge M., “Understanding low-and high-level contributions to fixation prediction,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 47994808.Google ScholarGoogle Scholar
  66. [66] Liu N., Han J., Zhang D., Wen S., and Liu T., “Predicting eye fixations using convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 362370.Google ScholarGoogle Scholar
  67. [67] Kruthiventi S. S., Ayush K., and Babu R. V., “Deepfix: A fully convolutional neural network for predicting human eye fixations,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 44464456, Sep. 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Cornia M., Baraldi L., Serra G., and Cucchiara R., “A deep multi-level network for saliency prediction,” in Proc. 23rd Int. Conf. Pattern Recognit., 2016, pp. 34883493.Google ScholarGoogle Scholar
  69. [69] Pan J., Sayrol E., Giro-i Nieto X., McGuinness K., and O'Connor N. E., “Shallow and deep convolutional networks for saliency prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 598606.Google ScholarGoogle Scholar
  70. [70] Jetley S., Murray N., and Vig E., “End-to-end saliency mapping via probability distribution prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 57535761.Google ScholarGoogle Scholar
  71. [71] Liu N. and Han J., “A deep spatial contextual long-term recurrent convolutional network for saliency detection,” IEEE Trans. Image Process., vol. 27, no. 7, pp. 32643274, 2018.Google ScholarGoogle Scholar
  72. [72] Pan J., Ferrer C. C., McGuinness K., O'Connor N. E., Torres J., Sayrol E., and Giro-i Nieto X., “Salgan: Visual saliency prediction with generative adversarial networks,” arXiv: 1701.01081, 2017.Google ScholarGoogle Scholar
  73. [73] Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., and Bengio Y., “Generative adversarial nets,” in Proc. Neural Inf. Process. Syst., 2014, pp. 26722680.Google ScholarGoogle Scholar
  74. [74] Tavakoli H. R., Borji A., Laaksonen J., and Rahtu E., “Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features,” Neurocomputing, vol. 244, pp. 1018, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Huang G.-B., Zhu Q.-Y., and Siew C.-K., “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 13, pp. 489501, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Jia S., “Eml-net: An expandable multi-layer network for saliency prediction,” arXiv: 1805.01047, 2018.Google ScholarGoogle Scholar
  77. [77] Wang W. and Shen J., “Deep visual attention prediction,” IEEE Trans. Image Process., vol. 27, no. 5, pp. 23682378, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Cornia M., Baraldi L., Serra G., and Cucchiara R., “Predicting human eye fixations via an lstm-based saliency attentive model,” IEEE Trans. Image Process., vol. 27, no. 10, pp. 51425154, 2018.Google ScholarGoogle Scholar
  79. [79] Bruce N. D., Catton C., and Janjic S., “A deeper look at saliency: feature contrast, semantics, and beyond,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 516524.Google ScholarGoogle Scholar
  80. [80] Mottaghi R., Chen X., Liu X., Cho N.-G., Lee S.-W., Fidler S., Urtasun R., and Yuille A., “The role of context for object detection and semantic segmentation in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 891898.Google ScholarGoogle Scholar
  81. [81] Gorji S. and Clark J. J., “Attentional push: A deep convolutional network for augmenting image salience with shared attention modeling in social scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 34723481.Google ScholarGoogle Scholar
  82. [82] Sun X., Huang Z., Yin H., and Shen H. T., “An integrated model for effective saliency prediction,” in Proc. 31st AAAI Conf. Artif. Intell., 2017, pp. 274281.Google ScholarGoogle Scholar
  83. [83] Dodge S. F. and Karam L. J., “Visual saliency prediction using a mixture of deep neural networks,” IEEE Trans. Image Process., vol. 27, no. 8, pp. 40804090, Aug. 2018.Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Zhao J., Siagian C., and Itti L., “Fixation bank: Learning to reweight fixation candidates,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 31743182.Google ScholarGoogle Scholar
  85. [85] Mahdi A. and Qin J., “Deepfeat: A bottom-up and top-down saliency model based on deep features of convolutional neural nets,” IEEE Trans. Cogn. Develop. Syst., 2019.Google ScholarGoogle Scholar
  86. [86] Xu Y., Wu J., Li N., Gao S., and Yu J., “Personalized saliency and its prediction,” arXiv: 1710.03011, 2017.Google ScholarGoogle Scholar
  87. [87] Assens M., Giro-i Nieto X., McGuinness K., and O'Connor N. E., “PathGAN: Visual scanpath prediction with generative adversarial networks,” in Proc. Eur. Conf. Comput. Vis., 2018.Google ScholarGoogle Scholar
  88. [88] Wloka C., Kotseruba I., and Tsotsos J. K., “Active fixation control to predict saccade sequences,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 31843193.Google ScholarGoogle Scholar
  89. [89] Nguyen T. V., Xu M., Gao G., Kankanhalli M., Tian Q., and Yan S., “Static saliency vs. dynamic saliency: A comparative study,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 987996.Google ScholarGoogle Scholar
  90. [90] Bak C., Kocak A., Erdem E., and Erdem A., “Spatio-temporal saliency networks for dynamic saliency prediction,” IEEE Trans. Multimedia, vol. 20, no. 7, pp. 16881698, Jul. 2018.Google ScholarGoogle Scholar
  91. [91] Chaabouni S., Benois-Pineau J., and Amar C. B., “Transfer learning with deep networks for saliency prediction in natural video,” in Proc. IEEE Int. Conf. Image Process., 2016, pp. 16041608.Google ScholarGoogle Scholar
  92. [92] Bazzani L., Larochelle H., and Torresani L., “Recurrent mixture density network for spatiotemporal visual attention,” in Proc. Int. Conf. Learn. Representations, 2017.Google ScholarGoogle Scholar
  93. [93] Jiang L., Xu M., and Wang Z., “Predicting video saliency with object-to-motion cnn and two-layer convolutional lstm,” arXiv: 1709.06316, 2017.Google ScholarGoogle Scholar
  94. [94] Leifman G., Rudoy D., Swedish T., Bayro-Corrochano E., and Raskar R., “Learning gaze transitions from depth to improve video saliency estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, vol. 3, pp. 17071716.Google ScholarGoogle Scholar
  95. [95] Gorji S. and Clark J. J., “Going from image to video saliency: Augmenting image salience with dynamic attentional push,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 75017511.Google ScholarGoogle Scholar
  96. [96] Sun M., Zhou Z., Hu Q., Wang Z., and Jiang J., “Sg-fcn: A motion and memory-based deep learning model for video saliency detection,” IEEE Trans. Cybern., vol. 49, no. 8, pp. 29002911, Aug. 2018.Google ScholarGoogle Scholar
  97. [97] Wang W., Lai Q., Fu H., Shen J., and Ling H., “Salient object detection in the deep learning era: An in-depth survey,” arXiv: 1904.09146, 2019.Google ScholarGoogle Scholar
  98. [98] Judd T., Ehinger K., Durand F., and Torralba A., “Learning to predict where humans look,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp. 21062113.Google ScholarGoogle Scholar
  99. [99] Borji A., “Boosting bottom-up and top-down visual features for saliency estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 438445.Google ScholarGoogle Scholar
  100. [100] Xu J., Jiang M., Wang S., Kankanhalli M. S., and Zhao Q., “Predicting human gaze beyond pixels,” J. Vis., vol. 14, no. 1, pp. 2828, 2014.Google ScholarGoogle Scholar
  101. [101] He K., Zhang X., Ren S., and Sun J., “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770778.Google ScholarGoogle Scholar
  102. [102] Tatler B. W., “The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions,” J. Vis., vol. 7, no. 14, pp. 117, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. [103] Kümmerer M., Wallis T. S., and Bethge M., “Saliency benchmarking made easy: Separating models, maps and metrics,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 770787.Google ScholarGoogle Scholar
  104. [104] Zanca D. and Gori M., “Variational laws of visual attention for dynamic scenes,” in Proc. Neural Inf. Process. Syst., 2017, pp. 38233832.Google ScholarGoogle Scholar
  105. [105] Kruthiventi S. S., Gudisa V., Dholakiya J. H., and Venkatesh Babu R., “Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 57815790.Google ScholarGoogle Scholar
  106. [106] Wang W., Shen J., and Shao L., “Video salient object detection via fully convolutional networks,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 3849, Jan. 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. [107] Recasens A., Khosla A., Vondrick C., and Torralba A., “Where are they looking?” in Proc. Neural Inf. Process. Syst., 2015, pp. 199207.Google ScholarGoogle Scholar
  108. [108] Henderson J. M. and Hayes T. R., “Meaning guides attention in real-world scene images: Evidence from eye movements and meaning maps,” J. Vis., vol. 18, no. 6, pp. 1010, 2018.Google ScholarGoogle Scholar
  109. [109] Rahman S. and Bruce N., “Saliency, scale and information: Towards a unifying theory,” in Proc. Neural Inf. Process. Syst., 2015, pp. 21882196.Google ScholarGoogle Scholar
  110. [110] Wolfe J. M. and Horowitz T. S., “Five factors that guide attention in visual search,” Nature Hum. Behaviour, vol. 1, 2017, Art. no. 0058.Google ScholarGoogle Scholar
  111. [111] Itti L. and Koch C., “Computational modelling of visual attention,” Nature Rev. Neuroscience, vol. 2, no. 3, pp. 194203, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  112. [112] Borji A., Parks D., and Itti L., “Complementary effects of gaze direction and early saliency in guiding fixations during free viewing,” J. Vis., vol. 14, no. 13, pp. 132, Nov. 2014.Google ScholarGoogle Scholar
  113. [113] Borji A., Feng M., and Lu H., “Vanishing point attracts gaze in free-viewing and visual search tasks,” J. Vis., vol. 16, no. 14, 2016, Art. no. 18.Google ScholarGoogle Scholar
  114. [114] Nuthmann A. and Henderson J. M., “Object-based attentional selection in scene viewing,” J. Vis., vol. 10, no. 8, 2010, Art. no. 20.Google ScholarGoogle Scholar
  115. [115] Borji A. and Tanner J., “Reconciling saliency and object center-bias hypotheses in explaining free-viewing fixations,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 6, pp. 12141226, Jun. 2016.Google ScholarGoogle Scholar
  116. [116] Cerf M., Frady E. P., and Koch C., “Faces and text attract gaze independent of the task: Experimental data and computer model,” J. Vis., vol. 9, no. 12, pp. 1010, 2009.Google ScholarGoogle Scholar
  117. [117] Borji A. and Itti L., “Cat2000: A large scale fixation dataset for boosting saliency research,” arXiv:1505.03581, pp. 14, May 2015.Google ScholarGoogle Scholar
  118. [118] Huang G. B., Ramesh M., Berg T., and Learned-Miller E., “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Univ. Massachusetts, Amherst, MA, USA, Tech. Rep. 07–49, 2007.Google ScholarGoogle Scholar
  119. [119] Schauerte B., Richarz J., and Fink G. A., “Saliency-based identification and recognition of pointed-at objects,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2010, pp. 46384643.Google ScholarGoogle Scholar
  120. [120] Foulsham T., Cheng J. T., Tracy J. L., Henrich J., and Kingstone A., “Gaze allocation in a dynamic situation: Effects of social status and speaking,” Cognition, vol. 117, no. 3, pp. 319331, 2010.Google ScholarGoogle Scholar
  121. [121] Belardinelli A., Stepper M. Y., and Butz M. V., “It's in the eyes: Planning precise manual actions before execution,” J. Vis., vol. 16, no. 1, pp. 1818, 2016.Google ScholarGoogle Scholar
  122. [122] Borji A., Parks D., and Itti L., “Complementary effects of gaze direction and early saliency in guiding fixations during free viewing,” J. Vis., vol. 14, no. 13, pp. 33, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  123. [123] Spain M. and Perona P., “Measuring and predicting object importance,” Int. J. Comput. Vis., vol. 91, no. 1, pp. 5976, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. [124] Elazary L. and Itti L., “Interesting objects are visually salient,” J. Vis., vol. 8, no. 3:3, pp. 115, Mar. 2008.Google ScholarGoogle Scholar
  125. [125] M't Hart B., Schmidt H. C., Roth C., and Einhäuser W., “Fixations on objects in natural scenes: Dissociating importance from salience,” Frontiers Psychology, vol. 4, 2013, Art. no. 455.Google ScholarGoogle Scholar
  126. [126] Chua H. F., Boland J. E., and Nisbett R. E., “Cultural variation in eye movements during scene perception,” Proc. Nat. Acad. Sci. United State Amer., vol. 102, pp. 1262912633, 2005.Google ScholarGoogle Scholar
  127. [127] Shen J. and Itti L., “Top-down influences on visual attention during listening are modulated by observer sex,” Vis. Res., vol. 65, pp. 6276, 2012.Google ScholarGoogle Scholar
  128. [128] Einhäuser W., Spain M., and Perona P., “Objects predict fixations better than early saliency,” J. Vis., vol. 8, no. 14, pp. 1818, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  129. [129] Borji A., Sihite D. N., and Itti L., “Objects do not predict fixations better than early saliency: A re-analysis of Einhaeuser, et al.'s, data,” J. Vis., vol. 13, no. 10, pp. 14, Aug. 2013.Google ScholarGoogle Scholar
  130. [130] Stoll J., Thrun M., Nuthmann A., and Einhäuser W., “Overt attention in natural scenes: Objects dominate features,” Vis. Res., vol. 107, pp. 3648, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  131. [131] Kootstra G., de Boer B., and Schomaker L. R., “Predicting eye fixations on complex visual stimuli using local symmetry,” Cognitive Comput., vol. 3, no. 1, pp. 223240, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  132. [132] Rudoy D., Goldman D. B., Shechtman E., and Zelnik-Manor L., “Crowdsourcing gaze data collection,” arXiv:1204.3367, 2012.Google ScholarGoogle Scholar
  133. [133] Judd T., Durand F., and Torralba A., “Fixations on low-resolution images,” J. Vis., vol. 11, no. 4, pp. 1414, 2011.Google ScholarGoogle Scholar
  134. [134] Gide M. S., Dodge S. F., and Karam L. J., “The effect of distortions on the prediction of visual attention,” arXiv:1604.03882, 2016.Google ScholarGoogle Scholar
  135. [135] Kim C. and Milanfar P., “Visual saliency in noisy images,” J. Vis., vol. 13, no. 4, 2013, Art. no. 5.Google ScholarGoogle Scholar
  136. [136] Yohanandan S. A., Dyer A. G., Tao D., and Song A., “Saliency preservation in low-resolution grayscale images,” in Proc. European Conf. Comput. Vis., 2018, pp. 235251.Google ScholarGoogle Scholar
  137. [137] Che Z., Borji A., Zhai G., and Min X., “Invariance analysis of saliency models versus human gaze during scene free viewing,” arXiv preprint arXiv:1810.04456, 2018.Google ScholarGoogle Scholar
  138. [138] Che Z., Borji A., Zhai G., Ling S., Guo G., and Callet P. L., “Adversarial attacks against deep saliency models,” arXiv: 1904.01231, 2019.Google ScholarGoogle Scholar
  139. [139] Sun C., Shrivastava A., Singh S., and Gupta A., “Revisiting unreasonable effectiveness of data in deep learning era,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 843852.Google ScholarGoogle Scholar
  140. [140] Zitnick L. C. and Parikh D., “Bringing semantics into focus using visual abstraction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 30093016.Google ScholarGoogle Scholar
  141. [141] Soomro K., Zamir A. R., and Shah M., “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv:1212.0402, 2012.Google ScholarGoogle Scholar
  142. [142] Oh S., Hoogs A., Perera A., Cuntoor N., Chen C.-C., Lee J. T., Mukherjee S., Aggarwal J., Lee H., Davis L., et al., “A large-scale benchmark dataset for event recognition in surveillance video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 31533160.Google ScholarGoogle Scholar
  143. [143] Pirsiavash H. and Ramanan D., “Detecting activities of daily living in first-person camera views,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 28472854.Google ScholarGoogle Scholar
  144. [144] Sigurdsson G. A., Gupta A., Schmid C., Farhadi A., and Alahari K., “Actor and observer: Joint modeling of first and third-person videos,” CVPR, 2018.Google ScholarGoogle Scholar
  145. [145] Yun K., Peng Y., Samaras D., Zelinsky G. J., and Berg T. L., “Studying relationships between human gaze, description, and computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 739746.Google ScholarGoogle Scholar
  146. [146] Xu K., Ba J., Kiros R., Cho K., Courville A., Salakhudinov R., Zemel R., and Bengio Y., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 20482057.Google ScholarGoogle Scholar
  147. [147] Wang J., Borji A., Kuo C.-C. J., and Itti L., “Learning a combined model of visual saliency for fixation prediction,” IEEE Trans. Image Process., vol. 25, no. 4, pp. 15661579, Apr. 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. [148] Schwartz I., Schwing A., and Hazan T., “High-order attention models for visual question answering,” in Proc. Neural Inf. Process. Syst., 2017, pp. 36643674.Google ScholarGoogle Scholar
  149. [149] Lu J., Yang J., Batra D., and Parikh D., “Hierarchical question-image co-attention for visual question answering,” in Proc. Neural Inf. Process. Syst., 2016, pp. 289297.Google ScholarGoogle Scholar
  150. [150] Jiang M. and Zhao Q., “Learning visual attention to identify people with autism spectrum disorder,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 32673276.Google ScholarGoogle Scholar
  151. [151] Wang S., Jiang M., Duchesne X. M., Laugeson E. A., Kennedy D. P., Adolphs R., and Zhao Q., “Atypical visual saliency in autism spectrum disorder quantified through model-based eye tracking,” Neuron, vol. 88, no. 3, pp. 604616, 2015.Google ScholarGoogle Scholar
  152. [152] Gatys L. A., Kümmerer M., Wallis T. S., and Bethge M., “Guiding human gaze with convolutional neural networks,” arXiv: 1712.06492, 2017.Google ScholarGoogle Scholar
  153. [153] Theis L., Korshunova I., Tejani A., and Huszár F., “Faster gaze prediction with dense networks and fisher pruning,” arXiv: 1801.05787, 2018.Google ScholarGoogle Scholar
  154. [154] He S., Tavakoli H. R., Borji A., and Pugeault N., “Human attention in image captioning: Dataset and analysis,” in Proc. Int. Conf. Comput. Vis., 2019.Google ScholarGoogle Scholar
  155. [155] Zhou B., Khosla A., Lapedriza A., Oliva A., and Torralba A., “Object detectors emerge in deep scene cnns,” in Proc. Int. Conf. Learn. Representations, 2015.Google ScholarGoogle Scholar
  156. [156] Xiao S., “Light-weighted saliency detection with distinctively lower memory cost and model size,” arXiv: 1901.05002, 2019.Google ScholarGoogle Scholar
  157. [157] Borji A., “Pros and cons of gan evaluation measures,” Comput. Vis. Image Understanding, vol. 179, pp. 4165, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Saliency Prediction in the Deep Learning Era: Successes and Limitations
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
          IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 43, Issue 2
          Feb. 2021
          376 pages

          0162-8828 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

          Publisher

          IEEE Computer Society

          United States

          Publication History

          • Published: 1 February 2021

          Qualifiers

          • research-article