skip to main content
research-article

Exploring Uncertainty Measures for Image-caption Embedding-and-retrieval Task

Published:21 April 2021Publication History
Skip Abstract Section

Abstract

With the significant development of black-box machine learning algorithms, particularly deep neural networks, the practical demand for reliability assessment is rapidly increasing. On the basis of the concept that “Bayesian deep learning knows what it does not know,” the uncertainty of deep neural network outputs has been investigated as a reliability measure for classification and regression tasks. By considering an embedding task as a regression task, several existing studies have quantified the uncertainty of embedded features and improved the retrieval performance of cutting-edge models by model averaging. However, in image-caption embedding-and-retrieval tasks, well-known samples are not always easy to retrieve. This study shows that the existing method has poor performance in reliability assessment and investigates another aspect of image-caption embedding-and-retrieval tasks. We propose posterior uncertainty by considering the retrieval task as a classification task, which can accurately assess the reliability of retrieval results. The consistent performance of the two uncertainty measures is observed with different datasets (MS-COCO and Flickr30k), different deep-learning architectures (dropout and batch normalization), and different similarity functions. To the best of our knowledge, this is the first study to perform a reliability assessment on image-caption embedding-and-retrieval tasks.

References

  1. Andrei Atanov, Arsenii Ashukha, Dmitry Molchanov, Kirill Neklyudov, and Dmitry Vetrov. 2018. Uncertainty estimation via stochastic batch normalization. In International Conference on Learning Representations Workhosp (ICLRW’18).Google ScholarGoogle Scholar
  2. David Barber and Christopher M. Bishopt. 1997. Ensemble learning for multi-layer networks. In Advances in Neural Information Processing Systems (NIPS’97).Google ScholarGoogle Scholar
  3. Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In ICML Workshop.Google ScholarGoogle Scholar
  4. Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019. A closer look at few-shot classification. In International Conference on Learning Representations (ICLR’19).Google ScholarGoogle Scholar
  7. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14).Google ScholarGoogle ScholarCross RefCross Ref
  8. Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).Google ScholarGoogle ScholarCross RefCross Ref
  9. Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimedia 20, 12 (2018).Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  11. Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. 2018. Finding beans in burgers: Deep semantic-visual embedding with localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  12. Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In British Machine Vision Conference (BMVC’18).Google ScholarGoogle Scholar
  13. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems (NIPS’13).Google ScholarGoogle Scholar
  14. Yarin Gal. 2016. Uncertainty in Deep Learning. Ph.D. Dissertation. University of Cambridge.Google ScholarGoogle Scholar
  15. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML’16).Google ScholarGoogle Scholar
  16. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML’15).Google ScholarGoogle Scholar
  17. Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  18. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  19. Matthias Hein, Maksym Andriushchenko, and Julian Bitterwolf. 2019. Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  20. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. -VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  21. Geoffrey E. Hinton and Drew van Camp. 1993. Keeping the neural networks simple by minimizing the description length of the weights. In Annual Conference on Computational Learning Theory (COLT’93).Google ScholarGoogle Scholar
  22. Po-Yu Huang, Wan-Ting Hsu, Chun-Yueh Chiu, Ting-Fan Wu, and Min Sun. 2018. Efficient uncertainty estimation for semantic segmentation in videos. In European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  23. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML’15).Google ScholarGoogle Scholar
  24. Himanshu Jain, Yashoteja Prabhu, and Manik Varma. 2016. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle ScholarCross RefCross Ref
  26. Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems (NIPS’17).Google ScholarGoogle Scholar
  27. Alex Kendall, Vijay Badrinarayanan, Roberto Cipolla, Vijay Badrinarayanan, and Roberto Cipolla. 2017. Bayesian SegNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In British Machine Vision Conference (BMVC’17).Google ScholarGoogle ScholarCross RefCross Ref
  28. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  29. Diederik P. Kingma, Tim Salimans, Max Welling, and Machine Learning Group. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems (NIPS’15).Google ScholarGoogle Scholar
  30. Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR’14).Google ScholarGoogle Scholar
  31. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. In NIPS Workshop.Google ScholarGoogle Scholar
  32. Armen Der Kiureghian and Ove Ditlevsen. 2009. Aleatory or epistemic? Does it matter? Struct. Safety 31, 2 (2009).Google ScholarGoogle Scholar
  33. Christian Leibig and Siegfried Wahl. 2016. Discriminative Bayesian neural networks know what they do not know. In NIPS Workshop.Google ScholarGoogle Scholar
  34. Kunpeng Li, Yulun Zhang, K. Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In IEEE International Conference on Computer Vision (ICCV’19).Google ScholarGoogle ScholarCross RefCross Ref
  35. Yuncheng Li, Yale Song, and Jiebo Luo. 2017. Improving pairwise ranking for multi-label image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  36. Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV’14).Google ScholarGoogle Scholar
  37. David J. C. MacKay. 1992. A practical Bayesian framework for backpropagation networks. Neural Comput. 4, 3 (1992).Google ScholarGoogle Scholar
  38. Andrey Malinin and Mark Gales. 2018. Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems (NIPS’18).Google ScholarGoogle Scholar
  39. Takashi Matsubara, Ryosuke Tachibana, and Kuniaki Uehara. 2018. Anomaly machine component detection by deep generative model with unregularized score. In International Joint Conference on Neural Networks (IJCNN’18).Google ScholarGoogle ScholarCross RefCross Ref
  40. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS’13).Google ScholarGoogle Scholar
  41. Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. 2015. Distributional smoothing with virtual adversarial training. In International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  42. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. In Conference on Empirical Methods in Natural Language Processing (EMNLP’18).Google ScholarGoogle ScholarCross RefCross Ref
  43. Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. 2004. Are loss functions all the same? Neural Comput. 16, 5 (2004).Google ScholarGoogle Scholar
  44. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems (NIPS’17).Google ScholarGoogle Scholar
  45. Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Netw. 61 (2015).Google ScholarGoogle Scholar
  46. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  47. Lewis Smith and Yarin Gal. 2018. Understanding measures of uncertainty for adversarial example detection. In Uncertainty in Artificial Intelligence (UAI’18).Google ScholarGoogle Scholar
  48. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (2014).Google ScholarGoogle Scholar
  49. Ahmed Taha, Yi-ting Chen, Teruhisa Misu, Abhinav Shrivastava, and Larry Davis. 2019. Unsupervised data uncertainty learning in visual retrieval systems. CoRR, abs/1902.02586.Google ScholarGoogle Scholar
  50. Ahmed Taha, Yi-Ting Chen, Xitong Yang, Teruhisa Misu, and Larry Davis. 2019. Exploring uncertainty in conditional multi-modal retrieval systems. CoRR, abs/1901.07702.Google ScholarGoogle Scholar
  51. Ryo Takahashi, Takashi Matsubara, and Kuniaki Uehara. 2018. RICAP: Random image cropping and patching data augmentation for deep CNNs. In Asian Conference on Machine Learning (ACML’18).Google ScholarGoogle Scholar
  52. Sumio Watanabe. 2010. Equations of states in singular statistical estimation. Neural Netw. 23, 1 (2010).Google ScholarGoogle Scholar
  53. Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: Learning to rank with joint word-image embeddings. In European Conference on Machine Learning (ECML’10).Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yijun Xiao and William Yang Wang. 2019. Quantifying uncertainties in natural language processing tasks. In AAAI Conference on Artificial Intelligence (AAAI’19).Google ScholarGoogle ScholarCross RefCross Ref
  55. Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. 2018. RecipeQA: A challenge dataset for multimodal comprehension of cooking recipes. In Conference on Empirical Methods in Natural Language Processing (EMNLP’18).Google ScholarGoogle ScholarCross RefCross Ref
  56. M. H. Peter Young, Alice Lai, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014).Google ScholarGoogle Scholar
  57. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  58. Li Zhang, Tao Xiang, and Shaogang Gong. 2017. Learning a deep embedding model for zero-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  59. Quanshi Zhang, Wenguan Wang, and Song-Chun Zhu. 2018. Examining CNN representations with respect to dataset bias. In AAAI Conference on Artificial Intelligence (AAAI’18).Google ScholarGoogle Scholar
  60. Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2021. Dual-path convolutional image-text embedding with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020).Google ScholarGoogle Scholar

Index Terms

  1. Exploring Uncertainty Measures for Image-caption Embedding-and-retrieval Task

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!