skip to main content
research-article

Learning Transferable Perturbations for Image Captioning

Authors Info & Claims
Published:16 February 2022Publication History
Skip Abstract Section

Abstract

Present studies have discovered that state-of-the-art deep learning models can be attacked by small but well-designed perturbations. Existing attack algorithms for the image captioning task is time-consuming, and their generated adversarial examples cannot transfer well to other models. To generate adversarial examples faster and stronger, we propose to learn the perturbations by a generative model that is governed by three novel loss functions. Image feature distortion loss is designed to maximize the encoded image feature distance between original images and the corresponding adversarial examples at the image domain, and local-global mismatching loss is introduced to separate the mapping encoding representation of the adversarial images and the ground true captions from a local and global perspective in the common semantic space as far as possible cross image and caption domain. Language diversity loss is to make the image captions generated by the adversarial examples as different as possible from the correct image caption at the language domain. Extensive experiments show that our proposed generative model can efficiently generate adversarial examples that successfully generalize to attack image captioning models trained on unseen large-scale datasets or with different architectures, or even the image captioning commercial service.

REFERENCES

  1. [1] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. SPICE: Semantic propositional image caption evaluation. In ECCV.Google ScholarGoogle Scholar
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.Google ScholarGoogle Scholar
  3. [3] Baluja Shumeet and Fischer Ian. 2018. Learning to attack: Adversarial transformation networks. In AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In [email protected].Google ScholarGoogle Scholar
  5. [5] Chen Hongge, Zhang Huan, Chen Pin-Yu, Yi Jinfeng, and Hsieh Cho-Jui. 2018. Show-and-Fool: Crafting adversarial examples for neural image captioning. In ACL.Google ScholarGoogle Scholar
  6. [6] Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L.. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.Google ScholarGoogle Scholar
  7. [7] Dong Yinpeng, Liao Fangzhou, Pang Tianyu, Su Hang, Zhu Jun, Hu Xiaolin, and Li Jianguo. 2018. Boosting adversarial attacks with momentum. In CVPR.Google ScholarGoogle Scholar
  8. [8] Gan Chuang, Gan Zhe, He X., Gao Jianfeng, and Deng L.. 2017. StyleNet: Generating attractive visual captions with styles. In CVPR. 955964.Google ScholarGoogle Scholar
  9. [9] Gan Zhe, Gan Chuang, He X., Pu Yunchen, Tran Kenneth, Gao Jianfeng, Carin L., and Deng L.. 2017. Semantic compositional networks for visual captioning. In CVPR. 11411150.Google ScholarGoogle Scholar
  10. [10] Goodfellow Ian, Shlens Jonathon, and Szegedy Christian. 2015. Explaining and harnessing adversarial examples. In ICLR.Google ScholarGoogle Scholar
  11. [11] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. In IJCAI 47 (2013), 853899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Hossain MD. Zakir, Sohel Ferdous, Shiratuddin Mohd Fairuz, and Laga Hamid. 2019. A comprehensive survey of deep learning for image captioning. J. Artif. Intell., Mach. Learn. Soft Comput.Google ScholarGoogle Scholar
  13. [13] Huang Lun, Wang Wenmin, Xia Yaxian, and Chen Jie. 2019. Adaptively aligned image captioning via adaptive attention time. In NeurIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Jin Di, Jin Zhijing, Zhou Joey Tianyi, and Szolovits Peter. 2019. Is BERT really robust? Natural language attack on text classification and entailment. In AAAI.Google ScholarGoogle Scholar
  15. [15] Kilickaya Mert, Erdem Aykut, Ikizler-Cinbis Nazli, and Erdem Erkut. 2017. Re-evaluating automatic metrics for image captioning. In ACL.Google ScholarGoogle Scholar
  16. [16] Kingma Diederik and Ba Jimmy. 2014. Adam: A method for stochastic optimization. In ICLR (12 2014).Google ScholarGoogle Scholar
  17. [17] Klein Guillaume, Kim Yoon, Deng Yuntian, Senellart Jean, and Rush Alexander M.. 2017. OpenNMT: Open-source toolkit for neural machine translation. In ACL. Retrieved from https://doi.org/10.18653/v1/P17-4012.Google ScholarGoogle Scholar
  18. [18] Kurakin Alex, Boneh Dan, Tramér Florian, Goodfellow Ian, Papernot Nicolas, and McDaniel Patrick. 2018. Ensemble adversarial training: Attacks and defenses. In ICLR.Google ScholarGoogle Scholar
  19. [19] Kurakin A., Goodfellow Ian J., and Bengio S.. 2017. Adversarial machine learning at scale. ArXiv abs/1611.01236 (2017).Google ScholarGoogle Scholar
  20. [20] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In ACL.Google ScholarGoogle Scholar
  21. [21] Lin Tsung-Yi, Maire Michael, Belongie Serge J., Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In ECCV.Google ScholarGoogle Scholar
  22. [22] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS. 1323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Lu Jiasen, Goswami Vedanuj, Rohrbach Marcus, Parikh Devi, and Lee Stefan. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR.Google ScholarGoogle Scholar
  24. [24] Lu Jiasen, Xiong Caiming, Parikh Devi, and Socher Richard. 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR. 32423250.Google ScholarGoogle Scholar
  25. [25] Luo Ruotian, Price Brian L., Cohen Scott, and Shakhnarovich Gregory. 2018. Discriminability objective for training descriptive captions. In CVPR. 69646974.Google ScholarGoogle Scholar
  26. [26] Ma Chih-Yao, Kalantidis Yannis, AlRegib Ghassan, Vajda Peter, Rohrbach Marcus, and Kira Zsolt. 2019. Learning to generate grounded image captions without localization supervision. In CVPR.Google ScholarGoogle Scholar
  27. [27] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2001. BLEU: A method for automatic evaluation of machine translation. In ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Qiao Tingting, Zhang Jing, Xu Duanqing, and Tao Dacheng. 2019. MirrorGAN: Learning text-to-image generation by redescription. In CVPR.Google ScholarGoogle Scholar
  29. [29] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2016. Self-critical sequence training for image captioning. In CVPR11791195.Google ScholarGoogle Scholar
  30. [30] Szegedy Christian, Zaremba Wojciech, Sutskever Ilya, Bruna Joan, Erhan Dumitru, Goodfellow Ian J., and Fergus Rob. 2014. Intriguing properties of neural networks. In ICLR.Google ScholarGoogle Scholar
  31. [31] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2014. CIDEr: Consensus-based image description evaluation. CVPR. 45664575.Google ScholarGoogle Scholar
  32. [32] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In CVPR.Google ScholarGoogle Scholar
  33. [33] Xie Cihang, Wang Jianyu, Zhang Zhishuai, Ren Zhou, and Yuille Alan. 2018. Mitigating adversarial effects through randomization. In ICLR.Google ScholarGoogle Scholar
  34. [34] Xie Cihang, Wang Jianyu, Zhang Zhishuai, Zhou Yuyin, Xie Lingxi, and Yuille Alan L.. 2017. Adversarial examples for semantic segmentation and object detection. ICCV. 13781387.Google ScholarGoogle Scholar
  35. [35] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron C., Salakhutdinov Ruslan, Zemel Richard S., and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Xu Tao, Zhang Pengchuan, Huang Qiuyuan, Zhang Han, Gan Zhe, Huang Xiaolei, and He Xiaodong. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR.Google ScholarGoogle Scholar
  37. [37] Xu Yan, Wu Baoyuan, Shen Fumin, Fan Yanbo, Zhang Yong, Shen Heng Tao, and Liu Wei. 2019. Exact adversarial attack to image captioning via structured output learning with latent variables. In CVPR.Google ScholarGoogle Scholar
  38. [38] Yakura Hiromu and Sakuma Jun. 2019. Robust audio adversarial example for a physical attack. In IJCAI. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Ye Deheng, Liu Zhao, Sun Mingfei, Shi Bei, Zhao Peilin, Wu Hao, Yu Hongsheng, Yang Shaojie, Wu Xipeng, Guo Qingwei, Chen Qiaobo, Yin Yinyuting, Zhang Hao, Shi Tengfei, Wang Liang, Fu Qiang, Yang Wei, and Huang Lanxiao. 2019. Mastering complex control in MOBA games with deep reinforcement learning. In AAAI.Google ScholarGoogle Scholar
  40. [40] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New Similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 6778.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Yuan X., He P., Zhu Q., and Li X.. 2019. Adversarial examples: Attacks and defenses for deep learning. IEEE Trans. Neural Netw. Learn. Syst. 30, 9 (2019), 28052824.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Zhang S., Wang Z., Xu X., Guan X., and Yang Y.. 2020. Fooled by imagination: Adversarial attack to image captioning via perturbation in complex domain. In ICME. 16.Google ScholarGoogle Scholar

Index Terms

  1. Learning Transferable Perturbations for Image Captioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
      May 2022
      494 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3505207
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 February 2022
      • Accepted: 1 July 2021
      • Revised: 1 June 2021
      • Received: 1 January 2021
      Published in tomm Volume 18, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!