Abstract
Present studies have discovered that state-of-the-art deep learning models can be attacked by small but well-designed perturbations. Existing attack algorithms for the image captioning task is time-consuming, and their generated adversarial examples cannot transfer well to other models. To generate adversarial examples faster and stronger, we propose to learn the perturbations by a generative model that is governed by three novel loss functions. Image feature distortion loss is designed to maximize the encoded image feature distance between original images and the corresponding adversarial examples at the image domain, and local-global mismatching loss is introduced to separate the mapping encoding representation of the adversarial images and the ground true captions from a local and global perspective in the common semantic space as far as possible cross image and caption domain. Language diversity loss is to make the image captions generated by the adversarial examples as different as possible from the correct image caption at the language domain. Extensive experiments show that our proposed generative model can efficiently generate adversarial examples that successfully generalize to attack image captioning models trained on unseen large-scale datasets or with different architectures, or even the image captioning commercial service.
- [1] . 2016. SPICE: Semantic propositional image caption evaluation. In ECCV.Google Scholar
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.Google Scholar
- [3] . 2018. Learning to attack: Adversarial transformation networks. In AAAI. Google Scholar
Digital Library
- [4] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In [email protected].Google Scholar
- [5] . 2018. Show-and-Fool: Crafting adversarial examples for neural image captioning. In ACL.Google Scholar
- [6] . 2009. ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
- [7] . 2018. Boosting adversarial attacks with momentum. In CVPR.Google Scholar
- [8] . 2017. StyleNet: Generating attractive visual captions with styles. In CVPR. 955–964.Google Scholar
- [9] . 2017. Semantic compositional networks for visual captioning. In CVPR. 1141–1150.Google Scholar
- [10] . 2015. Explaining and harnessing adversarial examples. In ICLR.Google Scholar
- [11] . 2013. Framing image description as a ranking task: Data, models and evaluation metrics. In IJCAI 47 (2013), 853–899. Google Scholar
Digital Library
- [12] . 2019. A comprehensive survey of deep learning for image captioning. J. Artif. Intell., Mach. Learn. Soft Comput.Google Scholar
- [13] . 2019. Adaptively aligned image captioning via adaptive attention time. In NeurIPS. Google Scholar
Digital Library
- [14] . 2019. Is BERT really robust? Natural language attack on text classification and entailment. In AAAI.Google Scholar
- [15] . 2017. Re-evaluating automatic metrics for image captioning. In ACL.Google Scholar
- [16] . 2014. Adam: A method for stochastic optimization. In ICLR (
12 2014).Google Scholar - [17] . 2017. OpenNMT: Open-source toolkit for neural machine translation. In ACL. Retrieved from https://doi.org/10.18653/v1/P17-4012.Google Scholar
- [18] . 2018. Ensemble adversarial training: Attacks and defenses. In ICLR.Google Scholar
- [19] . 2017. Adversarial machine learning at scale. ArXiv abs/1611.01236 (2017).Google Scholar
- [20] . 2004. ROUGE: A package for automatic evaluation of summaries. In ACL.Google Scholar
- [21] . 2014. Microsoft COCO: Common objects in context. In ECCV.Google Scholar
- [22] . 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS. 13–23. Google Scholar
Digital Library
- [23] . 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR.Google Scholar
- [24] . 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR. 3242–3250.Google Scholar
- [25] . 2018. Discriminability objective for training descriptive captions. In CVPR. 6964–6974.Google Scholar
- [26] . 2019. Learning to generate grounded image captions without localization supervision. In CVPR.Google Scholar
- [27] . 2001. BLEU: A method for automatic evaluation of machine translation. In ACL. Google Scholar
Digital Library
- [28] . 2019. MirrorGAN: Learning text-to-image generation by redescription. In CVPR.Google Scholar
- [29] . 2016. Self-critical sequence training for image captioning. In CVPR1179–1195.Google Scholar
- [30] . 2014. Intriguing properties of neural networks. In ICLR.Google Scholar
- [31] . 2014. CIDEr: Consensus-based image description evaluation. CVPR. 4566–4575.Google Scholar
- [32] . 2015. Show and tell: A neural image caption generator. In CVPR.Google Scholar
- [33] . 2018. Mitigating adversarial effects through randomization. In ICLR.Google Scholar
- [34] . 2017. Adversarial examples for semantic segmentation and object detection. ICCV. 1378–1387.Google Scholar
- [35] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. Google Scholar
Digital Library
- [36] . 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR.Google Scholar
- [37] . 2019. Exact adversarial attack to image captioning via structured output learning with latent variables. In CVPR.Google Scholar
- [38] . 2019. Robust audio adversarial example for a physical attack. In IJCAI. Google Scholar
Cross Ref
- [39] . 2019. Mastering complex control in MOBA games with deep reinforcement learning. In AAAI.Google Scholar
- [40] . 2014. From image descriptions to visual denotations: New Similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67–78.Google Scholar
Cross Ref
- [41] . 2019. Adversarial examples: Attacks and defenses for deep learning. IEEE Trans. Neural Netw. Learn. Syst. 30, 9 (2019), 2805–2824.Google Scholar
Cross Ref
- [42] . 2020. Fooled by imagination: Adversarial attack to image captioning via perturbation in complex domain. In ICME. 1–6.Google Scholar
Index Terms
Learning Transferable Perturbations for Image Captioning
Recommendations
A Comprehensive Survey of Deep Learning for Image Captioning
Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct ...
NumCap: A Number-controlled Multi-caption Image Captioning Network
Image captioning is a promising task that attracted researchers in the last few years. Existing image captioning models are primarily trained to generate one caption per image. However, an image may contain rich contents, and one caption cannot express ...
Factors Influencing The Performance of Image Captioning Model: An Evaluation
MoMM '16: Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi MediaRecently, neural network-based methods have shown impressive performances in captioning task. There have been numerous attempts with many proposed architectures to solve this captioning problem. In this paper, we present the evaluation of different ...






Comments