Abstract
Image caption editing, which aims at editing the inaccurate descriptions of the images, is an interdisciplinary task of computer vision and natural language processing. As the task requires encoding the image and its corresponding inaccurate caption simultaneously and decoding to generate an accurate image caption, the encoder-decoder framework is widely adopted for image caption editing. However, existing methods mostly focus on the decoder, yet ignore a big challenge on the encoder: the semantic inconsistency between image and caption. To this end, we propose a novel Adaptive Text Denoising Network (ATD-Net) to filter out noises at the word level and improve the model’s robustness at sentence level. Specifically, at the word level, we design a cross-attention mechanism called Textual Attention Mechanism (TAM), to differentiate the misdescriptive words. The TAM is designed to encode the inaccurate caption word by word based on the content of both image and caption. At the sentence level, in order to minimize the influence of misdescriptive words on the semantic of an entire caption, we introduce a Bidirectional Encoder to extract the correct semantic representation from the raw caption. The Bidirectional Encoder is able to model the global semantics of the raw caption, which enhances the robustness of the framework. We extensively evaluate our proposals on the MS-COCO image captioning dataset and prove the effectiveness of our method when compared with the state-of-the-arts.
- [1] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [2] . 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664–676.Google Scholar
- [3] et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.Google Scholar
- [4] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1179–1195.Google Scholar
Cross Ref
- [5] et al. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [6] . 2019. Look and modify: Modification networks for image captioning. In Proceedings of the British Machine Vision Conference. 75.Google Scholar
- [7] . 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4807–4815.Google Scholar
- [8] . 2020. Recall what you see continually using GridLSTM in image captioning. IEEE Transactions on Multimedia. 808–818.Google Scholar
Cross Ref
- [9] . 2019. Show, tell and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia. 2149–2162.Google Scholar
- [10] . 2020. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10324–10333.Google Scholar
Cross Ref
- [11] . 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [12] . 2019. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 711–727.Google Scholar
- [13] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10968–0977.Google Scholar
Cross Ref
- [14] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4633–4642.Google Scholar
Cross Ref
- [15] . 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6298–6306.Google Scholar
Cross Ref
- [16] . 2017. When to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3242–3250.Google Scholar
Cross Ref
- [17] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 3104–3112.Google Scholar
Digital Library
- [18] . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd Int. Conf. Learn. Representations.Google Scholar
- [19] . 2020. Spatio-temporal memory attention for image captioning. IEEE Transactions on Image Processing (2020), 7615–7628.Google Scholar
Digital Library
- [20] . 2020. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Transactions on Image Processing. 4013–4026.Google Scholar
Digital Library
- [21] . 2020. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia. 1372–1383.Google Scholar
Cross Ref
- [22] . 2020. Say as you wish: Fine-grained control of image caption generation with bstract scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9959–9968.Google Scholar
Cross Ref
- [23] . 2019. Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Transactions on Image Processing. 694–709.Google Scholar
- [24] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10575–10584.Google Scholar
Cross Ref
- [25] . 2020. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685–10694.Google Scholar
Cross Ref
- [26] . 2019. Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google Scholar
- [27] . 2019. MSCap - multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4204–4213.Google Scholar
Cross Ref
- [28] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics. 4171–4186.Google Scholar
- [29] . 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2556–2565.Google Scholar
- [30] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [31] . 2015. ImageNet large scale visual recognition challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 221–252.Google Scholar
Digital Library
- [32] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 664–676.Google Scholar
- [33] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
- [34] . 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 1–8.Google Scholar
- [35] . 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [36] , et al. 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. In Proceedings of the International Journal of Computer Vision. 32–73.Google Scholar
Digital Library
- [37] . 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd Int. Conf. Learn. Representations. 7–9.Google Scholar
- [38] . 2018. Recurrent fusion network for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 510–526.Google Scholar
Digital Library
- [39] . 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685–10694.Google Scholar
Cross Ref
- [40] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 1724–1734.Google Scholar
- [41] . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd Int. Conf. Learn. Representations.Google Scholar
- [42] . 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the Assoc. Comput. Linguistics. 1–10.Google Scholar
- [43] . 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 1412–1421.Google Scholar
- [44] . 2019. Encode, tag, realize: high-precision text editing. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 5053–5064.Google Scholar
- [45] . 2020. Recurrent inference in text editing. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 1758–1759.Google Scholar
Cross Ref
- [46] . 2018. QuaSE: Sequence editing under quantifiable guidance. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3855–3864.Google Scholar
Cross Ref
- [47] . 2019. Denoising based sequence-to-sequence pre-training for text generation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 4001–4013.Google Scholar
Cross Ref
- [48] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.Google Scholar
Cross Ref
- [49] . 2019. Adaptively aligned image captioning via adaptive attention time. In Advances in Neural Information Processing Systems. 8940–8949.Google Scholar
- [50] . 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8367–8375.Google Scholar
Cross Ref
- [51] . 2020. Show, recall, and tell: Image captioning with recall mechanism. In Proceedings of the 45th AAAI Conference on Artificial Intelligence. 12176–12183.Google Scholar
Cross Ref
- [52] . 2021. Dual-level collaborative transformer for image captioning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2286–2293.Google Scholar
Cross Ref
- [53] . 2020. Adaptively clustering-driven learning for visual relationship detection. IEEE Transactions on Multimedia.Google Scholar
- [54] . 2022. DF-GAN: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [55] . 2021. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia 14, 8 (2021).Google Scholar
- [56] . 2021. Facial chirality: Using self-face reflection to learn discriminative features for facial expression recognition. In Proceedings of the IEEE International Conference on Multimedia Expo. 5–9.Google Scholar
Cross Ref
- [57] . 2020. AU-assisted graph attention convolutional network for micro-expression recognition. In Proceedings of the ACM International Conference on Multimedia. 12–16.Google Scholar
Digital Library
- [58] . 2021. FashionMirror: Co-attention feature-remapping virtual try-on with sequential template poses. In Proceedings of the IEEE International Conference on Computer Vision. 10–17.Google Scholar
Cross Ref
Index Terms
Adaptive Text Denoising Network for Image Caption Editing
Recommendations
Explicit Image Caption Editing
Computer Vision – ECCV 2022AbstractGiven an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption. However, all existing caption editing works are implicit models, i.e., they directly produce the refined ...
NumCap: A Number-controlled Multi-caption Image Captioning Network
Image captioning is a promising task that attracted researchers in the last few years. Existing image captioning models are primarily trained to generate one caption per image. However, an image may contain rich contents, and one caption cannot express ...
Phrase-based image caption generator with hierarchical LSTM network
Highlights- This paper proposes a phrase-based hierarchical LSTM network for image captioning.
AbstractAutomatic generation of caption to describe the content of an image has been gaining a lot of research interests recently, where most of the existing works treat the image caption as pure sequential data. Natural language, however ...






Comments