skip to main content
research-article

Adaptive Text Denoising Network for Image Caption Editing

Published:03 February 2023Publication History
Skip Abstract Section

Abstract

Image caption editing, which aims at editing the inaccurate descriptions of the images, is an interdisciplinary task of computer vision and natural language processing. As the task requires encoding the image and its corresponding inaccurate caption simultaneously and decoding to generate an accurate image caption, the encoder-decoder framework is widely adopted for image caption editing. However, existing methods mostly focus on the decoder, yet ignore a big challenge on the encoder: the semantic inconsistency between image and caption. To this end, we propose a novel Adaptive Text Denoising Network (ATD-Net) to filter out noises at the word level and improve the model’s robustness at sentence level. Specifically, at the word level, we design a cross-attention mechanism called Textual Attention Mechanism (TAM), to differentiate the misdescriptive words. The TAM is designed to encode the inaccurate caption word by word based on the content of both image and caption. At the sentence level, in order to minimize the influence of misdescriptive words on the semantic of an entire caption, we introduce a Bidirectional Encoder to extract the correct semantic representation from the raw caption. The Bidirectional Encoder is able to model the global semantics of the raw caption, which enhances the robustness of the framework. We extensively evaluate our proposals on the MS-COCO image captioning dataset and prove the effectiveness of our method when compared with the state-of-the-arts.

REFERENCES

  1. [1] Vinyals O., Toshev A., Bengio S., and Erhan D.. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Karpathy A. and Fei-Fei L.. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664676.Google ScholarGoogle Scholar
  3. [3] Xu K. et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 20482057.Google ScholarGoogle Scholar
  4. [4] Rennie S. J., Marcheret E., Mroueh Y., Ross J., and Goel V.. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11791195.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Anderson P. et al. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Fawaz S. and Mahmoud E.. 2019. Look and modify: Modification networks for image captioning. In Proceedings of the British Machine Vision Conference. 75.Google ScholarGoogle Scholar
  7. [7] Fawaz S. and Mahmoud E.. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 48074815.Google ScholarGoogle Scholar
  8. [8] Wu L., Xu M., Wang J., and Perry S.. 2020. Recall what you see continually using GridLSTM in image captioning. IEEE Transactions on Multimedia. 808818.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Guo L., Liu J., Lu S., and Lu H.. 2019. Show, tell and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia. 21492162.Google ScholarGoogle Scholar
  10. [10] Guo L., Liu J., Zhu X., Yao P., Lu S., and Lu H.. 2020. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1032410333.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez N. A., Kaiser L., and Polosukhin I.. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  12. [12] Yao T., Pan Y., Li Y., and Mei T.. 2019. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 711727.Google ScholarGoogle Scholar
  13. [13] Pan Y., Yao T., Li Y., and Mei T.. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10968–0977.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Huang L., Wang W., Chen J., and Wei X.. 2019. Attention on attention for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46334642.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Chen L., Zhang H., Xiao J., Nie L., Shao J., and Chua T. S.. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62986306.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Lu J., Xiong C., Devi P., and Knowing S. Richard. 2017. When to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 32423250.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Sutskever I., Vinyals O., and Le V. Q.. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 31043112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Bahdanau D., Cho K., and Bengio Y.. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd Int. Conf. Learn. Representations.Google ScholarGoogle Scholar
  19. [19] Ji J., Xu C., Zhang X., Wang B., and Song X.. 2020. Spatio-temporal memory attention for image captioning. IEEE Transactions on Image Processing (2020), 76157628.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Huang Y., Chen J., Ouyang W., Wan W., and Xue Y.. 2020. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Transactions on Image Processing. 40134026.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Xu N., Zhang H., Liu A., Nie W., Su Y., Nie J., and Zhang Y.. 2020. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia. 13721383.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Chen S., Jin Q., Wang P., and Wu Q.. 2020. Say as you wish: Fine-grained control of image caption generation with bstract scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 99599968.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Zhou L., Zhang Y., Jiang Y., Zhang T., and Fan W.. 2019. Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Transactions on Image Processing. 694709.Google ScholarGoogle Scholar
  24. [24] Cornia M., Stefanini M., Baraldi L., and Cucchiara R.. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1057510584.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Yang X., Tang K., Zhang H., and Cai J.. 2020. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1068510694.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Zha Z., Liu D., Zhang H., Zhang Y., and Wu F.. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google ScholarGoogle Scholar
  27. [27] Guo L., Liu J., Yao P., Li J., and Lu H.. 2019. MSCap - multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42044213.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Devlin J., Chang M., Lee K., and Toutanova K.. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics. 41714186.Google ScholarGoogle Scholar
  29. [29] Soricut R., Ding N., Sharma P., and Goodman S.. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 25562565.Google ScholarGoogle Scholar
  30. [30] He K., Zhang X., Ren S., and Sun J.. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein S. M., Berg C. A., and Li F.. 2015. ImageNet large scale visual recognition challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 221252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Andrej K. and Li F.. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 664676.Google ScholarGoogle Scholar
  33. [33] Kishore P., Salim R., Todd W., and Zhu W.. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle Scholar
  34. [34] Lin C.. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 18.Google ScholarGoogle Scholar
  35. [35] Vedantam R., Zitnick C. L., and Parikh D.. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Krishna R., Zhu Y., et al. 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. In Proceedings of the International Journal of Computer Vision. 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Kingma D. P. and Ba J.. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd Int. Conf. Learn. Representations. 79.Google ScholarGoogle Scholar
  38. [38] Jiang W., Ma L., Jiang Y., Liu W., and Zhang T.. 2018. Recurrent fusion network for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 510526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Yang X., Tang K., Zhang H., and Cai J.. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1068510694.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Kyunghyun C., Bart V. M., Çaglar G., Fethi B., Holger S., and Yoshua B.. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 17241734.Google ScholarGoogle Scholar
  41. [41] Dzmitry B., Kyunghyun C., and Yoshua B.. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd Int. Conf. Learn. Representations.Google ScholarGoogle Scholar
  42. [42] Sébastien J., KyungHyun C., Roland M., and Yoshua B.. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the Assoc. Comput. Linguistics. 110.Google ScholarGoogle Scholar
  43. [43] Thang L., Hieu P., and Christopher D. M.. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 14121421.Google ScholarGoogle Scholar
  44. [44] Eric M., Sebastian K., Sascha R., Daniil M., and Aliaksei S.. 2019. Encode, tag, realize: high-precision text editing. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 50535064.Google ScholarGoogle Scholar
  45. [45] Shi N., Zeng Z., Zhang H., and Gong Y.. 2020. Recurrent inference in text editing. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 17581759.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Liao Y., Bing L., Li P., Shi S., Lam W., and Zhang T.. 2018. QuaSE: Sequence editing under quantifiable guidance. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 38553864.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Wang L., Zhao W., Jia R., Li S., and Liu J.. 2019. Denoising based sequence-to-sequence pre-training for text generation. In Proceedings of the Conf. Empirical Methods Natural Lang. Process. 40014013.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Lin T., Michael M., Serge J. B., Lubomir D. B., Ross B. G., James H., Pietro P., Deva R., Piotr D., and Lawrence Z. C.. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Huang L., Wang W., Xia Y., and Chen J.. 2019. Adaptively aligned image captioning via adaptive attention time. In Advances in Neural Information Processing Systems. 89408949.Google ScholarGoogle Scholar
  50. [50] Qin Y., Du J., Zhang Y., and Lu H.. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 83678375.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wang L., Bai Z., Zhang Y., and Lu H.. 2020. Show, recall, and tell: Image captioning with recall mechanism. In Proceedings of the 45th AAAI Conference on Artificial Intelligence. 1217612183.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Luo Y., Ji J., Sun X., Cao L., Wu Y., Huang F., Lin C., and Ji R.. 2021. Dual-level collaborative transformer for image captioning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence. 22862293.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Liu A., Wang Y., Xu N., Nie W., Nie J., and Zhang Y.. 2020. Adaptively clustering-driven learning for visual relationship detection. IEEE Transactions on Multimedia.Google ScholarGoogle Scholar
  54. [54] Tao M., Tang H., Wu F., Jing X., Bao B., and Xu C.. 2022. DF-GAN: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Wang J., Bao B., and Xu C.. 2021. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia 14, 8 (2021).Google ScholarGoogle Scholar
  56. [56] Lo L., Xie H., Shuai H., and Cheng W.. 2021. Facial chirality: Using self-face reflection to learn discriminative features for facial expression recognition. In Proceedings of the IEEE International Conference on Multimedia Expo. 59.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Xie H., Lo L., Shuai H., and Cheng W.. 2020. AU-assisted graph attention convolutional network for micro-expression recognition. In Proceedings of the ACM International Conference on Multimedia. 1216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Chen C., Lo L., Huang P., Shuai H., and Cheng W.. 2021. FashionMirror: Co-attention feature-remapping virtual try-on with sequential template poses. In Proceedings of the IEEE International Conference on Computer Vision. 1017.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Adaptive Text Denoising Network for Image Caption Editing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1s
      February 2023
      504 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572859
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 February 2023
      • Online AM: 14 July 2022
      • Accepted: 17 April 2022
      • Revised: 1 April 2022
      • Received: 5 January 2022
      Published in tomm Volume 19, Issue 1s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)250
      • Downloads (Last 6 weeks)16

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!