skip to main content
research-article

Contrastive Adversarial Training for Multi-Modal Machine Translation

Published:16 June 2023Publication History
Skip Abstract Section

Abstract

The multi-modal machine translation task is to improve translation quality with the help of additional visual input. It is expected to disambiguate or complement semantics while there are ambiguous words or incomplete expressions in the sentences. Existing methods have tried many ways to fuse visual information into text representations. However, only a minority of sentences need extra visual information as complementary. Without guidance, models tend to learn text-only translation from the major well-aligned translation pairs. In this article, we propose a contrastive adversarial training approach to enhance visual participation in semantic representation learning. By contrasting multi-modal input with the adversarial samples, the model learns to identify the most informed sample that is coupled with a congruent image and several visual objects extracted from it. This approach can prevent the visual information from being ignored and further fuse cross-modal information. We examine our method in three multi-modal language pairs. Experimental results show that our model is capable of improving translation accuracy. Further analysis shows that our model is more sensitive to visual information.

REFERENCES

  1. [1] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15). http://arxiv.org/abs/1409.0473.Google ScholarGoogle Scholar
  2. [2] Barrault Loïc, Bougares Fethi, Specia Lucia, Lala Chiraag, Elliott Desmond, and Frank Stella. 2018. Findings of the third shared task on multimodal machine translation. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. 304323. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Belinkov Yonatan and Bisk Yonatan. 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18). https://openreview.net/forum?id=BJ8vJebC-.Google ScholarGoogle Scholar
  4. [4] Caglayan Ozan, Barrault Loïc, and Bougares Fethi. 2016. Multimodal attention for neural machine translation. CoRR abs/1609.03976 (2016).Google ScholarGoogle Scholar
  5. [5] Caglayan Ozan, Madhyastha Pranava, Specia Lucia, and Barrault Loïc. 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 41594170. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Calixto Iacer and Liu Qun. 2017. Incorporating global visual features into attention-based neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 9921003. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Calixto Iacer, Liu Qun, and Campbell Nick. 2017. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 19131924. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Cheng Yong, Jiang Lu, and Macherey Wolfgang. 2019. Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 43244333. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Denkowski Michael and Lavie Alon. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376380. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Elliott Desmond. 2018. Adversarial evaluation of multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 29742978. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Elliott Desmond, Frank Stella, Barrault Loïc, Bougares Fethi, and Specia Lucia. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the 2nd Conference on Machine Translation. 215233. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Elliott Desmond, Frank Stella, and Hasler Eva. 2015. Multilingual image description with neural sequence models. (2015). arXiv:cs.CL/1510.04709 (2015).Google ScholarGoogle Scholar
  13. [13] Elliott Desmond, Frank Stella, Sima’an Khalil, and Specia Lucia. 2016. Multi30K: Multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language. 7074. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Elliott Desmond and Kádár Ákos. 2017. Imagination improves multimodal translation. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 130141. https://aclanthology.org/I17-1014.Google ScholarGoogle Scholar
  15. [15] Fang Qingkai and Feng Yang. 2022. Neural machine translation with phrase-level universal visual representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 56875698. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Los Alamitos, CA, 770778. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hjelm R. Devon, Fedorov Alex, Lavoie-Marchildon Samuel, Grewal Karan, Bachman Philip, Trischler Adam, and Bengio Yoshua. 2019. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=Bklr3j0cKX.Google ScholarGoogle Scholar
  18. [18] Huang Kevin, Qi Peng, Wang Guangtao, Ma Tengyu, and Huang Jing. 2021. Entity and evidence guided document-level relation extraction. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP’21). 307315. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Huang Po-Yao, Liu Frederick, Shiang Sz-Rung, Oh Jean, and Dyer Chris. 2016. Attention-based multimodal neural machine translation. In Proceedings of the 1st Conference on Machine Translation (Volume 2: Shared Task Papers). 639645. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Huang Xin, Zhang Jiajun, and Zong Chengqing. 2021. Entity-level cross-modal learning improves multi-modal machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 10671080. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Iter Dan, Guu Kelvin, Lansing Larry, and Jurafsky Dan. 2020. Pretraining with contrastive sentence objectives improves discourse performance of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 48594870. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Ive Julia, Madhyastha Pranava, and Specia Lucia. 2019. Distilling translations with visual awareness. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 65256538. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Jaiswal Ashish, Babu Ashwin Ramesh, Zadeh Mohammad Zaki, Banerjee Debapriya, and Makedon Fillia. 2020. A survey on contrastive self-supervised learning. CoRR abs/2011.00362 (2020).Google ScholarGoogle Scholar
  24. [24] Khosla Prannay, Teterwak Piotr, Wang Chen, Sarna Aaron, Tian Yonglong, Isola Phillip, Maschinot Aaron, Liu Ce, and Krishnan Dilip. 2020. Supervised contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’20). https://proceedings.neurips.cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html.Google ScholarGoogle Scholar
  25. [25] Kim Minseon, Tack Jihoon, and Hwang Sung Ju. 2020. Adversarial self-supervised contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’20). https://proceedings.neurips.cc/paper/2020/hash/1f1baa5b8edac74eb4eaa329f14a0361-Abstract.html.Google ScholarGoogle Scholar
  26. [26] Kim Wonjae, Son Bokyung, and Kim Ildoo. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML’21). 5583–5594. http://proceedings.mlr.press/v139/kim21k.html.Google ScholarGoogle Scholar
  27. [27] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539 (2014).Google ScholarGoogle Scholar
  28. [28] Klein Tassilo and Nabi Moin. 2020. Contrastive self-supervised learning for commonsense reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 75177523. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, et al. 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 3273. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Li Bei, Lv Chuanhao, Zhou Zefan, Zhou Tao, Xiao Tong, Ma Anxiang, and Zhu JingBo. 2022. On vision features in multimodal machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 63276337. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Li Jiaoda, Ataman Duygu, and Sennrich Rico. 2021. Vision matters when it should: Sanity checking multimodal machine translation models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 85568562. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Li Jiwei, Monroe Will, Shi Tianlin, Jean Sébastien, Ritter Alan, and Jurafsky Dan. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 21572169. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Li Wei, Gao Can, Niu Guocheng, Xiao Xinyan, Liu Hao, Liu Jiachen, Wu Hua, and Wang Haifeng. 2021. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 25922607. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Libovický Jindřich and Helcl Jindřich. 2017. Attention strategies for multi-source sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 196202. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Libovický Jindřich, Helcl Jindřich, and Mareček David. 2018. Input combination strategies for multi-source transformer decoder. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. 253260. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Liu Pengbo, Cao Hailong, and Zhao Tiejun. 2021. Gumbel-attention for multi-modal machine translation. CoRR abs/2103.08862 (2021).Google ScholarGoogle Scholar
  37. [37] Luong Thang, Pham Hieu, and Manning Christopher D.. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 14121421. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Nakayama Hideki, Tamura Akihiro, and Ninomiya Takashi. 2020. A visually-grounded parallel corpus with phrase-to-region linking. In Proceedings of the 12th Language Resources and Evaluation Conference. 42044210. https://aclanthology.org/2020.lrec-1.518.Google ScholarGoogle Scholar
  39. [39] Pan Xiao, Wang Mingxuan, Wu Liwei, and Li Lei. 2021. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 244258. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Parida Shantipriya, Bojar Ondrej, and Dash Satya Ranjan. 2019. Hindi Visual Genome: A dataset for multi-modal English to Hindi machine translation. Computación y Sistemas 23, 4 (2019), 1–7. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 7493. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Su Weijie, Zhu Xizhou, Cao Yue, Li Bin, Lu Lewei, Wei Furu, and Dai Jifeng. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20). https://openreview.net/forum?id=SygXPaEYvH.Google ScholarGoogle Scholar
  44. [44] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems (NeurIPS’17). 59986008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google ScholarGoogle Scholar
  45. [45] Wang Dong, Ding Ning, Li Piji, and Zheng Haitao. 2021. CLINE: Contrastive learning with semantic negative examples for natural language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 23322342. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Dexin and Xiong Deyi. 2021. Efficient object-level visual context modeling for multimodal machine translation: Masking irrelevant objects helps grounding. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI’21), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI’21). 27202728. https://ojs.aaai.org/index.php/AAAI/article/view/16376.Google ScholarGoogle Scholar
  47. [47] Wu Zhiyong, Kong Lingpeng, Bi Wei, Li Xiang, and Kao Ben. 2021. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 61536166. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Xiang Lu, Zhu Junnan, Zhao Yang, Zhou Yu, and Zong Chengqing. 2021. Robust cross-lingual task-oriented dialogue. ACM Transactions on Asian Low-Resource Language Information Processing 20, 6 (Aug. 2021), Article 93, 24 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Yang Zhengyuan, Gong Boqing, Wang Liwei, Huang Wenbing, Yu Dong, and Luo Jiebo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Los Alamitos, CA, 46824692. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Yao Shaowei and Wan Xiaojun. 2020. Multimodal transformer for multimodal machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 43464350. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Yin Yongjing, Meng Fandong, Su Jinsong, Zhou Chulun, Yang Zhengyuan, Zhou Jie, and Luo Jiebo. 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 30253035. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 6778. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zhou Mingyang, Cheng Runxiang, Lee Yong Jae, and Yu Zhou. 2018. A visual attention grounding neural model for multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 36433653. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Contrastive Adversarial Training for Multi-Modal Machine Translation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
      June 2023
      635 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3604597
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 June 2023
      • Online AM: 14 March 2023
      • Accepted: 2 March 2023
      • Revised: 7 January 2023
      • Received: 20 September 2022
      Published in tallip Volume 22, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)141
      • Downloads (Last 6 weeks)43

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!