Abstract
The multi-modal machine translation task is to improve translation quality with the help of additional visual input. It is expected to disambiguate or complement semantics while there are ambiguous words or incomplete expressions in the sentences. Existing methods have tried many ways to fuse visual information into text representations. However, only a minority of sentences need extra visual information as complementary. Without guidance, models tend to learn text-only translation from the major well-aligned translation pairs. In this article, we propose a contrastive adversarial training approach to enhance visual participation in semantic representation learning. By contrasting multi-modal input with the adversarial samples, the model learns to identify the most informed sample that is coupled with a congruent image and several visual objects extracted from it. This approach can prevent the visual information from being ignored and further fuse cross-modal information. We examine our method in three multi-modal language pairs. Experimental results show that our model is capable of improving translation accuracy. Further analysis shows that our model is more sensitive to visual information.
- [1] . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15). http://arxiv.org/abs/1409.0473.Google Scholar
- [2] . 2018. Findings of the third shared task on multimodal machine translation. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. 304–323.
DOI: Google ScholarCross Ref
- [3] . 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18). https://openreview.net/forum?id=BJ8vJebC-.Google Scholar
- [4] . 2016. Multimodal attention for neural machine translation. CoRR abs/1609.03976 (2016).Google Scholar
- [5] . 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 4159–4170.
DOI: Google ScholarCross Ref
- [6] . 2017. Incorporating global visual features into attention-based neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 992–1003.
DOI: Google ScholarCross Ref
- [7] . 2017. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1913–1924.
DOI: Google ScholarCross Ref
- [8] . 2019. Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4324–4333.
DOI: Google ScholarCross Ref
- [9] . 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376–380.
DOI: Google ScholarCross Ref
- [10] . 2018. Adversarial evaluation of multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2974–2978.
DOI: Google ScholarCross Ref
- [11] . 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the 2nd Conference on Machine Translation. 215–233.
DOI: Google ScholarCross Ref
- [12] . 2015. Multilingual image description with neural sequence models. (2015). arXiv:cs.CL/1510.04709 (2015).Google Scholar
- [13] . 2016. Multi30K: Multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language. 70–74.
DOI: Google ScholarCross Ref
- [14] . 2017. Imagination improves multimodal translation. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 130–141. https://aclanthology.org/I17-1014.Google Scholar
- [15] . 2022. Neural machine translation with phrase-level universal visual representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5687–5698.
DOI: Google ScholarCross Ref
- [16] . 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Los Alamitos, CA, 770–778.
DOI: Google ScholarCross Ref
- [17] . 2019. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=Bklr3j0cKX.Google Scholar
- [18] . 2021. Entity and evidence guided document-level relation extraction. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP’21). 307–315.
DOI: Google ScholarCross Ref
- [19] . 2016. Attention-based multimodal neural machine translation. In Proceedings of the 1st Conference on Machine Translation (Volume 2: Shared Task Papers). 639–645.
DOI: Google ScholarCross Ref
- [20] . 2021. Entity-level cross-modal learning improves multi-modal machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 1067–1080.
DOI: Google ScholarCross Ref
- [21] . 2020. Pretraining with contrastive sentence objectives improves discourse performance of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4859–4870.
DOI: Google ScholarCross Ref
- [22] . 2019. Distilling translations with visual awareness. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6525–6538.
DOI: Google ScholarCross Ref
- [23] . 2020. A survey on contrastive self-supervised learning. CoRR abs/2011.00362 (2020).Google Scholar
- [24] . 2020. Supervised contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’20). https://proceedings.neurips.cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html.Google Scholar
- [25] . 2020. Adversarial self-supervised contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS’20). https://proceedings.neurips.cc/paper/2020/hash/1f1baa5b8edac74eb4eaa329f14a0361-Abstract.html.Google Scholar
- [26] . 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML’21). 5583–5594. http://proceedings.mlr.press/v139/kim21k.html.Google Scholar
- [27] . 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539 (2014).Google Scholar
- [28] . 2020. Contrastive self-supervised learning for commonsense reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7517–7523.
DOI: Google ScholarCross Ref
- [29] . 2017. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
DOI: Google ScholarDigital Library
- [30] . 2022. On vision features in multimodal machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6327–6337.
DOI: Google ScholarCross Ref
- [31] . 2021. Vision matters when it should: Sanity checking multimodal machine translation models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8556–8562.
DOI: Google ScholarCross Ref
- [32] . 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2157–2169.
DOI: Google ScholarCross Ref
- [33] . 2021. UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2592–2607.
DOI: Google ScholarCross Ref
- [34] . 2017. Attention strategies for multi-source sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 196–202.
DOI: Google ScholarCross Ref
- [35] . 2018. Input combination strategies for multi-source transformer decoder. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. 253–260.
DOI: Google ScholarCross Ref
- [36] . 2021. Gumbel-attention for multi-modal machine translation. CoRR abs/2103.08862 (2021).Google Scholar
- [37] . 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421.
DOI: Google ScholarCross Ref
- [38] . 2020. A visually-grounded parallel corpus with phrase-to-region linking. In Proceedings of the 12th Language Resources and Evaluation Conference. 4204–4210. https://aclanthology.org/2020.lrec-1.518.Google Scholar
- [39] . 2021. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 244–258.
DOI: Google ScholarCross Ref
- [40] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
DOI: Google ScholarDigital Library
- [41] . 2019. Hindi Visual Genome: A dataset for multi-modal English to Hindi machine translation. Computación y Sistemas 23, 4 (2019), 1–7.
DOI: Google ScholarCross Ref
- [42] . 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 74–93.
DOI: Google ScholarDigital Library
- [43] . 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20). https://openreview.net/forum?id=SygXPaEYvH.Google Scholar
- [44] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems (NeurIPS’17). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google Scholar
- [45] . 2021. CLINE: Contrastive learning with semantic negative examples for natural language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2332–2342.
DOI: Google ScholarCross Ref
- [46] . 2021. Efficient object-level visual context modeling for multimodal machine translation: Masking irrelevant objects helps grounding. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), the 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI’21), and the 11th Symposium on Educational Advances in Artificial Intelligence (EAAI’21). 2720–2728. https://ojs.aaai.org/index.php/AAAI/article/view/16376.Google Scholar
- [47] . 2021. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6153–6166.
DOI: Google ScholarCross Ref
- [48] . 2021. Robust cross-lingual task-oriented dialogue. ACM Transactions on Asian Low-Resource Language Information Processing 20, 6 (Aug. 2021), Article 93, 24 pages.
DOI: Google ScholarDigital Library
- [49] . 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Los Alamitos, CA, 4682–4692.
DOI: Google ScholarCross Ref
- [50] . 2020. Multimodal transformer for multimodal machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4346–4350.
DOI: Google ScholarCross Ref
- [51] . 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3025–3035.
DOI: Google ScholarCross Ref
- [52] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
DOI: Google ScholarCross Ref
- [53] . 2018. A visual attention grounding neural model for multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3643–3653.
DOI: Google ScholarCross Ref
Index Terms
Contrastive Adversarial Training for Multi-Modal Machine Translation
Recommendations
An error analysis for image-based multi-modal neural machine translation
In this article, we conduct an extensive quantitative error analysis of different multi-modal neural machine translation (MNMT) models which integrate visual features into different parts of both the encoder and the decoder. We investigate the scenario ...
Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training
MM '21: Proceedings of the 29th ACM International Conference on MultimediaThe goal of this work is to learn discriminative visual representations for lip reading without access to manual text annotation. Recent advances in cross-modal self-supervised learning have shown that the corresponding audio can serve as a supervisory ...
Contrastive Learning for Robust Neural Machine Translation with ASR Errors
Natural Language Processing and Chinese ComputingAbstractNeural machine translation (NMT) models have recently achieved excellent performance when translating non-noised sentences. Unfortunately, they are also very brittle and easily falter when fed with noisy sentences, i.e., from automatic speech ...






Comments