Abstract
Image captioning for low-resource languages has attracted much attention recently. Researchers propose to augment the low-resource caption dataset into (image, rich-resource language, and low-resource language) triplets and develop the dual attention mechanism to exploit the existence of triplets in training to improve the performance. However, datasets in triplet form are usually small due to their high collecting cost. On the other hand, there are already many large-scale datasets, which contain one pair from the triplet, such as caption datasets in the rich-resource language and translation datasets from the rich-resource language to the low-resource language. In this article, we revisit the caption-translation pipeline of the translation-based approach to utilize not only the triplet dataset but also large-scale paired datasets in training. The caption-translation pipeline is composed of two models, one caption model of the rich-resource language and one translation model from the rich-resource language to the low-resource language. Unfortunately, it is not trivial to fully benefit from incorporating both the triplet dataset and paired datasets into the pipeline, due to the gap between the training and testing phases and the instability in the training process. We propose to jointly optimize the two models of the pipeline in an end-to-end manner to bridge the training and testing gap, and introduce two auxiliary training objectives to stabilize the training process. Experimental results show that the proposed method improves significantly over the state-of-the-art methods.
- [1] . 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382–398.Google Scholar
Cross Ref
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [3] . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations. 2015.Google Scholar
- [4] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.Google Scholar
Digital Library
- [5] . 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9962–9971.Google Scholar
Cross Ref
- [6] . 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, 103.Google Scholar
- [7] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.Google Scholar
Cross Ref
- [8] . 2019. A long video caption generation algorithm for big video data retrieval. Future Generation Computer Systems 93 (2019), 583–595.Google Scholar
Digital Library
- [9] . 2020. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398 (2020), 520–530.Google Scholar
Cross Ref
- [10] . 2015. Multilingual image description with neural sequence models. arXiv:1510.04709 . Retrieved from https://arxiv.org/abs/1510.04709.Google Scholar
- [11] . 2016. Multi30K: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language. 70–74.Google Scholar
Cross Ref
- [12] . 2018. Neural machine translation with gumbel-greedy decoding. In Proceedings of the Association for the Advancement of Artificial Intelligence.Google Scholar
Cross Ref
- [13] . 2018. Unpaired image captioning by language pivoting. In Proceedings of the European Conference on Computer Vision. 503–519.Google Scholar
Cross Ref
- [14] . 1954. Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series 33 (1954), 51.Google Scholar
- [15] . 2018. Improving reinforcement learning based image captioning with natural language prior. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 751–756.Google Scholar
Cross Ref
- [16] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [17] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [18] . 2017. Generating image descriptions using multilingual data. In Proceedings of the Second Conference on Machine Translation. 458–464.Google Scholar
Cross Ref
- [19] . 2016. Categorical reparameterization with gumbel-softmax. arXiv:1611.01144. Retrieved from https://arxiv.org/abs/1611.01144.Google Scholar
- [20] . 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8888–8897.Google Scholar
Cross Ref
- [21] . 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google Scholar
- [22] . 2017. Fluency-guided cross-lingual image captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1549–1557.Google Scholar
Digital Library
- [23] . 2016. Adding chinese captions to images. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 271–275.Google Scholar
Digital Library
- [24] . 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [25] . 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.Google Scholar
Cross Ref
- [26] . 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7219–7228.Google Scholar
Cross Ref
- [27] . 2016. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the International Conference on Learning Representations. International Conference on Learning Representations.Google Scholar
- [28] . 2014. A* sampling. In Proceedings of the Advances in Neural Information Processing Systems. 3086–3094.Google Scholar
- [29] . 2016. Cross-lingual image caption generation. In Proceedings of the ACL.Google Scholar
Cross Ref
- [30] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [31] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.Google Scholar
- [32] . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [33] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.Google Scholar
Cross Ref
- [34] . 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [35] . 2017. Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5753–5761.Google Scholar
Cross Ref
- [36] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [37] . 2019. Improving captioning for low-resource languages by cycle consistency. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. IEEE, 362–367.Google Scholar
Cross Ref
- [38] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.Google Scholar
Digital Library
- [39] . 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685–10694.Google Scholar
Cross Ref
- [40] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684–699.Google Scholar
Digital Library
- [41] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2621–2629.Google Scholar
Cross Ref
- [42] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [43] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.Google Scholar
Index Terms
When Pairs Meet Triplets: Improving Low-Resource Captioning via Multi-Objective Optimization
Recommendations
Low-resource text classification using domain-adversarial learning
Highlights- Domain-Adversarial training works for complex NLP model architectures in low- and no- resource settings.
AbstractDeep learning techniques have recently shown to be successful in many natural language processing tasks forming state-of-the-art systems. They require, however, a large amount of annotated data which is often missing. This paper ...
Improving Low resource Reading Comprehension via Cross lingual Transposition Rethinking
IJCKG '21: Proceedings of the 10th International Joint Conference on Knowledge GraphsExtractive Question Answering (QA) has made tremendous advances enabled by the availability of large scale high quality QA training data. Although such rapid progress and widespread application, extractive QA datasets in languages other than English ...
An Object-Extensible Training Framework for Image Captioning
Natural Language Processing and Chinese ComputingAbstractRecent years have witnessed great progress in image captioning based on deep learning. However, most previous methods are limited to the original training dataset that contains only a fraction of objects in the real world. They lack the ability to ...






Comments