skip to main content
research-article

When Pairs Meet Triplets: Improving Low-Resource Captioning via Multi-Objective Optimization

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Image captioning for low-resource languages has attracted much attention recently. Researchers propose to augment the low-resource caption dataset into (image, rich-resource language, and low-resource language) triplets and develop the dual attention mechanism to exploit the existence of triplets in training to improve the performance. However, datasets in triplet form are usually small due to their high collecting cost. On the other hand, there are already many large-scale datasets, which contain one pair from the triplet, such as caption datasets in the rich-resource language and translation datasets from the rich-resource language to the low-resource language. In this article, we revisit the caption-translation pipeline of the translation-based approach to utilize not only the triplet dataset but also large-scale paired datasets in training. The caption-translation pipeline is composed of two models, one caption model of the rich-resource language and one translation model from the rich-resource language to the low-resource language. Unfortunately, it is not trivial to fully benefit from incorporating both the triplet dataset and paired datasets into the pipeline, due to the gap between the training and testing phases and the instability in the training process. We propose to jointly optimize the two models of the pipeline in an end-to-end manner to bridge the training and testing gap, and introduce two auxiliary training objectives to stabilize the training process. Experimental results show that the proposed method improves significantly over the state-of-the-art methods.

REFERENCES

  1. [1] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382398.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations. 2015.Google ScholarGoogle Scholar
  4. [4] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 6572.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chen Shizhe, Jin Qin, Wang Peng, and Wu Qi. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 99629971.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cho Kyunghyun, Merriënboer Bart van, Bahdanau Dzmitry, and Bengio Yoshua. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, 103.Google ScholarGoogle Scholar
  7. [7] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1057810587.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Ding Songtao, Qu Shiru, Xi Yuling, and Wan Shaohua. 2019. A long video caption generation algorithm for big video data retrieval. Future Generation Computer Systems 93 (2019), 583595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Ding Songtao, Qu Shiru, Xi Yuling, and Wan Shaohua. 2020. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398 (2020), 520530.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Elliott Desmond, Frank Stella, and Hasler Eva. 2015. Multilingual image description with neural sequence models. arXiv:1510.04709 . Retrieved from https://arxiv.org/abs/1510.04709.Google ScholarGoogle Scholar
  11. [11] Elliott Desmond, Frank Stella, Sima’an Khalil, and Specia Lucia. 2016. Multi30K: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language. 7074.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Gu Jiatao, Im D., and Li V.. 2018. Neural machine translation with gumbel-greedy decoding. In Proceedings of the Association for the Advancement of Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Gu Jiuxiang, Joty Shafiq, Cai Jianfei, and Wang Gang. 2018. Unpaired image captioning by language pivoting. In Proceedings of the European Conference on Computer Vision. 503519.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gumbel Emil Julius. 1954. Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series 33 (1954), 51.Google ScholarGoogle Scholar
  15. [15] Guo Tszhang, Chang Shiyu, Yu Mo, and Bai Kun. 2018. Improving reinforcement learning based image captioning with natural language prior. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 751756.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Jaffe Alan. 2017. Generating image descriptions using multilingual data. In Proceedings of the Second Conference on Machine Translation. 458464.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Jang Eric, Gu Shixiang, and Poole Ben. 2016. Categorical reparameterization with gumbel-softmax. arXiv:1611.01144. Retrieved from https://arxiv.org/abs/1611.01144.Google ScholarGoogle Scholar
  20. [20] Ke Lei, Pei Wenjie, Li Ruiyu, Shen Xiaoyong, and Tai Yu-Wing. 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 88888897.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  22. [22] Lan Weiyu, Li Xirong, and Dong Jianfeng. 2017. Fluency-guided cross-lingual image captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 15491557.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Xirong, Lan Weiyu, Dong Jianfeng, and Liu Hailong. 2016. Adding chinese captions to images. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 271275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C Lawrence. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Lu Jiasen, Xiong Caiming, Parikh Devi, and Socher Richard. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375383.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Lu Jiasen, Yang Jianwei, Batra Dhruv, and Parikh Devi. 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72197228.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Maddison Chris J., Mnih Andriy, and Teh Yee Whye. 2016. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the International Conference on Learning Representations. International Conference on Learning Representations.Google ScholarGoogle Scholar
  28. [28] Maddison Chris J, Tarlow Daniel, and Minka Tom. 2014. A* sampling. In Proceedings of the Advances in Neural Information Processing Systems. 30863094.Google ScholarGoogle Scholar
  29. [29] Miyazaki Takashi and Shimizu Nobuyuki. 2016. Cross-lingual image caption generation. In Proceedings of the ACL.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311318.Google ScholarGoogle Scholar
  32. [32] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70087024.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Venugopalan Subhashini, Hendricks Lisa Anne, Rohrbach Marcus, Mooney Raymond, Darrell Trevor, and Saenko Kate. 2017. Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 57535761.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Wu Yike, Zhao Shiwan, Chen Jia, Zhang Ying, Yuan Xiaojie, and Su Zhong. 2019. Improving captioning for low-resource languages by cycle consistency. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. IEEE, 362367.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 20482057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Yang Xu, Tang Kaihua, Zhang Hanwang, and Cai Jianfei. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1068510694.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 26212629.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 6778.Google ScholarGoogle Scholar

Index Terms

  1. When Pairs Meet Triplets: Improving Low-Resource Captioning via Multi-Objective Optimization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
      August 2022
      478 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3505208
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 March 2022
      • Accepted: 1 October 2021
      • Revised: 1 July 2021
      • Received: 1 October 2020
      Published in tomm Volume 18, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!