skip to main content
research-article

NumCap: A Number-controlled Multi-caption Image Captioning Network

Published:27 February 2023Publication History
Skip Abstract Section

Abstract

Image captioning is a promising task that attracted researchers in the last few years. Existing image captioning models are primarily trained to generate one caption per image. However, an image may contain rich contents, and one caption cannot express its full details. A better solution is to describe an image with multiple captions, with each caption focusing on a specific aspect of the image. In this regard, we introduce a new number-based image captioning model that describes an image with multiple sentences. An image is annotated with multiple ground-truth captions; thus, we assign an external number to each caption to distinguish its order. Given an image-number pair as input, we could achieve different captions for the same image under different numbers. First, a number is attached to the image features to form an image-number vector (INV). Then, this vector and the corresponding caption are embedded using the order-embedding approach. Afterward, the INV’s embedding is fed to a language model to generate the caption. To show the efficiency of the numbers incorporation strategy, we conduct extensive experiments using MS-COCO, Flickr30K, and Flickr8K datasets. The proposed model attains 24.1 in METEOR on MS-COCO. The achieved results demonstrate that our method is competitive with a range of state-of-the-art models and validate its ability to produce different descriptions under different given numbers.

REFERENCES

  1. [1] Al-Qatf Majjed, Wang Xingfu, Hawbani Ammar, Abdusallam Amr, and Alsamhi Saeed Hammod. 2022. Image captioning with novel topics guidance and retrieval-based topics re-weighting. IEEE Transactions on Multimedia (2022), 116. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Amirian Soheyla, Rasheed Khaled, Taha Thiab R., and Arabnia Hamid R.. 2020. Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap. IEEE Access 8 (2020), 218386218400. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 60776086. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bae Ju-Won, Lee Soo-Hwan, Kim Won-Yeol, Seong Ju-Hyeon, and Seo Dong-Hoan. 2022. Image captioning model using part-of-speech guidance module for description with diverse vocabulary. IEEE Access 10 (2022), 4521945229. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceeding of the European Conference on Computer Vision. Springer, 740–755.Google ScholarGoogle Scholar
  6. [6] Cui Chaoran, Shen Jialie, Ma Jun, and Lian Tao. 2017. Social tag relevance learning via ranking-oriented neighbor voting. Multimedia Tools and Applications 76, 6 (2017), 88318857.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Dai Bo, Fidler Sanja, Urtasun Raquel, and Lin Dahua. 2017. Towards diverse and natural image descriptions via a conditional GAN. In Proceedings of the 2017 IEEE International Conference on Computer Vision. 29892998. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Dash Sandeep Kumar, Acharya Shantanu, Pakray Partha, Das Ranjita, and Gelbukh Alexander. 2020. Topic-based image caption generation. Arabian Journal for Science and Engineering 45, 4 (2020), 30253034.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Dutta Titir and Biswas Soma. 2019. Generalized zero-shot cross-modal retrieval. IEEE Transactions on Image Processing 28, 12 (2019), 59535962.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Fu Jianlong and Rui Yong. 2017. Advances in deep learning approaches for image tagging. APSIPA Transactions on Signal and Information Processing 6 (2017), e11. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Gao Lianli, Li Xiangpeng, Song Jingkuan, and Shen Heng Tao. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 5 (2019), 11121131.Google ScholarGoogle Scholar
  12. [12] Guo Longteng, Liu Jing, Lu Shichen, and Lu Hanqing. 2019. Show, tell, and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia 22, 8 (2019), 21492162.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Guo Rui, Ma Shubo, and Han Yahong. 2019. Image captioning: From structural tetrad to translated sentences. Multimedia Tools and Applications 78, 17 (2019), 2432124346.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Han Zhimeng, Jian Muwei, and Wang Gai-Ge. 2022. ConvUNeXt: An efficient convolution neural network for medical image segmentation. Knowledge-Based Systems 253 (2022), 109512. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Huang Yiqing, Chen Jiansheng, Ouyang Wanli, Wan Weitao, and Xue Youze. 2020. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Transactions on Image processing 29 (2020), 40134026. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Ji Junzhong, Xu Cheng, Zhang Xiaodan, Wang Boyue, and Song Xinhang. 2020. Spatio-temporal memory attention for image captioning. IEEE Transactions on Image Processing 29 (2020), 76157628. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Jian Muwei, Wang Jing, Yu Hui, Wang Guodong, Meng Xianjing, Yang Lu, Dong Junyu, and Yin Yilong. 2021. Visual saliency detection by integrating spatial position prior of object with background cues. Expert Systems with Applications 168 (2021), 114219. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Jian Muwei, Wang Jiaojin, Yu Hui, and Wang Gai-Ge. 2021. Integrating object proposal with attention networks for video saliency detection. Information Sciences 576 (2021), 819830. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Jiang Weitao, Wang Weixuan, and Hu Haifeng. 2021. Bi-directional co-attention network for image captioning. 17, 4 (2021), 20 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539. Retrieved from https://arxiv.org/abs/1411.2539.Google ScholarGoogle Scholar
  22. [22] Li Linghui, Tang Sheng, Zhang Yongdong, Deng Lixi, and Tian Qi. 2017. GLA: Global–local attention for image description. IEEE Transactions on Multimedia 20, 3 (2017), 726737.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Xuelong, Yuan Aihong, and Lu Xiaoqiang. 2021. Vision-to-language tasks based on attributes and attention mechanism. IEEE Transactions on Cybernetics 51, 2 (2021), 913926. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Liu Han, Zhang Shifeng, Lin Ke, Wen Jing, Li Jianmin, and Hu Xiaolin. 2021. Vocabulary-wide credit assignment for training image captioning models. IEEE Transactions on Image Processing 30 (2021), 24502460. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Liu Maofu, Hu Huijun, Li Lingjun, Yu Yan, and Guan Weili. 2020. Chinese image caption generation via visual attention and topic modeling. IEEE Transactions on Cybernetics 52, 2 (2020), 1247–1257. Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Lu Huimin, Yang Rui, Deng Zhenrong, Zhang Yonglin, Gao Guangwei, and Lan Rushi. 2021. Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 18 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mao Yuzhao, Zhou Chang, Wang Xiaojie, and Li Ruifan. 2018. Show and tell more: Topic-oriented multi-sentence image captioning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.AAAI, 42584264.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Qi Guo-Jun and Luo Jiebo. 2022. Small data challenges in big data era: A survey of recent progress on unsupervised and semi-supervised methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 4 (2022), 21682187. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Shen Jialie, Wang Meng, and Chua Tat-Seng. 2016. Accurate online video tagging via probabilistic hybrid modeling. Multimedia Systems 22, 1 (2016), 99113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Shen Jialie, Wang Meng, Yan Shuicheng, and Hua Xian-Sheng. 2011. Multimedia tagging: Past, present and future. In Proceedings of the 19th ACM International Conference on Multimedia.Association for Computing Machinery, New York, NY, 639640. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Tang Jinhui, Chen Qiang, Wang Meng, Yan Shuicheng, Chua Tat-Seng, and Jain Ramesh. 2013. Towards optimizing human labeling for interactive image tagging. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 4 (2013), 18 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the 4th International Conference on Learning Representations. 1–12.Google ScholarGoogle Scholar
  34. [34] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Wang Cheng, Yang Haojin, and Meinel Christoph. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2s (2018), 120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Wang Qingzhong and Chan Antoni B.. 2018. Cnn+ cnn: Convolutional decoders for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarGoogle Scholar
  37. [37] Wang Shuo, Guo Dan, Xu Xin, Zhuo Li, and Wang Meng. 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Wang Xin-Jing, Zhang Lei, Liu Ming, Li Yi, and Ma Wei-Ying. 2010. ARISTA - image search to annotation on billions of web photos. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 29872994. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Wei Haiyang, Li Zhixin, Huang Feicheng, Zhang Canlong, Ma Huifang, and Shi Zhongzhi. 2021. Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2 (2021), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Wu Hanjie, Liu Yongtuo, Cai Hongmin, and He Shengfeng. 2022. Learning transferable perturbations for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2 (2022), 18 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Wu Jie, Chen Tianshui, Wu Hefeng, Yang Zhi, Luo Guangchun, and Lin Liang. 2020. Fine-grained image captioning with global-local discriminative objective. IEEE Transactions on Multimedia 23 (2020), 24132427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Wu Lingxiang, Xu Min, Wang Jinqiao, and Perry Stuart. 2020. Recall what you see continually using GridLSTM in image captioning. IEEE Transactions on Multimedia 22, 3 (2020), 808818. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wu Qi, Shen Chunhua, Wang Peng, Dick Anthony, and Hengel Anton Van Den. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 13671381.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Xiao Xinyu, Wang Lingfeng, Ding Kun, Xiang Shiming, and Pan Chunhong. 2019. Deep hierarchical encoder–decoder network for image captioning. IEEE Transactions on Multimedia 21, 11 (2019), 29422956.Google ScholarGoogle Scholar
  45. [45] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. PMLR, 20482057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Xu Ning, Zhang Hanwang, Liu An-An, Nie Weizhi, Su Yuting, Nie Jie, and Zhang Yongdong. 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia 22, 5 (2019), 13721383.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Yan Chenggang, Hao Yiming, Li Liang, Yin Jian, Liu Anan, Mao Zhendong, Chen Zhenyu, and Gao Xingyu. 2022. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2022), 4351. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Yang Liang, Hu Haifeng, Xing Songlong, and Lu Xinlong. 2020. Constrained LSTM and residual attention for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 18 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Yang Min, Liu Junhao, Shen Ying, Zhao Zhou, Chen Xiaojun, Wu Qingyao, and Li Chengming. 2020. An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing 29 (2020), 96279640. Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 02 (2014), 6778. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Yu Litao, Zhang Jian, and Wu Qiang. 2022. Dual attention on pyramid feature maps for image captioning. IEEE Transactions on Multimedia 24 (2022), 17751786. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yu Niange, Hu Xiaolin, Song Binheng, Yang Jian, and Zhang Jianwei. 2019. Topic-oriented image captioning based on order-embedding. IEEE Transactions on Image Processing 28, 6 (2019), 27432754. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Yuan Jin, Zhang Lei, Guo Songrui, Xiao Yi, and Li Zhiyong. 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 22 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Zhang Junxuan and Hu Haifeng. 2019. Deep captioning with attention-based visual concept transfer mechanism for enriching description. Neural Processing Letters 50, 2 (2019), 18911905.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zhang Ji, Mei Kuizhi, Zheng Yu, and Fan Jianping. 2021. Integrating part of speech guidance for image captioning. IEEE Transactions on Multimedia 23 (2021), 92104. Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zhang Mingxing, Yang Yang, Zhang Hanwang, Ji Yanli, Shen Heng Tao, and Chua Tat-Seng. 2018. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing 28, 1 (2018), 3244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Zhao Wentian, Wu Xinxiao, and Luo Jiebo. 2021. Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Transactions on Image Processing 30 (2021), 11801192. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhou Lian, Zhang Yuejie, Jiang Yu-Gang, Zhang Tao, and Fan Weiguo. 2019. Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Transactions on Image Processing 29 (2019), 694709. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Zia Usman, Riaz M. Mohsin, Ghafoor Abdul, and Ali Syed Sohaib. 2020. Topic sensitive image descriptions. Neural Computing and Applications 32, 14 (2020), 1047110479.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. NumCap: A Number-controlled Multi-caption Image Captioning Network

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 4
          July 2023
          263 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3582888
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 February 2023
          • Online AM: 16 December 2022
          • Accepted: 7 December 2022
          • Revised: 17 October 2022
          • Received: 18 April 2022
          Published in tomm Volume 19, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)156
          • Downloads (Last 6 weeks)27

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!