Abstract
Image captioning is a promising task that attracted researchers in the last few years. Existing image captioning models are primarily trained to generate one caption per image. However, an image may contain rich contents, and one caption cannot express its full details. A better solution is to describe an image with multiple captions, with each caption focusing on a specific aspect of the image. In this regard, we introduce a new number-based image captioning model that describes an image with multiple sentences. An image is annotated with multiple ground-truth captions; thus, we assign an external number to each caption to distinguish its order. Given an image-number pair as input, we could achieve different captions for the same image under different numbers. First, a number is attached to the image features to form an image-number vector (INV). Then, this vector and the corresponding caption are embedded using the order-embedding approach. Afterward, the INV’s embedding is fed to a language model to generate the caption. To show the efficiency of the numbers incorporation strategy, we conduct extensive experiments using MS-COCO, Flickr30K, and Flickr8K datasets. The proposed model attains 24.1 in METEOR on MS-COCO. The achieved results demonstrate that our method is competitive with a range of state-of-the-art models and validate its ability to produce different descriptions under different given numbers.
- [1] . 2022. Image captioning with novel topics guidance and retrieval-based topics re-weighting. IEEE Transactions on Multimedia (2022), 1–16.
DOI: Google ScholarCross Ref
- [2] . 2020. Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap. IEEE Access 8 (2020), 218386–218400.
DOI: Google ScholarCross Ref
- [3] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6077–6086.
DOI: Google ScholarCross Ref
- [4] . 2022. Image captioning model using part-of-speech guidance module for description with diverse vocabulary. IEEE Access 10 (2022), 45219–45229.
DOI: Google ScholarCross Ref
- [5] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceeding of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
- [6] . 2017. Social tag relevance learning via ranking-oriented neighbor voting. Multimedia Tools and Applications 76, 6 (2017), 8831–8857.Google Scholar
Digital Library
- [7] . 2017. Towards diverse and natural image descriptions via a conditional GAN. In Proceedings of the 2017 IEEE International Conference on Computer Vision. 2989–2998.
DOI: Google ScholarCross Ref
- [8] . 2020. Topic-based image caption generation. Arabian Journal for Science and Engineering 45, 4 (2020), 3025–3034.Google Scholar
Cross Ref
- [9] . 2019. Generalized zero-shot cross-modal retrieval. IEEE Transactions on Image Processing 28, 12 (2019), 5953–5962.Google Scholar
Cross Ref
- [10] . 2017. Advances in deep learning approaches for image tagging. APSIPA Transactions on Signal and Information Processing 6 (2017), e11.
DOI: Google ScholarCross Ref
- [11] . 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 5 (2019), 1112–1131.Google Scholar
- [12] . 2019. Show, tell, and polish: Ruminant decoding for image captioning. IEEE Transactions on Multimedia 22, 8 (2019), 2149–2162.Google Scholar
Cross Ref
- [13] . 2019. Image captioning: From structural tetrad to translated sentences. Multimedia Tools and Applications 78, 17 (2019), 24321–24346.Google Scholar
Digital Library
- [14] . 2022. ConvUNeXt: An efficient convolution neural network for medical image segmentation. Knowledge-Based Systems 253 (2022), 109512.
DOI: Google ScholarDigital Library
- [15] . 2020. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Transactions on Image processing 29 (2020), 4013–4026. Google Scholar
Digital Library
- [16] . 2020. Spatio-temporal memory attention for image captioning. IEEE Transactions on Image Processing 29 (2020), 7615–7628.
DOI: Google ScholarDigital Library
- [17] . 2021. Visual saliency detection by integrating spatial position prior of object with background cues. Expert Systems with Applications 168 (2021), 114219. Google Scholar
Cross Ref
- [18] . 2021. Integrating object proposal with attention networks for video saliency detection. Information Sciences 576 (2021), 819–830.
DOI: Google ScholarDigital Library
- [19] . 2021. Bi-directional co-attention network for image captioning. 17, 4 (2021), 20 pages.
DOI: Google ScholarDigital Library
- [20] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [21] . 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539. Retrieved from https://arxiv.org/abs/1411.2539.Google Scholar
- [22] . 2017. GLA: Global–local attention for image description. IEEE Transactions on Multimedia 20, 3 (2017), 726–737.Google Scholar
Digital Library
- [23] . 2021. Vision-to-language tasks based on attributes and attention mechanism. IEEE Transactions on Cybernetics 51, 2 (2021), 913–926.
DOI: Google ScholarCross Ref
- [24] . 2021. Vocabulary-wide credit assignment for training image captioning models. IEEE Transactions on Image Processing 30 (2021), 2450–2460. Google Scholar
Cross Ref
- [25] . 2020. Chinese image caption generation via visual attention and topic modeling. IEEE Transactions on Cybernetics 52, 2 (2020), 1247–1257. Google Scholar
Cross Ref
- [26] . 2021. Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 18 pages.
DOI: Google ScholarDigital Library
- [27] . 2018. Show and tell more: Topic-oriented multi-sentence image captioning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.AAAI, 4258–4264.Google Scholar
Cross Ref
- [28] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [29] . 2022. Small data challenges in big data era: A survey of recent progress on unsupervised and semi-supervised methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 4 (2022), 2168–2187.
DOI: Google ScholarCross Ref
- [30] . 2016. Accurate online video tagging via probabilistic hybrid modeling. Multimedia Systems 22, 1 (2016), 99–113.Google Scholar
Digital Library
- [31] . 2011. Multimedia tagging: Past, present and future. In Proceedings of the 19th ACM International Conference on Multimedia.Association for Computing Machinery, New York, NY, 639–640.
DOI: Google ScholarDigital Library
- [32] . 2013. Towards optimizing human labeling for interactive image tagging. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 4 (2013), 18 pages.
DOI: Google ScholarDigital Library
- [33] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the 4th International Conference on Learning Representations. 1–12.Google Scholar
- [34] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [35] . 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2s (2018), 1–20.Google Scholar
Digital Library
- [36] . 2018. Cnn+ cnn: Convolutional decoders for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
- [37] . 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), 1–16.Google Scholar
Digital Library
- [38] . 2010. ARISTA - image search to annotation on billions of web photos. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2987–2994.
DOI: Google ScholarCross Ref
- [39] . 2021. Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2 (2021), 1–22.Google Scholar
Digital Library
- [40] . 2022. Learning transferable perturbations for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 2 (2022), 18 pages.
DOI: Google ScholarDigital Library
- [41] . 2020. Fine-grained image captioning with global-local discriminative objective. IEEE Transactions on Multimedia 23 (2020), 2413–2427. Google Scholar
Digital Library
- [42] . 2020. Recall what you see continually using GridLSTM in image captioning. IEEE Transactions on Multimedia 22, 3 (2020), 808–818.
DOI: Google ScholarCross Ref
- [43] . 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1367–1381.Google Scholar
Cross Ref
- [44] . 2019. Deep hierarchical encoder–decoder network for image captioning. IEEE Transactions on Multimedia 21, 11 (2019), 2942–2956.Google Scholar
- [45] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2048–2057.Google Scholar
Digital Library
- [46] . 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia 22, 5 (2019), 1372–1383.Google Scholar
Cross Ref
- [47] . 2022. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2022), 43–51.
DOI: Google ScholarDigital Library
- [48] . 2020. Constrained LSTM and residual attention for image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 18 pages.
DOI: Google ScholarDigital Library
- [49] . 2020. An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing 29 (2020), 9627–9640. Google Scholar
Cross Ref
- [50] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [51] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2,
02 (2014), 67–78.DOI: Google ScholarCross Ref
- [52] . 2022. Dual attention on pyramid feature maps for image captioning. IEEE Transactions on Multimedia 24 (2022), 1775–1786.
DOI: Google ScholarCross Ref
- [53] . 2019. Topic-oriented image captioning based on order-embedding. IEEE Transactions on Image Processing 28, 6 (2019), 2743–2754.
DOI: Google ScholarCross Ref
- [54] . 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 22 pages.
DOI: Google ScholarDigital Library
- [55] . 2019. Deep captioning with attention-based visual concept transfer mechanism for enriching description. Neural Processing Letters 50, 2 (2019), 1891–1905.Google Scholar
Digital Library
- [56] . 2021. Integrating part of speech guidance for image captioning. IEEE Transactions on Multimedia 23 (2021), 92–104. Google Scholar
Cross Ref
- [57] Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.
DOI: Google ScholarCross Ref
- [58] . 2018. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing 28, 1 (2018), 32–44.Google Scholar
Digital Library
- [59] . 2021. Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Transactions on Image Processing 30 (2021), 1180–1192.
DOI: Google ScholarCross Ref
- [60] . 2019. Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Transactions on Image Processing 29 (2019), 694–709. Google Scholar
Digital Library
- [61] . 2020. Topic sensitive image descriptions. Neural Computing and Applications 32, 14 (2020), 10471–10479.Google Scholar
Digital Library
Index Terms
NumCap: A Number-controlled Multi-caption Image Captioning Network
Recommendations
Integrating Scene Semantic Knowledge into Image Captioning
Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus ...
Recurrent Fusion Network for Image Captioning
Computer Vision – ECCV 2018AbstractRecently, much advance has been made in image captioning, and an encoder-decoder framework has been adopted by all the state-of-the-art models. Under this framework, an input image is encoded by a convolutional neural network (CNN) and then ...
A neural image captioning model with caption-to-images semantic constructor
AbstractThe current dominant image captioning models are mostly based on a CNN-LSTM encoder-decoder framework. Although this architecture has achieved remarkable progress, it still has shortcomings for not fully capturing the encoded image ...






Comments