Abstract
Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language processing for generating captions. Numerous works have been carried out on dense image captioning for resource-rich languages like English; however, resource-poor languages like Hindi are not explored. Hindi is one of India’s official languages and is the third most spoken language in the world. This article proposes a dense image captioning model to describe different segments of an image by generating more than one caption in the Hindi language. For localized image recognition and language modeling, we employ Faster R-CNN and Long Short-Term Memory (LSTM), respectively. Apart from this, we conduct various experiments using gated recurrent units (GRUs) and attention mechanism. By manually translating the well-known Visual Genome dataset from English to Hindi, a dataset has been created for dense image captioning in Hindi. The experiments conducted on the newly constructed Hindi dense image captioning dataset illustrate the efficacy of the proposed method over the state-of-the-art methods.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
Cross Ref
- [2] . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- [3] . 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2874–2883.Google Scholar
Cross Ref
- [4] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
- [5] . 2020. Meshed-memory transformer for image captioning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google Scholar
Cross Ref
- [6] . 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [7] . 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- [8] . 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.Google Scholar
- [9] . 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.Google Scholar
Cross Ref
- [10] . 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292–1302.Google Scholar
- [11] . 2010. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.Google Scholar
Digital Library
- [12] . 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.Google Scholar
Cross Ref
- [13] . 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.Google Scholar
Digital Library
- [14] . 2019. Unsupervised image captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- [15] . 2021. Application of image captioning and retrieval to support military decision making. In 2021 International Conference on Military Communication and Information Systems (ICMCIS’21). IEEE, 1–8.Google Scholar
Cross Ref
- [16] . 2021. Dense image captioning in hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 2894–2899.Google Scholar
Digital Library
- [17] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [18] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [19] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.Google Scholar
Cross Ref
- [20] . 2018. Learning to guide decoding for image captioning. In 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [21] . 2015. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272 (2015).Google Scholar
- [22] . 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574.Google Scholar
Cross Ref
- [23] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [24] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679 (2014).Google Scholar
- [25] . 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8888–8897.Google Scholar
Cross Ref
- [26] . 2019. Dense relational captioning: Triple-stream networks for relationship-based captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6271–6280.Google Scholar
Cross Ref
- [27] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google Scholar
- [28] . 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer.Google Scholar
Digital Library
- [29] . 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.Google Scholar
Digital Library
- [30] . 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 7 (2020), 11588–11595.Google Scholar
- [31] . 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.Google Scholar
Digital Library
- [32] . 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014).Google Scholar
- [33] . 2021. A hindi image caption generation framework using deep learning. Transactions on Asian and Low-resource Language Information Processing 20, 2 (2021), 1–19.Google Scholar
Digital Library
- [34] . 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.Google Scholar
Cross Ref
- [35] . 2021. An information multiplexed encoder-decoder network for image captioning in Hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 3019–3024.Google Scholar
Digital Library
- [36] . 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 1–17.Google Scholar
- [37] . 2017. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA’17). IEEE, 203–211.Google Scholar
Cross Ref
- [38] . 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891–898.Google Scholar
Digital Library
- [39] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [40] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.Google Scholar
- [41] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [42] . 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 1137–1149.Google Scholar
Digital Library
- [43] . 2021. On multimodal microblog summarization. IEEE Transactions on Computational Social Systems (2021), 1–13.
DOI: Google ScholarCross Ref
- [44] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [45] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [46] . 2018. Social image captioning: Exploring visual attention and user attention. Sensors 18, 2 (2018), 646.Google Scholar
Cross Ref
- [47] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.Google Scholar
Digital Library
- [48] . 2017. Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2193–2202.Google Scholar
Cross Ref
- [49] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google Scholar
Digital Library
- [50] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [51] . 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 13041–13049.Google Scholar
Index Terms
An Object Localization-based Dense Image Captioning Framework in Hindi
Recommendations
Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed ...
Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi
Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision ...






Comments