Abstract
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed dimensional vector representation in the image captioning task, whereas a decoder, a recurrent neural network, performs language modeling and generates the target descriptions. Recent CNNs use the same operation over every pixel; however, all the image pixels are not equally important. To address this, the proposed method uses a dynamic convolution-based encoder for image encoding or feature extraction, Long-Short-Term-Memory as a decoder for language modeling, and X-Linear attention to make the system robust. Encoders, attentions, and decoders are important aspects of the image captioning task; therefore, we experiment with various encoders, decoders, and attention mechanisms. Most of the works for image captioning have been carried out for the English language in the existing literature. We propose a novel approach for caption generation from images in Hindi. Hindi, widely spoken in South Asia and India, is the fourth most-spoken language globally; it is India’s official language. The proposed method utilizes dynamic convolution operation on the encoder side to obtain a better image encoding quality. The Hindi image captioning dataset is manually created by translating the popular MSCOCO dataset from English to Hindi. In terms of BLEU scores, the performance of the proposed method is compared with other baselines, and the results obtained show that the proposed method outperforms different baselines. Manual human assessment in terms of adequacy and fluency of the captions generated further determines the efficacy of the proposed method in generating good-quality captions.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [2] . 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). Retrieved from https://arxiv.org/abs/s1409.0473.Google Scholar
- [3] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/abs/1406.1078.Google Scholar
- [4] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google Scholar
Cross Ref
- [5] . 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 886–893.Google Scholar
Digital Library
- [6] . 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- [7] . 2019. A deep attention based framework for image caption generation in Hindi language. Comput. Sist. 23, 3 (2019).Google Scholar
- [8] . 2013. Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1292–1302.Google Scholar
- [9] . 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.Google Scholar
Digital Library
- [10] . 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- [11] . 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/abs/1606.01847.Google Scholar
- [12] . 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 317–326.Google Scholar
Cross Ref
- [13] . 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages.Google Scholar
- [14] . 2021. Dense image captioning in Hindi. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 2894–2899.Google Scholar
Digital Library
- [15] . 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529–545.Google Scholar
Cross Ref
- [16] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [17] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [18] . 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853–899.Google Scholar
Cross Ref
- [19] . 2018. Squeeze-and-excitation networks, 7132–7141. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
- [20] . 2016. Categorical reparameterization with gumbel-softmax. International Conference on Learning Representations. Retrieved from https://arxiv.org/abs/1611.01144.Google Scholar
- [21] . 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [22] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [23] . 2022. Shi-Tomasi corner detector for cattle identification from muzzle print image pattern. Ecol. Inf. 68 (2022), 101549.Google Scholar
- [24] . 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564–1574.Google Scholar
- [25] . 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). Citeseer.Google Scholar
Digital Library
- [26] . 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.Google Scholar
Digital Library
- [27] . 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’20). 11588–11595.Google Scholar
Cross Ref
- [28] . 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91–110.Google Scholar
Digital Library
- [29] . 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.Google Scholar
Cross Ref
- [30] . 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421.Google Scholar
- [31] . 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.Google Scholar
Digital Library
- [32] . 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). 3rd International Conference on Learning Representations (ICLR’15, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings), Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1412.6632Google Scholar
- [33] . 2021. A Hindi image caption generation framework using deep learning. Trans. Asian Low-Resour. Lang. Inf. Process. 20, 2 (2021), 1–19.Google Scholar
Digital Library
- [34] . 2020. Image captioning in Hindi language using transformer networks. (unpublished).Google Scholar
- [35] . 2021. Image captioning in Hindi language using transformer networks. Comput. Electr. Eng. 92 (2021), 107114.Google Scholar
Cross Ref
- [36] . 2021. An information multiplexed encoder-decoder network for image captioning in Hindi. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 3019–3024.Google Scholar
Digital Library
- [37] . 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Trans. Asian Low-Resour. Lang. Inf. Process. 21, 3 (2021), 1–17.Google Scholar
- [38] . 2021. A scaled encoder decoder network for image captioning in Hindi. In Proceedings of the 18th International Conference on Natural Language Processing (ICON’21). 251–260.Google Scholar
- [39] . 2022. An object localization based dense image captioning framework in Hindi. Trans. Asian Low-Resour. Lang. Inf. Process. (2022).Google Scholar
Digital Library
- [40] . 2021. UrduDeepNet: Offline handwritten Urdu character recognition using deep neural network. Neural Comput. Appl. 33, 22 (2021), 15229–15252.Google Scholar
Digital Library
- [41] . 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Comput. 24, 22 (2020), 17279–17289.Google Scholar
Digital Library
- [42] . 2020. Ancient text recognition: A review. Artif. Intell. Rev. 53, 8 (2020), 5517–5558.Google Scholar
Digital Library
- [43] . 2021. DeepNetDevanagari: A deep learning model for Devanagari ancient character recognition. Multimedia Tools Appl. 80, 13 (2021), 20671–20686.Google Scholar
Digital Library
- [44] . 2000. Gray scale and rotation invariant texture classification with local binary patterns. In European Conference on Computer Vision. Springer, 404–420.Google Scholar
Cross Ref
- [45] . 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [46] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.Google Scholar
- [47] . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [48] . 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [49] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [50] . 2020. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2320–2329.Google Scholar
Cross Ref
- [51] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [52] . 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2016), 652–663.Google Scholar
Digital Library
- [53] . 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637–12646.Google Scholar
Cross Ref
- [54] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.Google Scholar
Digital Library
- [55] . 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361–2369.Google Scholar
- [56] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [57] . 2018. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV’18). 574–589.Google Scholar
Digital Library
- [58] . 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’20). 13041–13049.Google Scholar
Cross Ref
Index Terms
Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
Recommendations
Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi
Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision ...
An Object Localization-based Dense Image Captioning Framework in Hindi
Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language ...






Comments